# 📚 **Import Necessary Libraries**

### 🔧 **Data Manipulation**
- `pandas` (`pd`): For handling and analyzing data in tabular format.
- `numpy` (`np`): For numerical computations and array manipulations.

### 📊 **Visualization**
- `plotly.express` (`px`): For creating expressive and interactive plots.
- `plotly.graph_objects` (`go`): For more customizable visualizations.

### 🧪 **Model Development**
- **Data Splitting and Preprocessing**:
  - `train_test_split`: To split the dataset into training and testing subsets.
  - `LabelEncoder`: For encoding categorical labels into numerical values.
  - `StandardScaler`: For standardizing features by removing the mean and scaling to unit variance.

- **Evaluation Metrics**:
  - `accuracy_score`: To calculate the accuracy of predictions.
  - `classification_report`: To generate a detailed classification report.
  - `confusion_matrix`: To evaluate model performance using confusion matrices.

### 🤖 **Machine Learning Models**
- `XGBClassifier`: Extreme Gradient Boosting for efficient and powerful classification.
- `SVC`: Support Vector Classifier for linear and non-linear classification.
- `RandomForestClassifier`: An ensemble learning method using decision trees.
- `LogisticRegression`: For binary or multinomial logistic regression.
- `MLPClassifier`: A neural network model for classification.

### 💾 **Model Saving**
- `joblib`: For saving and loading trained models efficiently.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
import joblib

# 📂 **Load and Preprocess Dataset**

### 🔄 **Steps**:
1. **Specify Dataset Path**:
   - `file_path = "dataset/breast-cancer.csv"`
   - This points to the dataset file in CSV format.

2. **Load the Dataset**:
   - Use `pd.read_csv(file_path)` to load the dataset into a DataFrame:
     ```python
     data = pd.read_csv(file_path)
     ```

### 💡 **Note**:
- Ensure the file exists at the specified location.
- Replace `"dataset/breast-cancer.csv"` with the actual path to your dataset if necessary.

In [2]:
# Load and Preprocess Dataset 
file_path = "dataset/breast-cancer.csv"  
data = pd.read_csv(file_path)

# 🔑 **Encode the Target Variable**

### 🎯 **Target: `diagnosis`**
- The target variable `diagnosis` contains categorical labels (`M` for Malignant and `B` for Benign).

### 🧰 **Encoding Steps**:
1. **Initialize a Label Encoder**:
   - Use `LabelEncoder()` to convert categorical labels into numerical values.
     ```python
     label_encoder = LabelEncoder()
     ```

2. **Transform the `diagnosis` Column**:
   - Apply the encoder to the `diagnosis` column.
     ```python
     data['diagnosis'] = label_encoder.fit_transform(data['diagnosis'])
     ```

3. **Encoding Output**:
   - `M` (Malignant) → `1`
   - `B` (Benign) → `0`

### 💡 **Why Encode?**
- Machine learning models require numerical input for predictions, so categorical data must be transformed into numerical representations.

In [3]:
# Encode the target variable ('diagnosis')
label_encoder = LabelEncoder()
data['diagnosis'] = label_encoder.fit_transform(data['diagnosis'])  # M=1, B=0

In [4]:
data.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,1,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,1,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,1,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,1,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


# 🗑️ **Drop the 'id' Column**

### 📋 **Purpose**:
- The `id` column serves as a unique identifier for rows but does not contribute to the prediction task.
- It is unnecessary for model training and can be safely removed.

### 🚀 **Code**:
1. **Check for the Column**:
   - Ensure the `id` column exists in the dataset:
     ```python
     if 'id' in data.columns:
     ```

2. **Drop the Column**:
   - Remove the `id` column using the `drop` method:
     ```python
     data = data.drop(columns=['id'])
     ```

### 💡 **Note**:
- This ensures compatibility with datasets where the `id` column might not always be present.

In [5]:
# Drop the 'id' column 
if 'id' in data.columns:
    data = data.drop(columns=['id'])

# ✂️ **Split Data into Features (`X`) and Target (`y`)**

### 🎯 **Objective**:
- Separate the dataset into:
  - **Features (`X`)**: Input variables used for prediction.
  - **Target (`y`)**: The output variable (`diagnosis`) to predict.

### 🚀 **Code**:
1. **Drop the Target Column for Features (`X`)**:
   - Exclude the `diagnosis` column from the dataset to create the features set:
     ```python
     X = data.drop(columns=['diagnosis'])
     ```

2. **Extract the Target (`y`)**:
   - Assign the `diagnosis` column as the target:
     ```python
     y = data['diagnosis']
     ```

### 📊 **Result**:
- `X`: Contains all columns except `diagnosis` (features).
- `y`: Contains the `diagnosis` column (target variable).

### 💡 **Why Split?**
- This separation is crucial for training machine learning models, which learn patterns in the features (`X`) to predict the target (`y`).

In [6]:
# Split data into features (X) and target (y)
X = data.drop(columns=['diagnosis'])
y = data['diagnosis']

# 📏 **Standardize the Features**

### 🎯 **Objective**:
- Scale the features (`X`) to have a mean of 0 and a standard deviation of 1, ensuring all features contribute equally to the model.

### 🔧 **Steps**:
1. **Initialize the Scaler**:
   - Use `StandardScaler` from `sklearn.preprocessing` to standardize the features:
     ```python
     scaler = StandardScaler()
     ```

2. **Fit and Transform the Features**:
   - Fit the scaler to the data and transform the features:
     ```python
     X = scaler.fit_transform(X)
     ```

3. **Save the Scaler**:
   - Save the scaler object for consistent preprocessing during deployment:
     ```python
     joblib.dump(scaler, 'scaler.pkl')
     ```

### 💡 **Why Standardize?**
- Standardization ensures all features are on the same scale, improving model performance and preventing features with larger ranges from dominating the training process.

### 🗂️ **Result**:
- `X`: Standardized feature set with mean ≈ 0 and standard deviation ≈ 1.
- Scaler saved as `scaler.pkl` for use during inference.

In [7]:
# Standardize the features and save the scaler for deployment
scaler = StandardScaler()
X = scaler.fit_transform(X)

In [8]:
joblib.dump(scaler, 'scaler.pkl')

['scaler.pkl']

# ✂️ **Split into Training and Test Sets**

### 🎯 **Objective**:
- Divide the dataset into:
  - **Training Set**: Used to train the model.
  - **Test Set**: Used to evaluate the model's performance on unseen data.

### 🔧 **Code**:
1. **Use `train_test_split`**:
   - Split the standardized features (`X`) and target (`y`) into training and test sets:
     ```python
     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
     ```

2. **Parameters**:
   - `test_size=0.2`: Reserves 20% of the data for testing.
   - `random_state=42`: Ensures reproducibility of the split.

### 📊 **Result**:
- `X_train`, `y_train`: Training features and target.
- `X_test`, `y_test`: Test features and target.

### 💡 **Why Split?**
- To evaluate the model's generalization ability on unseen data.
- Prevents overfitting by assessing performance on a separate test set.

In [9]:
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 📊 **Class Distribution Visualization**

### 🎯 **Objective**:
- Visualize the distribution of the target variable (`diagnosis`) to understand the balance between `Benign` and `Malignant` cases.

### 🔧 **Code**:
1. **Calculate Class Counts**:
   - Use `value_counts` to count the number of occurrences for each class:
     ```python
     class_counts = data['diagnosis'].value_counts()
     ```

2. **Create a Bar Chart**:
   - Use `plotly.express` to visualize the class distribution:
     ```python
     fig = px.bar(
         x=['Benign', 'Malignant'], 
         y=class_counts.values, 
         labels={'x': 'Diagnosis', 'y': 'Count'}, 
         title='Class Distribution'
     )
     ```

3. **Display the Chart**:
   - Render the plot:
     ```python
     fig.show()
     ```

### 📊 **Visualization Output**:
- A bar chart showing the count of `Benign` and `Malignant` cases.

### 💡 **Why Visualize?**
- To check for class imbalance, which can affect model performance.
- If classes are imbalanced, techniques like oversampling, undersampling, or class weighting might be required.

In [10]:
# Class Distribution Visualization
class_counts = data['diagnosis'].value_counts()
fig = px.bar(
    x=['Benign', 'Malignant'], 
    y=class_counts.values, 
    labels={'x': 'Diagnosis', 'y': 'Count'}, 
    title='Class Distribution'
)
fig.show()

# 🌡️ **Correlation Heatmap**

### 🎯 **Objective**:
- Visualize the correlation between features to identify relationships or redundancies in the dataset.

### 🔧 **Code**:
1. **Compute the Correlation Matrix**:
   - Use the `corr` method to calculate pairwise correlations between features:
     ```python
     correlation_matrix = data.corr()
     ```

2. **Create the Heatmap**:
   - Use `plotly.express.imshow` to create an interactive heatmap:
     ```python
     fig = px.imshow(
         correlation_matrix, 
         title='Feature Correlation Heatmap', 
         color_continuous_scale='Viridis'
     )
     ```

3. **Display the Heatmap**:
   - Render the heatmap:
     ```python
     fig.show()
     ```

### 🌈 **Visualization Output**:
- A heatmap showing correlation coefficients:
  - Values range from `-1` (perfect negative correlation) to `1` (perfect positive correlation).
  - `0` indicates no correlation.

### 💡 **Why Visualize?**
- Identify highly correlated features that may cause multicollinearity, which can degrade model performance.
- Aid in feature selection by removing redundant features.

In [11]:
# Correlation Heatmap
correlation_matrix = data.corr()
fig = px.imshow(
    correlation_matrix, 
    title='Feature Correlation Heatmap', 
    color_continuous_scale='Viridis'
)
fig.show()

# 🔍 **Important Features Visualization**

### 🎯 **Objective**:
- Visualize the relationship between two key features, `radius_mean` and `texture_mean`, and how they vary by target class (`diagnosis`).

### 🔧 **Code**:
1. **Create a Scatter Plot**:
   - Use `plotly.express.scatter` to plot `radius_mean` vs. `texture_mean`:
     ```python
     fig = px.scatter(
         data, 
         x='radius_mean', 
         y='texture_mean', 
         color='diagnosis', 
         labels={'color': 'Diagnosis'},
         title='Radius Mean vs Texture Mean'
     )
     ```

2. **Display the Scatter Plot**:
   - Render the plot:
     ```python
     fig.show()
     ```

### 📊 **Visualization Output**:
- A scatter plot with:
  - **X-axis**: `radius_mean`
  - **Y-axis**: `texture_mean`
  - Points colored by `diagnosis` (`Benign` or `Malignant`).

### 💡 **Why Visualize?**
- Explore how these features differentiate between classes.
- Identify feature separability and potential decision boundaries for classification.

In [12]:
# Important Features Visualization
fig = px.scatter(
    data, 
    x='radius_mean', 
    y='texture_mean', 
    color='diagnosis', 
    labels={'color': 'Diagnosis'},
    title='Radius Mean vs Texture Mean'
)
fig.show()

# 🤖 **Initialize and Train Models**

### 🎯 **Objective**:
- Set up a dictionary of machine learning models for training and evaluation.

### 🔧 **Code**:
1. **Define the Models**:
   - Initialize a collection of popular classification algorithms:
     ```python
     models = {
         'XGBoost': XGBClassifier(eval_metric='logloss', use_label_encoder=False, random_state=42),
         'SVC': SVC(kernel='linear', probability=True, random_state=42),
         'Random Forest': RandomForestClassifier(random_state=42),
         'Logistic Regression': LogisticRegression(random_state=42),
         'Neural Network': MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=500, random_state=42)
     }
     ```

2. **Included Models**:
   - **XGBoost**:
     - Extreme Gradient Boosting for high-performance classification.
     - Parameters:
       - `eval_metric='logloss'`: Evaluation metric for optimization.
       - `use_label_encoder=False`: Disables label encoding for compatibility.
   - **SVC**:
     - Support Vector Classifier with a linear kernel.
     - Parameters:
       - `kernel='linear'`: Uses a linear decision boundary.
       - `probability=True`: Enables probability estimation.
   - **Random Forest**:
     - An ensemble of decision trees.
     - Parameter:
       - `random_state=42`: Ensures reproducibility.
   - **Logistic Regression**:
     - A simple and interpretable binary classifier.
   - **Neural Network**:
     - Multi-Layer Perceptron (MLP) with 2 hidden layers of sizes `100` and `50`.
     - Parameters:
       - `max_iter=500`: Sets the maximum number of iterations.

### 💡 **Why Use Multiple Models?**
- To compare the performance of various algorithms and select the best-performing one for deployment.

In [13]:
# Initialize and Train Models

models = {
    'XGBoost': XGBClassifier(eval_metric='logloss', use_label_encoder=False, random_state=42),
    'SVC': SVC(kernel='linear', probability=True, random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42),
    'Logistic Regression': LogisticRegression(random_state=42),
    'Neural Network': MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=500, random_state=42)
}

# 🏋️‍♂️ **Train and Evaluate Models**

### 🎯 **Objective**:
- Train each model on the training dataset and evaluate its performance on the test dataset.

### 🔧 **Code**:
1. **Initialize Performance Dictionary**:
   - Create an empty dictionary to store the accuracy of each model:
     ```python
     model_performance = {}
     ```

2. **Iterate Over Models**:
   - For each model in the `models` dictionary:
     - Train the model using `X_train` and `y_train`:
       ```python
       model.fit(X_train, y_train)
       ```
     - Make predictions on the test set:
       ```python
       y_pred = model.predict(X_test)
       ```
     - Calculate accuracy:
       ```python
       accuracy = accuracy_score(y_test, y_pred)
       ```
     - Store the accuracy in the `model_performance` dictionary:
       ```python
       model_performance[name] = accuracy
       ```

3. **Print Performance Details**:
   - Display the accuracy and classification report for each model:
     ```python
     print(f"{name} Accuracy: {accuracy:.4f}")
     print(classification_report(y_test, y_pred))
     ```

### 📊 **Output**:
- **Accuracy**:
  - Numerical accuracy score for each model.
- **Classification Report**:
  - Precision, recall, F1-score, and support for each class.

### 💡 **Why Iterate?**
- To compare the performance of different models systematically.
- Identifies the best-performing model for deployment.

In [14]:
model_performance = {}  # Store model performances

In [15]:
for name, model in models.items():
    model.fit(X_train, y_train)  # Train the model
    y_pred = model.predict(X_test)  # Predict on the test set
    accuracy = accuracy_score(y_test, y_pred)  # Calculate accuracy
    model_performance[name] = accuracy

    # Print performance details
    print(f"{name} Accuracy: {accuracy:.4f}")
    print(classification_report(y_test, y_pred))


Parameters: { "use_label_encoder" } are not used.




XGBoost Accuracy: 0.9561
              precision    recall  f1-score   support

           0       0.96      0.97      0.97        71
           1       0.95      0.93      0.94        43

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114

SVC Accuracy: 0.9561
              precision    recall  f1-score   support

           0       0.97      0.96      0.96        71
           1       0.93      0.95      0.94        43

    accuracy                           0.96       114
   macro avg       0.95      0.96      0.95       114
weighted avg       0.96      0.96      0.96       114

Random Forest Accuracy: 0.9649
              precision    recall  f1-score   support

           0       0.96      0.99      0.97        71
           1       0.98      0.93      0.95        43

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       1

# 📈 **Compare Model Performances**

### 🎯 **Objective**:
- Visualize and compare the accuracy of all trained models.
- Highlight the best-performing model.

### 🔧 **Code**:
1. **Create a Bar Chart**:
   - Use `plotly.express.bar` to create a bar chart of model performances:
     ```python
     fig = px.bar(
         x=list(model_performance.keys()),
         y=list(model_performance.values()),
         labels={'x': 'Model', 'y': 'Accuracy'},
         title='Model Performance Comparison'
     )
     ```

2. **Identify the Best Model**:
   - Find the model with the highest accuracy:
     ```python
     best_model_name = max(model_performance, key=model_performance.get)
     best_accuracy = model_performance[best_model_name]
     ```

3. **Highlight the Best Model**:
   - Add a marker to emphasize the best-performing model:
     ```python
     fig.add_trace(go.Scatter(
         x=[best_model_name],
         y=[best_accuracy],
         mode='markers+text',
         text=['Best Model'],
         textposition='top center',
         marker=dict(color='red', size=12)
     ))
     ```

4. **Display the Chart**:
   - Render the performance comparison plot:
     ```python
     fig.show()
     ```

### 📊 **Visualization Output**:
- A bar chart comparing the accuracy of each model.
- A highlighted marker indicating the best model with its accuracy.

### 💡 **Why Visualize?**
- Quickly identify the top-performing model for further use.
- Gain insights into the relative performance of all models.

In [17]:
# Compare Model Performances

# Create bar chart for model performances
fig = px.bar(
    x=list(model_performance.keys()),
    y=list(model_performance.values()),
    labels={'x': 'Model', 'y': 'Accuracy'},
    title='Model Performance Comparison'
)

# Highlight the best model
best_model_name = max(model_performance, key=model_performance.get)
best_accuracy = model_performance[best_model_name]
fig.add_trace(go.Scatter(
    x=[best_model_name],
    y=[best_accuracy],
    mode='markers+text',
    text=['Best Model'],
    textposition='top center',
    marker=dict(color='red', size=12)
))
fig.show()

# 💾 **Save the Best Model**

### 🎯 **Objective**:
- Save the best-performing model to a file for future deployment or inference.

### 🔧 **Code**:
1. **Select the Best Model**:
   - Retrieve the best model using its name:
     ```python
     best_model = models[best_model_name]
     ```

2. **Save the Model**:
   - Use `joblib.dump` to save the model as a `.pkl` file:
     ```python
     joblib.dump(best_model, 'best_model.pkl')
     ```

3. **Print Confirmation**:
   - Confirm the saved model and its filename:
     ```python
     print(f"Best model ({best_model_name}) saved as 'best_model.pkl'.")
     ```

### 🗂️ **Output**:
- A file named `best_model.pkl` containing the serialized model.

### 💡 **Why Save?**
- Ensures the best model can be reused without retraining.
- Facilitates integration into applications or APIs for real-time predictions.

In [18]:
# Save Best Model 
best_model = models[best_model_name]
joblib.dump(best_model, 'best_model.pkl')  # Save the best model for deployment
print(f"Best model ({best_model_name}) saved as 'best_model.pkl'.")

Best model (Logistic Regression) saved as 'best_model.pkl'.
