### Logistic Regression: Breast Cancer Prediction

As we have already observed the importance of **scaling** in improving model performance, this section focuses on evaluating the performance of **Logistic Regression** with scaled data. The goal is to determine whether Logistic Regression performs better than other models (e.g., SVM) on the breast cancer diagnosis dataset.

#### Steps Performed:
1. **Data Preprocessing**:
   - Removed the `ID` column as it is irrelevant for prediction.
   - Encoded the `Diagnosis` column (`M` = 1, `B` = 0).
   - Scaled the features using `StandardScaler` to ensure consistent scaling.

2. **Train-Test-Validation Split**:
   - Split the data into **train (80%)**, **test (15%)**, and **validation (5%)** sets.

3. **Model Development**:
   - Trained a **Logistic Regression** model on the scaled training data.
   - Evaluated the model using **10-fold cross-validation** to assess generalization.

4. **Performance Evaluation**:
   - Evaluated the model on the **test set** and **validation set** using metrics such as accuracy, precision, recall, F1-score, and ROC-AUC.

5. **Hyperparameter Tuning**:
   - Used **Randomized Search** to find the optimal hyperparameters for Logistic Regression.

#### Next Steps:
- Compare the performance of Logistic Regression with other models (e.g., SVM) to determine the best-performing model for this dataset.
- Use **Randomized Search** for hyperparameter tuning to further optimize Logistic Regression.


## Importing the libraries

In [158]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix
from scipy.stats import uniform

In [160]:
# Load the dataset
data = pd.read_csv('cancer_data.csv')

In [162]:
# Step 1: Remove the ID column
#data.drop(columns=['ID'], inplace=True)

In [163]:
# Step 2: Encode the Diagnosis column (M = 1, B = 0)
label_encoder = LabelEncoder()
data['diagnosis'] = label_encoder.fit_transform(data['diagnosis'])

In [164]:
# Step 3: Choose Diagnosis as the target variable
X = data.drop(columns=['diagnosis'])  # Features
y = data['diagnosis']  # Target

In [168]:
# Step 4: Standardize the features (important for Logistic Regression)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [169]:
# Step 5: Split the data into train (80%), test (15%), and validation (5%)
X_train, X_temp, y_train, y_temp = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42)  # 0.25 * 20% = 5%

In [172]:
# Step 6: Develop Logistic Regression model
log_reg = LogisticRegression(random_state=42)

10-Fold Cross-Validation Accuracy Scores: [0.97826087 0.97826087 0.97826087 0.95652174 1.         1.
 0.97777778 0.97777778 0.93333333 0.93333333]
Mean Cross-Validation Accuracy: 0.9713526570048309


In [None]:
# Step 7: Evaluate the model using 10-fold cross-validation on the training set
cv_scores = cross_val_score(log_reg, X_train, y_train, cv=10, scoring='accuracy')
print("10-Fold Cross-Validation Accuracy Scores:", cv_scores)
print("Mean Cross-Validation Accuracy:", np.mean(cv_scores))

In [174]:
# Step 8: Hyperparameter Tuning using Randomized Search
param_dist = {
    'C': uniform(0.01, 100),  # Regularization parameter
    'penalty': ['l2'],  # Regularization type (L2 for Logistic Regression)
    'solver': ['lbfgs', 'liblinear', 'sag', 'saga']  # Solvers for optimization
}

In [198]:
# Set up Randomized Search with Cross-Validation
random_search = RandomizedSearchCV(
    estimator=log_reg,              
    param_distributions=param_dist, 
    n_iter=1000,                    
    scoring='accuracy',             
    cv=10,                          
    verbose=1,                     
    n_jobs=-1,                      
    random_state=42                 
)


In [None]:
random_search.fit(X_train, y_train)

Fitting 10 folds for each of 1000 candidates, totalling 10000 fits


In [None]:
# Best parameters and score
print("Best Parameters:", random_search.best_params_)
print("Best Cross-Validation Accuracy:", random_search.best_score_)

In [184]:
# Step 9: Train the final model with the best parameters
final_model = random_search.best_estimator_
final_model.fit(X_train, y_train)



In [186]:
# Step 10: Evaluate the model on the test set
y_test_pred = final_model.predict(X_test)
y_test_pred_prob = final_model.predict_proba(X_test)[:, 1]  # Needed for ROC-AUC

In [188]:
test_accuracy = accuracy_score(y_test, y_test_pred)
test_precision = precision_score(y_test, y_test_pred)
test_recall = recall_score(y_test, y_test_pred)
test_f1 = f1_score(y_test, y_test_pred)
test_roc_auc = roc_auc_score(y_test, y_test_pred_prob)

print("\nTest Set Performance:")
print(f"Accuracy: {test_accuracy:.4f}")
print(f"Precision: {test_precision:.4f}")
print(f"Recall: {test_recall:.4f}")
print(f"F1-Score: {test_f1:.4f}")
print(f"ROC-AUC: {test_roc_auc:.4f}")
print("Confusion Matrix:\n", confusion_matrix(y_test, y_test_pred))


Test Set Performance:
Accuracy: 0.9882
Precision: 1.0000
Recall: 0.9706
F1-Score: 0.9851
ROC-AUC: 0.9983
Confusion Matrix:
 [[51  0]
 [ 1 33]]


In [190]:
# Step 11: Predict on the unseen validation set
y_val_pred = final_model.predict(X_val)
y_val_pred_prob = final_model.predict_proba(X_val)[:, 1]

val_accuracy = accuracy_score(y_val, y_val_pred)
val_precision = precision_score(y_val, y_val_pred)
val_recall = recall_score(y_val, y_val_pred)
val_f1 = f1_score(y_val, y_val_pred)
val_roc_auc = roc_auc_score(y_val, y_val_pred_prob)

print("\nValidation Set Performance:")
print(f"Accuracy: {val_accuracy:.4f}")
print(f"Precision: {val_precision:.4f}")
print(f"Recall: {val_recall:.4f}")
print(f"F1-Score: {val_f1:.4f}")
print(f"ROC-AUC: {val_roc_auc:.4f}")
print("Confusion Matrix:\n", confusion_matrix(y_val, y_val_pred))


Validation Set Performance:
Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
F1-Score: 1.0000
ROC-AUC: 1.0000
Confusion Matrix:
 [[20  0]
 [ 0  9]]
