###  Support Vector Machines (SVM): Breast Cancer Prediction
This project focuses on predicting breast cancer diagnosis (malignant or benign) using machine learning. The dataset contains features computed from cell nuclei images, and the goal is to evaluate the impact of **data scaling** and **hyperparameter tuning** on model performance, particularly for **Support Vector Machines (SVM)**. Below are the steps performed:

1. **Data Preprocessing**:
   - Removed the `ID` column as it is irrelevant for prediction.
   - Encoded the `Diagnosis` column (`M` = 1, `B` = 0).
   - Checked for and handled missing values (if any).
   - Split the data into **train (80%)**, **test (10%)**, and **validation (10%)** sets.

2. **Feature Scaling**:
   - Applied **StandardScaler** to standardize features, ensuring consistent scaling for algorithms sensitive to feature magnitudes (e.g., Logistic Regression, SVM).

3. **Model Development**:
   - Trained a **Logistic Regression** model as a baseline.
   - Evaluated using **10-fold cross-validation** to assess generalization.
   - Tested the model on the **test set** and validated on the **validation set**.

4. **Impact of Scaling**:
   - Compared model performance with and without scaling to evaluate its importance.

5. **Hyperparameter Tuning for SVM**:
   - Used techniques like **GridSearchCV** or **RandomizedSearchCV** to find optimal hyperparameters for SVM.
   - Evaluated whether tuning improves SVM performance.

6. **Performance Metrics**:
   - Measured accuracy, precision, recall, F1-score, ROC-AUC, and confusion matrices for all models.

### Key Questions to Address:
- Does **scaling the data** improve model performance?
- Does **hyperparameter tuning** enhance the performance of SVM?

In [86]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

In [88]:
# Load the dataset
data = pd.read_csv('cancer_data.csv')

In [90]:
# Step 1: Remove the ID column
#data.drop(columns=['ID'], inplace=True)

In [92]:
# Step 2: Encode the Diagnosis column (M = 1, B = 0)
label_encoder = LabelEncoder()
data['diagnosis'] = label_encoder.fit_transform(data['diagnosis'])

In [94]:
# Step 3: Choose Diagnosis as the target variable
X = data.drop(columns=['diagnosis'])  # Features
y = data['diagnosis']  # Target

### Performance of Support vector Machine with out data scaling

In [97]:
# Step 5: Split the data into train (80%), test (15%), and validation (5%)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42)  # 0.25 * 20% = 5%

In [99]:
# Step 6: Develop SVM model (RBF Kernel)
svm_model = SVC(kernel='rbf', probability=True, random_state=42)

In [101]:
# Step 7: Evaluate the model using 10-fold cross-validation on the training set
cv_scores = cross_val_score(svm_model, X_train, y_train, cv=10, scoring='accuracy')
print("10-Fold Cross-Validation Accuracy Scores:", cv_scores)
print("Mean Cross-Validation Accuracy:", np.mean(cv_scores))

10-Fold Cross-Validation Accuracy Scores: [0.63043478 0.63043478 0.63043478 0.60869565 0.58695652 0.64444444
 0.62222222 0.6        0.62222222 0.62222222]
Mean Cross-Validation Accuracy: 0.6198067632850242


In [103]:
# Step 8: Train the model on the full training set
svm_model.fit(X_train, y_train)

In [105]:
# Step 9: Evaluate the model on the test set
y_test_pred = svm_model.predict(X_test)
y_test_pred_prob = svm_model.predict_proba(X_test)[:, 1]  # Needed for ROC-AUC

In [107]:
test_accuracy = accuracy_score(y_test, y_test_pred)
test_precision = precision_score(y_test, y_test_pred)
test_recall = recall_score(y_test, y_test_pred)
test_f1 = f1_score(y_test, y_test_pred)
test_roc_auc = roc_auc_score(y_test, y_test_pred_prob)

print("\nTest Set Performance:")
print(f"Accuracy: {test_accuracy:.4f}")
print(f"Precision: {test_precision:.4f}")
print(f"Recall: {test_recall:.4f}")
print(f"F1-Score: {test_f1:.4f}")
print(f"ROC-AUC: {test_roc_auc:.4f}")
print("Confusion Matrix:\n", confusion_matrix(y_test, y_test_pred))


Test Set Performance:
Accuracy: 0.6000
Precision: 0.0000
Recall: 0.0000
F1-Score: 0.0000
ROC-AUC: 0.3760
Confusion Matrix:
 [[51  0]
 [34  0]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [109]:
# Step 10: Predict on the unseen validation set
y_val_pred = svm_model.predict(X_val)
y_val_pred_prob = svm_model.predict_proba(X_val)[:, 1]

val_accuracy = accuracy_score(y_val, y_val_pred)
val_precision = precision_score(y_val, y_val_pred)
val_recall = recall_score(y_val, y_val_pred)
val_f1 = f1_score(y_val, y_val_pred)
val_roc_auc = roc_auc_score(y_val, y_val_pred_prob)

print("\nValidation Set Performance:")
print(f"Accuracy: {val_accuracy:.4f}")
print(f"Precision: {val_precision:.4f}")
print(f"Recall: {val_recall:.4f}")
print(f"F1-Score: {val_f1:.4f}")
print(f"ROC-AUC: {val_roc_auc:.4f}")
print("Confusion Matrix:\n", confusion_matrix(y_val, y_val_pred))


Validation Set Performance:
Accuracy: 0.6897
Precision: 0.0000
Recall: 0.0000
F1-Score: 0.0000
ROC-AUC: 0.3944
Confusion Matrix:
 [[20  0]
 [ 9  0]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### Results Without Scaling
- **10-Fold CV Accuracy**: `[0.630, 0.630, 0.630, 0.609, 0.587, 0.644, 0.622, 0.600, 0.622, 0.622]` (Mean: **61.98%**).  
- **Test Set**: Accuracy: **60.00%**, Precision: **0.00%**, Recall: **0.00%**, F1-Score: **0.00%**, ROC-AUC: **37.60%**. Confusion Matrix: `[[51 0] [34 0]]`.  
- **Validation Set**: Accuracy: **68.97%**, Precision: **0.00%**, Recall: **0.00%**, F1-Score: **0.00%**, ROC-AUC: **39.44%**. Confusion Matrix: `[[20 0] [9 0]]`.  

### Key Issues:
- The model predicts **all samples as benign** (no positive predictions), resulting in **0.00% precision, recall, and F1-score**.  
- Low accuracy and ROC-AUC indicate poor performance.  

### Reason:
- **Lack of scaling** causes algorithms like SVM to perform poorly, especially with imbalanced data.  

### Solution:
- I will apply **feature scaling** (e.g., `StandardScaler`) and address **class imbalance** to improve performance.  

---

### Next Steps:
1. **Apply Scaling**:
   - I will use `StandardScaler` to standardize the features. If the performance will not improve I will use `MinMaxScaler`later.

2. **Address Class Imbalance**:
   - I will use the class imbalance techniques such as **oversampling**, **undersampling**, or **class weights** to handle imbalanced data if the performance will not improve with scaling.

3. **Re-evaluate the Model**:
   - Train and test the model again after scaling and balancing the data to observe improvements in performance.

### Performance of Support vector Machine with data scaling

In [112]:
# Step 4: Standardize the features (important for SVM)
scaler = StandardScaler()
T_scaled = scaler.fit_transform(X)

In [114]:
# Step 5: Split the data into train (80%), test (15%), and validation (5%)
X_train, X_temp, y_train, y_temp = train_test_split(T_scaled, y, test_size=0.2, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42)  # 0.25 * 20% = 5%

In [116]:
# Step 6: Develop SVM model (RBF Kernel)
svm_model = SVC(kernel='rbf', probability=True, random_state=42)

In [118]:
# Step 7: Evaluate the model using 10-fold cross-validation on the training set
cv_scores = cross_val_score(svm_model, X_train, y_train, cv=10, scoring='accuracy')
print("10-Fold Cross-Validation Accuracy Scores:", cv_scores)
print("Mean Cross-Validation Accuracy:", np.mean(cv_scores))

10-Fold Cross-Validation Accuracy Scores: [0.97826087 0.97826087 0.97826087 0.95652174 0.97826087 1.
 1.         0.97777778 0.97777778 0.88888889]
Mean Cross-Validation Accuracy: 0.971400966183575


In [120]:
# Step 8: Train the model on the full training set
svm_model.fit(X_train, y_train)

In [122]:
# Step 9: Evaluate the model on the test set
y_test_pred = svm_model.predict(X_test)
y_test_pred_prob = svm_model.predict_proba(X_test)[:, 1]  # Needed for ROC-AUC

In [124]:
test_accuracy = accuracy_score(y_test, y_test_pred)
test_precision = precision_score(y_test, y_test_pred)
test_recall = recall_score(y_test, y_test_pred)
test_f1 = f1_score(y_test, y_test_pred)
test_roc_auc = roc_auc_score(y_test, y_test_pred_prob)

In [126]:
print("\nTest Set Performance:")
print(f"Accuracy: {test_accuracy:.4f}")
print(f"Precision: {test_precision:.4f}")
print(f"Recall: {test_recall:.4f}")
print(f"F1-Score: {test_f1:.4f}")
print(f"ROC-AUC: {test_roc_auc:.4f}")
print("Confusion Matrix:\n", confusion_matrix(y_test, y_test_pred))


Test Set Performance:
Accuracy: 0.9765
Precision: 1.0000
Recall: 0.9412
F1-Score: 0.9697
ROC-AUC: 0.9965
Confusion Matrix:
 [[51  0]
 [ 2 32]]


In [128]:
# Step 10: Predict on the unseen validation set
y_val_pred = svm_model.predict(X_val)
y_val_pred_prob = svm_model.predict_proba(X_val)[:, 1]

val_accuracy = accuracy_score(y_val, y_val_pred)
val_precision = precision_score(y_val, y_val_pred)
val_recall = recall_score(y_val, y_val_pred)
val_f1 = f1_score(y_val, y_val_pred)
val_roc_auc = roc_auc_score(y_val, y_val_pred_prob)

print("\nValidation Set Performance:")
print(f"Accuracy: {val_accuracy:.4f}")
print(f"Precision: {val_precision:.4f}")
print(f"Recall: {val_recall:.4f}")
print(f"F1-Score: {val_f1:.4f}")
print(f"ROC-AUC: {val_roc_auc:.4f}")
print("Confusion Matrix:\n", confusion_matrix(y_val, y_val_pred))


Validation Set Performance:
Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
F1-Score: 1.0000
ROC-AUC: 1.0000
Confusion Matrix:
 [[20  0]
 [ 0  9]]


In [130]:
print("\nTest Set Performance:")
print(f"Accuracy: {test_accuracy:.4f}")
print(f"Precision: {test_precision:.4f}")
print(f"Recall: {test_recall:.4f}")
print(f"F1-Score: {test_f1:.4f}")
print(f"ROC-AUC: {test_roc_auc:.4f}")
print("Confusion Matrix:\n", confusion_matrix(y_test, y_test_pred))


Test Set Performance:
Accuracy: 0.9765
Precision: 1.0000
Recall: 0.9412
F1-Score: 0.9697
ROC-AUC: 0.9965
Confusion Matrix:
 [[51  0]
 [ 2 32]]


In [132]:
# Step 10: Predict on the unseen validation set
y_val_pred = svm_model.predict(X_val)
y_val_pred_prob = svm_model.predict_proba(X_val)[:, 1]

In [134]:
val_accuracy = accuracy_score(y_val, y_val_pred)
val_precision = precision_score(y_val, y_val_pred)
val_recall = recall_score(y_val, y_val_pred)
val_f1 = f1_score(y_val, y_val_pred)
val_roc_auc = roc_auc_score(y_val, y_val_pred_prob)

print("\nValidation Set Performance:")
print(f"Accuracy: {val_accuracy:.4f}")
print(f"Precision: {val_precision:.4f}")
print(f"Recall: {val_recall:.4f}")
print(f"F1-Score: {val_f1:.4f}")
print(f"ROC-AUC: {val_roc_auc:.4f}")
print("Confusion Matrix:\n", confusion_matrix(y_val, y_val_pred))


Validation Set Performance:
Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
F1-Score: 1.0000
ROC-AUC: 1.0000
Confusion Matrix:
 [[20  0]
 [ 0  9]]


### Results Summary
Using **StandardScaler** improves Logistic Regression performance. Key results:  
- **10-Fold CV Accuracy**: `[0.978, 0.978, 0.978, 0.957, 0.978, 1.0, 1.0, 0.978, 0.978, 0.889]` (Mean: **97.14%**).  
- **Test Set**: Accuracy: **97.65%**, Precision: **100%**, Recall: **94.12%**, F1-Score: **96.97%**, ROC-AUC: **99.65%**. Confusion Matrix: `[[51 0] [2 32]]`.  
- **Validation Set**: Accuracy: **100%**, Precision: **100%**, Recall: **100%**, F1-Score: **100%**, ROC-AUC: **100%**. Confusion Matrix: `[[20 0] [0 9]]`.  
**Key Takeaways**: StandardScaler ensures consistent scaling, improving model performance. Logistic Regression is robust for breast cancer prediction with proper preprocessing.

### SVM performance with Hyperparameter Tuning

In [143]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

In [145]:
# Define the parameter grid
param_grid = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'rbf', 'poly'],
    'gamma': ['scale', 'auto', 0.1, 1, 10],
    'degree': [2, 3, 4]  # Only for 'poly' kernel
}


In [147]:
# Initialize SVM
svm = SVC()

In [149]:
# Perform Grid Search
grid_search = GridSearchCV(estimator=svm, param_grid=param_grid, scoring='accuracy', cv=10, verbose=1, n_jobs=-1)
grid_search.fit(X_train, y_train)

Fitting 10 folds for each of 180 candidates, totalling 1800 fits


In [151]:
# Best parameters and score
print("Best Parameters:", grid_search.best_params_)
print("Best Cross-Validation Accuracy:", grid_search.best_score_)

Best Parameters: {'C': 1, 'degree': 2, 'gamma': 'scale', 'kernel': 'rbf'}
Best Cross-Validation Accuracy: 0.971400966183575


After performing hyperparameter tuning for the Support Vector Machine (SVM) model, we observed that there was **no significant improvement** in cross-validation performance for this specific dataset. Below are the details: