# Breast Cancer Classification using SVM

## Project Objective
Classify tumors as Malignant or Benign using the Sklearn Breast Cancer dataset, applying proper preprocessing, model tuning, evaluation, and deployment readiness.

## 1. Data Exploration & Understanding

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer

# Load dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Display info
print(f"Features: {data.feature_names}")
print(f"Target Classes: {data.target_names}")
print(f"Dataset Shape: {df.shape}")
df.head()

In [None]:
# Class Balance
sns.countplot(x=df['target'])
plt.title('Class Distribution (0=Malignant, 1=Benign)')
plt.show()

# Summary Statistics
df.describe()

## 2. Data Preprocessing
Scaling is critical for SVM because it tries to maximize the distance between the separating hyperplane and the support vectors. If one feature has a much larger range than others, it will dominate the distance calculation.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## 3. Baseline Model (Linear SVM)

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score

svm_linear = SVC(kernel='linear', random_state=42)
svm_linear.fit(X_train_scaled, y_train)
y_pred_linear = svm_linear.predict(X_test_scaled)

print("Linear SVM Results:")
print(classification_report(y_test, y_pred_linear))

## 4. Advanced Model (RBF SVM) & 5. Hyperparameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [1, 0.1, 0.01, 0.001],
    'kernel': ['rbf']
}

grid = GridSearchCV(SVC(probability=True, random_state=42), param_grid, refit=True, verbose=2, cv=5)
grid.fit(X_train_scaled, y_train)

print(f"Best Params: {grid.best_params_}")
print(f"Best Score: {grid.best_score_}")

## 6. Model Evaluation
False Negatives (Predicting Benign when it is Malignant) are highly critical in medical diagnosis as they can lead to delayed treatment.

In [None]:
from sklearn.metrics import confusion_matrix

best_model = grid.best_estimator_
y_pred_best = best_model.predict(X_test_scaled)

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_best))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_best))

## 7. ROC Curve & AUC

In [None]:
from sklearn.metrics import roc_curve, auc

y_prob = best_model.predict_proba(X_test_scaled)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()

## 8. Model Saving

In [None]:
import joblib
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('scaler', scaler),
    ('svc', best_model)
])

joblib.dump(pipeline, 'svm_breast_cancer_model.pkl')
print("Model saved!")

## Bonus: Interview Explanations
**Why SVM?** Effective in high-dimensional spaces (30 features here) and when number of dimensions is greater than number of samples (not here, but generally true).  
**Linear vs RBF:** Linear assumes data is linearly separable. RBF maps data to higher dimensions to find a separating hyperplane for non-linear data.  
**C & Gamma:** C controls misclassification penalty (High C = low bias, high variance). Gamma defines influence of a single training example (High gamma = close reach, complex boundary).