# 🩺 Cancer Diagnosis Prediction

This project builds and evaluates multiple machine learning models for predicting **breast cancer diagnosis (Benign vs. Malignant)** using a clinical dataset.  
The dataset includes demographic, tumor, and clinical features (e.g., age, menopause, tumor size, breast quadrant, history, etc.).

---

## 📂 Dataset
- Source: `cancer_diagnosis_data.csv`
- Shape: **213 samples × 11 columns**
- Target: `Diagnosis Result` (Benign / Malignant)
- Example columns:
  - `Age`
  - `Menopause`
  - `Tumor Size (cm)`
  - `Breast Quadrant`
  - `History`
  - `Diagnosis Result` (target)

---

## ⚙️ Preprocessing
1. **Missing values** handled:
   - Numeric → median imputation
   - Categorical → most frequent imputation
2. **Normalization**: StandardScaler on numeric features
3. **One-Hot Encoding**: categorical features converted to binary features
4. **Train/Validation/Test split**:  
   - Train: 64%  
   - Validation: 16%  
   - Test: 20%

---

## 🤖 Models Trained
The following models were benchmarked:

- Logistic Regression
- Random Forest
- Support Vector Machine (RBF Kernel)
- XGBoost
- K-Nearest Neighbors (KNN)
- Naive Bayes
- Gradient Boosting (sklearn)

---

## 📊 Results

| Model               | Validation Accuracy | Test Accuracy |
|----------------------|---------------------|---------------|
| Logistic Regression  | 0.9706              | **0.9070**    |
| Random Forest        | 0.9706              | 0.8837        |
| SVM (RBF Kernel)     | 0.9706              | **0.9070**    |
| XGBoost              | 0.8824              | 0.7907        |
| KNN                  | 0.9412              | **0.9070**    |
| Naive Bayes          | 0.9706              | **0.9070**    |
| Gradient Boosting    | 0.9118              | 0.7907        |

✅ **Best Model:** Logistic Regression (Validation Accuracy = **97.06%**)  
🎉 Saved as `best_model.pkl`

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

In [4]:
df = pd.read_csv(f"/kaggle/input/breast-cancer-prediction/breast-cancer-dataset.csv")  # replace with actual file name
print(df.head())


   S/N  Year  Age  Menopause Tumor Size (cm) Inv-Nodes Breast Metastasis  \
0    1  2019   40          1               2         0  Right          0   
1    2  2019   39          1               2         0   Left          0   
2    3  2019   45          0               4         0   Left          0   
3    4  2019   26          1               3         0   Left          0   
4    5  2019   21          1               1         0  Right          0   

  Breast Quadrant History Diagnosis Result  
0     Upper inner       0           Benign  
1     Upper outer       0           Benign  
2     Lower outer       0           Benign  
3     Lower inner       1           Benign  
4     Upper outer       1           Benign  


In [5]:
# Quick check
print("Dataset shape:", df.shape)
print("\nColumns:\n", df.columns)
print("\nSample rows:\n", df.head())

Dataset shape: (213, 11)

Columns:
 Index(['S/N', 'Year', 'Age', 'Menopause', 'Tumor Size (cm)', 'Inv-Nodes',
       'Breast', 'Metastasis', 'Breast Quadrant', 'History',
       'Diagnosis Result'],
      dtype='object')

Sample rows:
    S/N  Year  Age  Menopause Tumor Size (cm) Inv-Nodes Breast Metastasis  \
0    1  2019   40          1               2         0  Right          0   
1    2  2019   39          1               2         0   Left          0   
2    3  2019   45          0               4         0   Left          0   
3    4  2019   26          1               3         0   Left          0   
4    5  2019   21          1               1         0  Right          0   

  Breast Quadrant History Diagnosis Result  
0     Upper inner       0           Benign  
1     Upper outer       0           Benign  
2     Lower outer       0           Benign  
3     Lower inner       1           Benign  
4     Upper outer       1           Benign  


In [6]:
# 2. Define target and features
target = "Diagnosis Result"
X = df.drop(columns=[target, "S/N"])  # drop target + serial number
y = df[target].map({"Benign": 0, "Malignant": 1})  # convert to numeric

In [7]:
# 3. Identify categorical & numeric features
categorical_cols = X.select_dtypes(include=["object"]).columns.tolist()
numeric_cols = X.select_dtypes(include=[np.number]).columns.tolist()

print("\nCategorical columns:", categorical_cols)
print("Numeric columns:", numeric_cols)


Categorical columns: ['Year', 'Tumor Size (cm)', 'Inv-Nodes', 'Breast', 'Metastasis', 'Breast Quadrant', 'History']
Numeric columns: ['Age', 'Menopause']


In [8]:
# 5. Train-validation-test split
# First split train/test (80/20), then split train into train/val (80/20 of train)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42, stratify=y_train
)

print("\nDataset split:")
print("Train:", X_train.shape, "Validation:", X_val.shape, "Test:", X_test.shape)



Dataset split:
Train: (136, 9) Validation: (34, 9) Test: (43, 9)


In [9]:
# 4. Preprocessing
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),  # handle missing
    ("scaler", StandardScaler())  # normalize
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))  # dense output
])


preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_cols),
        ("cat", categorical_transformer, categorical_cols)
    ]
)

In [10]:
# Apply preprocessing to the entire dataset (or just X_train)
X_preprocessed = preprocessor.fit_transform(X)  # or X_train if you only want train

# Get feature names
# Numeric feature names stay the same
numeric_features = numeric_cols

# Get the new one-hot encoded feature names
categorical_features = preprocessor.named_transformers_["cat"]["onehot"].get_feature_names_out(categorical_cols)

# Combine all feature names
all_features = list(numeric_features) + list(categorical_features)

# Create a DataFrame with processed data
X_encoded_df = pd.DataFrame(X_preprocessed, columns=all_features)

# Display the first 5 rows
print(X_encoded_df.head())


        Age  Menopause  Year_#  Year_2019  Year_2020  Tumor Size (cm)_#  \
0  0.015356   0.707107     0.0        1.0        0.0                0.0   
1 -0.055749   0.707107     0.0        1.0        0.0                0.0   
2  0.370884  -1.414214     0.0        1.0        0.0                0.0   
3 -0.980123   0.707107     0.0        1.0        0.0                0.0   
4 -1.335651   0.707107     0.0        1.0        0.0                0.0   

   Tumor Size (cm)_1  Tumor Size (cm)_10  Tumor Size (cm)_12  \
0                0.0                 0.0                 0.0   
1                0.0                 0.0                 0.0   
2                0.0                 0.0                 0.0   
3                0.0                 0.0                 0.0   
4                1.0                 0.0                 0.0   

   Tumor Size (cm)_14  ...  Metastasis_1  Breast Quadrant_#  \
0                 0.0  ...           0.0                0.0   
1                 0.0  ...           0

In [11]:
# 6. Build preprocessing pipeline
X_train_preprocessed = preprocessor.fit_transform(X_train)
X_val_preprocessed = preprocessor.transform(X_val)
X_test_preprocessed = preprocessor.transform(X_test)

print("\nFinal processed feature matrix shapes:")
print("Train:", X_train_preprocessed.shape)
print("Validation:", X_val_preprocessed.shape)
print("Test:", X_test_preprocessed.shape)


Final processed feature matrix shapes:
Train: (136, 35)
Validation: (34, 35)
Test: (43, 35)


In [12]:
import joblib
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report


from sklearn.metrics import accuracy_score, classification_report

# Dictionary of models to train
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000, random_state=42),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "SVM (RBF Kernel)": SVC(kernel="rbf", probability=True, random_state=42),
    "XGBoost": XGBClassifier(use_label_encoder=False, eval_metric="logloss", random_state=42),
    "KNN": KNeighborsClassifier(n_neighbors=5),
    "Naive Bayes": GaussianNB(),
    "Gradient Boosting": GradientBoostingClassifier(random_state=42)
}

results = {}
best_model = None
best_val_acc = 0
best_model_name = ""

# =========================
# Training loop
# =========================
for name, model in models.items():
    print(f"\n🔹 Training {name}...")
    model.fit(X_train_preprocessed, y_train)

    # Validation performance
    y_val_pred = model.predict(X_val_preprocessed)
    val_acc = accuracy_score(y_val, y_val_pred)

    # Test performance
    y_test_pred = model.predict(X_test_preprocessed)
    test_acc = accuracy_score(y_test, y_test_pred)

    # Store results
    results[name] = {
        "Validation Accuracy": val_acc,
        "Test Accuracy": test_acc,
        "Classification Report (Test)": classification_report(y_test, y_test_pred, target_names=["Benign","Malignant"])
    }

    # Track best model
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        best_model = model
        best_model_name = name

# =========================
# Results summary
# =========================
print("\n================= Model Comparison =================")
for name, metrics in results.items():
    print(f"\n{name}")
    print(f"Validation Accuracy: {metrics['Validation Accuracy']:.4f}")
    print(f"Test Accuracy: {metrics['Test Accuracy']:.4f}")
    print(metrics["Classification Report (Test)"])

print("\n✅ Best model:", best_model_name, f"(Validation Accuracy: {best_val_acc:.4f})")

# =========================
# Save best model
# =========================
joblib.dump(best_model, "best_model.pkl")
print("🎉 Best model saved as best_model.pkl")


🔹 Training Logistic Regression...

🔹 Training Random Forest...

🔹 Training SVM (RBF Kernel)...

🔹 Training XGBoost...

🔹 Training KNN...

🔹 Training Naive Bayes...

🔹 Training Gradient Boosting...


Logistic Regression
Validation Accuracy: 0.9706
Test Accuracy: 0.9070
              precision    recall  f1-score   support

      Benign       0.88      0.96      0.92        24
   Malignant       0.94      0.84      0.89        19

    accuracy                           0.91        43
   macro avg       0.91      0.90      0.90        43
weighted avg       0.91      0.91      0.91        43


Random Forest
Validation Accuracy: 0.9706
Test Accuracy: 0.8837
              precision    recall  f1-score   support

      Benign       0.85      0.96      0.90        24
   Malignant       0.94      0.79      0.86        19

    accuracy                           0.88        43
   macro avg       0.89      0.87      0.88        43
weighted avg       0.89      0.88      0.88        43


SVM (RBF K

In [13]:
import joblib
# Save the preprocessing pipeline
joblib.dump(preprocessor, "preprocessor.pkl")
print("🎉 Preprocessing pipeline saved as preprocessor.pkl")

🎉 Preprocessing pipeline saved as preprocessor.pkl
