<a href="https://colab.research.google.com/github/Mohammed-Saif-07/ML-winter-quarter/blob/main/Quiz_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab-Based Quiz 4  
# Time: 1 hour
## Comparing Individual Models vs Ensemble Methods  
### Dataset: Breast Cancer Wisconsin (Diagnostic)


## Objective

In this lab, you will:

1. Train and evaluate the following classifiers individually:
   - Decision Tree
   - Naive Bayes
   - K-Nearest Neighbors (KNN)
   - Logistic Regression
   - SGD Classifier

2. Build ensemble models using:
   - Hard Voting
   - Soft Voting
   - Bagging
   - Random Forest
   - AdaBoost

3. Compare all models using appropriate evaluation metrics.

4. Decide which metric best represents performance in this case study and justify your reasoning.

5. Write your own summary explaining your findings.

**Important**:
- Do NOT hard-code results.
- Do NOT use Gemini or any LLM model to do this for you.
- All answers must be written in your own words.
- Replace each TODO with working code.

# Part 1 – Load and Explore the Dataset

The Breast Cancer Wisconsin dataset is a binary classification dataset.

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer

breast_cancer = load_breast_cancer()

df = pd.DataFrame(data=breast_cancer.data, columns=breast_cancer.feature_names)

df['target'] = breast_cancer.target

df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


## Basic Exploration

1. How many samples and features are there?
2. What are the target classes?
3. Is the dataset balanced?

In [2]:
print("Dataset shape:", df.shape)


print("\nClass distribution:")
print(df['target'].value_counts())


print("\nClass percentage:")
print(df['target'].value_counts(normalize=True) * 100)

Dataset shape: (569, 31)

Class distribution:
target
1    357
0    212
Name: count, dtype: int64

Class percentage:
target
1    62.741652
0    37.258348
Name: proportion, dtype: float64


### Write your answers below:

- Number of samples:
- Number of features:
- Class distribution:
- Is the dataset balanced?

Number of samples: 569

Number of features: 30

Class distribution: 357 malignant (0), 212 benign (1) (your output may show reversed order depending on labeling)

Is the dataset balanced?

The dataset is slightly imbalanced but not severely. One class has more samples than the other.

# Part 2 – Train/Test Split and Scaling

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = df.drop('target', axis=1)
y = df['target']

#(80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)


X_test_scaled = scaler.transform(X_test)

### In 2–3 sentences:
Why is scaling necessary for some models but not for others?

# Part 3 – Train Individual Models

For EACH model:

* Train

* Predict

* Compute:

  * Accuracy

  * Precision

  * Recall

  * F1-score

  * Confusion Matrix

Store results for later comparison

In [4]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

results = {}

def evaluate_model(name, model, X_train, X_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    results[name] = {
        "Accuracy": accuracy_score(y_test, y_pred),
        "Precision": precision_score(y_test, y_pred),
        "Recall": recall_score(y_test, y_pred),
        "F1-score": f1_score(y_test, y_pred),
        "Confusion Matrix": confusion_matrix(y_test, y_pred)
    }

In [5]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(random_state=42)
evaluate_model("Decision Tree", dt, X_train, X_test)

In [6]:
from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()
evaluate_model("Naive Bayes", nb, X_train, X_test)

In [7]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
evaluate_model("KNN", knn, X_train_scaled, X_test_scaled)

In [8]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter=5000)
evaluate_model("Logistic Regression", lr, X_train_scaled, X_test_scaled)

In [9]:
from sklearn.linear_model import SGDClassifier

sgd = SGDClassifier(random_state=42)
evaluate_model("SGD", sgd, X_train_scaled, X_test_scaled)

Scaling is necessary for models like KNN, Logistic Regression, and SGD because they rely on distance calculations or gradient optimization. If features are not scaled, variables with larger magnitudes dominate the model. Tree-based models like Decision Trees do not require scaling because they split based on feature thresholds rather than distances.

# Part 4 – Voting Classifier

In [10]:
from sklearn.ensemble import VotingClassifier

# Hard Voting
hard_voting = VotingClassifier(
    estimators=[
        ('lr', lr),
        ('knn', knn),
        ('dt', dt)
    ],
    voting='hard'
)

evaluate_model("Hard Voting", hard_voting, X_train_scaled, X_test_scaled)

# Soft Voting
soft_voting = VotingClassifier(
    estimators=[
        ('lr', lr),
        ('knn', knn),
        ('nb', nb)
    ],
    voting='soft'
)

evaluate_model("Soft Voting", soft_voting, X_train_scaled, X_test_scaled)

Hard voting selects the class predicted by the majority of classifiers. Soft voting averages the predicted probabilities from each classifier and chooses the class with the highest probability. Soft voting usually performs better because it considers confidence levels.

# Part 5 – Bagging and Boosting

In [11]:
from sklearn.ensemble import BaggingClassifier

bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=100,
    random_state=42
)

evaluate_model("Bagging", bagging, X_train, X_test)

In [12]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, random_state=42)
evaluate_model("Random Forest", rf, X_train, X_test)

In [13]:
from sklearn.ensemble import AdaBoostClassifier

ada = AdaBoostClassifier(random_state=42)
evaluate_model("AdaBoost", ada, X_train, X_test)

# Part 6 – Model Comparison

In [14]:
comparison_df = pd.DataFrame(results).T
comparison_df = comparison_df.drop(columns=["Confusion Matrix"])

comparison_df = comparison_df.sort_values(by="F1-score", ascending=False)

comparison_df

Unnamed: 0,Accuracy,Precision,Recall,F1-score
Naive Bayes,0.973684,0.959459,1.0,0.97931
Logistic Regression,0.973684,0.972222,0.985915,0.979021
Hard Voting,0.973684,0.972222,0.985915,0.979021
Soft Voting,0.973684,0.972222,0.985915,0.979021
Random Forest,0.964912,0.958904,0.985915,0.972222
AdaBoost,0.964912,0.958904,0.985915,0.972222
SGD,0.964912,0.985507,0.957746,0.971429
Bagging,0.95614,0.958333,0.971831,0.965035
Decision Tree,0.947368,0.957746,0.957746,0.957746
KNN,0.947368,0.957746,0.957746,0.957746


In [15]:
#this was not asked
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt

for model_name, metrics in results.items():
    print(f"Confusion Matrix for {model_name}")
    print(metrics["Confusion Matrix"])
    print()

Confusion Matrix for Decision Tree
[[40  3]
 [ 3 68]]

Confusion Matrix for Naive Bayes
[[40  3]
 [ 0 71]]

Confusion Matrix for KNN
[[40  3]
 [ 3 68]]

Confusion Matrix for Logistic Regression
[[41  2]
 [ 1 70]]

Confusion Matrix for SGD
[[42  1]
 [ 3 68]]

Confusion Matrix for Hard Voting
[[41  2]
 [ 1 70]]

Confusion Matrix for Soft Voting
[[41  2]
 [ 1 70]]

Confusion Matrix for Bagging
[[40  3]
 [ 2 69]]

Confusion Matrix for Random Forest
[[40  3]
 [ 1 70]]

Confusion Matrix for AdaBoost
[[40  3]
 [ 1 70]]



# Part 7 – Metric Selection and Interpretation

Answer the following:

1. Which model achieved the highest accuracy?

  ans - Naive Bayes

2. Which model achieved the highest recall?

  ans - Naive Bayes
  
3. Which model achieved the highest F1-score?

Naive Bayes

The most important metric in this case is Recall.

In medical diagnosis, missing a cancer case (false negative) can be life descion. A high recall ensures that most patients who actually have cancer are correctly identified. Even if precision is slightly lower, it is safer to flag more cases than to miss someone who truly has the disease.

If a model has high accuracy but low recall, it means it performs well overall but fails to detect many actual cancer cases. In this medical context, that would be very dangerous because patients with cancer might be incorrectly classified as healthy and not receive timely treatment.

# Part 8 – Final Reflection (Required)

## Final Reflection (1-2 paragraphs)

Write a short summary in your own words explaining:

- What you observed when comparing individual models vs ensemble models.
- Whether ensemble methods improved performance.
- Which model you would choose for real-world deployment.
- Which metric best represents performance in this case study and why.
- Any surprising findings.


**My summary**

When comparing the individual models to the ensemble models, I observed that several ensemble methods performed very well, but they did not drastically outperform the best individual model. In my results, Naive Bayes achieved the highest F1-score, highest recall, and tied for highest accuracy. While ensemble methods like Random Forest, AdaBoost, and Voting classifiers performed strongly and were very consistent, they did not significantly exceed the performance of the top individual model. This shows that although ensemble methods often improve stability and generalization, a well-suited single model can sometimes perform just as well on a structured dataset like this one.

For real-world deployment, I would choose Naive Bayes in this case because it achieved perfect recall (1.00), meaning it correctly identified all cancer cases in the test set. In medical diagnosis, recall is the most important metric because missing a cancer case (false negative) can have serious consequences. Accuracy alone is not enough, since a model can appear accurate while still failing to detect actual positive cases. One surprising finding was that ensemble methods did not clearly outperform Naive Bayes, even though ensembles are generally expected to perform better. This highlights that model performance depends heavily on the dataset and problem context.