# Day 5 — Skewed Classes & Evaluation Metrics
### Machine Learning Roadmap — Week 3
### Author — N Manish Kumar
---

In many real-world problems, one class is much rarer than the other.
Examples include:
- Cancer detection
- Fraud detection
- Spam filtering

In such cases, a model can achieve very high accuracy by simply predicting
the majority class, while still failing to detect important minority cases.

Therefore, accuracy alone is not a reliable metric for evaluating models
on imbalanced datasets.

In this notebook, we will:
- Create an imbalanced classification dataset
- Train a classifier and observe misleading accuracy
- Compute Precision, Recall, and F1-score
- Understand which metric is more important for different applications

---

## 2. Create an Imbalanced Dataset and Train/Test Split

In [1]:
import pandas as pd
import numpy as np

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Create imbalanced dataset (90% of class 0, 10% of class 1)
X, y = make_classification(
    n_samples=2000,
    n_features=20,
    n_informative=5,
    n_redundant=2,
    weights=[0.9, 0.1],
    random_state=42
)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print("Class distribution in full dataset:", np.bincount(y))
print("Class distribution in training set:", np.bincount(y_train))
print("Class distribution in test set:", np.bincount(y_test))

Class distribution in full dataset: [1791  209]
Class distribution in training set: [1433  167]
Class distribution in test set: [358  42]


---
## 3. Train Classifier and Evaluate Accuracy

We now train a simple Logistic Regression classifier on the imbalanced
training data and evaluate it using accuracy.

Accuracy measures the percentage of correct predictions, but it does not
distinguish between different types of errors.

On imbalanced datasets, accuracy can appear high even when the model
performs poorly on the minority class.


In [2]:
# Train logistic regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Accuracy
acc = accuracy_score(y_test, y_pred)
print("Accuracy:", acc)

Accuracy: 0.895


---
## 4. Evaluate Precision, Recall, and F1-Score

Accuracy alone does not tell us how well the model detects minority-class
samples.

We therefore compute:
- Precision: How many predicted positives are actually positive
- Recall: How many actual positives were correctly detected
- F1-score: Balance between precision and recall

These metrics give a clearer picture of model performance on imbalanced data.


In [3]:
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)

Precision: 0.5
Recall: 0.047619047619047616
F1-score: 0.08695652173913043


### Interpretation

Precision indicates how reliable positive predictions are, while recall
indicates how many actual positive cases the model successfully detected.

On imbalanced datasets, recall is often low, meaning the model misses many
minority-class cases even if accuracy is high.

F1-score summarizes the balance between precision and recall and is often
more informative than accuracy in such scenarios.

---
## 5. Compare with Dummy Baseline (Always Predict Majority Class)

To understand how misleading accuracy can be, we compare our trained model
with a very simple baseline that always predicts the majority class.

If this dummy model achieves similar accuracy, it means accuracy alone
is not a useful metric for this problem.


In [5]:
from sklearn.dummy import DummyClassifier

# Dummy model that always predicts the most frequent class
dummy = DummyClassifier(strategy="most_frequent")
dummy.fit(X_train, y_train)

y_dummy_pred = dummy.predict(X_test)

dummy_acc = accuracy_score(y_test, y_dummy_pred)
dummy_precision = precision_score(y_test, y_dummy_pred, zero_division=0)
dummy_recall = recall_score(y_test, y_dummy_pred, zero_division=0)
dummy_f1 = f1_score(y_test, y_dummy_pred, zero_division=0)

print("Dummy Accuracy:", dummy_acc)
print("Dummy Precision:", dummy_precision)
print("Dummy Recall:", dummy_recall)
print("Dummy F1:", dummy_f1)

Dummy Accuracy: 0.895
Dummy Precision: 0.0
Dummy Recall: 0.0
Dummy F1: 0.0


### Interpretation

The dummy model may achieve high accuracy by predicting only the majority class,
but its recall for the minority class is zero, meaning it completely fails to
detect rare cases.

If our trained model shows only slightly better accuracy than the dummy model,
it confirms that accuracy is not sufficient for evaluating performance on
imbalanced datasets.

This highlights the importance of using precision, recall, and F1-score.

---
## 6. Choosing the Right Metric Based on Application

Different applications care about different types of errors.

Examples:
- Medical diagnosis → Missing a positive case is very dangerous → prioritize Recall
- Spam detection → False alarms are annoying → prioritize Precision
- Balanced importance → Use F1-score

Therefore, model evaluation should focus on the metric that aligns with
real-world costs of mistakes, not just accuracy.


In [6]:
metrics_df = pd.DataFrame({
    "Model": ["Logistic Regression", "Dummy Baseline"],
    "Accuracy": [acc, dummy_acc],
    "Precision": [precision, dummy_precision],
    "Recall": [recall, dummy_recall],
    "F1-score": [f1, dummy_f1]
})

metrics_df

Unnamed: 0,Model,Accuracy,Precision,Recall,F1-score
0,Logistic Regression,0.895,0.5,0.047619,0.086957
1,Dummy Baseline,0.895,0.0,0.0,0.0


### Interpretation

Comparing all metrics together shows that:
- Accuracy alone may not differ much between models
- Recall and F1-score reveal large differences in real usefulness

For problems where detecting rare events is critical, recall (or F1-score)
should be the primary metric used for model selection and tuning.

---
# Notebook Summary — Week 3 Day 5

In this notebook, we studied why accuracy can be misleading when working with
imbalanced classification datasets and learned to use better evaluation metrics.

### What was done
- Created an intentionally imbalanced dataset
- Trained a Logistic Regression classifier
- Evaluated performance using accuracy
- Computed Precision, Recall, and F1-score
- Compared results with a dummy baseline model
- Compared all metrics side-by-side to guide model evaluation

### Key Learnings
- High accuracy does not guarantee good performance on minority classes
- Recall measures how many real positive cases are detected
- Precision measures how reliable positive predictions are
- F1-score balances precision and recall
- Metric selection must depend on the real-world cost of mistakes

### Final Outcome
The analysis showed that models must be evaluated using metrics aligned with
the problem objective, and that recall or F1-score is often more important than
accuracy in imbalanced datasets.
