<h4 style="color:#1a73e8;">2.3.9 Handling Class Imbalance: Beyond Accuracy</h4>

### **The Problem**

In fraud detection (0.1% fraud rate), a model predicting "not fraud" always achieves 99.9% accuracyâ€”but is useless.

### **Solutions**

#### **1. Evaluation Metrics**
- **Accuracy**: Misleading for imbalanced data.
- **Precision, Recall, F1**: Better.
- **ROC-AUC, PR-AUC**: Even better (PR-AUC preferred for high imbalance).

#### **2. Resampling Techniques**

- **Undersampling**: Reduce majority class. Risk: lose valuable info.
- **Oversampling**: Duplicate minority samples. Risk: overfitting.
- **SMOTE**: Generate synthetic samples along lines between minority neighbors.
  

In [None]:
from imblearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

# Build a pipeline to avoid leakage
pipeline = Pipeline([
    ('smote', SMOTE(random_state=42)),
    ('clf', LogisticRegression())
])
pipeline.fit(X_train, y_train)`

> **Never apply SMOTE before splitting**. Always resample **only the training set** inside a pipeline.

#### **3. Algorithmic Approaches**
- **Class weights**: Most sklearn classifiers support `class_weight='balanced'`.
  ```python
  clf = RandomForestClassifier(class_weight='balanced')
  ```
- **Anomaly detection**: Treat minority class as anomalies (e.g., Isolation Forest).

---