Chuẩn bị notebook & tạo dataset mất cân bằng
imbalance (vd. fraud)

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, accuracy_score
import numpy as np
import pandas as pd

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name="target")

# Giả lập mất cân bằng: giữ lại ít mẫu của class 1 (hoặc 0 tùy dataset)
minority_class = 1
majority_class = 0

idx_min = y[y == minority_class].index
idx_maj = y[y == majority_class].index

# Giữ tất cả majority, giảm minority xuống ~20%
np.random.seed(42)
idx_min_down = np.random.choice(idx_min, size=int(0.2 * len(idx_min)), replace=False)

idx_imbalanced = np.concatenate([idx_maj, idx_min_down])
X_imb = X.loc[idx_imbalanced]
y_imb = y.loc[idx_imbalanced]

print(y.value_counts(normalize=True))
print(y_imb.value_counts(normalize=True))

X_train, X_test, y_train, y_test = train_test_split(
    X_imb, y_imb, test_size=0.2, random_state=42, stratify=y_imb
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


target
1    0.627417
0    0.372583
Name: proportion, dtype: float64
target
0    0.749117
1    0.250883
Name: proportion, dtype: float64


Baseline: model trên dữ liệu mất cân bằng
Dùng Logistic Regression (hoặc model bạn quen):

In [2]:
from sklearn.linear_model import LogisticRegression

log_clf = LogisticRegression(max_iter=1000, random_state=42)
log_clf.fit(X_train_scaled, y_train)
y_pred = log_clf.predict(X_test_scaled)

print("Baseline accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred, target_names=data.target_names))


Baseline accuracy: 0.9649122807017544
              precision    recall  f1-score   support

   malignant       0.98      0.98      0.98        43
      benign       0.93      0.93      0.93        14

    accuracy                           0.96        57
   macro avg       0.95      0.95      0.95        57
weighted avg       0.96      0.96      0.96        57



Xử lý imbalance với SMOTE

In [6]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

print("Before SMOTE:", y_train.value_counts())
print("After SMOTE:", y_train_smote.value_counts())

# Scale lại cho dữ liệu đã oversample
X_train_smote_scaled = scaler.fit_transform(X_train_smote)
X_test_scaled = scaler.transform(X_test)


Before SMOTE: target
0    169
1     57
Name: count, dtype: int64
After SMOTE: target
1    169
0    169
Name: count, dtype: int64


Train lại model:



In [8]:
log_clf_smote = LogisticRegression(max_iter=1000, random_state=42)
log_clf_smote.fit(X_train_smote_scaled, y_train_smote)
y_pred_smote = log_clf_smote.predict(X_test_scaled)

print("SMOTE accuracy:", accuracy_score(y_test, y_pred_smote))
print(classification_report(y_test, y_pred_smote, target_names=data.target_names))


SMOTE accuracy: 0.9649122807017544
              precision    recall  f1-score   support

   malignant       0.98      0.98      0.98        43
      benign       0.93      0.93      0.93        14

    accuracy                           0.96        57
   macro avg       0.95      0.95      0.95        57
weighted avg       0.96      0.96      0.96        57



Dùng class_weight trong model

In [9]:
log_clf_weighted = LogisticRegression(
    max_iter=1000,
    random_state=42,
    class_weight="balanced"  # hoặc dict {0: w0, 1: w1}
)
log_clf_weighted.fit(X_train_scaled, y_train)
y_pred_w = log_clf_weighted.predict(X_test_scaled)

print("Class_weight accuracy:", accuracy_score(y_test, y_pred_w))
print(classification_report(y_test, y_pred_w, target_names=data.target_names))


Class_weight accuracy: 0.9298245614035088
              precision    recall  f1-score   support

   malignant       0.91      1.00      0.96        43
      benign       1.00      0.71      0.83        14

    accuracy                           0.93        57
   macro avg       0.96      0.86      0.89        57
weighted avg       0.94      0.93      0.93        57



In [None]:
# ┌─────────────────────────────────────────────┐
# │   IMBALANCED DATA HANDLING FRAMEWORK        │
# └─────────────────────────────────────────────┘

# LEVEL 1: DATA-LEVEL (Thay đổi dữ liệu)
# ├── Oversampling: SMOTE, ADASYN
# ├── Undersampling: Tomek Links, NearMiss
# └── Hybrid: SMOTEENN, SMOTETomek

# LEVEL 2: ALGORITHM-LEVEL (Thay đổi model)
# ├── Cost-Sensitive: Class Weight, Custom Loss
# ├── Ensemble: BalancedRF, EasyEnsemble
# └── One-Class: Isolation Forest, One-Class SVM

# LEVEL 3: EVALUATION-LEVEL (Thay đổi đánh giá)
# ├── Metrics: F1, Precision, Recall
# ├── Curves: ROC-AUC, PR-AUC
# ├── Balanced: Balanced Accuracy, MCC
# └── Tuning: Threshold Optimization

Một cách suy nghĩ như AI Engineer:

Dataset vừa, imbalance vừa (vd. 90–10, 95–5)

Bước 1: đổi metric, xem Recall/F1 minority.

Bước 2: thử class_weight="balanced" trên Logistic/SVM/Tree/Forest.

Bước 3: nếu chưa đủ, thêm SMOTE + model mạnh (Random Forest / XGBoost), so sánh trên validation.

Dataset rất lớn + imbalance mạnh (vd. 99–1)

Ưu tiên class_weight + có thể undersample một phần majority để giảm size.

SMOTE chỉ dùng trên subset hoặc rất cẩn trọng (vì chi phí).

Bài toán cực nhạy với false negative (bỏ sót minority là rất nguy hiểm: bệnh nặng, fraude lớn)

Chấp nhận tăng false positive.

Ưu tiên kỹ thuật nâng Recall minority: SMOTE hoặc class_weight cao cho minority, thêm threshold tuning (vd. quyết định positive nếu prob > 0.3 thay vì 0.5).