# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model?
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?

In [3]:
# Identify target column
target_candidates = ['fraud', 'is_fraud', 'Fraud', 'isFraud']
target_col = next((c for c in target_candidates if c in fraud.columns), None)
assert target_col is not None, f"Target not found. Candidates tried: {target_candidates}"

# Class distribution
dist = fraud[target_col].value_counts(dropna=False).rename_axis('class').to_frame('count')
dist['ratio'] = dist['count'] / dist['count'].sum()
print(dist)

# Quick boolean flag for imbalance (tweak threshold if you like)
minority_ratio = dist['ratio'].min()
print(f"\nMinority class ratio: {minority_ratio:.4f}")
print("Imbalanced dataset?" , "Yes" if minority_ratio < 0.30 else "No")


        count     ratio
class                  
0.0    912597  0.912597
1.0     87403  0.087403

Minority class ratio: 0.0874
Imbalanced dataset? Yes


In [4]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
import numpy as np

# Features (numeric only to keep it simple) and target
X = fraud.select_dtypes(include=np.number).drop(columns=[target_col])
y = fraud[target_col].astype(int)

# Stratified split to preserve class ratios
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42, stratify=y
)

# Baseline pipeline: scale + logistic regression
log_reg_pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=1000, solver="lbfgs", random_state=42))
])

log_reg_pipe.fit(X_train, y_train)
log_reg_pipe


In [5]:
from sklearn.metrics import (
    accuracy_score, f1_score, precision_score, recall_score,
    roc_auc_score, average_precision_score, classification_report, confusion_matrix
)
import numpy as np

# Predictions
y_pred = log_reg_pipe.predict(X_test)
y_proba = log_reg_pipe.predict_proba(X_test)[:, 1]

# Metrics (binary, positive class = 1)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision (pos class=1):", precision_score(y_test, y_pred, pos_label=1))
print("Recall (pos class=1):", recall_score(y_test, y_pred, pos_label=1))
print("F1 (pos class=1):", f1_score(y_test, y_pred, pos_label=1))
print("ROC-AUC:", roc_auc_score(y_test, y_proba))
print("PR-AUC (Average Precision):", average_precision_score(y_test, y_proba))

print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred, digits=4))


Accuracy: 0.95941
Precision (pos class=1): 0.8964349225167245
Recall (pos class=1): 0.605571763629083
F1 (pos class=1): 0.7228405599180607
ROC-AUC: 0.9669773583082968
PR-AUC (Average Precision): 0.8072243732180122

Confusion Matrix:
 [[181296   1223]
 [  6895  10586]]

Classification Report:
               precision    recall  f1-score   support

           0     0.9634    0.9933    0.9781    182519
           1     0.8964    0.6056    0.7228     17481

    accuracy                         0.9594    200000
   macro avg     0.9299    0.7994    0.8505    200000
weighted avg     0.9575    0.9594    0.9558    200000



In [6]:
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import RandomOverSampler

ros_pipe = ImbPipeline(steps=[
    ("scaler", StandardScaler()),
    ("sampler", RandomOverSampler(random_state=42)),
    ("clf", LogisticRegression(max_iter=1000, solver="lbfgs", random_state=42))
])

ros_pipe.fit(X_train, y_train)
y_pred_ros = ros_pipe.predict(X_test)
y_proba_ros = ros_pipe.predict_proba(X_test)[:, 1]

print("Oversampling (RandomOverSampler)")
print("Accuracy:", accuracy_score(y_test, y_pred_ros))
print("Precision:", precision_score(y_test, y_pred_ros, pos_label=1))
print("Recall:", recall_score(y_test, y_pred_ros, pos_label=1))
print("F1:", f1_score(y_test, y_pred_ros, pos_label=1))
print("ROC-AUC:", roc_auc_score(y_test, y_proba_ros))
print("PR-AUC:", average_precision_score(y_test, y_proba_ros))


Oversampling (RandomOverSampler)
Accuracy: 0.9348
Precision: 0.5773720338687759
Recall: 0.9478862765288028
F1: 0.7176266782156778
ROC-AUC: 0.9795250850098041
PR-AUC: 0.7574035331120728


In [7]:
from imblearn.under_sampling import RandomUnderSampler

rus_pipe = ImbPipeline(steps=[
    ("scaler", StandardScaler()),
    ("sampler", RandomUnderSampler(random_state=42)),
    ("clf", LogisticRegression(max_iter=1000, solver="lbfgs", random_state=42))
])

rus_pipe.fit(X_train, y_train)
y_pred_rus = rus_pipe.predict(X_test)
y_proba_rus = rus_pipe.predict_proba(X_test)[:, 1]

print("Undersampling (RandomUnderSampler)")
print("Accuracy:", accuracy_score(y_test, y_pred_rus))
print("Precision:", precision_score(y_test, y_pred_rus, pos_label=1))
print("Recall:", recall_score(y_test, y_pred_rus, pos_label=1))
print("F1:", f1_score(y_test, y_pred_rus, pos_label=1))
print("ROC-AUC:", roc_auc_score(y_test, y_proba_rus))
print("PR-AUC:", average_precision_score(y_test, y_proba_rus))


Undersampling (RandomUnderSampler)
Accuracy: 0.934785
Precision: 0.5773494143892917
Recall: 0.9474858417710658
F1: 0.7174944226645584
ROC-AUC: 0.9795581555971167
PR-AUC: 0.7568604945535805


In [8]:
from imblearn.over_sampling import SMOTE

smote_pipe = ImbPipeline(steps=[
    ("scaler", StandardScaler()),
    ("sampler", SMOTE(random_state=42, k_neighbors=5)),
    ("clf", LogisticRegression(max_iter=1000, solver="lbfgs", random_state=42))
])

smote_pipe.fit(X_train, y_train)
y_pred_sm = smote_pipe.predict(X_test)
y_proba_sm = smote_pipe.predict_proba(X_test)[:, 1]

print("SMOTE")
print("Accuracy:", accuracy_score(y_test, y_pred_sm))
print("Precision:", precision_score(y_test, y_pred_sm, pos_label=1))
print("Recall:", recall_score(y_test, y_pred_sm, pos_label=1))
print("F1:", f1_score(y_test, y_pred_sm, pos_label=1))
print("ROC-AUC:", roc_auc_score(y_test, y_proba_sm))
print("PR-AUC:", average_precision_score(y_test, y_proba_sm))


SMOTE
Accuracy: 0.9348
Precision: 0.5773504720103111
Recall: 0.9481150963903667
F1: 0.7176755867324847
ROC-AUC: 0.9795260633479468
PR-AUC: 0.7573721059940975
