# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import recall_score, precision_score, f1_score, confusion_matrix, classification_report
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import RandomUnderSampler

In [2]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [None]:
df = fraud.copy()
X = df.drop('fraud', axis=1)
y = df['fraud']

# --- Scale Numerical Features 
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)

# --- Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y
)

print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")

X_train shape: (800000, 7), y_train shape: (800000,)


In [4]:
print("--- 1. Target Distribution ---")
print(y.value_counts(normalize=True))

# Check for imbalance
is_imbalanced = y.value_counts(normalize=True).min() < 0.2
print(f"\nIs this an imbalanced dataset? {'Yes' if is_imbalanced else 'No'}. (Minority class < 20%)")

--- 1. Target Distribution ---
fraud
0.0    0.912597
1.0    0.087403
Name: proportion, dtype: float64

Is this an imbalanced dataset? Yes. (Minority class < 20%)


In [None]:
# Train the model
lr_baseline = LogisticRegression(random_state=42)
lr_baseline.fit(X_train, y_train)
y_pred_baseline = lr_baseline.predict(X_test)

print("\n--- 2 & 3. Baseline Logistic Regression Evaluation (No Resampling) ---")
print("CRITICAL METRIC: Recall (0=Legit, 1=Fraud)")
print(classification_report(y_test, y_pred_baseline))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred_baseline))


--- 2 & 3. Baseline Logistic Regression Evaluation (No Resampling) ---
CRITICAL METRIC: Recall (0=Legit, 1=Fraud)
              precision    recall  f1-score   support

         0.0       0.96      0.99      0.98    182519
         1.0       0.90      0.61      0.72     17481

    accuracy                           0.96    200000
   macro avg       0.93      0.80      0.85    200000
weighted avg       0.96      0.96      0.96    200000


Confusion Matrix:
[[181294   1225]
 [  6894  10587]]


In [6]:
# 4a. Apply Random Oversampling to the training data
ros = RandomOverSampler(random_state=42)
X_train_ros, y_train_ros = ros.fit_resample(X_train, y_train)

print("\n--- 4. Random Oversampling ---")
print(f"New Training Shape: {X_train_ros.shape}. New Target Distribution: \n{y_train_ros.value_counts(normalize=True)}")

# 4b. Train and Evaluate the model
lr_ros = LogisticRegression(random_state=42)
lr_ros.fit(X_train_ros, y_train_ros)
y_pred_ros = lr_ros.predict(X_test)

print("\nEvaluation after Random Oversampling:")
print(classification_report(y_test, y_pred_ros))
print(f"Fraud Recall: {recall_score(y_test, y_pred_ros, pos_label=1):.4f}")


--- 4. Random Oversampling ---
New Training Shape: (1460156, 7). New Target Distribution: 
fraud
0.0    0.5
1.0    0.5
Name: proportion, dtype: float64

Evaluation after Random Oversampling:
              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    182519
         1.0       0.58      0.95      0.72     17481

    accuracy                           0.93    200000
   macro avg       0.79      0.94      0.84    200000
weighted avg       0.96      0.93      0.94    200000

Fraud Recall: 0.9479


In [7]:
# 5a. Apply Random Undersampling to the training data
rus = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)

print("\n--- 5. Random Undersampling ---")
print(f"New Training Shape: {X_train_rus.shape}. New Target Distribution: \n{y_train_rus.value_counts(normalize=True)}")

# 5b. Train and Evaluate the model
lr_rus = LogisticRegression(random_state=42)
lr_rus.fit(X_train_rus, y_train_rus)
y_pred_rus = lr_rus.predict(X_test)

print("\nEvaluation after Random Undersampling:")
print(classification_report(y_test, y_pred_rus))
print(f"Fraud Recall: {recall_score(y_test, y_pred_rus, pos_label=1):.4f}")


--- 5. Random Undersampling ---
New Training Shape: (139844, 7). New Target Distribution: 
fraud
0.0    0.5
1.0    0.5
Name: proportion, dtype: float64

Evaluation after Random Undersampling:
              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    182519
         1.0       0.58      0.95      0.72     17481

    accuracy                           0.93    200000
   macro avg       0.79      0.94      0.84    200000
weighted avg       0.96      0.93      0.94    200000

Fraud Recall: 0.9475


In [8]:
# 6a. Apply SMOTE to the training data
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

print("\n--- 6. SMOTE Sampling ---")
print(f"New Training Shape: {X_train_smote.shape}. New Target Distribution: \n{y_train_smote.value_counts(normalize=True)}")

# 6b. Train and Evaluate the model
lr_smote = LogisticRegression(random_state=42)
lr_smote.fit(X_train_smote, y_train_smote)
y_pred_smote = lr_smote.predict(X_test)

print("\nEvaluation after SMOTE Sampling:")
print(classification_report(y_test, y_pred_smote))
print(f"Fraud Recall: {recall_score(y_test, y_pred_smote, pos_label=1):.4f}")


--- 6. SMOTE Sampling ---
New Training Shape: (1460156, 7). New Target Distribution: 
fraud
0.0    0.5
1.0    0.5
Name: proportion, dtype: float64

Evaluation after SMOTE Sampling:
              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    182519
         1.0       0.58      0.95      0.72     17481

    accuracy                           0.93    200000
   macro avg       0.79      0.94      0.84    200000
weighted avg       0.96      0.93      0.94    200000

Fraud Recall: 0.9481
