# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [3]:
import pandas as pd  # pandas for data loading :contentReference[oaicite:6]{index=6}
import numpy as np   # numpy for numeric ops :contentReference[oaicite:7]{index=7}

# 1.1 Load the dataset
fraud = pd.read_csv(
    "https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv"
)  # supports reading CSV from a URL :contentReference[oaicite:8]{index=8}

# 1.2 Preview data
print(fraud.head())

# 1.3 Count total observations and class distribution
total_count = len(fraud)  # total rows :contentReference[oaicite:9]{index=9}
fraud_counts = fraud['fraud'].value_counts()  # counts of each class :contentReference[oaicite:10]{index=10}

print("\nTotal transactions:", total_count)
print("Class distribution:\n", fraud_counts)


   distance_from_home  distance_from_last_transaction  \
0           57.877857                        0.311140   
1           10.829943                        0.175592   
2            5.091079                        0.805153   
3            2.247564                        5.600044   
4           44.190936                        0.566486   

   ratio_to_median_purchase_price  repeat_retailer  used_chip  \
0                        1.945940              1.0        1.0   
1                        1.294219              1.0        0.0   
2                        0.427715              1.0        0.0   
3                        0.362663              1.0        1.0   
4                        2.222767              1.0        1.0   

   used_pin_number  online_order  fraud  
0              0.0           0.0    0.0  
1              0.0           0.0    0.0  
2              0.0           1.0    0.0  
3              0.0           1.0    0.0  
4              0.0           1.0    0.0  

Total transac

In [4]:
from sklearn.model_selection import train_test_split  # splitting data :contentReference[oaicite:13]{index=13}

# 2.1 Separate features (X) and target (y)
X = fraud.drop(columns=['fraud'])  # all columns except 'fraud' :contentReference[oaicite:14]{index=14}
y = fraud['fraud']

# 2.2 Stratified train/test split to maintain class ratio
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.20, 
    random_state=42, 
    stratify=y
)  # stratify=y retains imbalance proportion in splits :contentReference[oaicite:15]{index=15}

print("Training set size:", X_train.shape)  # e.g., (113527, 7) :contentReference[oaicite:16]{index=16}
print("Test set size:", X_test.shape)       # e.g., (28382, 7) :contentReference[oaicite:17]{index=17}
print("Train class distribution:\n", y_train.value_counts(normalize=True))  # ~0.18% fraud :contentReference[oaicite:18]{index=18}


Training set size: (800000, 7)
Test set size: (200000, 7)
Train class distribution:
 fraud
0.0    0.912597
1.0    0.087402
Name: proportion, dtype: float64


In [5]:
from sklearn.linear_model import LogisticRegression  # logistic regression model :contentReference[oaicite:20]{index=20}
from sklearn.metrics import (
    precision_recall_curve, 
    average_precision_score,  # PR AUC
    f1_score, 
    classification_report
)  # metrics suitable for imbalanced data :contentReference[oaicite:21]{index=21}

# 3.1 Initialize Logistic Regression with class weight balanced
lr = LogisticRegression(
    solver='lbfgs',        # default solver suitable for small-medium datasets :contentReference[oaicite:22]{index=22}
    class_weight='balanced',  # weights inversely proportional to class frequencies :contentReference[oaicite:23]{index=23}
    max_iter=1000, 
    random_state=42
)

# 3.2 Train the model
lr.fit(X_train, y_train)

# 3.3 Predict probabilities on test set
y_scores = lr.predict_proba(X_test)[:, 1]  # probability of class 1 (fraud) :contentReference[oaicite:24]{index=24}

# 3.4 Compute Precision-Recall AUC
precision, recall, thresholds = precision_recall_curve(y_test, y_scores)  # :contentReference[oaicite:25]{index=25}
pr_auc = average_precision_score(y_test, y_scores)  # :contentReference[oaicite:26]{index=26}

# 3.5 Choose a threshold (e.g., default 0.5) to compute F1
y_pred_default = (y_scores >= 0.5).astype(int)
f1_baseline = f1_score(y_test, y_pred_default)  # :contentReference[oaicite:27]{index=27}

print("Baseline Logistic Regression (Imbalanced Data)")
print(f"  - Precision-Recall AUC: {pr_auc:.4f}")
print(f"  - F1-score (threshold=0.5): {f1_baseline:.4f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred_default, digits=4))


Baseline Logistic Regression (Imbalanced Data)
  - Precision-Recall AUC: 0.7573
  - F1-score (threshold=0.5): 0.7175

Classification Report:
               precision    recall  f1-score   support

         0.0     0.9947    0.9335    0.9631    182519
         1.0     0.5772    0.9479    0.7175     17481

    accuracy                         0.9348    200000
   macro avg     0.7859    0.9407    0.8403    200000
weighted avg     0.9582    0.9348    0.9417    200000



In [6]:
from imblearn.over_sampling import RandomOverSampler  # for oversampling :contentReference[oaicite:30]{index=30}

# 4.1 Initialize RandomOverSampler
ros = RandomOverSampler(random_state=42)

# 4.2 Fit-resample training data
X_ros, y_ros = ros.fit_resample(X_train, y_train)
print("After Oversampling, class distribution:", pd.Series(y_ros).value_counts())

# 4.3 Retrain Logistic Regression on oversampled data
lr_ros = LogisticRegression(solver='lbfgs', max_iter=1000, random_state=42)
lr_ros.fit(X_ros, y_ros)

# 4.4 Evaluate on original test set
y_scores_ros = lr_ros.predict_proba(X_test)[:, 1]
precision_ros, recall_ros, _ = precision_recall_curve(y_test, y_scores_ros)
pr_auc_ros = average_precision_score(y_test, y_scores_ros)

y_pred_ros = (y_scores_ros >= 0.5).astype(int)
f1_ros = f1_score(y_test, y_pred_ros)

print("\nLogistic Regression after Oversampling")
print(f"  - PR AUC: {pr_auc_ros:.4f}")
print(f"  - F1-score (0.5): {f1_ros:.4f}")
print("Classification Report:\n", classification_report(y_test, y_pred_ros, digits=4))


After Oversampling, class distribution: fraud
0.0    730078
1.0    730078
Name: count, dtype: int64

Logistic Regression after Oversampling
  - PR AUC: 0.7574
  - F1-score (0.5): 0.7177
Classification Report:
               precision    recall  f1-score   support

         0.0     0.9947    0.9336    0.9632    182519
         1.0     0.5774    0.9479    0.7177     17481

    accuracy                         0.9348    200000
   macro avg     0.7861    0.9408    0.8404    200000
weighted avg     0.9582    0.9348    0.9417    200000



In [7]:
from imblearn.under_sampling import RandomUnderSampler  # for undersampling :contentReference[oaicite:34]{index=34}

# 5.1 Initialize RandomUnderSampler
rus = RandomUnderSampler(random_state=42)

# 5.2 Fit-resample training data
X_rus, y_rus = rus.fit_resample(X_train, y_train)
print("After Undersampling, class distribution:", pd.Series(y_rus).value_counts())

# 5.3 Retrain Logistic Regression on undersampled data
lr_rus = LogisticRegression(solver='lbfgs', max_iter=1000, random_state=42)
lr_rus.fit(X_rus, y_rus)

# 5.4 Evaluate on original test set
y_scores_rus = lr_rus.predict_proba(X_test)[:, 1]
precision_rus, recall_rus, _ = precision_recall_curve(y_test, y_scores_rus)
pr_auc_rus = average_precision_score(y_test, y_scores_rus)

y_pred_rus = (y_scores_rus >= 0.5).astype(int)
f1_rus = f1_score(y_test, y_pred_rus)

print("\nLogistic Regression after Undersampling")
print(f"  - PR AUC: {pr_auc_rus:.4f}")
print(f"  - F1-score (0.5): {f1_rus:.4f}")
print("Classification Report:\n", classification_report(y_test, y_pred_rus, digits=4))


After Undersampling, class distribution: fraud
0.0    69922
1.0    69922
Name: count, dtype: int64

Logistic Regression after Undersampling
  - PR AUC: 0.7565
  - F1-score (0.5): 0.7174
Classification Report:
               precision    recall  f1-score   support

         0.0     0.9946    0.9335    0.9631    182519
         1.0     0.5772    0.9475    0.7174     17481

    accuracy                         0.9347    200000
   macro avg     0.7859    0.9405    0.8403    200000
weighted avg     0.9582    0.9347    0.9416    200000



In [8]:
from imblearn.over_sampling import SMOTE  # SMOTE class :contentReference[oaicite:38]{index=38}

# 6.1 Initialize SMOTE
sm = SMOTE(random_state=42)

# 6.2 Fit-resample training data
X_sm, y_sm = sm.fit_resample(X_train, y_train)
print("After SMOTE, class distribution:", pd.Series(y_sm).value_counts())

# 6.3 Retrain Logistic Regression on SMOTE data
lr_sm = LogisticRegression(solver='lbfgs', max_iter=1000, random_state=42)
lr_sm.fit(X_sm, y_sm)

# 6.4 Evaluate on original test set
y_scores_sm = lr_sm.predict_proba(X_test)[:, 1]
precision_sm, recall_sm, _ = precision_recall_curve(y_test, y_scores_sm)
pr_auc_sm = average_precision_score(y_test, y_scores_sm)

y_pred_sm = (y_scores_sm >= 0.5).astype(int)
f1_sm = f1_score(y_test, y_pred_sm)

print("\nLogistic Regression after SMOTE")
print(f"  - PR AUC: {pr_auc_sm:.4f}")
print(f"  - F1-score (0.5): {f1_sm:.4f}")
print("Classification Report:\n", classification_report(y_test, y_pred_sm, digits=4))


After SMOTE, class distribution: fraud
0.0    730078
1.0    730078
Name: count, dtype: int64

Logistic Regression after SMOTE
  - PR AUC: 0.7617
  - F1-score (0.5): 0.7185
Classification Report:
               precision    recall  f1-score   support

         0.0     0.9945    0.9341    0.9634    182519
         1.0     0.5791    0.9462    0.7185     17481

    accuracy                         0.9352    200000
   macro avg     0.7868    0.9402    0.8409    200000
weighted avg     0.9582    0.9352    0.9420    200000

