# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [53]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from imblearn.over_sampling import SMOTE

In [54]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


In [55]:
print(fraud)

        distance_from_home  distance_from_last_transaction  \
0                57.877857                        0.311140   
1                10.829943                        0.175592   
2                 5.091079                        0.805153   
3                 2.247564                        5.600044   
4                44.190936                        0.566486   
...                    ...                             ...   
999995            2.207101                        0.112651   
999996           19.872726                        2.683904   
999997            2.914857                        1.472687   
999998            4.258729                        0.242023   
999999           58.108125                        0.318110   

        ratio_to_median_purchase_price  repeat_retailer  used_chip  \
0                             1.945940              1.0        1.0   
1                             1.294219              1.0        0.0   
2                             0.427715       

**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [56]:
# 1. 
# Count occurrences of each class
fraud_counts = fraud['fraud'].value_counts()
fraud_percent = fraud['fraud'].value_counts(normalize=True) * 100

# Display results
print("Class distribution:")
print(fraud_counts)
print("\nPercentage distribution:")
print(fraud_percent.round(2))

Class distribution:
fraud
0.0    912597
1.0     87403
Name: count, dtype: int64

Percentage distribution:
fraud
0.0    91.26
1.0     8.74
Name: proportion, dtype: float64


- Class 0 (non-fraud): 91.26%

- Class 1 (fraud): 8.74%

this is an imbalanced dataset

In [57]:
# 2
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report


# Separate features and target
X = fraud.drop(columns=['fraud'])
y = fraud['fraud']
# Split the data
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=42)

# Initialize the model
log_model = LogisticRegression(
    solver='liblinear',       
    class_weight='balanced',  
    random_state=42
)

# Train the model
log_model.fit(x_train, y_train)
# Predict labels and probabilities
y_pred = log_model.predict(x_test)
y_proba = log_model.predict_proba(x_test)[:, 1]

In [58]:
# 3
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Classification report
print(classification_report(y_test, y_pred, digits=4))

# Confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# ROC-AUC score
roc_auc = roc_auc_score(y_test, y_proba)
print(f"ROC-AUC Score: {roc_auc:.4f}")

              precision    recall  f1-score   support

         0.0     0.9950    0.9330    0.9630    182557
         1.0     0.5756    0.9513    0.7173     17443

    accuracy                         0.9346    200000
   macro avg     0.7853    0.9421    0.8401    200000
weighted avg     0.9585    0.9346    0.9416    200000

Confusion Matrix:
[[170325  12232]
 [   850  16593]]
ROC-AUC Score: 0.9796


In [59]:
train = pd.DataFrame(x_train, columns = x_train.columns)
train["fraud"] = y_train.values
fraud = train[train["fraud"] == 1]
no_fraud = train[train["fraud"] == 0]

In [60]:
# 4 oversampling

# Oversample the minority class
fraud_oversampled = resample(
    fraud,
    replace=True,                # Sample with replacement
    n_samples=len(no_fraud),   # Match majority class size
    random_state=42              # For reproducibility
)

# Combine with majority class
train_over = pd.concat([no_fraud, fraud_oversampled])


X_train_over = train_over.drop("fraud", axis=1)
y_train_over = train_over["fraud"]

# Train model
logreg = LogisticRegression(max_iter=1000)


logreg.fit(X_train_over, y_train_over)

# Predict
y_pred = logreg.predict(x_test)



# Classification report
print(classification_report(y_test, y_pred, digits=4))

# Confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# ROC-AUC score
roc_auc = roc_auc_score(y_test, y_proba)
print(f"ROC-AUC Score: {roc_auc:.4f}")

              precision    recall  f1-score   support

         0.0     0.9950    0.9332    0.9631    182557
         1.0     0.5762    0.9512    0.7177     17443

    accuracy                         0.9347    200000
   macro avg     0.7856    0.9422    0.8404    200000
weighted avg     0.9585    0.9347    0.9417    200000

Confusion Matrix:
[[170354  12203]
 [   851  16592]]
ROC-AUC Score: 0.9796


Yes, oversampling improved the model — not dramatically, but incrementally and safely


In [61]:
# 5 undersampling

train = pd.DataFrame(x_train, columns = x_train.columns)
train["fraud"] = y_train.values
fraud = train[train["fraud"] == 1]
no_fraud = train[train["fraud"] == 0]

# Undersample the majority class
no_fraud_undersampled = resample(
    no_fraud,
    replace=False,                 # No replacement
    n_samples=len(fraud),        # Match minority class size
    random_state=42                # For reproducibility
)

# Combine with minority class
train_under = pd.concat([fraud, no_fraud_undersampled])

X_train_under = train_under.drop("fraud", axis=1)
y_train_under = train_under["fraud"]

# Train model
logreg = LogisticRegression(max_iter=1000)


logreg.fit(X_train_under, y_train_under)

# Predict
y_pred = logreg.predict(x_test)

# Classification report
print(classification_report(y_test, y_pred, digits=4))

# Confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# ROC-AUC score
roc_auc = roc_auc_score(y_test, y_proba)
print(f"ROC-AUC Score: {roc_auc:.4f}")

              precision    recall  f1-score   support

         0.0     0.9951    0.9328    0.9630    182557
         1.0     0.5752    0.9518    0.7171     17443

    accuracy                         0.9345    200000
   macro avg     0.7851    0.9423    0.8400    200000
weighted avg     0.9585    0.9345    0.9415    200000

Confusion Matrix:
[[170296  12261]
 [   841  16602]]
ROC-AUC Score: 0.9796


No, undersampling did not improve the performance of the model, but it preserved it


In [62]:
# 6
smote = SMOTE(random_state = 42,sampling_strategy=1.0)
X_train_resampled, y_train_resampled = smote.fit_resample(x_train, y_train)

# Train model
logreg = LogisticRegression(max_iter=1000)


logreg.fit(X_train_resampled, y_train_resampled)

# Predict
y_pred = logreg.predict(x_test)

# Classification report
print(classification_report(y_test, y_pred, digits=4))

# Confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# ROC-AUC score
roc_auc = roc_auc_score(y_test, y_proba)
print(f"ROC-AUC Score: {roc_auc:.4f}")

              precision    recall  f1-score   support

         0.0     0.9949    0.9337    0.9633    182557
         1.0     0.5777    0.9500    0.7185     17443

    accuracy                         0.9351    200000
   macro avg     0.7863    0.9418    0.8409    200000
weighted avg     0.9585    0.9351    0.9420    200000

Confusion Matrix:
[[170446  12111]
 [   872  16571]]
ROC-AUC Score: 0.9796


Yes, SMOTE improved the model, especially in terms of precision and F1-score for the minority class
