# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [46]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [47]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [50]:
#1
fraud['fraud'].value_counts(normalize=True) * 100

fraud
0.0    91.2597
1.0     8.7403
Name: proportion, dtype: float64

Yes, we are dealing with an imbalanced dataset because one class (fraudulent transactions) represents a much smaller proportion (8.74%) compared to the other class (legitimate transactions, 91.26%).

In [52]:
#2
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

X = fraud.drop(columns='fraud')
y = fraud['fraud']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

         0.0       0.96      0.99      0.98    273871
         1.0       0.89      0.60      0.72     26129

    accuracy                           0.96    300000
   macro avg       0.93      0.80      0.85    300000
weighted avg       0.96      0.96      0.95    300000



The model does well at identifying legitimate transactions but misses a significant portion of fraudulent transactions. 

#3
In evaluating the model trained on the imbalanced dataset, we observed that while the overall accuracy was high at 96%, this metric is misleading in an imbalanced context. The model performed well on legitimate transactions (class 0), but the recall for fraudulent transactions (class 1) was only 60%. This means the model is missing 40% of actual fraud cases, which is concerning. The F1-score for class 1 is 0.72, reflecting a decent balance between precision and recall, but we need to prioritize improving recall for fraud detection. Therefore, balancing the dataset is necessary to address the class imbalance and improve the model's ability to detect fraudulent transactions.

In [55]:
#4

from imblearn.over_sampling import RandomOverSampler

# oversampling
ros = RandomOverSampler(random_state=42)
X_train_ros, y_train_ros = ros.fit_resample(X_train, y_train)

# Train the model again on the oversampled data
model.fit(X_train_ros, y_train_ros)

# Evaluate
y_pred_ros = model.predict(X_test)
print(classification_report(y_test, y_pred_ros))


              precision    recall  f1-score   support

         0.0       1.00      0.93      0.96    273871
         1.0       0.57      0.95      0.72     26129

    accuracy                           0.93    300000
   macro avg       0.78      0.94      0.84    300000
weighted avg       0.96      0.93      0.94    300000



Oversampling significantly improved the recall for fraudulent transactions from 0.60 to 0.95, which is good, as it's now identifying almost all fraud cases.
However, this comes at the cost of precision, which dropped from 0.89 to 0.57. This means the model is now predicting more false positives (legitimate transactions mistakenly classified as fraud).

The overall F1-score for fraud stayed consistent.

So yes, the model improved in identifying fraudulent transactions, however, this came at the cost of precision, meaning more legitimate transactions are now being flagged as fraud.

In [57]:
#5

from imblearn.under_sampling import RandomUnderSampler

# undersampling
rus = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)

# Train the model again on the undersampled data
model.fit(X_train_rus, y_train_rus)

# Evaluate 
y_pred_rus = model.predict(X_test)
print(classification_report(y_test, y_pred_rus))


              precision    recall  f1-score   support

         0.0       1.00      0.93      0.96    273871
         1.0       0.57      0.95      0.71     26129

    accuracy                           0.93    300000
   macro avg       0.78      0.94      0.84    300000
weighted avg       0.96      0.93      0.94    300000



Undersampling gave similar results to oversampling. The recall for fraud remained high (95%), but precision remained low (57%). The F1-score for fraud dropped slightly, likely because undersampling reduces the amount of legitimate transaction data available for the model to learn from.


In [59]:
#6
from imblearn.over_sampling import SMOTE

# SMOTE
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Train the model again on the SMOTE data
model.fit(X_train_smote, y_train_smote)

# Evaluate 
y_pred_smote = model.predict(X_test)
print(classification_report(y_test, y_pred_smote))



              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    273871
         1.0       0.58      0.95      0.72     26129

    accuracy                           0.93    300000
   macro avg       0.79      0.94      0.84    300000
weighted avg       0.96      0.93      0.94    300000



SMOTE provided very similar results to oversampling and undersampling, with a slight improvement in precision (0.58 vs. 0.57).
The recall for fraudulent transactions remains high (95%), indicating the model is successfully detecting most fraud cases.
The F1-score for fraud (0.72) is consistent with all methods tried so far, meaning the model is balancing precision and recall well but not making significant improvements beyond what was seen with oversampling and undersampling.
