# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [8]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [20]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")

df.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [16]:
fraud.to_csv('fraud.csv', index=False)

In [21]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix


# 1. Check the distribution of the target variable 'fraud'
fraud_distribution = df['fraud'].value_counts(normalize=True)
print("Fraud distribution:")
print(fraud_distribution)

# 2. Split the data into features and target
X = df.drop('fraud', axis=1)
y = df['fraud']

# Split the data into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 3. Train a Logistic Regression model
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train_scaled, y_train)

# 4. Evaluate the model using classification report and confusion matrix
y_pred = log_reg.predict(X_test_scaled)
classification_rep = classification_report(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# Print results
print("Classification Report:")
print(classification_rep)
print("Confusion Matrix:")
print(conf_matrix)

Fraud distribution:
fraud
0.0    0.912597
1.0    0.087403
Name: proportion, dtype: float64
Classification Report:
              precision    recall  f1-score   support

         0.0       0.96      0.99      0.98    182519
         1.0       0.90      0.61      0.72     17481

    accuracy                           0.96    200000
   macro avg       0.93      0.80      0.85    200000
weighted avg       0.96      0.96      0.96    200000

Confusion Matrix:
[[181296   1223]
 [  6895  10586]]


In [26]:
pip install imbalanced-learn

Note: you may need to restart the kernel to use updated packages.


In [24]:
# Step 4: Apply Oversampling to balance the dataset and repeat the previous steps
from imblearn.over_sampling import RandomOverSampler

# Apply oversampling using RandomOverSampler
oversampler = RandomOverSampler(random_state=42)
X_resampled, y_resampled = oversampler.fit_resample(X_train_scaled, y_train)

# Train the Logistic Regression model on the oversampled data
log_reg_oversampled = LogisticRegression(random_state=42)
log_reg_oversampled.fit(X_resampled, y_resampled)

# Evaluate the model on the test set
y_pred_oversampled = log_reg_oversampled.predict(X_test_scaled)
classification_rep_oversampled = classification_report(y_test, y_pred_oversampled)
conf_matrix_oversampled = confusion_matrix(y_test, y_pred_oversampled)

# Print the results
print("Classification Report (Oversampled):")
print(classification_rep_oversampled)
print("Confusion Matrix (Oversampled):")
print(conf_matrix_oversampled)

Classification Report (Oversampled):
              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    182519
         1.0       0.58      0.95      0.72     17481

    accuracy                           0.93    200000
   macro avg       0.79      0.94      0.84    200000
weighted avg       0.96      0.93      0.94    200000

Confusion Matrix (Oversampled):
[[170390  12129]
 [   911  16570]]


In [28]:
from imblearn.under_sampling import RandomUnderSampler

# Apply undersampling using RandomUnderSampler
undersampler = RandomUnderSampler(random_state=42)
X_resampled_under, y_resampled_under = undersampler.fit_resample(X_train_scaled, y_train)

# Train the Logistic Regression model on the undersampled data
log_reg_undersampled = LogisticRegression(random_state=42)
log_reg_undersampled.fit(X_resampled_under, y_resampled_under)

# Evaluate the model on the test set
y_pred_undersampled = log_reg_undersampled.predict(X_test_scaled)
classification_rep_undersampled = classification_report(y_test, y_pred_undersampled)
conf_matrix_undersampled = confusion_matrix(y_test, y_pred_undersampled)

# Print the results
print("Classification Report (Undersampled):")
print(classification_rep_undersampled)
print("Confusion Matrix (Undersampled):")
print(conf_matrix_undersampled)


Classification Report (Undersampled):
              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    182519
         1.0       0.58      0.95      0.72     17481

    accuracy                           0.93    200000
   macro avg       0.79      0.94      0.84    200000
weighted avg       0.96      0.93      0.94    200000

Confusion Matrix (Undersampled):
[[170394  12125]
 [   918  16563]]


In [30]:
from imblearn.over_sampling import SMOTE

# Apply SMOTE
smote = SMOTE(random_state=42)
X_resampled_smote, y_resampled_smote = smote.fit_resample(X_train_scaled, y_train)

# Train the Logistic Regression model on the SMOTE resampled data
log_reg_smote = LogisticRegression(random_state=42)
log_reg_smote.fit(X_resampled_smote, y_resampled_smote)

# Evaluate the model on the test set
y_pred_smote = log_reg_smote.predict(X_test_scaled)
classification_rep_smote = classification_report(y_test, y_pred_smote)
conf_matrix_smote = confusion_matrix(y_test, y_pred_smote)

# Print the results
print("Classification Report (SMOTE):")
print(classification_rep_smote)
print("Confusion Matrix (SMOTE):")
print(conf_matrix_smote)


Classification Report (SMOTE):
              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    182519
         1.0       0.58      0.95      0.72     17481

    accuracy                           0.93    200000
   macro avg       0.79      0.94      0.84    200000
weighted avg       0.96      0.93      0.94    200000

Confusion Matrix (SMOTE):
[[170386  12133]
 [   907  16574]]
