# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [2]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [4]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [None]:
# Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score
from sklearn.utils import resample
from imblearn.over_sampling import SMOTE

# Load the data
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

# Step 1: Check the distribution of the target variable
print("Class distribution:\n", fraud['fraud'].value_counts(normalize=True))
print("\nThe dataset is imbalanced." if fraud['fraud'].value_counts(normalize=True).min() < 0.2 else "\nThe dataset is balanced.")

# Split data into features and target
X = fraud.drop('fraud', axis=1)
y = fraud['fraud']

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Step 2: Train a Logistic Regression Model on the original data
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Step 3: Evaluate the Model
y_pred = model.predict(X_test)
print("Initial Model Performance (Imbalanced Data):")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("ROC AUC:", roc_auc_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

# Step 4: Oversample (Random Over Sampling)
X_train_oversampled, y_train_oversampled = resample(
    X_train[y_train == 1], y_train[y_train == 1], 
    replace=True, n_samples=X_train[y_train == 0].shape[0], random_state=42
)
X_train_balanced = pd.concat([X_train[y_train == 0], X_train_oversampled])
y_train_balanced = pd.concat([y_train[y_train == 0], y_train_oversampled])

# Train and evaluate the model on oversampled data
model.fit(X_train_balanced, y_train_balanced)
y_pred_oversampled = model.predict(X_test)
print("\nModel Performance (Oversampled Data):")
print("Accuracy:", accuracy_score(y_test, y_pred_oversampled))
print("ROC AUC:", roc_auc_score(y_test, y_pred_oversampled))
print(classification_report(y_test, y_pred_oversampled))
print(confusion_matrix(y_test, y_pred_oversampled))

# Step 5: Undersample
X_train_undersampled, y_train_undersampled = resample(
    X_train[y_train == 0], y_train[y_train == 0], 
    replace=False, n_samples=X_train[y_train == 1].shape[0], random_state=42
)
X_train_balanced = pd.concat([X_train_undersampled, X_train[y_train == 1]])
y_train_balanced = pd.concat([y_train_undersampled, y_train[y_train == 1]])

# Train and evaluate the model on undersampled data
model.fit(X_train_balanced, y_train_balanced)
y_pred_undersampled = model.predict(X_test)
print("\nModel Performance (Undersampled Data):")
print("Accuracy:", accuracy_score(y_test, y_pred_undersampled))
print("ROC AUC:", roc_auc_score(y_test, y_pred_undersampled))
print(classification_report(y_test, y_pred_undersampled))
print(confusion_matrix(y_test, y_pred_undersampled))

# Step 6: SMOTE (Synthetic Minority Over-sampling Technique)
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Train and evaluate the model on SMOTE data
model.fit(X_train_smote, y_train_smote)
y_pred_smote = model.predict(X_test)
print("\nModel Performance (SMOTE Data):")
print("Accuracy:", accuracy_score(y_test, y_pred_smote))
print("ROC AUC:", roc_auc_score(y_test, y_pred_smote))
print(classification_report(y_test, y_pred_smote))
print(confusion_matrix(y_test, y_pred_smote))
