# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [5]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.utils import resample
from imblearn.over_sampling import SMOTE
import seaborn as sns
import matplotlib.pyplot as plt

In [7]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [None]:
# Inspect the data
print(fraud.head())
print(fraud['fraud'].value_counts(normalize=True))  # Distribution of target variable

# Plot the distribution of the target variable
sns.countplot(fraud['fraud'])
plt.title('Distribution of Fraud (Target Variable)')
plt.show()

# Identify imbalance
is_imbalanced = fraud['fraud'].value_counts(normalize=True).max() > 0.75
print(f"Is the dataset imbalanced? {'Yes' if is_imbalanced else 'No'}")

# Split the data
X = fraud.drop(columns=['fraud'])
y = fraud['fraud']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Train Logistic Regression model
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)

# Evaluate the model
y_pred = model.predict(X_test)
print("Initial Model Evaluation")
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

# Oversampling using resample
print("\nOversampling...")
data_majority = fraud[fraud.fraud == 0]
data_minority = fraud[fraud.fraud == 1]

# Upsample minority class
data_minority_upsampled = resample(data_minority,
                                   replace=True,
                                   n_samples=len(data_majority),
                                   random_state=42)

data_oversampled = pd.concat([data_majority, data_minority_upsampled])
X_over = data_oversampled.drop(columns=['fraud'])
y_over = data_oversampled['fraud']

X_train_over, X_test_over, y_train_over, y_test_over = train_test_split(X_over, y_over, test_size=0.3, random_state=42, stratify=y_over)
model.fit(X_train_over, y_train_over)
y_pred_over = model.predict(X_test_over)

print("Evaluation with Oversampled Data")
print(confusion_matrix(y_test_over, y_pred_over))
print(classification_report(y_test_over, y_pred_over))

# Undersampling
print("\nUndersampling...")
data_majority_downsampled = resample(data_majority,
                                     replace=False,
                                     n_samples=len(data_minority),
                                     random_state=42)

data_undersampled = pd.concat([data_majority_downsampled, data_minority])
X_under = data_undersampled.drop(columns=['fraud'])
y_under = data_undersampled['fraud']

X_train_under, X_test_under, y_train_under, y_test_under = train_test_split(X_under, y_under, test_size=0.3, random_state=42, stratify=y_under)
model.fit(X_train_under, y_train_under)
y_pred_under = model.predict(X_test_under)

print("Evaluation with Undersampled Data")
print(confusion_matrix(y_test_under, y_pred_under))
print(classification_report(y_test_under, y_pred_under))

# SMOTE
print("\nSMOTE...")
smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X, y)

X_train_smote, X_test_smote, y_train_smote, y_test_smote = train_test_split(X_smote, y_smote, test_size=0.3, random_state=42, stratify=y_smote)
model.fit(X_train_smote, y_train_smote)
y_pred_smote = model.predict(X_test_smote)

print("Evaluation with SMOTE Data")
print(confusion_matrix(y_test_smote, y_pred_smote))
print(classification_report(y_test_smote, y_pred_smote))

   distance_from_home  distance_from_last_transaction  \
0           57.877857                        0.311140   
1           10.829943                        0.175592   
2            5.091079                        0.805153   
3            2.247564                        5.600044   
4           44.190936                        0.566486   

   ratio_to_median_purchase_price  repeat_retailer  used_chip  \
0                        1.945940              1.0        1.0   
1                        1.294219              1.0        0.0   
2                        0.427715              1.0        0.0   
3                        0.362663              1.0        1.0   
4                        2.222767              1.0        1.0   

   used_pin_number  online_order  fraud  
0              0.0           0.0    0.0  
1              0.0           0.0    0.0  
2              0.0           1.0    0.0  
3              0.0           1.0    0.0  
4              0.0           1.0    0.0  
fraud
0.0    0