# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Definir X e y
# Ajusta 'Class' al nombre real de tu columna
X = fraud.drop('Class', axis=1)
y = fraud['Class']

# Dividir en train y test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42,
    stratify=y  # Para mantener la proporción de clases
)

# Entrenar LogisticRegression
log_reg = LogisticRegression(max_iter=1000)  # max_iter alto para asegurar convergencia
log_reg.fit(X_train, y_train)


In [None]:
y_pred = log_reg.predict(X_test)

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['No Fraud', 'Fraud']))


In [None]:
!pip install imbalanced-learn  # Solo si no lo tienes instalado (comenta si ya lo usas)

from imblearn.over_sampling import RandomOverSampler

# Creamos un objeto de oversampling
ros = RandomOverSampler(random_state=42)

# Ajustamos en X_train, y_train
X_train_ros, y_train_ros = ros.fit_resample(X_train, y_train)

print("Distribución de clases antes de oversampling:", y_train.value_counts())
print("Distribución de clases después de oversampling:", y_train_ros.value_counts())


In [None]:
log_reg_ros = LogisticRegression(max_iter=1000)
log_reg_ros.fit(X_train_ros, y_train_ros)

y_pred_ros = log_reg_ros.predict(X_test)

print("Confusion Matrix (Oversampled):")
print(confusion_matrix(y_test, y_pred_ros))

print("\nClassification Report (Oversampled):")
print(classification_report(y_test, y_pred_ros, target_names=['No Fraud', 'Fraud']))


In [None]:
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)

print("Distribución de clases antes de undersampling:", y_train.value_counts())
print("Distribución de clases después de undersampling:", y_train_rus.value_counts())


In [None]:
log_reg_rus = LogisticRegression(max_iter=1000)
log_reg_rus.fit(X_train_rus, y_train_rus)

y_pred_rus = log_reg_rus.predict(X_test)

print("Confusion Matrix (Undersampled):")
print(confusion_matrix(y_test, y_pred_rus))

print("\nClassification Report (Undersampled):")
print(classification_report(y_test, y_pred_rus, target_names=['No Fraud', 'Fraud']))


In [None]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

print("Distribución de clases antes de SMOTE:", y_train.value_counts())
print("Distribución de clases después de SMOTE:", y_train_sm.value_counts())


In [None]:
log_reg_sm = LogisticRegression(max_iter=1000)
log_reg_sm.fit(X_train_sm, y_train_sm)

y_pred_sm = log_reg_sm.predict(X_test)

print("Confusion Matrix (SMOTE):")
print(confusion_matrix(y_test, y_pred_sm))

print("\nClassification Report (SMOTE):")
print(classification_report(y_test, y_pred_sm, target_names=['No Fraud', 'Fraud']))
