# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [23]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [24]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


In [25]:
fraud.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 8 columns):
 #   Column                          Non-Null Count    Dtype  
---  ------                          --------------    -----  
 0   distance_from_home              1000000 non-null  float64
 1   distance_from_last_transaction  1000000 non-null  float64
 2   ratio_to_median_purchase_price  1000000 non-null  float64
 3   repeat_retailer                 1000000 non-null  float64
 4   used_chip                       1000000 non-null  float64
 5   used_pin_number                 1000000 non-null  float64
 6   online_order                    1000000 non-null  float64
 7   fraud                           1000000 non-null  float64
dtypes: float64(8)
memory usage: 61.0 MB


**Steps:**

In [26]:
# - **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
fraud['fraud'].value_counts()

fraud
0.0    912597
1.0     87403
Name: count, dtype: int64

In [27]:
# - **2.** Train a LogisticRegression.

In [28]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Splitting the data into features and target
X = fraud.drop(columns=['fraud'])
y = fraud['fraud']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Training the Logistic Regression model
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)

In [29]:
# - **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
y_pred = model.predict(X_test)
no_scale_report = classification_report(y_test, y_pred)
print("Classification Report:\n", no_scale_report)
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))


Classification Report:
               precision    recall  f1-score   support

         0.0       0.96      0.99      0.98    182557
         1.0       0.89      0.60      0.72     17443

    accuracy                           0.96    200000
   macro avg       0.93      0.80      0.85    200000
weighted avg       0.96      0.96      0.95    200000

Confusion Matrix:
 [[181280   1277]
 [  6992  10451]]


In [30]:
# **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 

from imblearn.over_sampling import RandomOverSampler

# Apply undersampling to balance the dataset
oversampler = RandomOverSampler(random_state=42)
X_resampled, y_resampled = oversampler.fit_resample(X, y)

# Train-test split with resampled data
X_train_resampled, X_test_resampled, y_train_resampled, y_test_resampled = train_test_split(
    X_resampled, y_resampled, test_size=0.2, random_state=42
)

# LogisticRegression
model_resampled = LogisticRegression(max_iter=500)
model_resampled.fit(X_train_resampled, y_train_resampled)

# Evaluate the model
y_pred_resampled = model_resampled.predict(X_test_resampled)
oversamled_report = classification_report(y_test_resampled, y_pred_resampled)
print("Classification Report after SMOTE:\n", oversamled_report)
print("Confusion Matrix after SMOTE:\n", confusion_matrix(y_test_resampled, y_pred_resampled))



Classification Report after SMOTE:
               precision    recall  f1-score   support

         0.0       0.95      0.93      0.94    182421
         1.0       0.93      0.95      0.94    182618

    accuracy                           0.94    365039
   macro avg       0.94      0.94      0.94    365039
weighted avg       0.94      0.94      0.94    365039

Confusion Matrix after SMOTE:
 [[170187  12234]
 [  9365 173253]]


In [31]:
# - **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
from imblearn.under_sampling import RandomUnderSampler

# Apply undersampling to balance the dataset
undersampler = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = undersampler.fit_resample(X, y)

# Train-test split with resampled data
X_train_resampled, X_test_resampled, y_train_resampled, y_test_resampled = train_test_split(
    X_resampled, y_resampled, test_size=0.2, random_state=42
)

# Training the Logistic Regression model
model_resampled = LogisticRegression(max_iter=500, random_state=42)
model_resampled.fit(X_train_resampled, y_train_resampled)

# Evaluate the model
y_pred_resampled = model_resampled.predict(X_test_resampled)
undersampled_report = classification_report(y_test_resampled, y_pred_resampled)
print("Classification Report after Undersampling:\n", undersampled_report)
print("Confusion Matrix after Undersampling:\n", confusion_matrix(y_test_resampled, y_pred_resampled))



Classification Report after Undersampling:
               precision    recall  f1-score   support

         0.0       0.95      0.93      0.94     17474
         1.0       0.93      0.95      0.94     17488

    accuracy                           0.94     34962
   macro avg       0.94      0.94      0.94     34962
weighted avg       0.94      0.94      0.94     34962

Confusion Matrix after Undersampling:
 [[16246  1228]
 [  908 16580]]


In [32]:
# - **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 
from imblearn.over_sampling import SMOTE
# Apply SMOTE to balance the dataset
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)
# Train-test split with resampled data
X_train_resampled, X_test_resampled, y_train_resampled, y_test_resampled = train_test_split(
    X_resampled, y_resampled, test_size=0.2, random_state=42
)
# Training the Logistic Regression model
model_resampled = LogisticRegression(max_iter=500, random_state=42)
model_resampled.fit(X_train_resampled, y_train_resampled)
# Evaluate the model
y_pred_resampled = model_resampled.predict(X_test_resampled)
smote_report = classification_report(y_test_resampled, y_pred_resampled)
print("Classification Report after SMOTE:\n", smote_report)
print("Confusion Matrix after SMOTE:\n", confusion_matrix(y_test_resampled, y_pred_resampled))




Classification Report after SMOTE:
               precision    recall  f1-score   support

         0.0       0.95      0.93      0.94    182421
         1.0       0.93      0.95      0.94    182618

    accuracy                           0.94    365039
   macro avg       0.94      0.94      0.94    365039
weighted avg       0.94      0.94      0.94    365039

Confusion Matrix after SMOTE:
 [[170337  12084]
 [  9101 173517]]


In [34]:
# - **7.** Compare the results of all methods. Which one performed best?
# Comparing the results of all methods
from sklearn.metrics import classification_report, confusion_matrix
import pandas as pd

# Extract metrics from classification reports
def extract_metrics(report):
    lines = report.split("\n")
    metrics = lines[2].split()  # Extract metrics for class '1.0'
    return {
        'precision': float(metrics[1]),
        'recall': float(metrics[2]),
        'f1-score': float(metrics[3])
    }

no_scale_metrics = extract_metrics(no_scale_report)
oversampled_metrics = extract_metrics(oversamled_report)
undersampled_metrics = extract_metrics(undersampled_report)
smote_metrics = extract_metrics(smote_report)

methods = ['Original', 'Oversampling', 'Undersampling', 'SMOTE']
results = {
    'Method': methods,
    'Precision': [
        no_scale_metrics['precision'],
        oversampled_metrics['precision'],
        undersampled_metrics['precision'],
        smote_metrics['precision']
    ],
    'Recall': [
        no_scale_metrics['recall'],
        oversampled_metrics['recall'],
        undersampled_metrics['recall'],
        smote_metrics['recall']
    ],
    'F1-Score': [
        no_scale_metrics['f1-score'],
        oversampled_metrics['f1-score'],
        undersampled_metrics['f1-score'],
        smote_metrics['f1-score']
    ]
}

results_df = pd.DataFrame(results)
print(results_df)
# The best performing method is original, with the highest precision, recall, and F1-score.


          Method  Precision  Recall  F1-Score
0       Original       0.96    0.99      0.98
1   Oversampling       0.95    0.93      0.94
2  Undersampling       0.95    0.93      0.94
3          SMOTE       0.95    0.93      0.94
