# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [2]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [4]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [2]:
import pandas as pd
import numpy as np

# Load the data
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")

# Check the distribution of the 'fraud' column
fraud_distribution = fraud['fraud'].value_counts(normalize=True)

# Display the distribution of fraudulent vs non-fraudulent transactions
print(fraud_distribution)


fraud
0.0    0.912597
1.0    0.087403
Name: proportion, dtype: float64


In [4]:
# Check the distribution of the 'fraud' column
fraud_distribution = fraud['fraud'].value_counts(normalize=True)

# Display the distribution of fraudulent vs non-fraudulent transactions
print(fraud_distribution)


fraud
0.0    0.912597
1.0    0.087403
Name: proportion, dtype: float64


In [6]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# Split the data into features and target variable
X = fraud.drop('fraud', axis=1)
y = fraud['fraud']

# Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the Logistic Regression model
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train_scaled, y_train)

# Check the accuracy of the model on the training and testing sets
train_accuracy = log_reg.score(X_train_scaled, y_train)
test_accuracy = log_reg.score(X_test_scaled, y_test)

print(f'Training Accuracy: {train_accuracy}')
print(f'Testing Accuracy: {test_accuracy}')


Training Accuracy: 0.95885875
Testing Accuracy: 0.95875


In [12]:
from sklearn.metrics import classification_report, confusion_matrix

# Predict the target variable on the test set
y_pred = log_reg.predict(X_test_scaled)

# Display classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Display confusion matrix
display("Confusion Matrix:")
display(confusion_matrix(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

         0.0       0.96      0.99      0.98    182557
         1.0       0.89      0.60      0.72     17443

    accuracy                           0.96    200000
   macro avg       0.93      0.80      0.85    200000
weighted avg       0.96      0.96      0.96    200000



'Confusion Matrix:'

array([[181283,   1274],
       [  6976,  10467]])

In [14]:
from imblearn.over_sampling import RandomOverSampler

# Apply Random Oversampling
ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X_train_scaled, y_train)

# Train the Logistic Regression model on the oversampled data
log_reg_resampled = LogisticRegression(random_state=42)
log_reg_resampled.fit(X_resampled, y_resampled)

# Evaluate the model on the test set
y_pred_resampled = log_reg_resampled.predict(X_test_scaled)

# Display classification report
print("Classification Report After Oversampling:")
print(classification_report(y_test, y_pred_resampled))

# Display confusion matrix
print("Confusion Matrix After Oversampling:")
print(confusion_matrix(y_test, y_pred_resampled))


Classification Report After Oversampling:
              precision    recall  f1-score   support

         0.0       1.00      0.93      0.96    182557
         1.0       0.58      0.95      0.72     17443

    accuracy                           0.93    200000
   macro avg       0.79      0.94      0.84    200000
weighted avg       0.96      0.93      0.94    200000

Confusion Matrix After Oversampling:
[[170347  12210]
 [   852  16591]]


In [16]:
from imblearn.under_sampling import RandomUnderSampler

# Apply Random Undersampling
rus = RandomUnderSampler(random_state=42)
X_resampled_under, y_resampled_under = rus.fit_resample(X_train_scaled, y_train)

# Train the Logistic Regression model on the undersampled data
log_reg_resampled_under = LogisticRegression(random_state=42)
log_reg_resampled_under.fit(X_resampled_under, y_resampled_under)

# Evaluate the model on the test set
y_pred_resampled_under = log_reg_resampled_under.predict(X_test_scaled)

# Display classification report
print("Classification Report After Undersampling:")
print(classification_report(y_test, y_pred_resampled_under))

# Display confusion matrix
print("Confusion Matrix After Undersampling:")
print(confusion_matrix(y_test, y_pred_resampled_under))


Classification Report After Undersampling:
              precision    recall  f1-score   support

         0.0       1.00      0.93      0.96    182557
         1.0       0.58      0.95      0.72     17443

    accuracy                           0.93    200000
   macro avg       0.79      0.94      0.84    200000
weighted avg       0.96      0.93      0.94    200000

Confusion Matrix After Undersampling:
[[170295  12262]
 [   842  16601]]


In [18]:
from imblearn.over_sampling import SMOTE

# Apply SMOTE
smote = SMOTE(random_state=42)
X_resampled_smote, y_resampled_smote = smote.fit_resample(X_train_scaled, y_train)

# Train the Logistic Regression model on the SMOTE-balanced data
log_reg_smote = LogisticRegression(random_state=42)
log_reg_smote.fit(X_resampled_smote, y_resampled_smote)

# Evaluate the model on the test set
y_pred_smote = log_reg_smote.predict(X_test_scaled)

# Display classification report
print("Classification Report After SMOTE:")
print(classification_report(y_test, y_pred_smote))

# Display confusion matrix
print("Confusion Matrix After SMOTE:")
print(confusion_matrix(y_test, y_pred_smote))


Classification Report After SMOTE:
              precision    recall  f1-score   support

         0.0       1.00      0.93      0.96    182557
         1.0       0.58      0.95      0.72     17443

    accuracy                           0.93    200000
   macro avg       0.79      0.94      0.84    200000
weighted avg       0.96      0.93      0.94    200000

Confusion Matrix After SMOTE:
[[170334  12223]
 [   848  16595]]
