# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [10]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import roc_auc_score
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

In [11]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model?
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?

1. What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?

In [12]:
# Check the distribution of the target variable 'fraud'
fraud['fraud'].value_counts(normalize=True)

Unnamed: 0_level_0,proportion
fraud,Unnamed: 1_level_1
0.0,0.912597
1.0,0.087403


Yes, this is an imbalanced dataset. The non-fraud class (0) significantly outweighs the fraud class (1).

2. Train a LogisticRegression.

In [13]:
# Split the data into features and target
X = fraud.drop('fraud', axis=1)
y = fraud['fraud']

# Split the data into train and test sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Initialize the Logistic Regression model
log_reg = LogisticRegression(solver='liblinear', random_state=42)

# Fit the model on the training data
log_reg.fit(X_train, y_train)

# Predict on the test set
y_pred = log_reg.predict(X_test)

# Evaluate the model
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Classification Report:
               precision    recall  f1-score   support

         0.0       0.96      0.99      0.98    182519
         1.0       0.90      0.60      0.72     17481

    accuracy                           0.96    200000
   macro avg       0.93      0.80      0.85    200000
weighted avg       0.96      0.96      0.96    200000

Confusion Matrix:
 [[181290   1229]
 [  6927  10554]]


3. Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.

In [14]:
# ROC-AUC Score
roc_auc = roc_auc_score(y_test, log_reg.predict_proba(X_test)[:, 1])
print(f"ROC-AUC Score: {roc_auc:.4f}")

ROC-AUC Score: 0.9671


4. Run Oversample in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model?

In [15]:
# Apply SMOTE to the training data
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Train the Logistic Regression model again with the balanced data
log_reg.fit(X_train_smote, y_train_smote)

# Predict on the test set
y_pred_smote = log_reg.predict(X_test)

# Evaluate the model
print("Classification Report after SMOTE:\n", classification_report(y_test, y_pred_smote))
print("Confusion Matrix after SMOTE:\n", confusion_matrix(y_test, y_pred_smote))

# ROC-AUC Score after SMOTE
roc_auc_smote = roc_auc_score(y_test, log_reg.predict_proba(X_test)[:, 1])
print(f"ROC-AUC Score after SMOTE: {roc_auc_smote:.4f}")

Classification Report after SMOTE:
               precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    182519
         1.0       0.58      0.95      0.72     17481

    accuracy                           0.94    200000
   macro avg       0.79      0.94      0.84    200000
weighted avg       0.96      0.94      0.94    200000

Confusion Matrix after SMOTE:
 [[170490  12029]
 [   943  16538]]
ROC-AUC Score after SMOTE: 0.9792


SMOTE or Oversampling improved the model's ability to detect fraud (higher recall for fraud), which is the primary goal in fraud detection problems. However, this came at the expense of a lower precision for fraud, resulting in more false positives.
If the cost of false positives (misclassifying non-fraud as fraud) is not too high, SMOTE may be beneficial as it improves fraud detection.
If minimizing false positives is critical, further tuning or experimentation may be needed to strike a better balance between precision and recall for the fraud class.

5. Now, run Undersample in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?

In [16]:
# Apply Undersampling to the training data
undersample = RandomUnderSampler(random_state=42)
X_train_undersample, y_train_undersample = undersample.fit_resample(X_train, y_train)

# Train the Logistic Regression model again with the undersampled data
log_reg.fit(X_train_undersample, y_train_undersample)

# Predict on the test set
y_pred_undersample = log_reg.predict(X_test)

# Evaluate the model
print("Classification Report after Undersampling:\n", classification_report(y_test, y_pred_undersample))
print("Confusion Matrix after Undersampling:\n", confusion_matrix(y_test, y_pred_undersample))

# ROC-AUC Score after Undersampling
roc_auc_undersample = roc_auc_score(y_test, log_reg.predict_proba(X_test)[:, 1])
print(f"ROC-AUC Score after Undersampling: {roc_auc_undersample:.4f}")

Classification Report after Undersampling:
               precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    182519
         1.0       0.58      0.95      0.72     17481

    accuracy                           0.93    200000
   macro avg       0.79      0.94      0.84    200000
weighted avg       0.96      0.93      0.94    200000

Confusion Matrix after Undersampling:
 [[170363  12156]
 [   917  16564]]
ROC-AUC Score after Undersampling: 0.9796


Undersampling improved the detection of fraud (higher recall for fraud) while reducing the number of non-fraud instances, but at the cost of increased false positives (lower precision for fraud).
If fraud detection is a priority, undersampling may be beneficial as it leads to a higher recall for fraud, meaning more fraudulent transactions are identified. However, it comes at the expense of more non-fraud instances being misclassified as fraud.
If minimizing false positives (misclassifying non-fraud as fraud) is critical, further adjustments to the model or balancing methods may be needed to find a better trade-off between precision and recall.

6. Finally, run SMOTE in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?

In [17]:
# Prepare the data
X = fraud.drop('fraud', axis=1)
y = fraud['fraud']

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply SMOTE to oversample the minority class
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Initialize the Logistic Regression model
log_reg = LogisticRegression(random_state=42)

# Fit the model on the resampled (SMOTE) data
log_reg.fit(X_train_smote, y_train_smote)

# Make predictions on the test set
y_pred_smote = log_reg.predict(X_test)

# Evaluate the model

# 1. Classification Report
print("Classification Report after SMOTE:")
print(classification_report(y_test, y_pred_smote))

# 2. Confusion Matrix
print("Confusion Matrix after SMOTE:")
print(confusion_matrix(y_test, y_pred_smote))

# 3. ROC-AUC Score
roc_auc = roc_auc_score(y_test, log_reg.predict_proba(X_test)[:, 1])
print(f"ROC-AUC Score after SMOTE: {roc_auc:.4f}")

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Classification Report after SMOTE:
              precision    recall  f1-score   support

         0.0       1.00      0.93      0.96    182557
         1.0       0.56      0.95      0.71     17443

    accuracy                           0.93    200000
   macro avg       0.78      0.94      0.84    200000
weighted avg       0.96      0.93      0.94    200000

Confusion Matrix after SMOTE:
[[169770  12787]
 [   849  16594]]
ROC-AUC Score after SMOTE: 0.9790


SMOTE has improved the recall for fraud detection significantly, making the model much more sensitive to detecting fraud (95% recall).
However, this comes at the cost of precision (which dropped to 0.56 for fraud), meaning the model now classifies more non-fraud instances as fraud.
The ROC-AUC score is very high (0.9790), suggesting the model is still excellent at distinguishing between the classes, despite the increase in false positives.