# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
df= pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
df.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [3]:
# Assuming your dataframe is called df
target_distribution = df['fraud'].value_counts(normalize=True)
print(target_distribution)

fraud
0.0    0.912597
1.0    0.087403
Name: proportion, dtype: float64


Logic regression

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Define features and target
X = df.drop('fraud', axis=1)
y = df['fraud']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model = LogisticRegression(solver='liblinear', random_state=42)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")


[[181277   1280]
 [  7003  10440]]
              precision    recall  f1-score   support

         0.0       0.96      0.99      0.98    182557
         1.0       0.89      0.60      0.72     17443

    accuracy                           0.96    200000
   macro avg       0.93      0.80      0.85    200000
weighted avg       0.96      0.96      0.95    200000

Accuracy: 0.958585


Oversample

In [5]:
from sklearn.utils import resample

# Combine X_train and y_train
train_data = pd.concat([X_train, y_train], axis=1)

# Separate majority and minority classes
majority_class = train_data[train_data['fraud'] == 0]
minority_class = train_data[train_data['fraud'] == 1]

# Upsample minority class
minority_upsampled = resample(minority_class, 
                              replace=True,     # Sample with replacement
                              n_samples=len(majority_class),    # Match number of majority class
                              random_state=42)  # For reproducibility

# Combine majority class with upsampled minority class
train_data_upsampled = pd.concat([majority_class, minority_upsampled])

# Separate X and y
X_train_balanced = train_data_upsampled.drop('fraud', axis=1)
y_train_balanced = train_data_upsampled['fraud']

# Train and evaluate again
model.fit(X_train_balanced, y_train_balanced)
y_pred_balanced = model.predict(X_test)
print(confusion_matrix(y_test, y_pred_balanced))
print(classification_report(y_test, y_pred_balanced))
print(f"Accuracy after oversampling: {accuracy_score(y_test, y_pred_balanced)}")


[[170338  12219]
 [   851  16592]]
              precision    recall  f1-score   support

         0.0       1.00      0.93      0.96    182557
         1.0       0.58      0.95      0.72     17443

    accuracy                           0.93    200000
   macro avg       0.79      0.94      0.84    200000
weighted avg       0.96      0.93      0.94    200000

Accuracy after oversampling: 0.93465


Undersample

In [6]:
# Undersample majority class
majority_downsampled = resample(majority_class, 
                                replace=False,    # Sample without replacement
                                n_samples=len(minority_class),    # Match number of minority class
                                random_state=42)  # For reproducibility

# Combine minority class with downsampled majority class
train_data_downsampled = pd.concat([majority_downsampled, minority_class])

# Separate X and y
X_train_balanced = train_data_downsampled.drop('fraud', axis=1)
y_train_balanced = train_data_downsampled['fraud']

# Train and evaluate again
model.fit(X_train_balanced, y_train_balanced)
y_pred_balanced = model.predict(X_test)
print(confusion_matrix(y_test, y_pred_balanced))
print(classification_report(y_test, y_pred_balanced))
print(f"Accuracy after undersampling: {accuracy_score(y_test, y_pred_balanced)}")


[[170284  12273]
 [   840  16603]]
              precision    recall  f1-score   support

         0.0       1.00      0.93      0.96    182557
         1.0       0.57      0.95      0.72     17443

    accuracy                           0.93    200000
   macro avg       0.79      0.94      0.84    200000
weighted avg       0.96      0.93      0.94    200000

Accuracy after undersampling: 0.934435


SMOTE

In [7]:
from imblearn.over_sampling import SMOTE

# Initialize SMOTE
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Train and evaluate again
model.fit(X_train_smote, y_train_smote)
y_pred_smote = model.predict(X_test)
print(confusion_matrix(y_test, y_pred_smote))
print(classification_report(y_test, y_pred_smote))
print(f"Accuracy after SMOTE: {accuracy_score(y_test, y_pred_smote)}")


[[170446  12111]
 [   873  16570]]
              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    182557
         1.0       0.58      0.95      0.72     17443

    accuracy                           0.94    200000
   macro avg       0.79      0.94      0.84    200000
weighted avg       0.96      0.94      0.94    200000

Accuracy after SMOTE: 0.93508


Oversampling with Random Resampling:

In [9]:
import pandas as pd
from sklearn.utils import resample

# Combine X_train and y_train
train_data = pd.concat([X_train, y_train], axis=1)

# Separate majority and minority classes
majority_class = train_data[train_data['fraud'] == 0]
minority_class = train_data[train_data['fraud'] == 1]

# Upsample minority class
minority_upsampled = resample(minority_class, 
                              replace=True,     # Sample with replacement
                              n_samples=len(majority_class),    # Match number of majority class
                              random_state=42)  # For reproducibility

# Combine majority class with upsampled minority class
train_data_upsampled = pd.concat([majority_class, minority_upsampled])

# Separate X and y
X_train_balanced = train_data_upsampled.drop('fraud', axis=1)
y_train_balanced = train_data_upsampled['fraud']


Training and Evaluating the Model:

In [10]:
# Import necessary libraries
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from imblearn.over_sampling import SMOTE

# Initialize SMOTE
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Train the model with the balanced dataset
model = LogisticRegression(solver='liblinear', random_state=42)
model.fit(X_train_smote, y_train_smote)

# Predict on the test set
y_pred_smote = model.predict(X_test)

# Evaluate the model
print(confusion_matrix(y_test, y_pred_smote))
print(classification_report(y_test, y_pred_smote))
print(f"Accuracy after oversampling: {accuracy_score(y_test, y_pred_smote)}")


[[170446  12111]
 [   873  16570]]
              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    182557
         1.0       0.58      0.95      0.72     17443

    accuracy                           0.94    200000
   macro avg       0.79      0.94      0.84    200000
weighted avg       0.96      0.94      0.94    200000

Accuracy after oversampling: 0.93508
