# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [4]:
fraud = pd.read_csv("card_transdata.csv")
fraud.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [None]:
# What is the distribution of our target variable? 
# Can we say we're dealing with an imbalanced dataset?
fraud["fraud"].value_counts()
"""
fraud
0.0    17517
1.0     1645
Name: count, dtype: int64

yes it is an imbalanced dataset
"""

fraud
0.0    17517
1.0     1645
Name: count, dtype: int64

In [8]:
fraud.isna().sum()

distance_from_home                0
distance_from_last_transaction    0
ratio_to_median_purchase_price    0
repeat_retailer                   0
used_chip                         1
used_pin_number                   1
online_order                      1
fraud                             1
dtype: int64

In [9]:
fraud.dropna(inplace=True)

In [13]:
# Train a LogisticRegression.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score


X, y = fraud.drop('fraud',axis=1), fraud['fraud']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

LR = LogisticRegression(max_iter=1000)
LR.fit(X_train, y_train)
LR.score(X_test, y_test)
pred = LR.predict(X_test)
"""
Goal	                            Metric
Catch as many frauds as possible	High Recall
"""
# print("precision: ",precision_score(y_test,pred))
print("recall: ",recall_score(y_test,pred))
# print("f1: ",f1_score(y_test,pred))
# print("score: ", LR.score(X_test, y_test))

recall:  0.6019417475728155


In [None]:
from sklearn.utils import resample

train = pd.concat([X_train, y_train],axis=1)

# separate majority/minority classes
no_fraud = train[train['fraud']==0]
yes_fraud = train[train['fraud']==1]

display(no_fraud.shape)
display(yes_fraud.shape)

# oversample minority
yes_fraud_oversampled = resample(yes_fraud, #<- sample from here
                                    replace=True, #<- we need replacement, since we don't have enough data otherwise
                                    n_samples = len(no_fraud),#<- make both sets the same size
                                    random_state=0)
# both sets are now of a reasonable size
display(no_fraud.shape)
display(yes_fraud_oversampled.shape)

train_oversampled = pd.concat([no_fraud, yes_fraud_oversampled])

y_train_over = train_oversampled['fraud'].copy()
X_train_over = train_oversampled.drop('fraud',axis = 1).copy()

LR = LogisticRegression(max_iter=1000)
LR.fit(X_train_over, y_train_over)
pred = LR.predict(X_test)

print("recall: ",recall_score(y_test,pred))
# oversampling improved the recall performance significantly

(13138, 8)

(1233, 8)

(13138, 8)

(13138, 8)

recall:  0.9393203883495146


In [None]:
train = pd.concat([X_train, y_train],axis=1)

# separate majority/minority classes
no_fraud = train[train['fraud']==0]
yes_fraud = train[train['fraud']==1]

display(no_fraud.shape)
display(yes_fraud.shape)

# undersample majority
no_fraud_undersampled = resample(no_fraud, #<- downsample from here
                                    replace=False, #<- no need to reuse data now, we have an abundance
                                    n_samples = len(yes_fraud),
                                    random_state=0)

display(yes_fraud.shape)
display(no_fraud_undersampled.shape)

train_undersampled = pd.concat([yes_fraud,no_fraud_undersampled])

y_train_under = train_undersampled['fraud'].copy()
X_train_under = train_undersampled.drop('fraud',axis = 1).copy()


LR = LogisticRegression(max_iter=1000)
LR.fit(X_train_under, y_train_under)
pred = LR.predict(X_test)

print("recall: ",recall_score(y_test,pred))
# undersampling improved the performance slightly more than oversampling

(13138, 8)

(1233, 8)

(1233, 8)

(1233, 8)

recall:  0.9466019417475728


In [None]:
from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state = 123,sampling_strategy=1.0)
X_train_SMOTE,y_train_SMOTE = sm.fit_resample(X_train,y_train)

LR = LogisticRegression(max_iter=1000)
LR.fit(X_train_SMOTE, y_train_SMOTE)
pred = LR.predict(X_test)

print("recall: ",recall_score(y_test,pred))
# SMOTE improves recall performance but not as much as over and undersampling

recall:  0.9247572815533981


