# Balancing an Imbalanced Data
***
***

# Purpose
***
The purpose of this assignment is to practice using oversampling via SMOTE and undersampling to handle imbalanced data.

__IMPORTANT NOTE: In a full project that involves modeling, I'd normally split the data but since I won't be modeling, I won't need to split the data in this case.__

# Setup
***

In [10]:
# establishing environment
import pandas as pd
from imblearn.over_sampling import SMOTE

# Loading Data
***

In [2]:
# reading in data from local csv that was acquired from Kaggle
df = pd.read_csv('creditcard.csv')

In [3]:
# previewing data
df.head(2)

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0


# Preparing the Data
***

In [4]:
# checking for nulls and data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

In [5]:
# converting all column names to lower case to ease operations slightly
df.columns = df.columns.str.lower()

In [6]:
# renaming column "class" to "is_fraud" to make the column easier to understand and 
# allow for easier operations since "class" is a reserved word
df = df.rename({'class' : 'is_fraud'}, axis=1)

# Identifying magnitude of imbalance
***

In [7]:
# examining balance of is_fraud values
# many more non-fraud than fraud rows so the data is imbalanced
# if I were to model this data, it would be best to balance them
df['is_fraud'].value_counts()

0    284315
1       492
Name: is_fraud, dtype: int64

In [8]:
# calculating balance in percent format
# over 99% of data is non-fraud, >1% is fraud
# extremely imbalanced
df['is_fraud'].value_counts() / len(df)

0    0.998273
1    0.001727
Name: is_fraud, dtype: float64

# Balacing the data via oversampling using SMOTE
In a nutshell, SMOTE will make new rows of data that resemble the minority class (fraud) in order to balance out the classes. This is useful when undersampling would cause us to lose too much data. I'll cover undersampling in more depth in the next section.

In [15]:
# creating variables that hold non-target variables and target variable
x = df.drop(columns=['is_fraud'])
y = df['is_fraud']

In [16]:
# using SMOTE to create synthetic "fraud" rows to create even class split
x_oversamp, y_oversamp = SMOTE().fit_sample(x,y)

In [38]:
# printing rows from original df and new DFs
print('Original DF Rows:', df.shape[0])
print('Oversampled DF X Rows:', x_oversamp.shape[0])
print('Oversampled DF y Rows:', y_oversamp.shape[0])

Original DF Rows: 284807
Oversampled DF X Rows: 568630
Oversampled DF y Rows: 568630


- New DFs have a lot more rows than the original since SMOTE created new data for them

In [40]:
y_oversamp.value_counts()

1    284315
0    284315
Name: is_fraud, dtype: int64

In [41]:
y_oversamp.value_counts() / len(y_oversamp)

1    0.5
0    0.5
Name: is_fraud, dtype: float64

- Fraud and non-fraud rows are now split evenly, 50/50