# Balancing an Imbalanced Data
***
***

# Purpose
***
The purpose of this assignment is to practice using oversampling via SMOTE and undersampling to handle imbalanced data.

__IMPORTANT NOTE: In a full project that involves modeling, I'd normally split the data but since I won't be modeling, I won't need to split the data in this case.__

# Setup
***

In [89]:
# establishing environment
import pandas as pd
from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

# Loading Data
***

In [2]:
# reading in data from local csv that was acquired from Kaggle
df = pd.read_csv('creditcard.csv')

In [3]:
# previewing data
df.head(2)

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0


# Preparing the Data
***

In [4]:
# checking for nulls and data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

In [5]:
# converting all column names to lower case to ease operations slightly
df.columns = df.columns.str.lower()

In [6]:
# renaming column "class" to "is_fraud" to make the column easier to understand and 
# allow for easier operations since "class" is a reserved word
df = df.rename({'class' : 'is_fraud'}, axis=1)

# Identifying magnitude of imbalance
***

In [86]:
# creating df to observe data imbalance
imbalance_df = pd.DataFrame(df['is_fraud'].value_counts())
imbalance_df['percent'] = df['is_fraud'].value_counts() / len(df)

imbalance_df

Unnamed: 0,is_fraud,percent
0,284315,0.998273
1,492,0.001727


- Extreme imabalance between non-fraud than fraud counts 
    - less than1% of all data is fraudulent

# Handling Imbalanced Data
***

## Creating variables to hold the non-target variables (x) and target variable (y)

In [98]:
# creating variables that hold non-target variables and target variable
x = df.drop(columns=['is_fraud'])
y = df['is_fraud']

# Balancing the data via oversampling using SMOTE
In a nutshell, SMOTE will make new rows of data that resemble the minority class (fraud) in order to balance out the classes. This is useful when undersampling would result in the resulting dataset not containing enough data overall. I'll cover undersampling in more depth in the next section. 

__In this example, I'll be using smote to create an even split between the classes (ie. 50% non-fraud, 50% fraud)__ 

In [104]:
# using SMOTE to create synthetic "fraud" rows to create even class split
x_oversamp_smote, y_oversamp_smote = SMOTE().fit_sample(x,y)

In [105]:
# creating df of class counts after SMOTE 
smote_results = pd.DataFrame(y_oversamp_smote.value_counts())

# adding percentages of each class
smote_results['percent'] = y_oversamp_smote.value_counts() / len(y_oversamp)

smote_results

Unnamed: 0,is_fraud,percent
1,284315,0.5
0,284315,0.5


- Created enough synthetic rows so that fraud and non-fraud rows are now split evenly, 50/50

# Balancing the data via random oversampling
This method of oversampling works by duplicating existing rows of data to balance out the classes instead of creating artificial rows.

__In this example I'll duplicate random minority class rows (fraudulent) until there are half as many as the majority class (non-fraudulent) rows.__

In [107]:
# creating random oversampling object and specifying .50 in sampling_strategy argument 
# so that enough random minority class rows are duplicated enough times such that
# the minority class row count is half of the majority class row count after the operation is complete
rando_oversamp = RandomOverSampler(sampling_strategy= .5)

In [108]:
# performing random over sampling
X_oversamp, y_oversamp = rando_oversamp.fit_resample(x, y)

In [109]:
# creating df of class counts after random oversampling 
oversamp_results = pd.DataFrame(y_oversamp.value_counts())

# adding percentages of each class
oversamp_results['percent'] = y_oversamp.value_counts() / len(y_oversamp)

oversamp_results

Unnamed: 0,is_fraud,percent
0,284315,0.666667
1,142157,0.333333


- Duplicated enough random minority (fraudulent) rows to make it so that there are now half as many minority (fraudulent) rows as there are majority (non-fraudulent)

# Balancing the data via random undersampling
In a nutshell, undersampling is the process of reducing the amount of data that belongs to the majority class (non-fraud) to balance with the minority class (fraud). This is useful when we want to avoid creating synthetic data (ie. we only want real data) and we have enough data such that even after undersampling, we still have a healthy amount of data to work with. This is also useful in cases where the different classes are very similar to eachother and SMOTE's synthetic data may appear ambigous with regard to it's class.

__In this example, instead of making the minority and majority classes equal, I'm going to make it so that there is twice as much data of the majority class (non-fraudulent) than the minority class (fraudulent). This is useful for preserving data in the majority class.__

In [101]:
# creating random undersampling object and specifying 'majority' in sampling_strategy argument 
# so that enough of the majority classes' rows are removed to create an even split between the classes
rando_undersamp = RandomUnderSampler(sampling_strategy= 'majority')

In [102]:
# performing random under sampling
X_undersamp, y_undersamp = rando_undersamp.fit_resample(x, y)

In [103]:
# creating df of class counts after random undersampling 
undersamp_results = pd.DataFrame(y_undersamp.value_counts())

# adding percentages of each class
undersamp_results['percent'] = y_undersamp.value_counts() / len(y_undersamp)

undersamp_results

Unnamed: 0,is_fraud,percent
1,492,0.5
0,492,0.5


- Number of non-fraudulent rows reduced to the point that there is now a 50/50 split between fraudulent and non-fraudulent rows