# Balancing an Imbalanced Data
***
***

# Purpose
***
The purpose of this assignment is to practice using oversampling via SMOTE and undersampling to handle imbalanced data.

# Setup
***

In [2]:
# establishing environment
import pandas as pd

# Loading Data
***

In [3]:
# reading in data from local csv that was acquired from Kaggle
df = pd.read_csv('creditcard.csv')

In [25]:
# previewing data
df.head(2)

Unnamed: 0,time,v1,v2,v3,v4,v5,v6,v7,v8,v9,...,v21,v22,v23,v24,v25,v26,v27,v28,amount,is_fraud
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0


# Preparing the Data
__IMPORTANT NOTE__: In a full project that involves exploration and modeling, I'd normally scale and split the data but since I won't be modeling or exploring the data beyond identifying the magnitude of the class imbalance, I won't need to split or scale the data.
***

In [24]:
# checking for nulls and data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   time      284807 non-null  float64
 1   v1        284807 non-null  float64
 2   v2        284807 non-null  float64
 3   v3        284807 non-null  float64
 4   v4        284807 non-null  float64
 5   v5        284807 non-null  float64
 6   v6        284807 non-null  float64
 7   v7        284807 non-null  float64
 8   v8        284807 non-null  float64
 9   v9        284807 non-null  float64
 10  v10       284807 non-null  float64
 11  v11       284807 non-null  float64
 12  v12       284807 non-null  float64
 13  v13       284807 non-null  float64
 14  v14       284807 non-null  float64
 15  v15       284807 non-null  float64
 16  v16       284807 non-null  float64
 17  v17       284807 non-null  float64
 18  v18       284807 non-null  float64
 19  v19       284807 non-null  float64
 20  v20 

In [6]:
# converting all column names to lower case to ease operations slightly
df.columns = df.columns.str.lower()

In [14]:
# renaming column "class" to "is_fraud" to make the column easier to understand and 
# allow for easier operations since "class" is a reserved word
df = df.rename({'class' : 'is_fraud'}, axis=1)

# Explore
***

In [16]:
# examining balance of is_fraud values
# many more non-fraud than fraud rows so the data is imbalanced
# if I were to model this data, it would be best to balance them
df['is_fraud'].value_counts()

0    284315
1       492
Name: is_fraud, dtype: int64

In [21]:
# calculating balance in percent format
# over 99% of data is non-fraud, >1% is fraud
# extremely imbalanced
df['is_fraud'].value_counts() / len(df)

0    0.998273
1    0.001727
Name: is_fraud, dtype: float64

### Explore Takeaways
- Fraud and non-fraud classes are extremely imbalanced
    - Over 99% of the data is non-fraud
    - Less than 1% are fraud