<a href="https://colab.research.google.com/github/AhsenRiaz/ML-Data/blob/main/03_handling_imbalanced_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Imbalanced Dataset
A dataset in which number of instances of a class are substancially higher or lower than the other class. This kind of dataset shows significate disparities in class distribution.

Solution:
There are many solutions but the one we are going to use is Resampling.
1. OverSampling: Oversampling the minority class
2. Undersampling: Undersampling the majority class (Using in the example)


In [1]:
import pandas as pd
import numpy as np

In [3]:
credit_card_dataset = pd.read_csv('/content/credit_data.csv')

credit_card_dataset.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0.0
1,0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0.0
2,1,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0.0
3,1,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0.0
4,2,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0.0


In [4]:
credit_card_dataset.shape

(47628, 31)

In [5]:
credit_card_dataset['Class'].value_counts()

0.0    47481
1.0      146
Name: Class, dtype: int64

As we can see this is highly imbalanced dataset.

We will use UnderSampling to handle this imbalanced dataset

In [6]:
# separating the legit and fraudelent transactions
legit = credit_card_dataset[credit_card_dataset.Class == 0]
fraud = credit_card_dataset[credit_card_dataset.Class == 1]

In [7]:
print(legit.shape, "\n")
print(fraud.shape, "\n")

(47481, 31) 

(146, 31) 



In [11]:
legit_sample = legit.sample(n=146) # returns a new random sample with 146 rows

print(legit_sample.shape)

(146, 31)


In [16]:
# creating a new dataset by concatinating these datasets
# axis=0, both datasets on top of each other
# axis=0, both dateset set aside each other, from left to right

credit_card_dataset_balanced = pd.concat([legit_sample, fraud], axis=0)

In [17]:
credit_card_dataset_balanced.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
35330,38072,1.028141,-0.333513,-0.092083,0.202093,-0.58262,-1.293684,0.420982,-0.282739,0.103064,...,-0.373272,-1.508792,0.099358,0.37388,0.007647,0.611489,-0.135722,0.025583,147.99,0.0
11571,19909,1.056077,0.034563,1.030197,1.372536,-0.614096,0.072609,-0.588844,0.209796,1.516305,...,0.062002,0.33776,-0.020723,0.132993,0.296638,-0.326979,0.014106,0.014867,30.0,0.0
13578,24074,-0.336583,1.305971,1.454848,0.257695,0.17751,-0.918364,0.673916,-0.200786,0.764642,...,-0.405039,-0.807476,-0.014241,0.295969,-0.156012,0.030496,0.22452,0.09362,2.69,0.0
39504,39865,-0.747709,0.407251,-0.914767,-2.331414,2.023387,3.047759,-0.20443,1.11801,-1.852448,...,-0.296772,-0.761838,-0.034954,0.987181,-0.153853,0.255494,-0.166788,0.061597,24.5,0.0
5222,5020,0.990123,-0.690735,0.925905,-0.665393,-0.870173,0.521851,-0.935101,0.339272,2.952047,...,0.121707,0.71447,-0.115698,-0.310748,0.345911,-0.417228,0.070654,0.018226,69.14,0.0


In [18]:
credit_card_dataset_balanced.shape

(292, 31)

In [19]:
credit_card_dataset_balanced['Class'].value_counts()

0.0    146
1.0    146
Name: Class, dtype: int64