Imbalanced Dataset:

- A dataset with unequal class distribution

Eg: Let's say we have a diabetes dataset. It may have 1000 cases of non-diabetic people and only 100 cases of diabetic people. That is unequal distribution. Therefore, imbalanced dataset.

In [1]:
# importing the dependencies
import numpy as np
import pandas as pd

In [29]:
# loading the dataset to a pandas dataframe
credit_card_data = pd.read_csv('/content/credit_data.csv')

In [12]:
credit_card_data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0.0
1,0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0.0
2,1,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0.0
3,1,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0.0
4,2,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0.0


In [30]:
credit_card_data.tail()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
284802,172786.0,-11.881118,10.071785,-9.834783,-2.066656,-5.364473,-2.606837,-4.918215,7.305334,1.914428,...,0.213454,0.111864,1.01448,-0.509348,1.436807,0.250034,0.943651,0.823731,0.77,0
284803,172787.0,-0.732789,-0.05508,2.03503,-0.738589,0.868229,1.058415,0.02433,0.294869,0.5848,...,0.214205,0.924384,0.012463,-1.016226,-0.606624,-0.395255,0.068472,-0.053527,24.79,0
284804,172788.0,1.919565,-0.301254,-3.24964,-0.557828,2.630515,3.03126,-0.296827,0.708417,0.432454,...,0.232045,0.578229,-0.037501,0.640134,0.265745,-0.087371,0.004455,-0.026561,67.88,0
284805,172788.0,-0.24044,0.530483,0.70251,0.689799,-0.377961,0.623708,-0.68618,0.679145,0.392087,...,0.265245,0.800049,-0.163298,0.123205,-0.569159,0.546668,0.108821,0.104533,10.0,0
284806,172792.0,-0.533413,-0.189733,0.703337,-0.506271,-0.012546,-0.649617,1.577006,-0.41465,0.48618,...,0.261057,0.643078,0.376777,0.008797,-0.473649,-0.818267,-0.002415,0.013649,217.0,0


In [31]:
# distribution of two classes
credit_card_data['Class'].value_counts()

Class
0    284315
1       492
Name: count, dtype: int64

This is a Highly Imbalanced Dataset

0 --> Legit Transactions
1 --> Fradulent Transactions

In [33]:
# separating legit and fradulent transactions
legit = credit_card_data[credit_card_data.Class == 0]
fraud = credit_card_data[credit_card_data.Class == 1]

In [34]:
print(legit.shape)
print(fraud.shape)

(284315, 31)
(492, 31)


Under-Sampling

Build a sample dataset containing similar distribution of legit and fradulent transactions

Number of fradulent transactions --> 492

In [35]:
legit_sample = legit.sample(n=492)

In [36]:
print(legit_sample.shape)

(492, 31)


Concatenate the two dataframes

In [38]:
new_dataset = pd.concat([legit_sample, fraud], axis = 0)

In [39]:
new_dataset.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
97223,66117.0,1.171928,-1.070456,0.053051,-0.600388,-0.891454,0.070048,-0.677212,0.202065,-0.42135,...,0.13665,0.194663,-0.176375,-0.309974,0.519273,-0.115188,-0.013136,0.001873,96.75,0
50471,44504.0,1.551859,-1.09828,0.345359,-1.518276,-1.408498,-0.46659,-1.159334,-0.065661,-1.524527,...,-0.433219,-0.95188,0.084157,-0.541787,0.206095,-0.354759,0.034354,0.0198,17.6,0
28213,34905.0,0.926094,-0.864716,1.051325,0.310737,-1.537351,-0.310139,-0.760816,0.243552,0.94307,...,0.186842,0.273582,-0.098304,0.588285,0.038304,1.072128,-0.076436,0.024175,131.1,0
259581,159180.0,-1.454965,1.022804,-0.991559,-0.172936,0.363965,-1.21396,0.756247,0.16149,-0.154852,...,0.313025,1.127031,-0.213785,0.020753,-0.409381,-0.182103,0.265428,0.305662,74.89,0
145670,87120.0,-0.882767,1.880427,2.404731,4.684274,-0.651347,0.220064,0.00917,0.634988,-1.906572,...,-0.350136,-0.987496,0.094393,0.842395,-0.190992,0.027534,0.228843,0.102214,5.32,0


In [40]:
new_dataset.tail()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
279863,169142.0,-1.927883,1.125653,-4.518331,1.749293,-1.566487,-2.010494,-0.88285,0.697211,-2.064945,...,0.778584,-0.319189,0.639419,-0.294885,0.537503,0.788395,0.29268,0.147968,390.0,1
280143,169347.0,1.378559,1.289381,-5.004247,1.41185,0.442581,-1.326536,-1.41317,0.248525,-1.127396,...,0.370612,0.028234,-0.14564,-0.081049,0.521875,0.739467,0.389152,0.186637,0.76,1
280149,169351.0,-0.676143,1.126366,-2.2137,0.468308,-1.120541,-0.003346,-2.234739,1.210158,-0.65225,...,0.751826,0.834108,0.190944,0.03207,-0.739695,0.471111,0.385107,0.194361,77.89,1
281144,169966.0,-3.113832,0.585864,-5.39973,1.817092,-0.840618,-2.943548,-2.208002,1.058733,-1.632333,...,0.583276,-0.269209,-0.456108,-0.183659,-0.328168,0.606116,0.884876,-0.2537,245.0,1
281674,170348.0,1.991976,0.158476,-2.583441,0.40867,1.151147,-0.096695,0.22305,-0.068384,0.577829,...,-0.16435,-0.295135,-0.072173,-0.450261,0.313267,-0.289617,0.002988,-0.015309,42.53,1


In [41]:
new_dataset['Class'].value_counts()

Class
0    492
1    492
Name: count, dtype: int64