<a href="https://colab.research.google.com/github/SyedImranML/Machine-Learning/blob/main/4_7_How_to_Handle_imbalanced_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<b> 4.7. How to Handle imbalanced Dataset | Data Pre-Processing  </b>

<b> Imbalanced Dataset </b>

A dataset with an unequal class distribution

- Say we have diabetes dataset, this diabetes dataset contains datapoints of 
  patients who have diabetes and who doesn't have diabetes.
- If the dataset is imbalanced it will contain more datapoints for diabetes 
  patients and the number of datapoints for non diabetes is very less.
- Say for example the number of datapoints of diabetes patients can be 1000 and 
  for non diabetes patient it can only 100 datapoints. 
- So we have distribution of 1000 in one class and 100 is another class.
- So this is an example of imbalance of dataset.
- We  can't feed this dataset to our machine learning model it will make our 
  prediction very bad.

In [1]:
# importing the dependencies

import numpy as np
import pandas as pd

In [2]:
# Loading the dataset to pandas DataFrame
credit_card_data = pd.read_csv('/content/credit_data.csv')

In [3]:
# first 5 rows of the dataframe

credit_card_data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0.0
1,0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0.0
2,1,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0.0
3,1,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0.0
4,2,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0.0


In [4]:
credit_card_data.tail()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
51586,45023,-2.943382,-2.332451,2.568959,0.747114,2.08355,-1.319267,-1.837967,0.421285,0.347715,...,-0.013848,-0.091317,-0.268974,0.069693,0.20292,0.35693,-0.006093,0.268288,26.24,0.0
51587,45024,-1.278283,-3.726046,-0.902718,2.796542,-1.750987,0.199763,1.223412,-0.106145,0.793279,...,0.664178,-0.592258,-1.141974,0.322687,0.081328,-0.354449,-0.150956,0.269036,1233.16,0.0
51588,45024,-0.852261,0.886192,-0.378032,-1.142044,1.880941,3.505875,-0.403582,1.365218,-0.369193,...,0.091829,0.066337,-0.05597,1.027545,-0.3279,0.255778,0.132317,0.133329,28.98,0.0
51589,45024,-0.44774,0.775759,1.586053,0.227702,-0.270962,-0.539013,0.488655,0.07107,0.308739,...,-0.033903,0.202157,-0.180777,0.424592,-0.172455,0.365861,0.35312,0.198239,3.99,0.0
51590,45026,1.04238,-1.096464,0.234271,-0.615254,-1.243107,-0.714834,-0.433026,-0.084918,-0.946072,...,,,,,,,,,,


In [6]:
# distribution of the two class

credit_card_data['Class'].value_counts()

0.0    51440
1.0      150
Name: Class, dtype: int64

This is highly imbalanced Dataset

0 --> Legit Transactions

1 --> Fraudulent Transactions

Legitimate means 'correct or acceptable according to a law or rule'.

A fraudulent transaction is the unauthorized use of an individual's accounts

In [7]:
# Separating the legit and fraudulent Tansaction

legit = credit_card_data[credit_card_data.Class == 0.0]
fraud = credit_card_data[credit_card_data.Class == 1.0]

In [8]:
print(legit.shape)
print(fraud.shape)

(51440, 31)
(150, 31)


we are going to implement a technique called under-sampling

<b>Under-sampling </b>

Build a sample dataset containing similar distribution of legit and fraudulent transaction

Number of Fraudulent transaction --> 150

In [10]:
legit_sample = legit.sample(n=150)   # Here n is the number of datapoints we want

In [11]:
print(legit_sample.shape)

(150, 31)


In [12]:
print(legit_sample)

        Time        V1        V2         V3        V4        V5        V6  \
33737  37364 -0.999682  1.538412   0.509502 -0.485314  0.540350 -0.050640   
47915  43414  1.369717 -1.157116   0.611116 -1.171388 -1.609425 -0.535265   
12938  22731 -9.260584  7.452729 -13.748652  5.940094 -8.174739 -3.648539   
31657  36454  0.955576 -0.308161   1.188044  1.475890 -0.860503  0.327066   
4726    4158 -0.959084  1.504439   1.180137  1.034805 -0.002799 -0.486044   
...      ...       ...       ...        ...       ...       ...       ...   
39289  39774 -1.064511  0.794941   2.230122  1.005573 -0.980626  0.490387   
1922    1477 -4.481163  2.353595   2.340311 -0.921434  0.124872  0.507164   
15589  26982  1.460123 -1.268057   0.447235 -1.661999 -1.272985  0.439711   
46020  42606 -1.448037 -0.101255   0.749385 -2.505307 -1.005035 -0.145189   
32387  36776  1.253145 -0.859698   0.955939 -0.772346 -1.345816  0.049771   

              V7        V8        V9  ...       V21       V22       V23  \


In [13]:
print(fraud)

        Time        V1        V2        V3        V4        V5        V6  \
541      406 -2.312227  1.951992 -1.609851  3.997906 -0.522188 -1.426545   
623      472 -3.043541 -3.157307  1.088463  2.288644  1.359805 -1.064823   
4920    4462 -2.303350  1.759247 -0.359745  2.330243 -0.821628 -0.075788   
6108    6986 -4.397974  1.358367 -2.592844  2.679787 -1.128131 -1.706536   
6329    7519  1.234235  3.019740 -4.304597  4.732795  3.624201 -1.357746   
...      ...       ...       ...       ...       ...       ...       ...   
46998  43028 -1.109646  0.811069 -1.138135  0.935265 -2.330248 -0.116106   
47802  43369 -3.365319  2.426503 -3.752227  0.276017 -2.305870 -1.961578   
48094  43494 -1.278138  0.716242 -1.143279  0.217805 -1.293890 -1.168952   
50211  44393 -4.617461  3.663395 -5.297446  3.880960 -3.263551 -0.918547   
50537  44532 -0.234922  0.355413  1.972183 -1.255593 -0.681387 -0.665732   

             V7        V8        V9  ...       V21       V22       V23  \
541   -2.5373

In [14]:
# Concatenate the two DataFrames

new_dataset = pd.concat([legit_sample,fraud], axis = 0)


In [15]:
new_dataset.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
33737,37364,-0.999682,1.538412,0.509502,-0.485314,0.54035,-0.05064,0.552809,0.174917,-0.013572,...,-0.379383,-0.851784,-0.135866,-0.898324,0.124983,0.1402,0.549266,0.281379,5.36,0.0
47915,43414,1.369717,-1.157116,0.611116,-1.171388,-1.609425,-0.535265,-0.95314,-0.1087,-1.898929,...,-0.284316,-0.345213,0.01617,0.543702,0.40255,-0.279626,0.034148,0.018666,50.56,0.0
12938,22731,-9.260584,7.452729,-13.748652,5.940094,-8.174739,-3.648539,-10.960261,6.456649,-3.0962,...,1.536311,-0.740592,-0.239052,0.0046,0.449751,-0.286287,1.907938,0.604519,89.99,0.0
31657,36454,0.955576,-0.308161,1.188044,1.47589,-0.860503,0.327066,-0.536871,0.219292,0.824802,...,0.010574,0.086572,-0.039623,0.085425,0.304783,-0.371776,0.07239,0.048434,87.0,0.0
4726,4158,-0.959084,1.504439,1.180137,1.034805,-0.002799,-0.486044,0.384451,-0.207589,1.15069,...,-0.426137,-0.834602,0.085391,-0.085383,-0.398492,0.279005,0.059574,0.255594,11.46,0.0


In [16]:
new_dataset.tail()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
46998,43028,-1.109646,0.811069,-1.138135,0.935265,-2.330248,-0.116106,-1.621986,0.458028,-0.912189,...,0.641594,0.841755,0.176728,0.081004,-0.258899,0.707654,0.418649,0.080756,204.27,1.0
47802,43369,-3.365319,2.426503,-3.752227,0.276017,-2.30587,-1.961578,-3.029283,-1.674462,0.183961,...,2.070008,-0.512626,-0.248502,0.12655,0.104166,-1.055997,-1.200165,-1.012066,88.0,1.0
48094,43494,-1.278138,0.716242,-1.143279,0.217805,-1.29389,-1.168952,-2.564182,0.204532,-1.611155,...,0.490183,0.470427,-0.126261,-0.126644,-0.661908,-0.349793,0.454851,0.137843,24.9,1.0
50211,44393,-4.617461,3.663395,-5.297446,3.88096,-3.263551,-0.918547,-5.715262,0.83104,-2.457034,...,2.698175,-0.027081,0.366775,-0.123011,-0.300457,-0.239996,-0.183463,-0.07336,1.0,1.0
50537,44532,-0.234922,0.355413,1.972183,-1.255593,-0.681387,-0.665732,0.05911,-0.003153,1.122451,...,0.22067,0.912107,-0.286338,0.451208,0.188315,-0.531846,0.123185,0.039581,1.0,1.0


In [17]:
new_dataset['Class'].value_counts()

0.0    150
1.0    150
Name: Class, dtype: int64

Now we have evenly distributed dataset for the two class and when you give these dataset for your prediction using machine learning we will get better results.

This is how you can random the imbalanced datasetin python.
