<a href="https://colab.research.google.com/github/RaunakRaj2081/Machine-Learning_part1/blob/main/2_7_Handling_imbalanced_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Imbalanced Dataset

🔶 What is an Imbalanced Dataset?

=> An imbalanced dataset is when the number of examples in each class is not equal.

For example, if you are doing disease prediction, your data might look like this:

950 people do not have the disease (class 0)

50 people do have the disease (class 1)

Here, the dataset is imbalanced because:

95% → class 0

5% → class 1

One class has much more data than the other.


🔶 Why is this a problem?

=> Machine learning models often try to maximize overall accuracy. So in imbalanced data:

The model may predict only the majority class and still get high accuracy.

But it will miss the minority class, which is often the most important one (like detecting a disease or fraud).

🔶 Example:
If a model always predicts class 0, it will be 95% accurate — but completely useless because it never detects class 1.

🔶 Real-Life Examples of Imbalanced Data:

Fraud detection (most transactions are not fraud)

Disease prediction (few patients have a rare disease)

Spam email detection (most emails are not spam)



In [2]:
# importing the dependencies
import numpy as np
import pandas as pd

In [3]:
# loading the dataset to pandas DataFrame
credit_card_data = pd.read_csv('/content/credit_data.csv')

In [4]:
# first 5 rows of the dataframe
credit_card_data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0.0
1,0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0.0
2,1,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0.0
3,1,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0.0
4,2,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0.0


In [5]:
credit_card_data.tail()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
19893,30631,-0.377215,0.973528,1.647077,0.732439,0.024728,-0.541379,0.828488,-0.06074,-0.725148,...,0.228443,0.685913,-0.107687,0.63174,0.126366,-0.327633,0.056522,0.033139,29.9,0.0
19894,30631,1.209281,0.078793,0.06182,0.59373,-0.235772,-0.448524,-0.141196,0.089236,0.411825,...,-0.302369,-0.984051,0.130401,-0.390756,0.105615,0.152881,-0.025292,0.02113,16.0,0.0
19895,30632,1.286596,-1.450336,0.81453,-1.308949,-2.055209,-0.592064,-1.317286,0.032386,-1.720017,...,0.040743,0.262534,-0.045112,0.51566,0.218606,-0.138794,0.026395,0.030885,92.0,0.0
19896,30633,-0.48809,1.018448,0.670593,-0.245462,0.828347,-0.233102,0.662586,-0.040028,-0.279439,...,-0.344859,-0.902035,-0.050171,-1.060827,0.062221,0.150428,0.130266,0.06729,1.99,0.0
19897,30633,-2.609841,2.479357,0.763844,0.044509,-0.645716,0.762867,-1.626415,-7.617854,1.399746,...,,,,,,,,,,


#Distribution of the two classes

In [6]:
credit_card_data['Class'].value_counts()

Unnamed: 0_level_0,count
Class,Unnamed: 1_level_1
0.0,19812
1.0,85


This is Highly Imbalanced Dataset

0 --> Legit Transactions

1 --> Fraudulent Transactions

In [7]:
# separating the legit and fraudulent transactions
legit = credit_card_data[credit_card_data.Class == 0]
fraud = credit_card_data[credit_card_data.Class == 1]

In [8]:
print(legit.shape)
print(fraud.shape)

(19812, 31)
(85, 31)


#Under-sampling

Build a sample dataset containing similar distribution of Legit & Fraudulent Transactionds

Number of Fraudulent Transactions --> 85

In [9]:
legit_sample = legit.sample(n=85)

In [10]:
print(legit_sample.shape)

(85, 31)


#Concatenate the Two DataFrames

In [11]:
new_dataset = pd.concat([legit_sample, fraud], axis = 0)

In [12]:
new_dataset.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
19357,30210,-0.436689,0.071842,2.543647,-3.1e-05,-0.558228,1.474524,-0.685809,0.584585,0.97123,...,-0.065782,0.273989,-0.396706,-0.715507,-0.085904,1.261286,0.100384,0.083696,1.0,0.0
19130,30038,1.231002,0.258652,0.178323,0.506292,-0.190546,-0.568598,-0.045576,-0.014668,-0.147095,...,-0.260251,-0.810668,0.078531,-0.047263,0.215868,0.098306,-0.030333,0.016965,4.49,0.0
4969,4529,1.184167,0.066152,0.720988,1.04896,-0.385638,-0.091903,-0.303297,-0.043298,1.907172,...,-0.211691,-0.081422,-0.047083,0.073008,0.487745,0.429634,-0.026481,0.000958,6.99,0.0
4275,3756,1.455736,-0.593967,-0.883533,-1.639203,1.486036,3.264616,-1.207737,0.717662,0.363424,...,-0.295169,-0.8834,0.106058,0.918577,0.367366,-0.505379,-0.014611,0.010539,13.81,0.0
4336,3761,1.268691,-0.629294,0.729824,-0.542813,-1.111714,-0.370687,-0.932106,0.110866,0.550878,...,-0.165255,-0.502086,0.196727,0.07944,-0.002826,-0.515046,-0.021265,0.003647,29.0,0.0


In [13]:
new_dataset.tail()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
17480,28755,-30.55238,16.713389,-31.103685,6.534984,-22.105532,-4.977692,-20.371514,20.007208,-3.565738,...,1.81652,-2.288686,-1.460544,0.183179,2.208209,-0.208824,1.232636,0.35666,99.99,1.0
18466,29526,1.102804,2.829168,-3.93287,4.707691,2.937967,-1.800904,1.672734,-0.30024,-2.783011,...,-0.106994,-0.25005,-0.521627,-0.44895,1.291646,0.516327,0.009146,0.153318,0.68,1.0
18472,29531,-1.060676,2.608579,-2.971679,4.360089,3.738853,-2.728395,1.987616,-0.357345,-2.757535,...,-0.063168,-0.207385,-0.183261,-0.103679,0.896178,0.407387,-0.130918,0.192177,0.68,1.0
18773,29753,0.269614,3.549755,-5.810353,5.80937,1.538808,-2.269219,-0.824203,0.35107,-3.759059,...,0.371121,-0.32229,-0.549856,-0.520629,1.37821,0.564714,0.553255,0.4024,0.68,1.0
18809,29785,0.923764,0.344048,-2.880004,1.72168,-3.019565,-0.639736,-3.801325,1.299096,0.864065,...,0.899931,1.481271,0.725266,0.17696,-1.815638,-0.536517,0.489035,-0.049729,30.3,1.0


In [14]:
new_dataset['Class'].value_counts()

Unnamed: 0_level_0,count
Class,Unnamed: 1_level_1
0.0,85
1.0,85
