Imbalanced data refers to a situation where the classes (output labels) in a dataset are not represented equally. For example, in a binary classification problem with 2,000 samples, if 1,800 are labeled "yes" and only 200 are labeled "no", the dataset is heavily biased toward the "yes" class. This imbalance can lead to poor performance in machine learning models, as they may become biased toward the majority class and ignore the minority class.

To handle this, we can use techniques like random sampling, which includes:

Under-sampling: Reducing the number of samples in the majority class.

Over-sampling: Increasing the number of samples in the minority class (e.g., duplicating or using techniques like SMOTE).

Balancing the dataset helps models learn patterns from all classes more effectively, improving generalization and accuracy.


In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('credit_data.csv')
df.head()


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0.0
1,0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0.0
2,1,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0.0
3,1,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0.0
4,2,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0.0


In [4]:
df.shape

(75357, 31)

In [5]:
df.Class.value_counts()

Unnamed: 0_level_0,count
Class,Unnamed: 1_level_1
0.0,75173
1.0,183


0-> legit transaction
1-> fraud transaction

---



In [9]:
legit = df[df['Class']==0.0]
fraud =df[df['Class']==1.0]

In [10]:
print(legit.shape)
print(fraud.shape)

(75173, 31)
(183, 31)


We are doing under sampling

In [11]:
legit_sample = legit.sample(n=200)
print(legit_sample.shape)

(200, 31)


In [12]:
new_df = pd.concat([legit_sample,fraud],axis=0)
new_df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
45213,42253,-0.213029,0.021637,1.627503,-1.25653,-0.607742,-0.097723,-0.051607,0.13323,-1.16204,...,0.024824,0.037847,-0.13327,-0.00825,-0.014795,-0.369726,0.061453,0.043866,22.99,0.0
6363,7579,-0.510254,1.142216,2.104981,1.183339,0.141866,-0.099184,0.488675,-0.201551,0.715624,...,-0.153139,0.092433,-0.055248,0.252852,-0.346791,-0.474598,0.10247,-0.036466,5.0,0.0
13031,22887,1.191476,-0.256718,1.263408,-0.387676,-1.406036,-1.078807,-0.639869,-0.150443,3.069291,...,-0.121344,0.119941,0.032353,0.881295,0.401942,-0.759306,0.065533,0.036156,11.85,0.0
42919,41297,1.136237,-1.6894,0.604514,-1.25614,-2.097257,-0.698626,-1.105469,0.034664,-1.718728,...,-0.231232,-0.814349,0.077075,0.441469,-0.054131,-0.450488,-0.004123,0.045043,172.28,0.0
22706,32392,1.118052,0.128877,0.326239,1.174001,-0.412781,-0.885088,0.207696,-0.218763,0.04965,...,0.108323,0.224724,-0.165783,0.405295,0.656146,-0.303377,0.00767,0.032716,66.39,0.0


In [13]:
new_df.shape

(383, 31)

In [14]:
new_df.Class.value_counts()

Unnamed: 0_level_0,count
Class,Unnamed: 1_level_1
0.0,200
1.0,183


In [16]:
new_df.tail()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
73784,55279,-5.753852,0.57761,-6.312782,5.159401,-1.69832,-2.683286,-7.934389,2.37355,-3.073079,...,1.177852,0.175331,-1.211123,-0.446891,-0.40552,-0.165797,1.505516,0.359492,1.0,1.0
73857,55311,-6.159607,1.468713,-6.850888,5.174706,-2.986704,-1.795054,-6.545072,2.621236,-3.60587,...,1.061314,0.125737,0.589592,-0.568731,0.582825,-0.042583,0.95113,0.158996,0.83,1.0
74496,55614,-7.347955,2.397041,-7.572356,5.177819,-2.854838,-1.795239,-8.783235,0.437157,-3.740598,...,-0.175273,0.543325,-0.547955,-0.503722,-0.310933,-0.163986,1.197895,0.378187,0.83,1.0
74507,55618,-7.427924,2.948209,-8.67855,5.185303,-4.76109,-0.957095,-7.77338,0.717309,-3.682359,...,-0.299847,0.610479,0.789023,-0.564512,0.201196,-0.111225,1.144599,0.10228,130.44,1.0
74794,55760,-6.003422,-3.930731,-0.007045,1.714669,3.414667,-2.329583,-1.901512,-2.746111,0.887673,...,1.101671,-0.992494,-0.698259,0.139898,-0.205151,-0.472412,1.775378,-0.104285,311.91,1.0
