**Balancing a Dataset with Down sampling**

Imagine we have a datset for a binary classification task where the class are imbalanced, and we want to downsample the majority class to balance the dataset.

In [40]:
import pandas as pd
from sklearn.utils import resample
df=pd.DataFrame({
    'Age': [22,26,19,20,45,50,43,48,18,33,35,52,60,62,70],
    'Income': [2000,2500,3000,2300,2200,3300,3500,3900,4000,4100,4700,5000,5300,6200,6800],
'Class':['High','Low','Low','High','High','Low','High','High','Low','High','Low','High','Low','High','Low']
})
df['Class'].value_counts()

Class
High    8
Low     7
Name: count, dtype: int64

High class has 8 instances

Low class has 7 instances

In [44]:
df_high=df[df['Class']=='High']
df_low=df[df['Class']=='Low']

In [26]:
df_high_downsampled=resample(df_high,replace=False,n_samples=len(df_low),random_state=42)


In [34]:
import pandas as pd
df_balanced=pd.concat([df_high_downsampled,df_low])

In [38]:
print(df_balanced['Class'].value_counts())

Class
High    7
Low     7
Name: count, dtype: int64


In [52]:
#reexecute the original dataframe for the original values of high and low to perform upsampling

In [48]:
df_low_upsampled=resample(df_low,replace=True,n_samples=len(df_high),random_state=42)
df_balanced=pd.concat([df_low_upsampled,df_high])

In [50]:
print(df_balanced['Class'].value_counts())

Class
Low     8
High    8
Name: count, dtype: int64


In [54]:
pip install imbalanced-learn

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [56]:
import pandas as pd
from imblearn.over_sampling import SMOTE


**SMOTE( Synthetic Minority Over-Sampling Technique) to generate asynthetic samples instead of duplicating existing ones**

1. SMOTE ot genrate synthetic samples instead of duplicating existing ones.

2. Convert categorical class Labels into numeric form for SMOTE to work.

3. Apply SMOTE to balance the dataset.

4. Convert back to original categorical labels.

5. Combine the resampled data into a final balanced dataset.


In [61]:
df=pd.DataFrame({
    'Age': [22,26,19,20,45,50,43,48,18,33,35,52,60,62,70],
    'Income': [2000,2500,3000,2300,2200,3300,3500,3900,4000,4100,4700,5000,5300,6200,6800],
'Class':['Minority','Majority','Minority','Majority','Minority','Majority','Minority','Majority','Majority','Minority','Minority','Majority','Minority','Minority','Majority']
})
df['Class'].value_counts()

Class
Minority    8
Majority    7
Name: count, dtype: int64

pip uninstall scikit-learn imbalanced-learn -y
pip install -U scikit learn imbalanced-learn

In [86]:
import pandas as pd
from imblearn.over_sampling import SMOTE
df = pd.DataFrame({
    'Age': [22, 25, 27, 28, 30, 35, 40, 45, 50, 55, 60, 65, 70],
    'Income': [2000, 2500, 2700, 3200, 3500, 3800, 4000, 4200, 4300, 4500, 5000, 5500, 6000],
    'Class': ['Minority', 'Majority', 'Majority', 'Majority', 'Majority', 'Minority', 'Minority', 'Minority', 'Majority',  'Majority', 'Majority', 'Majority', 'Majority']
})
df['Class']=df['Class'].map({'Majority':0,'Minority':1})
X=df[['Age','Income']]
Y=df['Class']
smote=SMOTE(sampling_strategy='auto',random_state=42,k_neighbors=3)
x_resampled,y_resampled=smote.fit_resample(X,Y)
y_resampled,y_resampled.map({0:'Majority',1:'Minority'})
df_balanced=pd.concat([pd.DataFrame(x_resampled,columns=['Age','Income']),pd.DataFrame(y_resampled,columns=['Class'])],axis=1)
print(df_balanced['Class'].value_counts())
print(df_balanced)



Class
1    9
0    9
Name: count, dtype: int64
    Age  Income  Class
0    22    2000      1
1    25    2500      0
2    27    2700      0
3    28    3200      0
4    30    3500      0
5    35    3800      1
6    40    4000      1
7    45    4200      1
8    50    4300      0
9    55    4500      0
10   60    5000      0
11   65    5500      0
12   70    6000      0
13   40    4031      1
14   35    3831      1
15   44    4176      1
16   35    3826      1
17   41    4040      1
