**Balancing Dataset with Downsampling**

  Imagine we have a dataset for a binary classification task where the class labels are imbalanced ,and we want to downsample the majority class to balance the dataset

In [1]:
import pandas as pd
from sklearn.utils import resample

df=pd.DataFrame({
    'Age':[22,25,27,28,30,35,40,45,50,55,60,65,70],
    'Income':[2000,2500,2700,3200,3500,3800,4000,4200,4300,4500,5000,5500,6000],
    'Class':['High','Low','Low','High','High','Low','High','High','Low','Low','High','High','Low']
})

In [2]:
df.head()

Unnamed: 0,Age,Income,Class
0,22,2000,High
1,25,2500,Low
2,27,2700,Low
3,28,3200,High
4,30,3500,High


In [3]:
df_high=df[df['Class']=='High']
df_low=df[df['Class']=='Low']

In [4]:
print('High Classes: ',df_high.shape[0])
print('Low Classes: ',df_low.shape[0])

High Classes:  7
Low Classes:  6


In [5]:
df_high_downsampled= resample(df_high, replace=False,n_samples=len(df_low),random_state=42)
df_high_downsampled

Unnamed: 0,Age,Income,Class
0,22,2000,High
3,28,3200,High
10,60,5000,High
4,30,3500,High
7,45,4200,High
6,40,4000,High


In [6]:
df_balanced=pd.concat([df_high_downsampled,df_low])

In [7]:
df_balanced['Class'].value_counts()

Class
High    6
Low     6
Name: count, dtype: int64

**Balancing Dataset with Upsampling**

In [8]:
df=pd.DataFrame({
    'Age':[22,25,27,28,30,35,40,45,50,55,60,65,70],
    'Income':[2000,2500,2700,3200,3500,3800,4000,4200,4300,4500,5000,5500,6000],
    'Class':['High','Low','Low','High','High','Low','High','High','Low','Low','High','High','Low']
})
df_high=df[df['Class']=='High']
df_low=df[df['Class']=='Low']

In [9]:
# Upsampling
df_lowupsampled=resample(df_low, replace=True,n_samples=len(df_high),random_state=42)

In [10]:
df_lowupsampled

Unnamed: 0,Age,Income,Class
8,50,4300,Low
9,55,4500,Low
5,35,3800,Low
9,55,4500,Low
9,55,4500,Low
2,27,2700,Low
5,35,3800,Low


In [11]:
df_up_balanced=pd.concat([df_lowupsampled,df_high])

In [12]:
df_up_balanced['Class'].value_counts()

Class
Low     7
High    7
Name: count, dtype: int64

**SMOTE(Synthetic Minority Over-sampling)**
1. SMOTE to generate synthetic samples instead of duplicating exiting ones
2. Convert categorical class labels into numeric form for SMOTE to work
3. Apply SMOTE to balance the dataset
4. Convert back to original categorical labels 
5. Combine the resampled data into a final balanced dataset

In [6]:
# !pip install scikit-learn imbalanced-learn

In [18]:
# Loading dataset
import pandas as pd
# Create the DataFrame
df = pd.DataFrame({
    'Age': [22, 25, 27, 28, 30, 35, 40, 45, 50, 55, 60, 65, 70],
    'Income': [2000, 2500, 2700, 3200, 3500, 3800, 4000, 4200, 4300, 4500, 5000, 5500, 6000],
    'Class': ['High','Low','Low','High','High','Low','High','High','Low','Low','High','High','Low']
})
# First, set 9 instances of 'Majority' (from 'High')
df.loc[df.index[:9], 'Class'] = 'Majority'
# Then, set 4 instances of 'Minority' (from 'Low')
df.loc[df.index[9:], 'Class'] = 'Minority'
df.head()

Unnamed: 0,Age,Income,Class
0,22,2000,Majority
1,25,2500,Majority
2,27,2700,Majority
3,28,3200,Majority
4,30,3500,Majority


In [17]:
from imblearn.over_sampling import SMOTE