**Balancing Dataset with DownSampling**

Imagine we have a dataset for a binary classification task where the class lables are imbalanced, and we want to downsample the majority class to balance the dataset

In [19]:
import pandas as pd
from sklearn.utils import resample
df=pd.DataFrame({
    'Age':[22,25,27,28,30,35,40,45,50,55,60,65,70],
    'Income':[2200,2500,2700,2800,3000,3500,4000,4500,5000,5500,6000,6500,7000],
    'Class':['High','Low','Low','High','High','Low','High','High','Low','Low','High','High','Low']
})
df

Unnamed: 0,Age,Income,Class
0,22,2200,High
1,25,2500,Low
2,27,2700,Low
3,28,2800,High
4,30,3000,High
5,35,3500,Low
6,40,4000,High
7,45,4500,High
8,50,5000,Low
9,55,5500,Low


High class has 7 instances

Low class has 6 instances

In [20]:
#separate majority and minority classes
df_high=df[df['Class']=='High']
df_low=df[df['Class']=='Low']
print(df_high)
print(df_low)

    Age  Income Class
0    22    2200  High
3    28    2800  High
4    30    3000  High
6    40    4000  High
7    45    4500  High
10   60    6000  High
11   65    6500  High
    Age  Income Class
1    25    2500   Low
2    27    2700   Low
5    35    3500   Low
8    50    5000   Low
9    55    5500   Low
12   70    7000   Low


In [21]:
df_high_downsampled=resample(df_high,replace=False,n_samples=len(df_low),random_state=42)

In [22]:
df_balanced=pd.concat([df_high_downsampled,df_low])
df_balanced

Unnamed: 0,Age,Income,Class
0,22,2200,High
3,28,2800,High
10,60,6000,High
4,30,3000,High
7,45,4500,High
6,40,4000,High
1,25,2500,Low
2,27,2700,Low
5,35,3500,Low
8,50,5000,Low


In [16]:
print(df_balanced['Class'].value_counts())

Class
High    6
Low     6
Name: count, dtype: int64


**Upsampling the Minority Class**

Let us use a dataset with a binary classification task where the Minority class has fewer instances than the Majority Class, and we perform upsampling on the minority class.

In [47]:
import pandas as pd
from sklearn.utils import resample
df=pd.DataFrame({
    'Age':[22,25,27,28,30,35,40,45,50,55,60,65,70],
    'Income':[2200,2500,2700,2800,3000,3500,4000,4500,5000,5500,6000,6500,7000],
    'Class':['Minority','Majority','Majority','Majority','Majority','Minority','Minority','Minority','Mijority','Majority','Majority','Majority','Majority']
})
df

Unnamed: 0,Age,Income,Class
0,22,2200,Minority
1,25,2500,Majority
2,27,2700,Majority
3,28,2800,Majority
4,30,3000,Majority
5,35,3500,Minority
6,40,4000,Minority
7,45,4500,Minority
8,50,5000,Mijority
9,55,5500,Majority


In [48]:
#separate majority and minority classes
df_Majority=df[df['Class']=='Majority']
df_Minority=df[df['Class']=='Minority']
print(df_Majority)
print(df_Minority)

    Age  Income     Class
1    25    2500  Majority
2    27    2700  Majority
3    28    2800  Majority
4    30    3000  Majority
9    55    5500  Majority
10   60    6000  Majority
11   65    6500  Majority
12   70    7000  Majority
   Age  Income     Class
0   22    2200  Minority
5   35    3500  Minority
6   40    4000  Minority
7   45    4500  Minority


In [49]:
df_minority_upsampled=resample(df_Minority,replace=True,n_samples=len(df_Majority),random_state=42)

In [50]:
df_balanced=pd.concat([df_Majority,df_minority_upsampled])
df_balanced

Unnamed: 0,Age,Income,Class
1,25,2500,Majority
2,27,2700,Majority
3,28,2800,Majority
4,30,3000,Majority
9,55,5500,Majority
10,60,6000,Majority
11,65,6500,Majority
12,70,7000,Majority
6,40,4000,Minority
7,45,4500,Minority


In [1]:
print(df_balanced['Class'].value_counts())

NameError: name 'df_balanced' is not defined

In [2]:
pip uninstall scikit-learn imbalanced-learn -y

Found existing installation: imbalanced-learn 0.13.0
Uninstalling imbalanced-learn-0.13.0:
  Successfully uninstalled imbalanced-learn-0.13.0
Note: you may need to restart the kernel to use updated packages.




1.SMOTE to generate synthetic samples instead of duplicating exsisting ones

2.convert categorical class labels into numeric form for SMOTE to work

3.Apply SMOTE to balance the dataset

4.Convert back to original categorical labels

5.combine the resampled data into a final balanced dataset

In [3]:
pip install -U scikit-learn imbalanced-learn

Collecting scikit-learn
  Obtaining dependency information for scikit-learn from https://files.pythonhosted.org/packages/a1/a6/c5b78606743a1f28eae8f11973de6613a5ee87366796583fb74c67d54939/scikit_learn-1.6.1-cp311-cp311-win_amd64.whl.metadata
  Using cached scikit_learn-1.6.1-cp311-cp311-win_amd64.whl.metadata (15 kB)
Collecting imbalanced-learn
  Obtaining dependency information for imbalanced-learn from https://files.pythonhosted.org/packages/9d/41/721fec82606242a2072ee909086ff918dfad7d0199a9dfd4928df9c72494/imbalanced_learn-0.13.0-py3-none-any.whl.metadata
  Using cached imbalanced_learn-0.13.0-py3-none-any.whl.metadata (8.8 kB)
Using cached scikit_learn-1.6.1-cp311-cp311-win_amd64.whl (11.1 MB)
Using cached imbalanced_learn-0.13.0-py3-none-any.whl (238 kB)
Installing collected packages: scikit-learn, imbalanced-learn
Successfully installed imbalanced-learn-0.13.0 scikit-learn-1.6.1
Note: you may need to restart the kernel to use updated packages.


In [4]:
import pandas as pd
from imblearn.over_sampling import SMOTE

# Sample dataset
df=pd.DataFrame({
    'Age':[22,25,27,28,30,35,40,45,50,55,60,65,70],
    'Income':[2000,2500,2700,3200,3500,3800,4000,4200,4300,4500,5000,5500,6000],
    'Class':['Minority','Majority','Majority','Majority','Majority',
             'Minority','Minority','Minority','Majority','Majority',
             'Majority','Majority','Majority']
    
    
})
#Step 1: Convert categorical labels to numeric values
df['Class']=df['Class'].map({'Majority': 0, 'Minority':1})

# Step 2: Split features (X) and target variable (y)
X=df[['Age','Income']]
y=df['Class']

#Step 3: Apply SMOTE with k_neighbors=3 (reducing from default 5)
smote=SMOTE(sampling_strategy='auto',random_state=42,k_neighbors=3)
X_resampled,y_resampled=smote.fit_resample(X,y)

#Step 4: Convert numeric labels back to categorical
y_resampled=y_resampled.map({0:'Majority',1:'Minority'})

#Step 5: Combine the resampled dataset
df_balanced=pd.concat([pd.DataFrame(X_resampled,columns=['Age','Income']),pd.DataFrame(y_resampled,columns=['Class'])],axis=1)

#Step 6: Print class distribution
print(df_balanced['Class'].value_counts())

#Step 7: Display the upsampled dataset
print(df_balanced)

Class
Minority    9
Majority    9
Name: count, dtype: int64
    Age  Income     Class
0    22    2000  Minority
1    25    2500  Majority
2    27    2700  Majority
3    28    3200  Majority
4    30    3500  Majority
5    35    3800  Minority
6    40    4000  Minority
7    45    4200  Minority
8    50    4300  Majority
9    55    4500  Majority
10   60    5000  Majority
11   65    5500  Majority
12   70    6000  Majority
13   40    4031  Minority
14   35    3831  Minority
15   44    4176  Minority
16   35    3826  Minority
17   41    4040  Minority
