# Balancing a Dataset with Downsampling

Imagine we have a dataset for a binary classification task where the class labels are imbalanced, and we want to downsample the majority class to balance the dataset.

In [3]:
import pandas as pd
from sklearn.utils import resample

# Sample dataset
df = pd.DataFrame({
     'Age': [22,25,27,28,30,35,40,45,50,55,60,65,70],
     'Income': [2000,2500,2700,3200,3500,3800,4000,4200,4300,4500,5000,5500,6000],
     'Class': ['High','Low','Low','High','High','Low','High','High','Low','Low','High','High','Low']
    })

High class has 7 instances.

Low class has 6 instances.

In [10]:
# Seperate majority and minority classes
df_high = df[df['Class'] == 'High']
df_low = df[df['Class'] == 'Low']
print(df_high,"\n")
print(df_low)

    Age  Income Class
0    22    2000  High
3    28    3200  High
4    30    3500  High
6    40    4000  High
7    45    4200  High
10   60    5000  High
11   65    5500  High 

    Age  Income Class
1    25    2500   Low
2    27    2700   Low
5    35    3800   Low
8    50    4300   Low
9    55    4500   Low
12   70    6000   Low


In [3]:
# Downsample majority class
df_high_downsampled = resample(df_high, replace=False, n_samples=len(df_low),random_state=42)

In [4]:
# Combine downsampled majority with minority class
df_balanced = pd.concat([df_high_downsampled, df_low])

In [6]:
print(df_balanced['Class'].value_counts())

Class
High    6
Low     6
Name: count, dtype: int64


# Upsampling the Minority class



In [12]:
import pandas as pd
from sklearn.utils import resample

# Sample dataset
df = pd.DataFrame({
    'Age': [22,25,27,28,30,35,40,45,50,55,60,65],
    'Income': [2000,2500,2700,3200,3500,3800,4000,4200,4300,4500,5000,5500],
    'Class': ['Minority','Majority','Majority','Majority','Majority','Minority','Minority','Minority','Minority','Majority','Majority','Majority']
})

In [14]:
df_majority = df[df['Class']== 'Majority']
df_minority = df[df['Class']== 'Minority']
print(df_majority, "\n")
print(df_minority)

    Age  Income     Class
1    25    2500  Majority
2    27    2700  Majority
3    28    3200  Majority
4    30    3500  Majority
9    55    4500  Majority
10   60    5000  Majority
11   65    5500  Majority 

   Age  Income     Class
0   22    2000  Minority
5   35    3800  Minority
6   40    4000  Minority
7   45    4200  Minority
8   50    4300  Minority


In [16]:
df_minority_upsampled = resample(df_minority, replace=True, n_samples=len(df_majority),random_state=42)

In [17]:
df_balanced=pd.concat([df_majority,df_minority_upsampled])

In [18]:
print(df_balanced['Class'].value_counts())

Class
Majority    7
Minority    7
Name: count, dtype: int64


In [4]:
pip install imbalanced-learn

Note: you may need to restart the kernel to use updated packages.


1.SMOTE to generate synthetic samples instead of duplicating existing ones.

2.Convert categorical class labels into numeric form to SMOTE to work.

3.Apply SMOTE to balance the dataset.

4.Convert back to original categorical labels.

5.Combine the resampled data into a final balanced datset.

In [7]:
pip install --upgrade scikit-learn imbalanced-learn


Collecting imbalanced-learn
  Obtaining dependency information for imbalanced-learn from https://files.pythonhosted.org/packages/9d/41/721fec82606242a2072ee909086ff918dfad7d0199a9dfd4928df9c72494/imbalanced_learn-0.13.0-py3-none-any.whl.metadata
  Using cached imbalanced_learn-0.13.0-py3-none-any.whl.metadata (8.8 kB)
Collecting sklearn-compat<1,>=0.1 (from imbalanced-learn)
  Obtaining dependency information for sklearn-compat<1,>=0.1 from https://files.pythonhosted.org/packages/f0/a8/ad69cf130fbd017660cdd64abbef3f28135d9e2e15fe3002e03c5be0ca38/sklearn_compat-0.1.3-py3-none-any.whl.metadata
  Using cached sklearn_compat-0.1.3-py3-none-any.whl.metadata (18 kB)
Using cached imbalanced_learn-0.13.0-py3-none-any.whl (238 kB)
Using cached sklearn_compat-0.1.3-py3-none-any.whl (18 kB)
Installing collected packages: sklearn-compat, imbalanced-learn
  Attempting uninstall: imbalanced-learn
    Found existing installation: imbalanced-learn 0.10.1
    Uninstalling imbalanced-learn-0.10.1:
     

In [13]:
pip uninstall scikit-learn imbalanced-learn -y

Found existing installation: scikit-learn 1.6.1
Uninstalling scikit-learn-1.6.1:
  Successfully uninstalled scikit-learn-1.6.1
Found existing installation: imbalanced-learn 0.13.0
Uninstalling imbalanced-learn-0.13.0:
  Successfully uninstalled imbalanced-learn-0.13.0
Note: you may need to restart the kernel to use updated packages.


In [14]:
pip install -U scikit-learn imbalanced-learn

Collecting scikit-learn
  Obtaining dependency information for scikit-learn from https://files.pythonhosted.org/packages/a1/a6/c5b78606743a1f28eae8f11973de6613a5ee87366796583fb74c67d54939/scikit_learn-1.6.1-cp311-cp311-win_amd64.whl.metadata
  Using cached scikit_learn-1.6.1-cp311-cp311-win_amd64.whl.metadata (15 kB)
Collecting imbalanced-learn
  Obtaining dependency information for imbalanced-learn from https://files.pythonhosted.org/packages/9d/41/721fec82606242a2072ee909086ff918dfad7d0199a9dfd4928df9c72494/imbalanced_learn-0.13.0-py3-none-any.whl.metadata
  Using cached imbalanced_learn-0.13.0-py3-none-any.whl.metadata (8.8 kB)
Using cached scikit_learn-1.6.1-cp311-cp311-win_amd64.whl (11.1 MB)
Using cached imbalanced_learn-0.13.0-py3-none-any.whl (238 kB)
Installing collected packages: scikit-learn, imbalanced-learn
Successfully installed imbalanced-learn-0.13.0 scikit-learn-1.6.1
Note: you may need to restart the kernel to use updated packages.


In [1]:
import pandas as pd
from imblearn.over_sampling import SMOTE



In [3]:
# Sample dataset
df = pd.DataFrame({
    'Age': [22,25,27,28,30,35,40,45,50,55,60,65],
    'Income': [2000,2500,2700,3200,3500,3800,4000,4200,4300,4500,5000,5500],
    'Class': ['Minority','Majority','Majority','Majority','Majority','Minority','Minority','Minority','Minority','Majority','Majority','Majority']
})

# Step 1: Convert categorical labels to numerical values
df['Class'] = df['Class'].map({'Majority': 0, 'Minority': 1})

# Step 2: Split features (X) and target variable (y)
X = df[['Age', 'Income']]
y = df['Class']

# Step 3: Apply SMOTE with k_neighbors=3 (reducing from default 5)
smote =SMOTE(sampling_strategy='auto',random_state=42, k_neighbors=3)
X_resampled , y_resampled = smote.fit_resample(X,y)

# Step 4: Convert numeric Labels back to categorical
y_resampled = y_resampled.map({0: 'Majority', 1:'Minority'})

# Step 5: Combine the resampled dataset
df_balanced = pd.concat([pd.DataFrame(X_resampled, columns = ['Age', 'Income']), pd.DataFrame(y_resampled, columns=['Class'])],axis=1)

# Step 6: Print class distribution
print(df_balanced['Class'].value_counts())
                         
# Step 7: Display the upsampled dataset
print(df_balanced)

Class
Minority    7
Majority    7
Name: count, dtype: int64
    Age  Income     Class
0    22    2000  Minority
1    25    2500  Majority
2    27    2700  Majority
3    28    3200  Majority
4    30    3500  Majority
5    35    3800  Minority
6    40    4000  Minority
7    45    4200  Minority
8    50    4300  Minority
9    55    4500  Majority
10   60    5000  Majority
11   65    5500  Majority
12   44    4190  Minority
13   38    3946  Minority
