**Balancing a Dataset with sampling**

Imagine we have a dataset for a  binary classification task where the class labels are imbalanced and we want to downsample the  majority class to balance the dataset.

In [1]:
import pandas as pd
from sklearn.utils import resample
df= pd.DataFrame({
'Age' : [22,25,27,28,30,35,40,45,50,55,60,65,70],
'Income':[2200,2500,2700,3200,3500,3800,4000,4200,4300,4500,5000,5500,6000],
'Class':['High','Low','Low','High','High','Low','High','High','Low','Low','High','High','Low']
})
# create a dataset

High class has 7 instances 

low class has 6 instances

In [3]:
#seperate majority and minority classes
df_high=df[df['Class']=='High']
df_low=df[df['Class']=='Low']
print(df_high)
print(df_low)

    Age  Income Class
0    22    2200  High
3    28    3200  High
4    30    3500  High
6    40    4000  High
7    45    4200  High
10   60    5000  High
11   65    5500  High
    Age  Income Class
1    25    2500   Low
2    27    2700   Low
5    35    3800   Low
8    50    4300   Low
9    55    4500   Low
12   70    6000   Low


**DOWN SAMPLING**

In [7]:
#downsample majority class
df_high_downsmapled=resample(df_high,replace=False,n_samples=len(df_low),random_state=42)
print(df_high_downsmapled)

    Age  Income Class
0    22    2200  High
3    28    3200  High
10   60    5000  High
4    30    3500  High
7    45    4200  High
6    40    4000  High


In [9]:
#combine downsampled majority with minority class
df_balanced=pd.concat([df_high_downsmapled,df_low])
print(df_balanced)

    Age  Income Class
0    22    2200  High
3    28    3200  High
10   60    5000  High
4    30    3500  High
7    45    4200  High
6    40    4000  High
1    25    2500   Low
2    27    2700   Low
5    35    3800   Low
8    50    4300   Low
9    55    4500   Low
12   70    6000   Low


In [10]:
print(df_balanced['Class'].value_counts())

Class
High    6
Low     6
Name: count, dtype: int64


**UP SAMPLING**

In [13]:
#Upsample minority class
df_low_upsmapled=resample(df_low,replace=True,n_samples=len(df_high),random_state=42)
print(df_low_upsmapled)

   Age  Income Class
8   50    4300   Low
9   55    4500   Low
5   35    3800   Low
9   55    4500   Low
9   55    4500   Low
2   27    2700   Low
5   35    3800   Low


In [15]:
df_bal=pd.concat([df_low_upsmapled,df_high])
print(df_bal)

    Age  Income Class
8    50    4300   Low
9    55    4500   Low
5    35    3800   Low
9    55    4500   Low
9    55    4500   Low
2    27    2700   Low
5    35    3800   Low
0    22    2200  High
3    28    3200  High
4    30    3500  High
6    40    4000  High
7    45    4200  High
10   60    5000  High
11   65    5500  High


In [16]:
print(df_bal['Class'].value_counts())

Class
Low     7
High    7
Name: count, dtype: int64


**SMOTE (Synthetic Minority Over-Sampling Technique) to balance the dataset. SMOTE generates synthetic examples rather than simply duplicating existing ones.**"



pip install imbalanced-learn

1. SMOTE to generate synthetic samples instead of duplicating existing ones.


2. Convert categorical class labels into numeric form for SMOTE to work


3.Apply SMOTE to balance the dataset

4. Convert back to original categorical labels.


5. Combine the resampled data into a final balanced dataset



In [17]:
pip install imbalanced-learn

Note: you may need to restart the kernel to use updated packages.


In [21]:
pip uninstall scikit-learn imbalanced-learn -y


Found existing installation: scikit-learn 1.3.0
Uninstalling scikit-learn-1.3.0:
  Successfully uninstalled scikit-learn-1.3.0
Note: you may need to restart the kernel to use updated packages.


ERROR: Exception:
Traceback (most recent call last):
  File "C:\Users\CVR\anaconda3\Lib\site-packages\pip\_internal\cli\base_command.py", line 180, in exc_logging_wrapper
    status = run_func(*args)
             ^^^^^^^^^^^^^^^
  File "C:\Users\CVR\anaconda3\Lib\site-packages\pip\_internal\commands\uninstall.py", line 110, in run
    uninstall_pathset.commit()
  File "C:\Users\CVR\anaconda3\Lib\site-packages\pip\_internal\req\req_uninstall.py", line 432, in commit
    self._moved_paths.commit()
  File "C:\Users\CVR\anaconda3\Lib\site-packages\pip\_internal\req\req_uninstall.py", line 278, in commit
    save_dir.cleanup()
  File "C:\Users\CVR\anaconda3\Lib\site-packages\pip\_internal\utils\temp_dir.py", line 173, in cleanup
    rmtree(self._path)
  File "C:\Users\CVR\anaconda3\Lib\site-packages\pip\_vendor\tenacity\__init__.py", line 291, in wrapped_f
    return self(f, *args, **kw)
           ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\CVR\anaconda3\Lib\site-packages\pip\_vendor\tenacity\__

In [1]:
pip install -U scikit-learn imbalanced-learn

Collecting imbalanced-learn
  Obtaining dependency information for imbalanced-learn from https://files.pythonhosted.org/packages/9d/41/721fec82606242a2072ee909086ff918dfad7d0199a9dfd4928df9c72494/imbalanced_learn-0.13.0-py3-none-any.whl.metadata
  Using cached imbalanced_learn-0.13.0-py3-none-any.whl.metadata (8.8 kB)
Collecting sklearn-compat<1,>=0.1 (from imbalanced-learn)
  Obtaining dependency information for sklearn-compat<1,>=0.1 from https://files.pythonhosted.org/packages/f0/a8/ad69cf130fbd017660cdd64abbef3f28135d9e2e15fe3002e03c5be0ca38/sklearn_compat-0.1.3-py3-none-any.whl.metadata
  Using cached sklearn_compat-0.1.3-py3-none-any.whl.metadata (18 kB)
Using cached imbalanced_learn-0.13.0-py3-none-any.whl (238 kB)
Using cached sklearn_compat-0.1.3-py3-none-any.whl (18 kB)
Installing collected packages: sklearn-compat, imbalanced-learn
  Attempting uninstall: imbalanced-learn
    Found existing installation: imbalanced-learn 0.10.1
    Uninstalling imbalanced-learn-0.10.1:
     

In [2]:
import pandas as pd
from imblearn.over_sampling import SMOTE

In [10]:
# Original dataset
df = pd.DataFrame({
    'Age': [22, 25, 27, 28, 30, 35, 40, 45, 50, 55, 60, 70],
    'Income': [2000, 2500, 2700, 3200, 3500, 3800, 4000, 4200, 4300, 4500, 5000, 5500],
    'Class': ['Minority', 'Majority', 'Majority', 'Majority', 'Majority', 'Minority', 'Minority', 'Majority', 'Majority', 'Majority', 'Majority', 'Majority']
})

# 1. Convert categorical labels to numeric
df['Class'] = df['Class'].map({'Majority': 0, 'Minority': 1})

# 2. Separate features (X) and target (y)
X = df[['Age', 'Income']]  # Features
y = df['Class']  # Target variable

# 3. Apply SMOTE to generate synthetic samples for the minority class
smote = SMOTE(sampling_strategy='auto', random_state=42, k_neighbors=2)  # Adjusted k_neighbors to 2
X_resampled, y_resampled = smote.fit_resample(X, y)

# 4. Convert numeric labels back to categorical
y_resampled = y_resampled.map({0: 'Majority', 1: 'Minority'})

# 5. Combine the resampled dataset
df_balanced = pd.concat([pd.DataFrame(X_resampled, columns=['Age', 'Income']), pd.DataFrame(y_resampled, columns=['Class'])], axis=1)

# 6. Print class distribution
print(df_balanced['Class'].value_counts())

# 7. Display the upsampled dataset
print(df_balanced)


Class
Minority    9
Majority    9
Name: count, dtype: int64
    Age  Income     Class
0    22    2000  Minority
1    25    2500  Majority
2    27    2700  Majority
3    28    3200  Majority
4    30    3500  Majority
5    35    3800  Minority
6    40    4000  Minority
7    45    4200  Majority
8    50    4300  Majority
9    55    4500  Majority
10   60    5000  Majority
11   70    5500  Majority
12   32    3519  Minority
13   39    3988  Minority
14   39    3973  Minority
15   36    3879  Minority
16   36    3858  Minority
17   22    2041  Minority
