**Balancing Dataset with Downsampling**

  Imagine we have a dataset for a binary classification task where the class labels are imbalanced ,and we want to downsample the majority class to balance the dataset

In [51]:
import pandas as pd
from sklearn.utils import resample

df=pd.DataFrame({
    'Age':[22,25,27,28,30,35,40,45,50,55,60,65,70],
    'Income':[2000,2500,2700,3200,3500,3800,4000,4200,4300,4500,5000,5500,6000],
    'Class':['High','Low','Low','High','High','Low','High','High','Low','Low','High','High','Low']
})

In [53]:
df.head()

Unnamed: 0,Age,Income,Class
0,22,2000,High
1,25,2500,Low
2,27,2700,Low
3,28,3200,High
4,30,3500,High


In [54]:
df_high=df[df['Class']=='High']
df_low=df[df['Class']=='Low']

In [55]:
print('High Classes: ',df_high.shape[0])
print('Low Classes: ',df_low.shape[0])

High Classes:  7
Low Classes:  6


In [57]:
df_high_downsampled= resample(df_high, replace=False,n_samples=len(df_low),random_state=42)
df_high_downsampled

Unnamed: 0,Age,Income,Class
0,22,2000,High
3,28,3200,High
10,60,5000,High
4,30,3500,High
7,45,4200,High
6,40,4000,High


In [58]:
df_balanced=pd.concat([df_high_downsampled,df_low])

In [59]:
df_balanced['Class'].value_counts()

Class
High    6
Low     6
Name: count, dtype: int64

**Balancing Dataset with Upsampling**

In [60]:
df=pd.DataFrame({
    'Age':[22,25,27,28,30,35,40,45,50,55,60,65,70],
    'Income':[2000,2500,2700,3200,3500,3800,4000,4200,4300,4500,5000,5500,6000],
    'Class':['High','Low','Low','High','High','Low','High','High','Low','Low','High','High','Low']
})
df_high=df[df['Class']=='High']
df_low=df[df['Class']=='Low']

In [61]:
# Upsampling
df_lowupsampled=resample(df_low, replace=True,n_samples=len(df_high),random_state=42)

In [62]:
df_lowupsampled

Unnamed: 0,Age,Income,Class
8,50,4300,Low
9,55,4500,Low
5,35,3800,Low
9,55,4500,Low
9,55,4500,Low
2,27,2700,Low
5,35,3800,Low


In [63]:
df_up_balanced=pd.concat([df_lowupsampled,df_high])

In [64]:
df_up_balanced['Class'].value_counts()

Class
Low     7
High    7
Name: count, dtype: int64

**SMOTE(Synthetic Minority Over-sampling)**
1. SMOTE to generate synthetic samples instead of duplicating exiting ones
2. Convert categorical class labels into numeric form for SMOTE to work
3. Apply SMOTE to balance the dataset
4. Convert back to original categorical labels 
5. Combine the resampled data into a final balanced dataset

In [71]:
!pip install scikit-learn imbalanced-learn



In [73]:
!pip install --upgrade imbalanced-learn

Collecting imbalanced-learn
  Obtaining dependency information for imbalanced-learn from https://files.pythonhosted.org/packages/9d/41/721fec82606242a2072ee909086ff918dfad7d0199a9dfd4928df9c72494/imbalanced_learn-0.13.0-py3-none-any.whl.metadata
  Downloading imbalanced_learn-0.13.0-py3-none-any.whl.metadata (8.8 kB)
Collecting scikit-learn<2,>=1.3.2 (from imbalanced-learn)
  Obtaining dependency information for scikit-learn<2,>=1.3.2 from https://files.pythonhosted.org/packages/a1/a6/c5b78606743a1f28eae8f11973de6613a5ee87366796583fb74c67d54939/scikit_learn-1.6.1-cp311-cp311-win_amd64.whl.metadata
  Downloading scikit_learn-1.6.1-cp311-cp311-win_amd64.whl.metadata (15 kB)
Collecting sklearn-compat<1,>=0.1 (from imbalanced-learn)
  Obtaining dependency information for sklearn-compat<1,>=0.1 from https://files.pythonhosted.org/packages/f0/a8/ad69cf130fbd017660cdd64abbef3f28135d9e2e15fe3002e03c5be0ca38/sklearn_compat-0.1.3-py3-none-any.whl.metadata
  Downloading sklearn_compat-0.1.3-py3-n

ERROR: Could not install packages due to an OSError: [WinError 5] Access is denied: 'C:\\Users\\CVR\\anaconda3\\Lib\\site-packages\\~klearn\\decomposition\\_cdnmf_fast.cp311-win_amd64.pyd'
Consider using the `--user` option or check the permissions.



In [74]:
import pandas as pd
from imblearn.over_sampling import SMOTE
# Create the DataFrame
df = pd.DataFrame({
    'Age': [22, 25, 27, 28, 30, 35, 40, 45, 50, 55, 60, 65, 70],
    'Income': [2000, 2500, 2700, 3200, 3500, 3800, 4000, 4200, 4300, 4500, 5000, 5500, 6000],
    'Class': ['High','Low','Low','High','High','Low','High','High','Low','Low','High','High','Low']
})

# First, set 9 instances of 'Majority' (from 'High')
df.loc[df.index[:9], 'Class'] = 'Majority'

# Then, set 4 instances of 'Minority' (from 'Low')
df.loc[df.index[9:], 'Class'] = 'Minority'

df['Class'].value_counts()

ImportError: cannot import name '_MissingValues' from 'sklearn.utils._param_validation' (C:\Users\CVR\anaconda3\Lib\site-packages\sklearn\utils\_param_validation.py)