## Handle imbalanced datasets
### Handling mechanisms 
1. Up sampling 
2. Down sampling

In [2]:
# import libraries required
import pandas as pd
import numpy as np
from sklearn.utils import resample

### Up sampling
Upsampling is a technique to handle imbalanced datasets by increasing the number of samples in the minority class(es) to balance with the majority class.

How it works:
1. Identify the minority class(es)
2. Randomly duplicate samples from minority class
3. Balance the dataset so all classes have similar representation

Pros and Cons:

Pros:
-  Simple to implement
-  No data loss (unlike downsampling)
-  Can improve model performance on minority class

Cons:
-  Creates duplicate data (overfitting risk)
-  Doesn't add new information
-  Can amplify noise in minority class

Alternatives:
-  SMOTE (Synthetic Minority Over-sampling): Creates synthetic samples instead of duplicates
-  ADASYN: Adaptive synthetic sampling
-  Downsampling: Reduce majority class instead

In [None]:
# creating imbalanced dataset
df = pd.DataFrame({
    'feature1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'feature2': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
    'class': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B'] 
})

In [None]:
print('imbalanced dataset')
print(df)

df_majority = df[df['class']=='A']
df_minority = df[df['class']=='B']

print('Majority dataset')
print(df_majority)

print('Minority dataset')
print(df_minority)

df_minority_upsampled = resample(df_minority, replace=True, n_samples=len(df_majority), random_state=42)
df_majority_downsampled = resample(df_majority, replace=True, n_samples=len(df_minority), random_state=42)
df_upsampled = pd.concat([df_majority, df_minority_upsampled])
df_downsampled = pd.concat([df_minority, df_majority_downsampled])

print('Up sampled dataset')
print(df_upsampled)

print('Down sampled dataset')
print(df_downsampled)

imbalanced dataset
   feature1  feature2 class
0         1        10     A
1         2        20     A
2         3        30     A
3         4        40     A
4         5        50     A
5         6        60     A
6         7        70     A
7         8        80     B
8         9        90     B
9        10       100     B
Majority dataset
   feature1  feature2 class
0         1        10     A
1         2        20     A
2         3        30     A
3         4        40     A
4         5        50     A
5         6        60     A
6         7        70     A
Minority dataset
   feature1  feature2 class
7         8        80     B
8         9        90     B
9        10       100     B
Up sampled dataset
   feature1  feature2 class
0         1        10     A
1         2        20     A
2         3        30     A
3         4        40     A
4         5        50     A
5         6        60     A
6         7        70     A
9        10       100     B
7         8        80     B
9   