#  Handling Imbalanced Data â€” Data-Level Approaches

Imbalanced datasets are common in real-world problems (fraud detection, medical diagnosis, churn, etc.), where one class significantly outnumbers the other.  
Data-level approaches try to fix this imbalance by modifying the dataset itself.

---

##  Data-Level Approaches

1. **Over-sampling** â€“ Increase minority class samples  
2. **Under-sampling** â€“ Reduce majority class samples

---

##  Over-sampling

**Definition:**  
Increase the number of data points in the **minority class** to match the majority class using some procedure (not always simple duplication).

### ðŸ”¹ Random Over-sampling (ROS)

Randomly selects samples from the minority class and **duplicates** them until the classes are balanced.

**Advantages:**
- Simple to implement  
- No loss of data  

**Disadvantages:**
1. **Overfitting / Memorization** â€“ model may memorize duplicates  
2. **No new information** added for learning  
3. **Computationally expensive** â€“ dataset size increases  

---

## ðŸ”Š Random Over-sampling with Noise

Instead of duplicating samples, small random noise is added to minority samples to create new ones.

**Idea:**
- Generate noise from **Standard Normal Distribution (SND)**:  
  mean = 0, standard deviation = 1
- New sample:
  
\[
x_{new} = x + \alpha \cdot \epsilon
\]

Where:  
- \(x\) = original data point  
- \(\epsilon \sim \mathcal{N}(0,1)\) = noise  
- \(\alpha\) = **shrinkage factor**

### ðŸ”¹ Shrinkage Factor (Î±)

Controls how much noise is added:

- **Î± = 0** â†’ No noise â†’ same as random oversampling (duplication)  
- **Small Î±** â†’ Slight variation around original points  
- **Large Î±** â†’ Too much noise â†’ may create unrealistic samples  

---

##  SMOTE â€” Synthetic Minority Over-sampling Technique

SMOTE creates **synthetic samples** for the minority class by interpolating between existing minority points instead of duplicating them.

### ðŸ”¹ How SMOTE Works
1. Identify minority class samples.  
2. For each point, find its *k* nearest minority neighbors.  
3. Generate new points along the line joining them:

\[
x_{new} = x_i + \lambda (x_{nn} - x_i), \quad \lambda \in (0,1)
\]

### ðŸ”¹ Advantages
- Adds **new meaningful samples**  
- Reduces overfitting compared to ROS  
- Helps models learn better decision boundaries  

### ðŸ”¹ Limitations
- Can create overlapping samples near class boundaries  
- Not ideal for categorical features  
  â†’ Use **SMOTENC** (mixed) or **SMOTEN** (categorical)

---

## ðŸ”½ Under-sampling

**Definition:**  
Reduce the number of data points in the **majority class** to balance the dataset.

**Pros:**
- Faster training  
- Smaller dataset size  

**Cons:**
- May remove useful information  
- Risk of underfitting  

---

##  Summary

- **Random Over-sampling:** duplicates minority samples â†’ risk of overfitting  
- **Over-sampling with noise:** adds small variations using SND + shrinkage factor  
- **SMOTE:** generates synthetic samples via interpolation â†’ better generalization  
- **Under-sampling:** removes majority samples â†’ faster but may lose information  

---

 *These techniques help improve model performance on imbalanced datasets by ensuring fair representation of all classes.*


---

## Objective
To understand and implement data-level techniques for handling imbalanced datasets and study their impact on model performance.


In [2]:
from sklearn.datasets import load_breast_cancer
import pandas as pd
import numpy as np

In [3]:
data = load_breast_cancer()
data.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [4]:
X = data.data
y = data.target
dataset = pd.DataFrame(X, columns = data.feature_names)
dataset['target'] = y
dataset.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [5]:
dataset['target'].value_counts(normalize = True)

target
1    0.627417
0    0.372583
Name: proportion, dtype: float64

To handle imbalanced data apply methods ~ SMOTE on traimning data Never touch the test data 

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((455, 30), (114, 30), (455,), (114,))

In [11]:
pd.Series(y_train).value_counts(normalize = True)

1    0.626374
0    0.373626
Name: proportion, dtype: float64

In [12]:
pd.Series(y_train).value_counts()

1    285
0    170
Name: count, dtype: int64

In [13]:
from sklearn.preprocessing import StandardScaler

In [14]:
scaler = StandardScaler()
scaler

In [15]:
X_train_transformed = scaler.fit_transform(X_train)
X_train_transformed[:5]

array([[-1.07200079e+00, -6.58424598e-01, -1.08808010e+00,
        -9.39273639e-01, -1.35939882e-01, -1.00871795e+00,
        -9.68358632e-01, -1.10203235e+00,  2.81062120e-01,
        -1.13231479e-01, -7.04860874e-01, -4.40938351e-01,
        -7.43948977e-01, -6.29804931e-01,  7.48061001e-04,
        -9.91572979e-01, -6.93759567e-01, -9.83284458e-01,
        -5.91579010e-01, -4.28972052e-01, -1.03409427e+00,
        -6.23497432e-01, -1.07077336e+00, -8.76534437e-01,
        -1.69982346e-01, -1.03883630e+00, -1.07899452e+00,
        -1.35052668e+00, -3.52658049e-01, -5.41380026e-01],
       [ 1.74874285e+00,  6.65017334e-02,  1.75115682e+00,
         1.74555856e+00,  1.27446827e+00,  8.42288215e-01,
         1.51985232e+00,  1.99466430e+00, -2.93045055e-01,
        -3.20179716e-01,  1.27567198e-01, -3.81382677e-01,
         9.40746962e-02,  3.17524379e-01,  6.39656015e-01,
         8.73892616e-02,  7.08450758e-01,  1.18215034e+00,
         4.26212305e-01,  7.47970186e-02,  1.22834212e+

In [16]:
from imblearn.over_sampling import SMOTE

In [17]:
smote = SMOTE(sampling_strategy='auto',k_neighbors=5)
smote

In [18]:
X_train_res, y_train_res = smote.fit_resample(X_train_transformed, y_train)
X_train_res.shape, y_train_res.shape

((570, 30), (570,))

In [19]:
pd.Series(y_train_res).value_counts()

1    285
0    285
Name: count, dtype: int64

In [20]:
smote = SMOTE(sampling_strategy={1:285, 0: 300}, k_neighbors=5)
smote

In [21]:
X_train_res, y_train_res = smote.fit_resample(X_train_transformed, y_train)
X_train_res.shape, y_train_res.shape

((585, 30), (585,))

In [22]:
pd.Series(y_train_res).value_counts()

0    300
1    285
Name: count, dtype: int64

In [23]:
dataset.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [24]:
X_train_transformed = pd.DataFrame(X_train_transformed, columns = scaler.get_feature_names_out())
X_train_transformed.head()

Unnamed: 0,x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,...,x20,x21,x22,x23,x24,x25,x26,x27,x28,x29
0,-1.072001,-0.658425,-1.08808,-0.939274,-0.13594,-1.008718,-0.968359,-1.102032,0.281062,-0.113231,...,-1.034094,-0.623497,-1.070773,-0.876534,-0.169982,-1.038836,-1.078995,-1.350527,-0.352658,-0.54138
1,1.748743,0.066502,1.751157,1.745559,1.274468,0.842288,1.519852,1.994664,-0.293045,-0.32018,...,1.228342,-0.092833,1.187467,1.104386,1.517001,0.249655,1.178594,1.549916,0.191078,-0.173739
2,-0.974734,-0.931124,-0.997709,-0.867589,-0.613515,-1.138154,-1.092292,-1.243358,0.434395,-0.429247,...,-0.973231,-1.036772,-1.008044,-0.834168,-1.097823,-1.16726,-1.282241,-1.707442,-0.307734,-1.213033
3,-0.145103,-1.215186,-0.123013,-0.253192,0.664482,0.286762,-0.129729,-0.098605,0.555635,0.029395,...,-0.251266,-1.369643,-0.166633,-0.330292,0.234006,0.096874,-0.087521,-0.344838,0.242198,-0.118266
4,-0.771617,-0.081211,-0.8037,-0.732927,-0.672282,-1.006099,-0.798502,-0.684484,0.737495,-0.457213,...,-0.801135,0.07923,-0.824381,-0.74183,-0.911367,-0.984612,-0.93319,-0.777604,0.555118,-0.761639


In [25]:
X_train_transformed['x28'] = X_train_transformed['x28'].apply(lambda x:'high' if x>0 else 'low' )
X_train_transformed.head()

Unnamed: 0,x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,...,x20,x21,x22,x23,x24,x25,x26,x27,x28,x29
0,-1.072001,-0.658425,-1.08808,-0.939274,-0.13594,-1.008718,-0.968359,-1.102032,0.281062,-0.113231,...,-1.034094,-0.623497,-1.070773,-0.876534,-0.169982,-1.038836,-1.078995,-1.350527,low,-0.54138
1,1.748743,0.066502,1.751157,1.745559,1.274468,0.842288,1.519852,1.994664,-0.293045,-0.32018,...,1.228342,-0.092833,1.187467,1.104386,1.517001,0.249655,1.178594,1.549916,high,-0.173739
2,-0.974734,-0.931124,-0.997709,-0.867589,-0.613515,-1.138154,-1.092292,-1.243358,0.434395,-0.429247,...,-0.973231,-1.036772,-1.008044,-0.834168,-1.097823,-1.16726,-1.282241,-1.707442,low,-1.213033
3,-0.145103,-1.215186,-0.123013,-0.253192,0.664482,0.286762,-0.129729,-0.098605,0.555635,0.029395,...,-0.251266,-1.369643,-0.166633,-0.330292,0.234006,0.096874,-0.087521,-0.344838,high,-0.118266
4,-0.771617,-0.081211,-0.8037,-0.732927,-0.672282,-1.006099,-0.798502,-0.684484,0.737495,-0.457213,...,-0.801135,0.07923,-0.824381,-0.74183,-0.911367,-0.984612,-0.93319,-0.777604,high,-0.761639


In [26]:
X_train_transformed['x29'] = X_train_transformed['x29'].apply(lambda x:'yes' if x>0 else 'no' )
X_train_transformed.head()

Unnamed: 0,x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,...,x20,x21,x22,x23,x24,x25,x26,x27,x28,x29
0,-1.072001,-0.658425,-1.08808,-0.939274,-0.13594,-1.008718,-0.968359,-1.102032,0.281062,-0.113231,...,-1.034094,-0.623497,-1.070773,-0.876534,-0.169982,-1.038836,-1.078995,-1.350527,low,no
1,1.748743,0.066502,1.751157,1.745559,1.274468,0.842288,1.519852,1.994664,-0.293045,-0.32018,...,1.228342,-0.092833,1.187467,1.104386,1.517001,0.249655,1.178594,1.549916,high,no
2,-0.974734,-0.931124,-0.997709,-0.867589,-0.613515,-1.138154,-1.092292,-1.243358,0.434395,-0.429247,...,-0.973231,-1.036772,-1.008044,-0.834168,-1.097823,-1.16726,-1.282241,-1.707442,low,no
3,-0.145103,-1.215186,-0.123013,-0.253192,0.664482,0.286762,-0.129729,-0.098605,0.555635,0.029395,...,-0.251266,-1.369643,-0.166633,-0.330292,0.234006,0.096874,-0.087521,-0.344838,high,no
4,-0.771617,-0.081211,-0.8037,-0.732927,-0.672282,-1.006099,-0.798502,-0.684484,0.737495,-0.457213,...,-0.801135,0.07923,-0.824381,-0.74183,-0.911367,-0.984612,-0.93319,-0.777604,high,no


In [27]:
from imblearn.over_sampling import SMOTENC

In [28]:
smotenc = SMOTENC(categorical_features = ['x28', 'x29'])
smotenc

In [29]:
X_train_res, y_train_res = smotenc.fit_resample(X_train_transformed, y_train)
X_train_res.shape, y_train_res.shape

((570, 30), (570,))

In [30]:
X_train_res.head()

Unnamed: 0,x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,...,x20,x21,x22,x23,x24,x25,x26,x27,x28,x29
0,-1.072001,-0.658425,-1.08808,-0.939274,-0.13594,-1.008718,-0.968359,-1.102032,0.281062,-0.113231,...,-1.034094,-0.623497,-1.070773,-0.876534,-0.169982,-1.038836,-1.078995,-1.350527,low,no
1,1.748743,0.066502,1.751157,1.745559,1.274468,0.842288,1.519852,1.994664,-0.293045,-0.32018,...,1.228342,-0.092833,1.187467,1.104386,1.517001,0.249655,1.178594,1.549916,high,no
2,-0.974734,-0.931124,-0.997709,-0.867589,-0.613515,-1.138154,-1.092292,-1.243358,0.434395,-0.429247,...,-0.973231,-1.036772,-1.008044,-0.834168,-1.097823,-1.16726,-1.282241,-1.707442,low,no
3,-0.145103,-1.215186,-0.123013,-0.253192,0.664482,0.286762,-0.129729,-0.098605,0.555635,0.029395,...,-0.251266,-1.369643,-0.166633,-0.330292,0.234006,0.096874,-0.087521,-0.344838,high,no
4,-0.771617,-0.081211,-0.8037,-0.732927,-0.672282,-1.006099,-0.798502,-0.684484,0.737495,-0.457213,...,-0.801135,0.07923,-0.824381,-0.74183,-0.911367,-0.984612,-0.93319,-0.777604,high,no


In [31]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data"
columns = [
   "buying", "maint", "doors",
   "persons", "lug_boot", "safety", "target"
]
df = pd.read_csv(url, names=columns)
df.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,target
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [32]:
df.shape

(1728, 7)

In [33]:
df['target'].value_counts()

target
unacc    1210
acc       384
good       69
vgood      65
Name: count, dtype: int64

In [34]:
from imblearn.over_sampling import SMOTEN


In [35]:
smoten = SMOTEN(sampling_strategy={'unacc': 1210, 'acc': 605, 'good': 300, 'vgood': 300})
smoten

In [36]:
X = df.drop('target', axis = 1)
y = df['target']

In [37]:
X_res, y_res = smoten.fit_resample(X, y)
X_res.shape, y_res.shape

((2415, 6), (2415,))

In [38]:
y_res.value_counts()

target
unacc    1210
acc       605
vgood     300
good      300
Name: count, dtype: int64