An imbalanced dataset is one where some classes have much fewer samples than others.
Example:

Fraud detection → 99% Normal, 1% Fraud

Disease prediction → 95% Healthy, 5% Disease

Why is it a problem?

Model becomes biased toward majority class

High accuracy, but poor recall for minority class

Minority class (often important) is ignored

Example:
If model predicts “Normal” always → 99% accuracy but 0% fraud detected

Techniques to Handle Imbalanced Data
1. Resampling Techniques
(a) Oversampling (Increase minority class)

Duplicate or synthetically create minority samples

Methods:

Random Oversampling

SMOTE (Synthetic Minority Oversampling Technique) ⭐

Pros:
✔ No data loss
Cons:
✘ Overfitting risk

(b) Undersampling (Reduce majority class)

Remove samples from majority class

Methods:

Random Undersampling

Tomek Links

Pros:
✔ Faster training
Cons:
✘ Loss of information

(c) Hybrid Sampling

Combination of SMOTE + Undersampling

2. Algorithm-Level Techniques
(a) Class Weighting

Give more importance to minority class
Used in:

Logistic Regression

SVM

Decision Trees

Random Forest

(b) Cost-Sensitive Learning

Higher penalty for misclassifying minority class

3. Use Proper Evaluation Metrics (Very Important)

❌ Accuracy (misleading)
✅ Better metrics:

Precision

Recall

F1-score

ROC-AUC

Confusion Matrix

4. Ensemble Techniques

Random Forest (with class_weight)

Gradient Boosting

XGBoost / LightGBM (handles imbalance well)

5. Data-Level Improvements

Collect more minority data (best solution)

Feature engineering

Remove noise

An imbalanced dataset is one where class distribution is uneven.
It causes biased models favoring the majority class.
It can be handled using resampling techniques (oversampling, undersampling, SMOTE), algorithm-level methods (class weighting, cost-sensitive learning), ensemble methods, and by using proper evaluation metrics like precision, recall, F1-score instead of accuracy.

In [2]:
import numpy as np 
import pandas as pd 

#set the random seed for reproduciblity 

np.random.seed(123)

#create a dataframe with two classes 
n_samples=1000
class_0_ration=0.9
n_class_0=int(n_samples*class_0_ration)
n_class_1=n_samples-n_class_0

In [3]:
n_class_0,n_class_1

(900, 100)

In [4]:
#create my dataframe with imbalanced dataset 
class_0=pd.DataFrame({
    'feature_1':np.random.normal(loc=0,scale=1,size=n_class_0),
    'feature_2':np.random.normal(loc=0,scale=1,size=n_class_0),
    'target':[0]*n_class_0
})

class_1=pd.DataFrame({
    'feature_1':np.random.normal(loc=0,scale=1,size=n_class_1),
    'feature_2':np.random.normal(loc=0,scale=1,size=n_class_1),
    'target':[0]*n_class_1
})

In [5]:
df=pd.concat([class_0,class_1]).reset_index(drop=True)

In [7]:
df.head()
df.tail()

Unnamed: 0,feature_1,feature_2,target
995,-0.623629,0.845701,0
996,0.23981,-1.119923,0
997,-0.86824,-0.359297,0
998,0.902006,-1.609695,0
999,0.69749,0.01357,0


In [8]:
df['target'].value_counts()

target
0    1000
Name: count, dtype: int64

In [13]:
##Upsampling 
df_minority=df[df['target']==1]
df_majority=df[df['target']==0]

ValueError: a must be greater than 0 unless no samples are taken

In [17]:
print(df_minority.shape)
print(df_majority.shape)


(0, 3)
(1000, 3)


In [18]:
df_minority = df[df['target'] == 1]   # but maybe class is 0/No/False


In [19]:
df_majority = df[df['target'] == 0]
df_minority = df[df['target'] == 1]


In [21]:
df_majority = df[df['target'] == 0]
df_minority = df[df['target'] == 1]


In [22]:
from sklearn.utils import resample

df_minority_upsampled = resample(
    df_minority,
    replace=True,
    n_samples=len(df_majority),
    random_state=42
)

df_balanced = pd.concat([df_majority, df_minority_upsampled])

print(df_balanced['target'].value_counts())


ValueError: a must be greater than 0 unless no samples are taken