Imbalanced data occurs when the classes in a classification problem are not evenly distributed. This can lead to a biased model that favors the majority class.
Example: In a fraud detection dataset, 98% of transactions may be non-fraudulent, while only 2% are fraudulent. A naive model could predict "non-fraud" for every case and still be 98% accurate, but it wouldn’t detect fraud effectively.

Remember the test data must remain representative of the actual distribution of classes in the real-world scenario. You want to evaluate your model's performance on data that is as close to the real distribution as possible, and applying resampling techniques to the test data would distort that representation. 

### Resampling Methods
- Oversampling (SMOTE, ADASYN, Random Oversampling).Use when: The dataset is small, and you need more instances of the minority class.How it works: Duplicates or synthetically generates new examples for the minority class.

- Undersampling (Random Undersampling, Tomek Links, Edited Nearest Neighbors). Use when: You have a very large dataset, and keeping all majority class examples is unnecessary.How it works: Removes instances from the majority class to balance the dataset.

### Algorithmic Techniques
- Class Weighting(Several ML models RF, SVM, KNN,etc) Use when: The dataset is imbalanced, but you don’t want to modify the data.How it works: Assigns a higher weight to the minority class, making misclassifications costlier.

- Anomaly Detection Methods (One-Class SVM, Isolation Forest).Use when: The minority class represents anomalies (e.g., fraud detection).How it works: Models learn normal behavior and flag deviations as anomalies.

Evaluation Metrics for Imbalanced Data. Using accuracy alone is misleading, instead, use:
- Precision, Recall, and F1-score
- ROC-AUC and PR-AUC (Precision-Recall Curve)
- Confusion Matrix

### Resampling Methods

In [None]:
#Random Oversampling (Increasing Minority Class):Randomly duplicates instances of the minority class to balance the dataset.
#Best for small datasets where adding exact duplicates is acceptable.

from imblearn.over_sampling import RandomOverSampler
import numpy as np

# Apply Random Oversampling
ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X_train, y_train)

# Check class distribution
print("Class distribution after Random Oversampling:", np.bincount(y_resampled))



In [None]:
# SMOTE (Synthetic Minority Oversampling Technique)
#Generates synthetic data points based on the nearest neighbors of the minority class.
from imblearn.over_sampling import SMOTE

# Apply SMOTE
smote = SMOTE(sampling_strategy='auto', random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# Check class distribution
print("Class distribution after SMOTE:", np.bincount(y_resampled))


In [None]:
#ADASYN (Adaptive Synthetic Sampling)
#imilar to SMOTE but generates more synthetic samples for harder-to-learn cases.
from imblearn.over_sampling import ADASYN

# Apply ADASYN
adasyn = ADASYN(random_state=42)
X_resampled, y_resampled = adasyn.fit_resample(X_train, y_train)

# Check class distribution
print("Class distribution after ADASYN:", np.bincount(y_resampled))


In [None]:
#Random Undersampling (Reducing Majority Class)
#Randomly removes instances of the majority class to balance the dataset.

from imblearn.under_sampling import RandomUnderSampler

# Apply Random Undersampling
rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X_train, y_train)

# Check class distribution
print("Class distribution after Random Undersampling:", np.bincount(y_resampled))


In [None]:
#Tomek Links (Undersampling by Removing Closely Related Pairs)
#Removes majority class samples that are closest to the minority class. Best for reducing noise and making class separation clearer.

from imblearn.under_sampling import TomekLinks

# Apply Tomek Links
tomek = TomekLinks()
X_resampled, y_resampled = tomek.fit_resample(X_train, y_train)

# Check class distribution
print("Class distribution after Tomek Links:", np.bincount(y_resampled))


In [None]:
#ENN works by analyzing each sample in the majority class and removes majority class samples that are surrounded by minority class samples.
#When to Use: Use ENN when you have noisy samples in the majority class that could confuse the learning process.
from imblearn.under_sampling import EditedNearestNeighbours
from sklearn.model_selection import train_test_split

# Split into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply Edited Nearest Neighbors (ENN)
enn = EditedNearestNeighbours()
X_resampled, y_resampled = enn.fit_resample(X_train, y_train)

# Check the class distribution after ENN
print("Class distribution after ENN:", np.bincount(y_resampled))



In [None]:
#SMOTE + Tomek Links (Hybrid Approach)
#Combines SMOTE (oversampling) with Tomek Links (undersampling).Best for reducing noise and ensuring a well-balanced dataset.

from imblearn.combine import SMOTETomek

# Apply SMOTE + Tomek Links
smote_tomek = SMOTETomek(random_state=42)
X_resampled, y_resampled = smote_tomek.fit_resample(X_train, y_train)

# Check class distribution
print("Class distribution after SMOTE + Tomek Links:", np.bincount(y_resampled))
