# Handling imbalanced dataset in machine learning

Imbalanced datasets occur when the number of samples in one class significantly outnumber those in other classes. This imbalance can cause machine learning models to be biased toward the majority class. Below are four common methods to handle imbalanced data:

---

## Method 1: Undersampling

### Overview
- **Definition:** Reduces the size of the majority class by randomly removing samples.
- **Advantage:** Simplifies the dataset and speeds up training.
- **Disadvantage:** May discard useful information, leading to loss of potentially valuable data.

### How It Works
- Randomly select a subset of the majority class such that its size is closer to the minority class.

### Python Example

```python
from imblearn.under_sampling import RandomUnderSampler
from sklearn.datasets import make_classification
from collections import Counter

# Create an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, weights=[0.9, 0.1], random_state=42)
print("Original class distribution:", Counter(y))

# Apply random undersampling
rus = RandomUnderSampler(random_state=42)
X_res, y_res = rus.fit_resample(X, y)
print("Resampled class distribution:", Counter(y_res))
```

---

## Method 2: Oversampling

### Overview
- **Definition:** Increases the number of samples in the minority class by randomly duplicating them.
- **Advantage:** Retains all information from the majority class.
- **Disadvantage:** Can lead to overfitting since samples are repeated.

### How It Works
- Duplicate samples from the minority class until the classes are more balanced.

### Python Example

```python
from imblearn.over_sampling import RandomOverSampler

# Apply random oversampling
ros = RandomOverSampler(random_state=42)
X_res, y_res = ros.fit_resample(X, y)
print("Resampled class distribution (oversampling):", Counter(y_res))
```

---

## Method 3: SMOTE (Synthetic Minority Over-sampling Technique)

### Overview
- **Definition:** Generates synthetic samples for the minority class instead of duplicating existing ones.
- **Advantage:** Introduces new, slightly varied samples which can help reduce overfitting.
- **Disadvantage:** May introduce noise if not tuned properly.

### How It Works
<p>For each minority sample <span style="font-family: 'Courier New', Courier, monospace;">x<sub>i</sub></span>, SMOTE:</p>  
<ol>  
    <li>Finds <span style="font-family: 'Courier New', Courier, monospace;">k</span>-nearest neighbors.</li>  
    <li>Randomly selects one neighbor <span style="font-family: 'Courier New', Courier, monospace;">x<sub>zi</sub></span>.</li>  
</ol>  
3. Generates a synthetic sample:
   $$
   x_{\text{new}} = x_i + \delta \times (x_{zi} - x_i)
   $$
   
   <p>where <span style="font-family: 'Courier New', Courier, monospace;">𝛿</span> is a random number between 0 and 1.</p>  

### Python Example

```python
from imblearn.over_sampling import SMOTE

# Apply SMOTE
smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X, y)
print("Resampled class distribution (SMOTE):", Counter(y_res))
```

---

## Method 4: Ensemble Methods with Undersampling

### Overview
- **Definition:** Combines the idea of undersampling with ensemble learning to reduce information loss.
- **Advantage:** Mitigates the downsides of undersampling by training multiple models on different undersampled subsets.
- **Disadvantage:** More computationally expensive since it involves training multiple models.

### How It Works
1. **Create Multiple Subsets:** Randomly undersample the majority class multiple times to create several balanced subsets.
2. **Train Base Models:** Train a separate model on each balanced subset.
3. **Aggregate Predictions:** Combine predictions from all base models (e.g., using voting or averaging) to make a final decision.

### Python Example: Balanced Bagging Classifier

```python
from imblearn.ensemble import BalancedBaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

# Define base estimator and ensemble classifier
base_estimator = DecisionTreeClassifier()
ensemble = BalancedBaggingClassifier(base_estimator=base_estimator,
                                     n_estimators=10,
                                     random_state=42)

# Train the ensemble model
ensemble.fit(X, y)
y_pred = ensemble.predict(X)

# Evaluate the ensemble model
print(classification_report(y, y_pred))
```

---

## Conclusion

When working with imbalanced data, there is no one-size-fits-all solution. The choice of method depends on the dataset size, the importance of preserving the majority class data, and the risk of overfitting. Here’s a quick summary:
- **Undersampling:** Reduces data size by removing majority samples.
- **Oversampling:** Balances the dataset by duplicating minority samples.
- **SMOTE:** Generates new, synthetic minority samples to enhance diversity.
- **Ensemble with Undersampling:** Combines the robustness of ensembles with multiple undersampled subsets for improved performance.
---
> ## reference:
>    - Resampling strategies for imbalanced datasets - https://www.kaggle.com/code/rafjaa/resampling-strategies-for-imbalanced-datasets
>    - Kaggle Notebook - https://www.kaggle.com/kabure/credit-card-fraud-prediction-rf-smote
>    - SMOTE on Quora Dataset - https://www.kaggle.com/theoviel/dealing-with-class-imbalance-with-smote