# üìò Outlier Detection & Handling Imbalanced Data ‚Äî Notes

---

## üß† Outlier Detection

**Outliers** are data points that significantly deviate from the majority of data points.  
They may arise due to:
- Noise or measurement errors
- Rare events
- Data generation issues

### üîπ Why detect outliers?
- Outliers can influence the model heavily
- They distort regression planes / decision boundaries
- They may reduce model performance

‚û°Ô∏è Hence, detect and handle outliers during preprocessing.

---

## ‚öôÔ∏è Popular Outlier Detection Algorithms

Libraries:
- **pyod** ‚Üí KNN and many advanced algorithms  
- **sklearn** ‚Üí IsolationForest, LOF  

---

### 1Ô∏è‚É£ Isolation Forest

üì¶ `sklearn.ensemble.IsolationForest`

#### ‚úÖ Main Advantages
- Very fast and scalable
- No need of feature scaling
- Works well for high dimensional data
- Unsupervised method

#### üîπ How it works
- Builds a large number of random trees (Extra Trees)
- Randomly selects feature and split value
- Outliers are isolated at early levels (shorter path length)
- Normal points require more splits

#### üîπ Anomaly Score
- Based on average path length in trees
- Shorter path ‚Üí higher anomaly score ‚Üí outlier

#### üîπ Contamination parameter
- Defines expected fraction of outliers (e.g., 0.05)
- Top anomaly scores based on contamination are marked as outliers
- If not given, model estimates automatically (may vary like 10%, 20%)

‚û°Ô∏è Usually we explicitly set contamination.

---

### 2Ô∏è‚É£ KNN-based Outlier Detection

üì¶ Available in **pyod**

#### üîπ Idea
For each data point:
1. Find k nearest neighbors (assume k = 5)
2. Compute average distance:

d_avg = (d_nn1 + d_nn2 + d_nn3 + d_nn4 + d_nn5) / 5

- If d_avg is high ‚Üí point is far ‚Üí outlier

#### ‚ùå Disadvantage
- Computationally very expensive
- Needs distance computation for all points

---

### 3Ô∏è‚É£ LOF ‚Äì Local Outlier Factor

üì¶ `sklearn.neighbors.LocalOutlierFactor`

#### üîπ Idea
- Calculates local density for each observation
- Compares density with its neighbors

‚û°Ô∏è If density is much lower ‚Üí considered as outlier

#### ‚úÖ Strength
- Detects local outliers inside clusters
- Also works for global outliers

---

## üìä Summary: Outlier Methods

| Algorithm | Key Idea | Pros | Cons |
|-----------|----------|------|------|
Isolation Forest | Early isolation using trees | Fast, scalable, no scaling | Needs contamination |
KNN | Avg neighbor distance | Simple | Very slow |
LOF | Local density | Finds local anomalies | Sensitive to k |

---

## ‚öñÔ∏è Imbalanced Data

### üîπ What is Imbalanced Data?
When categories of the target variable are not equally represented.

Example:
- Class A = 95% ‚Üí Majority
- Class B = 5% ‚Üí Minority

‚û°Ô∏è This dataset is highly imbalanced.

---

### üîπ Why is it important?

Most ML algorithms assume:
> All classes are represented equally.

But in imbalanced data:
- Model becomes biased towards majority class
- Learns majority patterns more frequently
- Performs poorly on minority class

---

### üîπ Effects of Imbalanced Data
- High accuracy but poor minority class recall
- Model predicts majority class most of the time

Example:
If 95% = A, model predicts always A ‚Üí 95% accuracy but useless.

---

### üîπ When imbalance may not affect much (T&C)
1. Dataset is sufficiently large  
2. Data is linearly separable  

‚ö†Ô∏è These cases are rare in practice.

---

### üîç How to detect imbalance?

Using pandas:
```python
y.value_counts()


### Imbalance Ratio (IR):
IR = (# minority samples) / (# majority samples)

Balanced ‚Üí IR = 1

Highly imbalanced ‚Üí IR close to 0

### How to Handle Imbalanced Data?

Three main approaches:

- 1)Data-level methods
- 2)Cost-sensitive learning
- 3)Ensemble-based methods

## üõ†Ô∏è Data-Level Approach

### üîπ Oversampling (Increase minority class)

**Goal:** Make minority class size ‚âà majority class size  

üì¶ **Library:** `imblearn`

---

### (a) Random Over Sampling (ROS)
- Repeats minority class samples randomly  
- Sends the same patterns again to the model  

‚ùå **Disadvantages:**
- Overfitting / memorization  
- Increased training time  
- No new information added  

---

### (b) SMOTE ‚Äì Synthetic Minority Oversampling Technique
- Creates new synthetic samples  
- Interpolates between minority class neighbors  

**Variants:**
- SMOTE-N  
- SMOTE-NC (for categorical features)  
- Borderline-SMOTE  
- KNN-SMOTE  
- SVM-SMOTE  

‚úÖ Adds new information  
‚ùå May create noisy samples near decision boundaries  

---

### (c) ADASYN ‚Äì Adaptive Synthetic Sampling
- Generates more samples near difficult regions  
- Focuses on hard-to-learn minority points  

---

### üîπ Undersampling
- Removes samples from the majority class  

‚ùå **Risk:** Loss of useful information  

---

## ‚ö†Ô∏è Important Rule
‚úîÔ∏è Apply oversampling/undersampling **only on training data**  
‚ùå Never apply on test data ‚Üí avoids data leakage  

---

## 2Ô∏è‚É£ Cost-Sensitive Learning
- Assign higher cost to errors on minority class  
- Many models support:

```python
class_weight='balanced'


### 3Ô∏è‚É£ Ensemble-Based Methods

- Combine multiple models designed for imbalanced data
- Examples:
    - Balanced Random Forest
    - EasyEnsemble

## Key Takeaways
- Isolation Forest ‚Üí fast, most used outlier detector
- KNN ‚Üí intuitive but computationally expensive
- LOF ‚Üí best for detecting local outliers
- Imbalanced data biases models ‚Üí must be handled
- Use SMOTE / ADASYN for oversampling
- Use class weights for cost-sensitive learning
- Use ensemble methods when needed
- Always check class distribution using: #y.value_counts()