### What are Outliers?

**Outliers** are data points that lie significantly far from the majority of observations in a dataset.

They can occur due to:
- Measurement or data entry errors  
- Rare but valid events  
- Natural variability in data  

‚ö†Ô∏è Outliers can:
- Skew mean and variance  
- Mislead model training  
- Sometimes carry critical insights  

üëâ Outliers are not always bad ‚Äî context decides.


### When Outliers are dangerous?

<!-- ## Age[human] : 300 -->
### When to Remove or Keep Outliers?

####  Remove Outliers when:
- They are caused by **data entry or measurement errors**
- Values are **physically or logically impossible**
- They heavily **distort model performance**
- Dataset is **small** and outliers dominate patterns

#### Keep Outliers when:
- They represent **real-world rare events**
- The problem domain **expects extremes** (fraud, failures, spikes)
- They carry **important business or scientific insight**
- Using **robust models** (Tree-based, Median-based)

 Rule of thumb:  
**If an outlier is valid and meaningful ‚Üí keep it.  
If it‚Äôs noise or error ‚Üí remove it.**


# Effect of Outliers on ML algorithm

Outliers can significantly impact machine learning models, depending on the algorithm used.

####  Negative Effects:
- **Mean-based models** (Linear Regression, KNN) get skewed
- **Distance-based models** miscalculate similarity
- **Gradient-based models** may converge poorly
- Increased **variance** and unstable predictions

#### Minimal Effect on:
- **Tree-based models** (Decision Tree, Random Forest)
- **Rank / median-based methods**


# How to Treat Outliers?

Outlier treatment depends on data context and model sensitivity.

#### Common Techniques:
- **Remove** ‚Üí when caused by errors or noise
- **Capping (Winsorization)** ‚Üí limit values using IQR or percentiles
- **Transformation** ‚Üí log, sqrt to reduce skewness
- **Imputation** ‚Üí replace using mean, median, or model-based methods
- **Binning** ‚Üí group extreme values
- **Robust Models** ‚Üí tree-based or median-based algorithms

####  Insight:
Outlier treatment is a **modeling choice**, not a fixed rule.


In [1]:
# Mainly we use Trimming and Capping

# How to detect Outliers?

Outliers can be detected using statistical, visual, and model-based methods.

#### Statistical Methods:
- **IQR Method** ‚Üí values below Q1 ‚àí 1.5√óIQR or above Q3 + 1.5√óIQR
- **Z-Score** ‚Üí values with |z| > 3
- **Modified Z-Score** ‚Üí uses median (robust)

#### Visualization Methods:
- **Box Plot** 
- **Scatter Plot**
- **Histogram**

#### Model-Based Methods:
- **Isolation Forest**
- **Local Outlier Factor (LOF)**
- **DBSCAN**

No single method fits all ‚Äî always validate with domain knowledge.

### Œº ¬± 3œÉ Rule (Mean‚ÄìStandard Deviation Method)

This method assumes the data follows a **normal distribution**.

- **Œº (mu)** ‚Üí mean of the feature  
- **œÉ (sigma)** ‚Üí standard deviation  

Any data point outside the range:  
**Œº ‚àí 3œÉ to Œº + 3œÉ** is considered an **outlier**.

####  Works well when:
- Data is **normally distributed**
- Dataset is **large**
- No heavy skewness

#### Limitations:
- Sensitive to existing outliers (mean & std get skewed)
- Not suitable for **skewed or non-Gaussian data**

ule of thumb:  
Use **Œº ¬± 3œÉ** only after checking the distribution.


#### Percentile based : 
data should within 97.5 percentile and 2.5 percentile 

# Technique for outliers Detection and Removal[what we study {mainly used}]

1) Z-Score
2) IQR Based
3) Percentile
4) Capping (Winsorization