# Outliers in Machine Leanring

- Outliers are data points that significantly differ from the majority of the data in a dataset. 
- They are often far removed from the typical distribution of the data and can have a large impact on statistical analyses and machine learning models.

## Outlier Detection Methods

### 1. Visualization Techniques

##### Box Plot
- Highlights outliers as points outside the whiskers
- Provides a clear visual representation of data distribution and extreme values
`
##### Scatter Plot
- Useful for visualizing multivariate outliers
- Helps identify patterns and anomalies in two-dimensional data

##### Histogram
- Shows skewness or extreme values in the data
- Provides insights into data distribution and potential outliers

### 2. Statistical Methods

##### Z-Score (Standard Score)
Measures how far a data point is from the mean, in terms of standard deviations.

**Formula:**
```
Z = (X - μ) / σ
```
Where:
- X is the data point
- μ is the mean
- σ is the standard deviation

A typical threshold for detecting outliers is |Z| > 3.

##### Interquartile Range (IQR)
Outliers are defined as values below Q1 - 1.5 × IQR or above Q3 + 1.5 × IQR, where:

- Q1: 25th percentile
- Q3: 75th percentile
- IQR: Q3 - Q1

### 3. Machine Learning Methods

##### Isolation Forest
- Identifies outliers by randomly selecting features and splitting data recursively
- Efficient for high-dimensional data
- Works well with large datasets

##### DBSCAN (Density-Based Clustering)
- Detects outliers as points in low-density regions
- Based on the density of data points
- Can identify clusters of arbitrary shapes

##### Autoencoders
- Neural networks trained to reconstruct input data
- Large reconstruction errors indicate outliers
- Particularly effective for complex, high-dimensional data

##### Local Outlier Factor (LOF)
- Measures the local density of a data point compared to its neighbors
- Considers the relative density of points
- Effective for detecting outliers in varying density regions 

##### Why removing outliers are important ?
- as due to outliers many of the models are unable to pridict accurate results & became bias as their results tends towards the outliers and which reduces the accuracy of the model.