### Outliers
Outliers are data points that are very different from most of the values in a dataset. They can happen because of data entry mistakes, measurement errors, or rare real world events. Sometimes outliers are useful ( for example in fraud detection), but often they can reduce model accuracy and give misleading results. Thats why identifying and handling outliers is an important step in data processing.

### How to detect outliers?
They are 2 methods: IQR method, Z-score method
### Method 1: IQR method
The interquarrtile range method focuses on the spread of the middle 50% of data, IT calculates the IQR as the difference between the 75th and 25th percemtiles of the data and identifies outliers as those points are that fall below 1.5 times the IQR above the 75th percentile. THis method is robust to outliers and does not assume a normal distribution

#### Steps to detect outliers:
Step 1: Find Q1(25th percentage) and Q3(75th percentage)
Step 2: IQR=Q3-Q1
Step 3: Find:
    Lower bound: Q1-1.5*IQR
    Upper bound: Q3+1.5*IQR
### Method 2: Z-Score
It is a stastistical technique that detects outliers based on hoow far a data point is from the mean, measured in terms of standard deviation. It lies in the extreme tails of the distribution. It assumes the data follows a normal distribution. A point with a high or low Z-score (typically |Z|?3) is flagges as an outlier because it lies in the extreme tails of the distribution.

formula:
Z=(x-mean)/standard deviation

Why threshold:
Z>3 very strict (large datasets) Z>2 practical for small datasets

In [1]:
import pandas as pd
data=pd.DataFrame({
    "study_hours":[1,2,3,4,5,6,7,8,9,20]
})
data

Unnamed: 0,study_hours
0,1
1,2
2,3
3,4
4,5
5,6
6,7
7,8
8,9
9,20


In [2]:
data.describe()

Unnamed: 0,study_hours
count,10.0
mean,6.5
std,5.400617
min,1.0
25%,3.25
50%,5.5
75%,7.75
max,20.0


### IQR method

In [3]:
Q1=data.quantile(0.25)
Q3=data.quantile(0.75)
IQR=Q3-Q1
LB=Q1-1.5*IQR
UB=Q3+1.5*IQR
outliers=data[(data<(LB))|(data>(UB))]
outliers

Unnamed: 0,study_hours
0,
1,
2,
3,
4,
5,
6,
7,
8,
9,20.0


In [4]:
Q1

study_hours    3.25
Name: 0.25, dtype: float64

In [5]:
Q3

study_hours    7.75
Name: 0.75, dtype: float64

In [6]:
UB

study_hours    14.5
dtype: float64

In [7]:
LB

study_hours   -3.5
dtype: float64

In [8]:
IQR

study_hours    4.5
dtype: float64

In [9]:
import pandas as pd
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt

### Z-Score Method

In [10]:
import pandas as pd
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt