# Outlier Detection
## Sample Data
We'll create a sample dataset with a numerical feature, "Height":

In [1]:
import pandas as pd
import numpy as np

# Sample Data with potential outliers
data = pd.DataFrame({
    'Height': [150, 160, 165, 170, 175, 180, 185, 190, 195, 200, 250, 300]
})

print("Original Data:")
print(data)

Original Data:
    Height
0      150
1      160
2      165
3      170
4      175
5      180
6      185
7      190
8      195
9      200
10     250
11     300


## Detecting and Removing Outliers using Z-score
The Z-score method detects outliers by measuring how many standard deviations away a data point is from the mean. Typically, a threshold of 3 or -3 is used.

In [2]:
from scipy.stats import zscore

In [3]:
data['Z-Score'] = zscore(data['Height'])

In [4]:
data

Unnamed: 0,Height,Z-Score
0,150,-1.073135
1,160,-0.825488
2,165,-0.701665
3,170,-0.577842
4,175,-0.454019
5,180,-0.330195
6,185,-0.206372
7,190,-0.082549
8,195,0.041274
9,200,0.165098


In [5]:
threshold = 2
outliers_zscore = data[(data['Z-Score']>threshold) | (data['Z-Score'] < -threshold)]

In [6]:
outliers_zscore

Unnamed: 0,Height,Z-Score
11,300,2.641563


In [7]:
data_clean_zscore = data[(data['Z-Score']<=threshold) & (data['Z-Score'] >= -threshold)]

In [8]:
data_clean_zscore

Unnamed: 0,Height,Z-Score
0,150,-1.073135
1,160,-0.825488
2,165,-0.701665
3,170,-0.577842
4,175,-0.454019
5,180,-0.330195
6,185,-0.206372
7,190,-0.082549
8,195,0.041274
9,200,0.165098


In [9]:
data_clean_zscore = data_clean_zscore.drop(columns=['Z-Score'])

## Detecting and Removing Outliers using IQR
The IQR method detects outliers by identifying data points that fall below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR, where Q1 is the 25th percentile and Q3 is the 75th percentile.

In [12]:
Q1 = data['Height'].quantile(0.25)
Q3 = data['Height'].quantile(0.75)
IQR = Q3 - Q1

In [25]:
lower_bound = Q1 - 1.5*IQR
upper_bound = Q3 + 1.5*IQR

In [26]:
# Calculate Q1 (25th percentile) and Q3 (75th percentile)
Q1 = data['Height'].quantile(0.25)
Q3 = data['Height'].quantile(0.75)
IQR = Q3 - Q1

# Define the lower and upper bound for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identify outliers
outliers_iqr = data[(data['Height'] < lower_bound) | (data['Height'] > upper_bound)]

In [27]:
outliers_iqr

Unnamed: 0,Height,Z-Score
10,250,1.40333
11,300,2.641563


In [28]:
# Remove outliers
data_clean_iqr = data[(data['Height'] >= lower_bound) & (data['Height'] <= upper_bound)]

print("\nData after removing outliers using IQR:")
print(data_clean_iqr)


Data after removing outliers using IQR:
   Height   Z-Score
0     150 -1.073135
1     160 -0.825488
2     165 -0.701665
3     170 -0.577842
4     175 -0.454019
5     180 -0.330195
6     185 -0.206372
7     190 -0.082549
8     195  0.041274
9     200  0.165098


In [13]:
# Calculate Q1 (25th percentile) and Q3 (75th percentile)
Q1 = data['Height'].quantile(0.25)
Q3 = data['Height'].quantile(0.75)
IQR = Q3 - Q1

# Define the lower and upper bound for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identify outliers
outliers_iqr = data[(data['Height'] < lower_bound) | (data['Height'] > upper_bound)]

print("\nOutliers detected using IQR:")
print(outliers_iqr)

# Remove outliers
data_clean_iqr = data[(data['Height'] >= lower_bound) & (data['Height'] <= upper_bound)]

print("\nData after removing outliers using IQR:")
print(data_clean_iqr)


Outliers detected using IQR:
    Height   Z-Score
10     250  1.403330
11     300  2.641563

Data after removing outliers using IQR:
   Height   Z-Score
0     150 -1.073135
1     160 -0.825488
2     165 -0.701665
3     170 -0.577842
4     175 -0.454019
5     180 -0.330195
6     185 -0.206372
7     190 -0.082549
8     195  0.041274
9     200  0.165098
