Detecting outliers

Outliers could be <br>
1. Data entry error: e.g. human error during data entry <br>
2. Genuine data deviation: e.g. The ticket sale during festive season is significantly higher than the past few weeks.

Detecting outliers: <br>
1. z-score method: look at the stdev. Anything beyond stdev is considered outliers. <br>
2. Interquartile range method: if x<Q1-1.5*IQR or x>Q3+1.5*IQR, it's considered outliers. <br>
3. Visualisation: boxplot or scatter plot

What to do with the outliers? <br>
1. If it's data entry error, exclude the particulat datum <br>
2. Data transformation; e.g. logarithmic or square root transformation, to mitigate the outliers effect. <br>
3. Replace outliers with more reasonable values e.g. mean, median of the non-outliers

# Activity

In [2]:
import numpy as np
import pandas as pd

In [3]:
#Creating dataset 
np.random.seed(42)

data = pd.DataFrame({
    'Feature_A': np.random.normal(loc=50, scale=10, size=100),
    'Feature_B': np.random.normal(loc=100, scale=20, size=100),
})

In [4]:
#Adding outliers into the dataset
data.iloc[5, 0] = 500
data.iloc[20, 1] = 200
data.iloc[35, 1] = 250

Removing outliers using z-score

In [5]:
from scipy import stats

In [6]:
#Detect outliers

def remove_outliers_zscore(data, threshold=3):
    z_scores = np.abs(stats.zscore(data))
    filtered_data = data[(z_scores < threshold).all(axis=1)]
    return filtered_data

In [7]:
#Remove outliers

filtered_data = remove_outliers_zscore(data)
filtered_data

Unnamed: 0,Feature_A,Feature_B
0,54.967142,71.692585
1,48.617357,91.587094
2,56.476885,93.145710
3,65.230299,83.954455
4,47.658466,96.774286
...,...,...
95,35.364851,107.706348
96,52.961203,82.322851
97,52.610553,103.074502
98,50.051135,101.164174


IQR method

In [8]:
def remove_outliers_iqr(data, threshold_IQR=1.5):
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    filtered_data_IQR = data[~((data < (Q1 - threshold_IQR*IQR))|(data > (Q3 + threshold_IQR*IQR))).any(axis=1)]
    return filtered_data_IQR

In [9]:
filtered_data_IQR = remove_outliers_iqr(data)
filtered_data_IQR

Unnamed: 0,Feature_A,Feature_B
0,54.967142,71.692585
1,48.617357,91.587094
2,56.476885,93.145710
3,65.230299,83.954455
4,47.658466,96.774286
...,...,...
95,35.364851,107.706348
96,52.961203,82.322851
97,52.610553,103.074502
98,50.051135,101.164174


Alternate situation: Retaining outliers but replacing them with the median value

In [10]:
# Define the function to replace outliers with the median
def replace_outliers_with_median(data, threshold_3=1.5):
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1

    # Calculate lower and upper boundary
    lower_bound = Q1 - threshold_3 * IQR
    upper_bound = Q3 + threshold_3 * IQR

In [13]:
#Detect and replace outliers

filtered_data = replace_outliers_with_median(data)
filtered_data