# Remove outliers

1. Interquartile Range (IQR): data is not normally distributed or when it contains skewness.


2. The Z-Score is a statistical measure that quantifies how many standard deviations a data point lies from the mean of a dataset. It is based on the properties of the normal distribution.In a normal distribution:
- About 68% of data lies within ± 1 SD.
- About 95% of data lies within ±2 SD.
- About 99.7% of data lies within ±3 SD.
The 3-sigma rule (∣Z∣ > 3) is widely used to flag data points that fall outside the expected range in a normal distribution it may warrant further investigation or removal, depending on the context of the analysis.

3. Standard Desviation 
- Population Standard Deviation is used when the dataset represents the entire population σ.
- Sample Standard Deviation is used when the dataset is a sample of a larger population s.

Pandas and Statistics uses Sampling Standard Deviation by default i.e the denominator of the equation is N - 1, instead of N.
Numpy uses Population Standard Deviation by default. i.e the denominator of the equation in N instead of N - 1. 
With the parameter ddof: "Delta Degrees of Freedom" if ddof=1 you're calculating np.std() for a sample taken from your full dataset.And else if you are calculating on the full dataset and NOT a sample of it, then use ddof=0. The DDOF is included for samples in order to counterbalance bias that can occur in the numbers.



You can always try to use a model less sensitive to outliers.



In [None]:
# Interquartile Range (IQR)

import numpy as np


data = [10, 12, 11, 13, 14, 15, 100]

# Calculate Q1, Q3, and IQR
Q1 = np.percentile(data, 25) #df['numerical_column'].quantile(0.25)
Q3 = np.percentile(data, 75) #df['numerical_column'].quantile(0.75)
IQR = Q3 - Q1

# Define bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Remove outliers
filtered_data = [x for x in data if lower_bound <= x <= upper_bound]

print("Original Data:", data)
print("Filtered Data (IQR):", filtered_data)

Original Data: [10, 12, 11, 13, 14, 15, 100]
Filtered Data (IQR): [10, 12, 11, 13, 14, 15]


In [1]:
#Zscore

from scipy.stats import zscore
import numpy as np

# Example dataset with outliers
data = np.array([10, 12, 12, 13, 12, 14, 13, 15, 14, 1000])  # 1000 is an outlier

# Step 1: Calculate Z-scores
z_scores = zscore(data)

# Step 2: Define outlier threshold (e.g., |Z| > 3)
outliers = data[np.abs(z_scores) > 3]
cleaned_data = data[np.abs(z_scores) <= 3]

# Step 3: Display results
print("Original Data:", data)
print("Z-Scores:", z_scores)
print("Outliers:", outliers)
print("Cleaned Data:", cleaned_data)

Original Data: [  10   12   12   13   12   14   13   15   14 1000]
Z-Scores: [-0.34270901 -0.33595612 -0.33595612 -0.33257968 -0.33595612 -0.32920323
 -0.33257968 -0.32582679 -0.32920323  2.99996998]
Outliers: []
Cleaned Data: [  10   12   12   13   12   14   13   15   14 1000]


The Z-score formula is sensitive to the mean and standard deviation of the data. If the outlier significantly affects these values, the Z-scores may not be calculated correctly.
Robust Statistics: Use median and MAD for outlier detection if the dataset contains extreme outliers. this methods are less sensitive to outliers.

In [None]:

#Zscore

from scipy.stats import median_abs_deviation

# Example dataset with outliers
data = np.array([10, 12, 12, 13, 12, 14, 13, 15, 14, 1000])  # 1000 is an outlier

# Step 1: Calculate median and MAD
median = np.median(data)
mad = median_abs_deviation(data)

# Step 2: Calculate robust Z-scores
robust_z_scores = (data - median) / mad

# Step 3: Define outlier threshold (e.g., |Z| > 3)
outliers = data[np.abs(robust_z_scores) > 3]
cleaned_data = data[np.abs(robust_z_scores) <= 3]

# Step 4: Display results
print("Original Data:", data)
print("Robust Z-Scores:", robust_z_scores)
print("Outliers:", outliers)
print("Cleaned Data:", cleaned_data)

Original Data: [  10   12   12   13   12   14   13   15   14 1000]
Robust Z-Scores: [ -3.  -1.  -1.   0.  -1.   1.   0.   2.   1. 987.]
Outliers: [1000]
Cleaned Data: [10 12 12 13 12 14 13 15 14]


In [3]:
#Standard Desviation

import statistics

# Example dataset
arr = [1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 100]  # 100 is an outlier

# Calculate mean and sample standard deviation
mean = statistics.mean(arr)
std_dev = statistics.stdev(arr)

# Define outlier threshold (e.g., ±2 standard deviations from the mean)
threshold = 2
lower_bound = mean - threshold * std_dev
upper_bound = mean + threshold * std_dev

# Remove outliers
cleaned_data = [x for x in arr if lower_bound <= x <= upper_bound]

# Display results
print("Original Data:", arr)
print("Mean:", mean)
print("Sample Standard Deviation:", std_dev)
print("Lower Bound:", lower_bound)
print("Upper Bound:", upper_bound)
print("Cleaned Data:", cleaned_data)

Original Data: [1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 100]
Mean: 12.7
Sample Standard Deviation: 30.701248617387968
Lower Bound: -48.70249723477593
Upper Bound: 74.10249723477594
Cleaned Data: [1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5]


In [4]:
#Standard Desviation
import pandas as pd

# Example dataset
arr = [1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 100]  # 100 is an outlier

# Convert to pandas Series
pandas_series = pd.Series(arr)

# Calculate mean and sample standard deviation
mean = pandas_series.mean()
std_dev = pandas_series.std()

# Define outlier threshold (e.g., ±2 standard deviations from the mean)
threshold = 2
lower_bound = mean - threshold * std_dev
upper_bound = mean + threshold * std_dev

# Remove outliers
cleaned_data = pandas_series[(pandas_series >= lower_bound) & (pandas_series <= upper_bound)]

# Display results
print("Original Data:", arr)
print("Mean:", mean)
print("Sample Standard Deviation:", std_dev)
print("Lower Bound:", lower_bound)
print("Upper Bound:", upper_bound)
print("Cleaned Data:", cleaned_data.tolist())

Original Data: [1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 100]
Mean: 12.7
Sample Standard Deviation: 30.701248617387964
Lower Bound: -48.70249723477593
Upper Bound: 74.10249723477592
Cleaned Data: [1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0]


In [1]:
#Standard Desviation
import numpy as np

# Example dataset
arr = [1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 100]  # 100 is an outlier

# Debug: Verify dataset
print("Dataset:", arr)
print("Data Types:", [type(x) for x in arr])

# Calculate mean and population standard deviation
mean = np.mean(arr)
std_dev_population = np.std(arr)

# Define outlier threshold (e.g., ±2 standard deviations from the mean)
threshold = 2
lower_bound_population = mean - threshold * std_dev_population
upper_bound_population = mean + threshold * std_dev_population

# Remove outliers
cleaned_data_population = [x for x in arr if lower_bound_population <= x <= upper_bound_population]

# Display results
print("=== Population Standard Deviation ===")
print("Original Data:", arr)
print("Mean:", mean)
print("Population Standard Deviation:", std_dev_population)
print("Lower Bound (Population):", lower_bound_population)
print("Upper Bound (Population):", upper_bound_population)
print("Cleaned Data (Population):", cleaned_data_population)

# Calculate mean and sample standard deviation
std_dev_sample = np.std(arr, ddof=1)

# Define outlier threshold (e.g., ±2 standard deviations from the mean)
lower_bound_sample = mean - threshold * std_dev_sample
upper_bound_sample = mean + threshold * std_dev_sample

# Remove outliers
cleaned_data_sample = [x for x in arr if lower_bound_sample <= x <= upper_bound_sample]

# Display results
print("\n=== Sample Standard Deviation ===")
print("Mean:", mean)
print("Sample Standard Deviation:", std_dev_sample)
print("Lower Bound (Sample):", lower_bound_sample)
print("Upper Bound (Sample):", upper_bound_sample)
print("Cleaned Data (Sample):", cleaned_data_sample)

Dataset: [1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 100]
Data Types: [<class 'int'>, <class 'float'>, <class 'int'>, <class 'float'>, <class 'int'>, <class 'float'>, <class 'int'>, <class 'float'>, <class 'int'>, <class 'int'>]
=== Population Standard Deviation ===
Original Data: [1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 100]
Mean: 12.7
Population Standard Deviation: 29.12576179261239
Lower Bound (Population): -45.551523585224786
Upper Bound (Population): 70.95152358522478
Cleaned Data (Population): [1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5]

=== Sample Standard Deviation ===
Mean: 12.7
Sample Standard Deviation: 30.701248617387964
Lower Bound (Sample): -48.70249723477593
Upper Bound (Sample): 74.10249723477592
Cleaned Data (Sample): [1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5]


In [None]:
# Replace outliers with NaN (e.g., age > 100)
df['age'] = np.where(df['age'] > 100, np.nan, df['age'])