# **Handling Outliers (4 Techniques)**
**4 May 2025 
11:45PM (Sunday)**

- ### **Removing the outlier:** This is the most common method where all detected outliers are removed from the dataset.
- ### **Transforming and binning values:** Outliers can be transformed to bring them within a range. Techniques like log transformation or square root transformation can be used.
- ### **Imputation:** Outliers can also be replaced with mean, median, or mode values.
- ### **Separate treatment:** In some use-cases, it’s beneficial to treat outliers separately rather than removing or imputing them.
- ### **Robus Statistical Methods:** Some of the statistical methods to analyze and model the data are less sensitive to outliers and provide more accurate results in the data.

In [3]:
# import libraries
import pandas as pd
import numpy as np
from scipy import stats

# **1.The Z-Score Method**

In [23]:
# 1. sample data 
data = np.array([10, 12, 9, 13, 8, 11, 20])  # Convert list to numpy array

# 2. calculate z-score 
zscore = np.abs(stats.zscore(data))

# 3. set threshold
threshold = 2

# Find indices of outliers
outlier_indices = np.where(zscore > threshold)[0]

# Extract outliers using indices
outliers = data[outlier_indices]

# Extract cleaned data (non-outliers)
data_cleaned = data[zscore < threshold]

print(f"Outliers are at indices: {outlier_indices}")
print(f"Outlier values are: {outliers}")
print(f"Cleaned data: {data_cleaned}")


Outliers are at indices: [6]
Outlier values are: [20]
Cleaned data: [10 12  9 13  8 11]


**Explanation:**
- Convert data to a NumPy array so that boolean or integer array indexing works.

- Use np.where(zscore > threshold) to get the indices of the outliers.

- Use these indices to extract outliers from data.

- Use boolean indexing zscore < threshold to get the cleaned data without outliers.

- This will correctly identify and print the outliers and the cleaned dataset.
---
# **2.IQR (Interquartile Range)**

In [26]:
# Sample data
data = [10, 15, 20, 25, 30, 35, 40, 45, 50, 100]

# Calculate the IQR
q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1

# Set a threshold for identifying outliers
threshold = 3.5# Find outliers
outliers = [x for x in data if (q1 - threshold * iqr) > x > (q3 + threshold * iqr)]

print("Outliers:", outliers)

Outliers: []


---
# **3.Clustering (K-means)**

In [None]:
# Install library (run this in your terminal or notebook cell)
# !pip install scikit-learn

# Import library
from sklearn.cluster import model

# Sample data
data = [[2, 2], [3, 3], [3, 4], [30, 30], [31, 31], [32, 32]]

# Create a K-means model with two clusters (normal and outlier)
model = model(n_clusters=2, random_state=42)
model.fit(data)

# Predict cluster labels
labels = model.predict(data)

# Extract outliers
outlier = [data[i] for i , labels in enumerate(labels) if labels==1]

print("Outliers:", outliers)


Outliers: [[30, 30], [31, 31], [32, 32]]


---
# **4.ML Algorithms (Isolation Forest)**
- ### The Isolation Forest is an algorithm specifically designed for anomaly detection. 
- ### It works by creating isolation trees, where outliers are isolated in shorter trees compared to normal data points.

In [29]:
from sklearn.ensemble import IsolationForest

# Sample data
data = [[2], [3], [4], [30], [31], [32]]

# Create an Isolation Forest model
model = IsolationForest(contamination=0.2)

# Fit the model
model.fit(data)

# Predict outliers
outliers = [data[i] for i , pred in enumerate(model.predict(data)) if pred==-1] 
print("Outliers:", outliers)

Outliers: [[2]]
