<a href="https://colab.research.google.com/github/MikkoDT/MexEE402_AI/blob/main/Ch6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Chapter 6: Dealing with Outliers**

**Outliers** = data points that deviate significantly from the majority.  
  - Example: Most students study 10–20 hrs/week, but one studies 100 hrs → outlier.  
  - Like a violinist playing off-key in an orchestra → noticeable and disruptive.  
- Outliers can **skew results** and lead to **misinterpretations**.  
- Two common detection methods:  
  - **Z-score method**  
    - Formula: `Z = (X – μ) / σ`  
    - Z-score measures how many standard deviations a point is from the mean.  
    - Values outside **[-3, 3]** are usually considered outliers.  
  - **Interquartile Range (IQR) method** (to be discussed).  
- Handling outliers is critical for **reliable analysis and modeling**.

In [1]:
# We'll use the SciPy library to calculate the Z-score for an array of numbers:

import numpy as np
from scipy import stats

In [3]:
# example data
data = np.array([10, 12, 12, 15, 20, 21, 22, 100])
print(data)

[ 10  12  12  15  20  21  22 100]


In [6]:
# calculate z-scores
z_scores = stats.zscore(data)
print("z_scores = ",z_scores)

z_scores =  [-0.58704366 -0.51588685 -0.51588685 -0.40915164 -0.23125962 -0.19568122
 -0.16010282  2.61501265]


In [7]:
# find outliers
outliers = data[np.abs(z_scores) > 3]
print("Outliers: ", outliers)

Outliers:  []


# In this example, the number 100 is a clear outlier, being significantly higher than the other numbers.

**IQR Method** = identifies outliers based on statistical dispersion.  
- Focuses on the **middle 50% of data** (between Q1 and Q3).  
- Steps:  
  1. **Order data** from smallest to largest.  
  2. **Find Q1 & Q3** → Q1 = median of lower half, Q3 = median of upper half.  
  3. **Calculate IQR** = Q3 – Q1.  
  4. **Bounds**:  
     - Lower = Q1 – 1.5 × IQR  
     - Upper = Q3 + 1.5 × IQR  
  5. **Outliers** = values < Lower or > Upper.

In [8]:
import pandas as pd

In [9]:
# example data
data = pd.Series([10, 12, 12, 15, 20, 21, 22, 100])
data.head()

Unnamed: 0,0
0,10
1,12
2,12
3,15
4,20


In [11]:
# calculate IQR
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
print("IQR = ",IQR)

IQR =  9.25


In [12]:
# find outliers
outliers = data[(data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))]
print("Outliers: ", outliers)

Outliers:  7    100
dtype: int64


# In this scenario, again, the number 100 is identified as an outlier.

## **Strategies for Handling Outliers**

- **Capping & Flooring**  
  - Set boundaries for data.  
  - Outliers beyond limits are replaced with nearest boundary.  
  - Useful when values logically cannot exceed certain limits (e.g., age, temperature).  

- **Log Transformation**  
  - Compresses skewed data and reduces outlier impact.  
  - Creates a more normally distributed dataset.  
  - Best for exponential relationships between variables.  

- **Removing Outliers**  
  - Delete extreme values entirely.  
  - Use only if outlier is an error or irrelevant.  
  - Risk: information loss or bias.  