# IQR Method

 This is a non-parametric method that doesn't assume any specific distribution. It uses the concept of quartiles to identify outliers based on the spread of the middle 50% of the data.

#### Key Definitions:

Q1 (First Quartile): The 25th percentile - 25% of data falls below this value

Q3 (Third Quartile): The 75th percentile - 75% of data falls below this value

IQR (Interquartile Range): IQR = Q3 - Q1 - represents the spread of the middle 50% of data

#### Outlier Detection Rule (Tukey's Fences):

Lower Bound = Q1 - 1.5 × IQR

Upper Bound = Q3 + 1.5 × IQR

Any data point below the lower bound or above the upper bound is considered an outlier

#### For Extreme Outliers:

Extreme Lower Bound = Q1 - 3 × IQR

Extreme Upper Bound = Q3 + 3 × IQR

Points beyond these are considered extreme outliers

### Step-by-Step Example

``` Let's use our familiar dataset:```

```Dataset: [10, 12, 12, 13, 14, 15, 16, 120]```

``` bash
Step 1: Sort the Data


Sorted: [10, 12, 12, 13, 14, 15, 16, 120]
Step 2: Find Q1 (25th percentile)


Position of Q1 = (n + 1) × 0.25 = (8 + 1) × 0.25 = 2.25
Since it's between positions 2 and 3, we interpolate:
Value at position 2 = 12, Value at position 3 = 12
Q1 = 12 + 0.25 × (12 - 12) = 12
Step 3: Find Q3 (75th percentile)


Position of Q3 = (n + 1) × 0.75 = (8 + 1) × 0.75 = 6.75
Between positions 6 and 7:
Value at position 6 = 15, Value at position 7 = 16
Q3 = 15 + 0.75 × (16 - 15) = 15.75
Step 4: Calculate IQR


IQR = Q3 - Q1 = 15.75 - 12 = 3.75
Step 5: Calculate Outlier Boundaries


Lower Bound = Q1 - 1.5 × IQR = 12 - 1.5 × 3.75 = 12 - 5.625 = 6.375
Upper Bound = Q3 + 1.5 × IQR = 15.75 + 1.5 × 3.75 = 15.75 + 5.625 = 21.375
Step 6: Identify Outliers

Check each value against bounds [6.375, 21.375]:

10: 6.375 ≤ 10 ≤ 21.375 → Not outlier

12: Within bounds → Not outlier

12: Within bounds → Not outlier

13: Within bounds → Not outlier

14: Within bounds → Not outlier

15: Within bounds → Not outlier

16: Within bounds → Not outlier

120: 120 > 21.375 → Outlier!

```
#### Different Methods for Calculating Quartiles
There are several methods to calculate Q1 and Q3. Let me show the most common ones:
``` bash
- Method 1: Tukey's Method (Inclusive)

If n is even: Include median in both halves

Our dataset: [10, 12, 12, 13, 14, 15, 16, 120]

Lower half: [10, 12, 12, 13] → Q1 = median = 12

Upper half: [14, 15, 16, 120] → Q3 = median = 15.5

IQR = 15.5 - 12 = 3.5

Upper Bound = 15.5 + 1.5×3.5 = 20.75

- Method 2: Exclusive Median

Never include median in halves

Lower half: [10, 12, 12] → Q1 = 12

Upper half: [15, 16, 120] → Q3 = 16

IQR = 16 - 12 = 4

Upper Bound = 16 + 1.5×4 = 22

- Method 3: Linear Interpolation (Most Common)

What we used in the main example

Uses percentile positions with interpolation
```

#### When to Use IQR Method
Excellent for:

Non-normal distributions (skewed data)

Small to medium datasets

Quick, intuitive analysis

When you don't know the data distribution

Visual exploration (boxplots)

Good for Real-World Data Because:

Not affected by extreme outliers

Doesn't assume normal distribution

Easy to explain and understand

Robust and reliable

In [None]:
import pandas as pd


df = pd.DataFrame({"value": [10, 12, 11, 15, 14, 100]})

Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1
print(Q1)
print(Q3)

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = df[(df['value'] < lower_bound) | (df['value'] > upper_bound)]
print(outliers)


12.5
15.75
   value
5    100


In [1]:
import numpy as np

def detect_outliers_iqr(data, method='linear'):
    """
    Detect outliers using IQR method
    """
    data = np.array(data)
    
    if method == 'linear':
        # Using numpy percentile with linear interpolation
        q1 = np.percentile(data, 25)
        q3 = np.percentile(data, 75)
    elif method == 'tukey':
        # Tukey's method (inclusive)
        sorted_data = np.sort(data)
        n = len(sorted_data)
        if n % 2 == 0:
            lower_half = sorted_data[:n//2]
            upper_half = sorted_data[n//2:]
        else:
            lower_half = sorted_data[:n//2]
            upper_half = sorted_data[n//2+1:]
        q1 = np.median(lower_half)
        q3 = np.median(upper_half)
    
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    
    print(f"Q1: {q1}, Q3: {q3}, IQR: {iqr}")
    print(f"Bounds: [{lower_bound:.2f}, {upper_bound:.2f}]")
    
    outliers = []
    for i, value in enumerate(data):
        if value < lower_bound or value > upper_bound:
            outliers.append((i, value))
    
    return outliers

# Example usage
data = [10, 12, 12, 13, 14, 15, 16, 120]

print("=== IQR Method - Linear Interpolation ===")
outliers_linear = detect_outliers_iqr(data, method='linear')
print(f"Outliers: {outliers_linear}")

print("\n=== IQR Method - Tukey's Method ===")
outliers_tukey = detect_outliers_iqr(data, method='tukey')
print(f"Outliers: {outliers_tukey}")

# Using scipy for boxplot stats
from scipy import stats
stats_result = stats.scoreatpercentile(data, [25, 75])
print(f"\nScipy stats: Q1={stats_result[0]}, Q3={stats_result[1]}")

=== IQR Method - Linear Interpolation ===
Q1: 12.0, Q3: 15.25, IQR: 3.25
Bounds: [7.12, 20.12]
Outliers: [(7, np.int64(120))]

=== IQR Method - Tukey's Method ===
Q1: 12.0, Q3: 15.5, IQR: 3.5
Bounds: [6.75, 20.75]
Outliers: [(7, np.int64(120))]

Scipy stats: Q1=12.0, Q3=15.25


# Handling Outliers with IQR
Once detected, you can:

Remove them if they're clearly errors

Cap them (Winsorization) to the upper/lower bounds

Investigate why they exist

In [2]:
# Cap outliers using IQR bounds
def cap_outliers_iqr(data):
    q1, q3 = np.percentile(data, [25, 75])
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    
    capped_data = np.clip(data, lower_bound, upper_bound)
    return capped_data

capped_data = cap_outliers_iqr(data)
print(f"Original: {data}")
print(f"Capped: {capped_data}")

Original: [10, 12, 12, 13, 14, 15, 16, 120]
Capped: [10.    12.    12.    13.    14.    15.    16.    20.125]
