# Z-Score / Standard Deviation Method
 This method assumes your data follows a normal distribution (bell curve). It measures how many standard deviations each data point is away from the mean.

 Z-score tells you how far a data point is from the mean, measured in standard deviations.

 If a value is too far from the mean, it’s considered an outlie

 Z = (x - μ) / σ

Where:

Z = Z-score (number of standard deviations from mean)

x = individual data point

μ = mean of the dataset

σ = standard deviation of the dataset

#### Outlier Detection Rule:

If |Z| > 3, the point is considered an outlier

Why 3? In a perfect normal distribution:

±1σ covers 68% of data

±2σ covers 95% of data

±3σ covers 99.7% of data

So only 0.3% of points should naturally fall beyond 3 standard deviations

### Step-by-Step Example

``` Let's work through a clear example:```

``` Dataset: Student test scores out of 100 ```

```[85, 90, 78, 92, 88, 87, 84, 91, 30, 89]```

``` bash
Step 1: Calculate Mean (μ)


Mean = (85 + 90 + 78 + 92 + 88 + 87 + 84 + 91 + 30 + 89) / 10
     = 814 / 10 = 81.4
Step 2: Calculate Standard Deviation (σ)
First, find squared differences from mean:


(85-81.4)² = 12.96
(90-81.4)² = 73.96
(78-81.4)² = 11.56
(92-81.4)² = 112.36
(88-81.4)² = 43.56
(87-81.4)² = 31.36
(84-81.4)² = 6.76
(91-81.4)² = 92.16
(30-81.4)² = 2641.96  ← This will be large!
(89-81.4)² = 57.76
Sum of squared differences = 3084.4
Variance = 3084.4 / 10 = 308.44
Standard Deviation (σ) = √308.44 ≈ 17.56

Step 3: Calculate Z-scores for each point


Z(85)  = (85 - 81.4) / 17.56  =  0.21
Z(90)  = (90 - 81.4) / 17.56  =  0.49
Z(78)  = (78 - 81.4) / 17.56  = -0.19
Z(92)  = (92 - 81.4) / 17.56  =  0.60
Z(88)  = (88 - 81.4) / 17.56  =  0.38
Z(87)  = (87 - 81.4) / 17.56  =  0.32
Z(84)  = (84 - 81.4) / 17.56  =  0.15
Z(91)  = (91 - 81.4) / 17.56  =  0.55
Z(30)  = (30 - 81.4) / 17.56  = -2.93  ← Getting close!
Z(89)  = (89 - 81.4) / 17.56  =  0.43
Step 4: Identify Outliers

Threshold: |Z| > 3

The score 30 has |Z| = 2.93, which is close but not quite an outlier by our rule

No values exceed |Z| > 3 in this dataset

Critical Weakness - The "Masking Effect"
Let me show you why this method has a major flaw:

Example where it fails:
Data: [10, 12, 12, 13, 14, 15, 16, 120]

Calculations:

Mean (μ) = (10+12+12+13+14+15+16+120)/8 = 26.5

``` Standard Deviation (σ) = 36.7```

``` Z-score for 120: (120 - 26.5) / 36.7 = 2.55 ```
```
#### Problem: The obvious outlier 120 only has Z = 2.55, which is not flagged as an outlier! Why? Because the outlier itself inflated the mean and standard deviation, "masking" itself.

### When to Use This Method
Good for:

Data that is truly normally distributed

Quick preliminary analysis

Situations where you know the data is clean

Poor for:

Small datasets

Data with multiple outliers

Non-normal distributions

When outliers are extreme (they corrupt the mean and SD)

In [None]:
import numpy as np

def detect_outliers_zscore(data, threshold=3):
    """
    Detect outliers using Z-score method
    """
    mean = np.mean(data)
    std = np.std(data)
    z_scores = [(x - mean) / std for x in data]
    
    outliers = []
    for i, z in enumerate(z_scores):
        if abs(z) > threshold:
            outliers.append((i, data[i], z))
    
    return outliers
#  You have a dataset, and there's one huge number that doesn't make sense. If you don't catch it, your entire prediction model will be wrong.
data = [85, 90, 78, 92,300, 88, 87, 84, 91, 30,44, 89,100]

outliers = detect_outliers_zscore(data)
print(f"Outliers detected: {outliers}")

Outliers detected: [(4, 300, np.float64(3.2920335686769806))]
