# Chi-Square Method 
While previous methods work on single variables, the Chi-Square method detects outliers in multivariate data by considering the relationships between variables.

``` A point might not be extreme in any single dimension, but could be an outlier when considering all dimensions together due to unusual combinations of values.```

The Foundation - Mahalanobis Distance:


``` D² = (x - μ)ᵀ × Σ⁻¹ × (x - μ) ```
Where:

x = data point (vector)

μ = mean vector of the dataset

Σ = covariance matrix of the dataset

Σ⁻¹ = inverse of covariance matrix

D² = squared Mahalanobis distance

### Outlier Detection Rule:

Calculate Mahalanobis distance for each point

This distance follows a Chi-Square distribution with p degrees of freedom (where p = number of features)

If D² > χ²(p, α), the point is an outlier (typically α = 0.95 or 0.99)

### What Makes This Different?
```bash 
Method	        Scope	     Considers Relationships
Z-score, IQR	Univariate	No
Chi-Square	Multivariate	Yes
```
### Step-by-Step Example
```bash Let's use a simple 2D dataset of people's height and weight:

Dataset:


Person 1: [Height=170cm, Weight=65kg]
Person 2: [Height=175cm, Weight=70kg] 
Person 3: [Height=180cm, Weight=75kg]
Person 4: [Height=185cm, Weight=80kg]
Person 5: [Height=190cm, Weight=40kg]  ← Suspicious!
Step 1: Calculate Mean Vector (μ)

Height mean = (170+175+180+185+190)/5 = 180
Weight mean = (65+70+75+80+40)/5 = 66

μ = [180, 66]
Step 2: Calculate Covariance Matrix (Σ)


Covariance measures how two variables change together:

Σ = [[Cov(Height, Height), Cov(Height, Weight)],
     [Cov(Weight, Height), Cov(Weight, Weight)]]

After calculation:
Σ = [[50,  -25],
     [-25, 250]]
Step 3: Calculate Inverse Covariance Matrix (Σ⁻¹)


Σ⁻¹ = [[0.0222, 0.0022],
       [0.0022, 0.0044]]
Step 4: Calculate Mahalanobis Distance for Each Point

Let's calculate for Person 5: [190, 40]


x - μ = [190-180, 40-66] = [10, -26]

D² = [10, -26] × Σ⁻¹ × [10, -26]ᵀ
   = [10, -26] × [[0.0222, 0.0022], [0.0022, 0.0044]] × [10, -26]ᵀ
   = [10×0.0222 + (-26)×0.0022, 10×0.0022 + (-26)×0.0044] × [10, -26]ᵀ
   = [0.222 - 0.0572, 0.022 - 0.1144] × [10, -26]ᵀ
   = [0.1648, -0.0924] × [10, -26]ᵀ
   = 0.1648×10 + (-0.0924)×(-26)
   = 1.648 + 2.4024 = 4.05
Step 5: Compare with Chi-Square Critical Value

Degrees of freedom = number of features = 2

For α = 0.95, χ²(2, 0.95) = 5.991

For α = 0.99, χ²(2, 0.99) = 9.210

Since 4.05 < 5.991, Person 5 is not an outlier at 95% confidence level, but it's getting close!
```
### More Dramatic Example
Let's make the outlier more extreme:

Person 5 (extreme): [190cm, 40kg] → [190cm, 35kg]

Recalculating:


D² ≈ 6.8  (now > 5.991)
Now Person 5 is detected as an outlier!

Why This Matters: The Power of Multivariate Detection
Consider these scenarios:

"Tall and Lightweight": [190cm, 50kg] - Not extreme in either dimension alone, but unusual combination

"High Income, Low Spending": [$200k salary, $500 monthly spending] - Unusual pattern

"Young CEO": [Age=25, Company_Revenue=$10M] - Rare combination

### When to Use Chi-Square Method
Excellent for:

Multivariate datasets (2+ correlated features)

Finding unusual combinations of values

Quality control in manufacturing

Fraud detection (unusual behavior patterns)

Anomaly detection in complex systems

### Requirements:

Dataset should have more observations than features (n > p)

Variables should be roughly normally distributed

No perfect multicollinearity

### Limitations:

Sensitive to outliers in the mean/covariance estimation (like Z-score)

Computationally expensive for high dimensions

Assumes multivariate normal distribution

In [1]:
import numpy as np
import pandas as pd
from scipy.stats import chi2

def detect_multivariate_outliers(data, alpha=0.95):
    """
    Detect multivariate outliers using Mahalanobis Distance and Chi-Square test
    """
    data = np.array(data)
    
    # Calculate mean and covariance
    mean = np.mean(data, axis=0)
    cov = np.cov(data, rowvar=False)
    
    # Calculate inverse covariance matrix
    try:
        inv_cov = np.linalg.inv(cov)
    except np.linalg.LinAlgError:
        print("Matrix is singular, using pseudo-inverse")
        inv_cov = np.linalg.pinv(cov)
    
    # Calculate Mahalanobis distance for each point
    mahalanobis_distances = []
    for point in data:
        diff = point - mean
        distance_sq = diff @ inv_cov @ diff.T  # @ is matrix multiplication
        mahalanobis_distances.append(distance_sq)
    
    # Chi-square critical value
    p = data.shape[1]  # number of features
    critical_value = chi2.ppf(alpha, p)
    
    # Identify outliers
    outliers = []
    for i, distance_sq in enumerate(mahalanobis_distances):
        if distance_sq > critical_value:
            outliers.append((i, data[i], distance_sq))
    
    return outliers, mahalanobis_distances, critical_value

# Example 1: Height-Weight dataset
print("=== Height-Weight Example ===")
data_2d = [
    [170, 65],
    [175, 70], 
    [180, 75],
    [185, 80],
    [190, 40]  # Potential outlier
]

outliers, distances, critical = detect_multivariate_outliers(data_2d)
print(f"Critical value (χ²): {critical:.3f}")
for i, dist in enumerate(distances):
    print(f"Point {i}: {data_2d[i]}, D² = {dist:.3f}, Outlier: {dist > critical}")

print(f"\nOutliers detected: {len(outliers)}")

# Example 2: More realistic example with 3 features
print("\n=== 3-Feature Example (Age, Income, Spending) ===")
data_3d = [
    [25, 50000, 2000],
    [30, 60000, 2500],
    [35, 70000, 3000],
    [40, 80000, 3500],
    [45, 90000, 4000],
    [25, 200000, 1000]  # Young with high income but low spending - OUTLIER!
]

outliers, distances, critical = detect_multivariate_outliers(data_3d)
print(f"Critical value (χ²): {critical:.3f}")
for i, dist in enumerate(distances):
    print(f"Point {i}: {data_3d[i]}, D² = {dist:.3f}, Outlier: {dist > critical}")

=== Height-Weight Example ===
Critical value (χ²): 5.991
Point 0: [170, 65], D² = 2.000, Outlier: False
Point 1: [175, 70], D² = 0.400, Outlier: False
Point 2: [180, 75], D² = 0.400, Outlier: False
Point 3: [185, 80], D² = 2.000, Outlier: False
Point 4: [190, 40], D² = 3.200, Outlier: False

Outliers detected: 0

=== 3-Feature Example (Age, Income, Spending) ===
Critical value (χ²): 7.815
Point 0: [25, 50000, 2000], D² = 2.477, Outlier: False
Point 1: [30, 60000, 2500], D² = 0.668, Outlier: False
Point 2: [35, 70000, 3000], D² = 0.230, Outlier: False
Point 3: [40, 80000, 3500], D² = 1.154, Outlier: False
Point 4: [45, 90000, 4000], D² = 3.308, Outlier: False
Point 5: [25, 200000, 1000], D² = 5.174, Outlier: False
