# OUTLIERS:
What are Outliers?
Definition: Observations that deviate significantly from other observations

Statistical Perspective: Data points that don't conform to the assumed distribution

Types:

Point Outliers: Individual data points

Contextual Outliers: Abnormal in specific context

Collective Outliers: Group of data points abnormal collectively

Mathematical Formulations
1.1 Z-Score Method
``` bash
For data point x_i in feature X:
    z_i = (x_i - μ) / σ
 ```
    
where:
    μ = mean(X)
    σ = standard deviation(X)
    
Outlier if: |z_i| > threshold (typically 3)
Theoretical Basis: Based on Normal Distribution (Gaussian) properties where:

68% data within μ ± σ

95% data within μ ± 2σ

99.7% data within μ ± 3σ

- Assumptions: Data follows normal distribution
- Limitations: Sensitive to extreme values (mean and variance are not robust)

1.2 Interquartile Range (IQR) Method
``` bash
Q1 = 25th percentile
Q3 = 75th percentile
IQR = Q3 - Q1

Lower Bound = Q1 - k × IQR
Upper Bound = Q3 + k × IQR
```

Typically k = 1.5 (mild outliers) or k = 3 (extreme outliers)
Theoretical Basis: Based on quartiles which are robust to outliers
Statistical Properties:

### Non-parametric method

Doesn't assume normal distribution

Based on data percentiles

1.3 Modified Z-Score (Robust Z-Score)
``` bash
MAD = median(|X_i - median(X)|)
Modified Z = 0.6745 × (X_i - median(X)) / MAD
Uses median instead of mean → more robust
```

1.4 Mahalanobis Distance

``` D² = (x - μ)ᵀ Σ⁻¹ (x - μ)  ```

where:
    x = data point
    μ = mean vector
    Σ = covariance matrix
Theoretical Basis: Measures distance in multivariate space considering correlations

## OUTLIER TREATMENT STRATEGIES: THEORY
2.1 Removal
When to use:

Clear measurement errors

< 5% of data are outliers

Large dataset

Statistical Impact:

Reduces variance

May introduce bias if outliers are informative

Changes distribution parameters

2.2 Capping/Winsorizing
``` bash
x_capped = 
    ⎧ lower_bound, if x < lower_bound
    ⎨ x,          if lower_bound ≤ x ≤ upper_bound
    ⎩ upper_bound, if x > upper_bound

```
Statistical Properties:

Preserves sample size

Reduces variance less than removal

Creates artificial peaks at boundaries

Winsorizing: Special case where extreme values are replaced with nearest non-outlier

2.3 Transformation
Theory: Apply mathematical functions to reduce impact of outliers while preserving data

# LOG TRANSFORMATION
3.1 Purpose and Effect
``` bash
Original: y = f(x)
Log Transform: y' = log(y)
```
Theoretical Effects:

Variance Stabilization: Reduces heteroscedasticity

Symmetrization: Converts right-skewed to more symmetric

Additive Effects: Multiplicative relationships become additive

3.2 Mathematical Properties
``` bash
Natural Log (ln) vs Log10

ln(x) = logₑ(x) = 2.3026 × log₁₀(x)
Both have same effect on distribution shape

Log(1+x) Transformation

y = log(1 + x)
Why add 1?: To handle zero values (log(0) = -∞)
```
3.3 Box-Cox Transformation (Generalized)
``` bash
y(λ) = 
    ⎧ (y^λ - 1)/λ, if λ ≠ 0
    ⎨ log(y),      if λ = 0

```
Theoretical Basis:

Finds optimal λ that maximizes log-likelihood

Makes data approximately normal

Assumption: y > 0

Log-Likelihood Function:

``` bash
L(λ) = -(n/2) × ln(σ²(λ)) + (λ-1) × Σ ln(y_i)
```
where σ²(λ) is variance of transformed data

3.4 Yeo-Johnson Transformation
``` bash
For y ≥ 0: same as Box-Cox
For y < 0: ψ(y, λ) = 
    ⎧ [(y+1)^λ - 1]/λ, if λ ≠ 0, y ≥ 0
    ⎨ log(y+1),        if λ = 0, y ≥ 0
    ⎩ -[(-y+1)^(2-λ) - 1]/(2-λ), if λ ≠ 2, y < 0
    ⎩ -log(-y+1),      if λ = 2, y < 0
```
Advantage: Handles negative values

# FEATURE TRANSFORMATION THEORY
4.1 Scaling Methods: Mathematical Formulation
Standardization (Z-score Normalization)

``` x_standardized = (x - μ) / σ ```
Properties:

Mean = 0, Standard Deviation = 1

Preserves original distribution shape

Sensitive to outliers

Min-Max Scaling

``` x_scaled = (x - min(x)) / (max(x) - min(x)) ```
Properties:

Range: [0, 1]

Sensitive to outliers (min and max affected)

Robust Scaling

``` x_robust = (x - median(x)) / IQR ```
Properties:

Uses median and IQR → robust to outliers

Preserves outliers (just scales them)

Unit Vector Scaling (Normalization)

``` x_unit = x / ||x||  ```
where ||x|| = √(Σx_i²) (L2 norm)

4.2 Power Transformations Theory
General Power Transformation

``` T(y) = y^λ  ```
Effect on skewness:

λ > 1: Increases right skew

λ < 1: Reduces right skew

λ = 0: Log transform (by limit)

Theoretical Justification
Power transformations aim to:

Stabilize variance (Homoscedasticity)

Improve normality (Gaussianity)

Linearize relationships


# DOMAIN-DRIVEN FEATURES: THEORETICAL BASIS
5.1 Mathematical Foundation
Interaction Features
```bash
z = f(x, y) where f is:
    - Multiplicative: x × y
    - Ratio: x / y
    - Polynomial: x², x³, √x
```
Theoretical Basis:

Captures non-linear relationships

Based on Taylor expansion: f(x,y) ≈ f(0) + f_x x + f_y y + f_xy xy + ...

Interaction terms = cross-derivatives

Composite Features

``` Composite Score = Σ w_i × f_i(x_i)```
where w_i are weights based on domain knowledge

5.2 Statistical Justification
Dimensionless Quantities
```bash
Reynolds Number = (ρ × v × L) / μ  [Fluid Dynamics]
Sharpe Ratio = (R_p - R_f) / σ_p    [Finance]
```
Advantage: Scale-invariant, comparable across contexts

Information Ratio (Signal-to-Noise)

```IR = μ / σ```
Measures quality of signal relative to noise

# BINNING/DISCRETIZATION THEORY
6.1 Mathematical Formulation
Equal Width Binning
```bash
Bin width = (max - min) / k
Bin boundaries: min + i × width, i = 0..k
```
Equal Frequency (Quantile) Binning

```Each bin contains n/k observations``
```Boundaries at percentiles: 100 × i/k %```
6.2 Information Theory Basis
Entropy-Based Binning
Maximize Information Gain:


``` IG = H(D) - Σ (|D_i|/|D|) × H(D_i)```

where:
    H(D) = -Σ p(c) × log₂ p(c)  [Entropy]
    D_i = data in bin i
Minimum Description Length (MDL) Principle

```Total Cost = Model Cost + Data Cost ```
Optimal bins minimize total cost
6.3 Optimal Binning Algorithms
1. Fisher's Natural Breaks (Jenks)

``` Minimize: GVF = (SDAM - SDCM) / SDAM```

where:
    SDAM = Sum of squared deviations from array mean
    SDCM = Sum of squared deviations from class means
Maximize Goodness of Variance Fit (GVF)

2. Decision Tree Binning
```bash
Splits chosen to maximize:
    - Information Gain
    - Gini Impurity reduction
    - Variance reduction (regression)
```
Mathematical Formulation (CART Algorithm):


``` ΔI(s) = I(D) - (|D_left|/|D|)×I(D_left) - (|D_right|/|D|)×I(D_right) ```
Choose split s that maximizes ΔI(s)

# Why binning?

Reduce noise

Make non-linear patterns easier

Convert continuous → categorical

Good for decision tree models

- Equal-Width Binning
``` df['bins'] = pd.cut(df['column'], bins=4, labels=False)```

- Equal-Frequency Binning
``` df['bins'] = pd.qcut(df['column'], q=4, labels=False)```

- Domain Binning (Best)
Examples:

Age → child, teen, adult, senior

Income → low, mid, high

Blood pressure → normal, high, critical

When to use

Tree models (RF, XGBoost)

Meaningful categories required

Outlier smoothing
