# Removing the Outlier Rows
What it is: Completely delete the entire observation/row that contains outlier values.

How it works:

Identify outliers using any detection method

Remove the entire row from the dataset

Work with the remaining "clean" data

# Original data with outliers
data = [10, 12, 15, 18, 20, 22, 25, 28, 30, 150]  # 150 is outlier

# After removing outlier
clean_data = [10, 12, 15, 18, 20, 22, 25, 28, 30]

In [1]:
import numpy as np
import pandas as pd

def remove_outlier_rows(df, column, method='iqr'):
    """
    Remove rows containing outliers
    """
    data = df[column].copy()
    
    if method == 'iqr':
        Q1 = data.quantile(0.25)
        Q3 = data.quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
    elif method == 'zscore':
        mean = data.mean()
        std = data.std()
        lower_bound = mean - 3 * std
        upper_bound = mean + 3 * std
    
    # Keep only non-outlier rows
    mask = (data >= lower_bound) & (data <= upper_bound)
    return df[mask]

# Example usage
df = pd.DataFrame({
    'age': [25, 30, 35, 40, 200, 45, 50],  # 200 is outlier
    'income': [50000, 60000, 70000, 80000, 90000, 100000, 110000]
})

clean_df = remove_outlier_rows(df, 'age')
print(f"Original: {len(df)} rows, After removal: {len(clean_df)} rows")

Original: 7 rows, After removal: 6 rows


#  Capping using Winsorization
What it is: Instead of removing outliers, replace them with the nearest "acceptable" value at the boundaries.

How it works:

Set upper and lower bounds (e.g., 5th and 95th percentiles)

Values below lower bound → set to lower bound

Values above upper bound → set to upper bound

Types of Winsorization:

Full Winsorization: Cap both tails

One-sided Winsorization: Cap only upper or lower tail

# Original data with bounds at 10th and 90th percentiles
data = [5, 8, 12, 15, 18, 20, 22, 25, 28, 30, 35, 40, 100]
# Bounds: 10th percentile = 8, 90th percentile = 35

# After Winsorization
winsorized = [8, 8, 12, 15, 18, 20, 22, 25, 28, 30, 35, 35, 35]
# 5 → 8, 40 → 35, 100 → 35

In [2]:
def winsorize_data(data, lower_pct=5, upper_pct=95):
    """
    Cap outliers using Winsorization
    """
    data = np.array(data)
    lower_bound = np.percentile(data, lower_pct)
    upper_bound = np.percentile(data, upper_pct)
    
    winsorized = data.copy()
    winsorized[winsorized < lower_bound] = lower_bound
    winsorized[winsorized > upper_bound] = upper_bound
    
    return winsorized, lower_bound, upper_bound

# Example
salaries = [40000, 45000, 50000, 55000, 60000, 65000, 70000, 75000, 80000, 500000]
winsorized_salaries, lower, upper = winsorize_data(salaries, 10, 90)

print(f"Original: {salaries}")
print(f"Winsorized: {winsorized_salaries}")
print(f"Bounds: ${lower:,.0f} - ${upper:,.0f}")

Original: [40000, 45000, 50000, 55000, 60000, 65000, 70000, 75000, 80000, 500000]
Winsorized: [ 44500  45000  50000  55000  60000  65000  70000  75000  80000 121999]
Bounds: $44,500 - $122,000


# Replacing Values using Mean/Median/Mode
What it is: Replace outlier values with central tendency measures.

Choice of Replacement:

Mean: Good for normally distributed data

Median: Robust, not affected by outliers

Mode: Good for categorical data

# Original data
data = [10, 12, 15, 18, 20, 22, 25, 28, 30, 150]  # 150 is outlier

# Mean replacement
mean = np.mean([10, 12, 15, 18, 20, 22, 25, 28, 30])  # 20.0
mean_replaced = [10, 12, 15, 18, 20, 22, 25, 28, 30, 20.0]

# Median replacement  
median = np.median([10, 12, 15, 18, 20, 22, 25, 28, 30])  # 20.0
median_replaced = [10, 12, 15, 18, 20, 22, 25, 28, 30, 20.0]

In [3]:
def replace_outliers_central_tendency(data, method='median', threshold=3):
    """
    Replace outliers with mean/median/mode
    """
    data = np.array(data)
    clean_data = data.copy()
    
    # Detect outliers using Z-score
    z_scores = np.abs((data - np.mean(data)) / np.std(data))
    outlier_mask = z_scores > threshold
    
    if method == 'mean':
        replacement = np.mean(data[~outlier_mask])  # Mean of non-outliers
    elif method == 'median':
        replacement = np.median(data[~outlier_mask])  # Median of non-outliers
    elif method == 'mode':
        replacement = stats.mode(data[~outlier_mask])[0][0]  # Mode of non-outliers
    
    clean_data[outlier_mask] = replacement
    
    return clean_data, outlier_mask, replacement

# Example
test_scores = [85, 88, 90, 92, 95, 96, 98, 99, 100, 45]  # 45 is outlier
scores_median_replaced, mask, replacement = replace_outliers_central_tendency(test_scores, 'median')

print(f"Original: {test_scores}")
print(f"Median replaced: {scores_median_replaced}")
print(f"Outlier detected at index: {np.where(mask)[0]}")
print(f"Replacement value: {replacement}")

Original: [85, 88, 90, 92, 95, 96, 98, 99, 100, 45]
Median replaced: [ 85  88  90  92  95  96  98  99 100  45]
Outlier detected at index: []
Replacement value: 93.5


# Using Quantile Clipping
What it is: Similar to Winsorization but specifically uses quantile-based boundaries.

How it works:

Set explicit quantile boundaries (e.g., 0.05 and 0.95)

Values outside these quantiles are set to the boundary values

# Original data
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 100]

# 5th percentile = 1.5, 95th percentile = 9.5
# After quantile clipping:
clipped = [1.5, 2, 3, 4, 5, 6, 7, 8, 9, 9.5, 9.5]

In [4]:
def quantile_clipping(data, lower_quantile=0.05, upper_quantile=0.95):
    """
    Clip values outside specified quantiles
    """
    data = np.array(data)
    lower_bound = np.quantile(data, lower_quantile)
    upper_bound = np.quantile(data, upper_quantile)
    
    clipped = np.clip(data, lower_bound, upper_bound)
    
    return clipped, lower_bound, upper_bound

# Example
house_prices = [150000, 180000, 220000, 250000, 280000, 300000, 
                320000, 350000, 400000, 450000, 500000, 2000000]

clipped_prices, lower, upper = quantile_clipping(house_prices, 0.1, 0.9)

print(f"Original prices: {house_prices}")
print(f"Clipped prices: {clipped_prices}")
print(f"Bounds: ${lower:,.0f} - ${upper:,.0f}")
print(f"Values changed: {house_prices[-1]:,.0f} → {clipped_prices[-1]:,.0f}")

Original prices: [150000, 180000, 220000, 250000, 280000, 300000, 320000, 350000, 400000, 450000, 500000, 2000000]
Clipped prices: [184000. 184000. 220000. 250000. 280000. 300000. 320000. 350000. 400000.
 450000. 495000. 495000.]
Bounds: $184,000 - $495,000
Values changed: 2,000,000 → 495,000


## Domain-Specific Guidelines:
Finance:

Winsorization (common practice for returns data)

Quantile clipping for risk management

Healthcare:

Removing rows for impossible values (e.g., negative age)

Median replacement for skewed medical measurements

Machine Learning:

Quantile clipping for robust models

Winsorization for tree-based models

Scientific Research:

Removing rows for measurement errors

Transparent documentation of any handling