In [1]:
import pandas as pd

# Handling Outliers
Outliers are values that are unusually high or low.

In [2]:
# Sample DataFrame
df = pd.DataFrame({
    "Employee": ["A", "B", "C", "D", "E", "F"],
    "Salary": [50000, 52000, 51000, 49500, 50500, 200000]
})

df

Unnamed: 0,Employee,Salary
0,A,50000
1,B,52000
2,C,51000
3,D,49500
4,E,50500
5,F,200000


## 1. Why Outliers are a Problem

Outliers can:
- Skew mean
- Inflate SD
- Distort Correlation
- Mislead models and decisions

In [3]:
df["Salary"].mean(), df["Salary"].median()

(np.float64(75500.0), np.float64(50750.0))

Here the mean is heavily affected. Median is much more stable.

## 2. Detect Outliers Using IQR

In [4]:
# Step 1
# Calculate quartiles

Q1 = df["Salary"].quantile(0.25)
Q3 = df["Salary"].quantile(0.75)
IQR = Q3 - Q1

In [5]:
# Step 2
# Define bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 +  1.5 * IQR

In [6]:
lower_bound, upper_bound

(np.float64(47687.5), np.float64(54187.5))

In [10]:
# Step 3
# Identify outliers
df[(df["Salary"] < lower_bound) | (df["Salary"] > upper_bound)]

Unnamed: 0,Employee,Salary
5,F,200000


## 3. Remove Outliers

In [11]:
df_no_outliers = df[
    (df["Salary"] >= lower_bound) & (df["Salary"] <= upper_bound)
]
df_no_outliers

Unnamed: 0,Employee,Salary
0,A,50000
1,B,52000
2,C,51000
3,D,49500
4,E,50500


## 4. Cap (Clip) Outliers Instead of Removing

This limits extreme values but keeps all rows!  
The values that are greater than upperlimit are replaced by upperlimit itself. Same for lowerlimit.

In [12]:
df["Salary_capped"] = df["Salary"].clip(lower_bound, upper_bound)

In [13]:
df

Unnamed: 0,Employee,Salary,Salary_capped
0,A,50000,50000.0
1,B,52000,52000.0
2,C,51000,51000.0
3,D,49500,49500.0
4,E,50500,50500.0
5,F,200000,54187.5


## 5. Detect Outliers Using Z-Score (Conceptual)
Z-score measures how many Standard deviations a value if from the mean.Â 

In [14]:
mean = df["Salary"].mean()
std = df["Salary"].std()

df["z_score"] = (df["Salary"] - mean) / std

In [15]:
df

Unnamed: 0,Employee,Salary,Salary_capped,z_score
0,A,50000,50000.0,-0.418044
1,B,52000,52000.0,-0.385256
2,C,51000,51000.0,-0.40165
3,D,49500,49500.0,-0.426241
4,E,50500,50500.0,-0.409847
5,F,200000,54187.5,2.041038


### |Z| > 3 : Considered a Outlier

## 6. When NOT to remove Outliers
- They are valid business cases (eg. CEO Salary)
- You are doing fraud detection
- Extreme values are the signal, not noise

---

# Summary

1. Outliers are the problem. Mean gets heavily affected but Median is much more stable as compare to mean.
2. Detect Outliers using IQR:
    1. Calculate quartiles - `Q1 = df["Salary"].quantile(0.25)` and `Q3 = df["quantiles(0.75)"]`
    2. IQR = Q3 - Q1
    3. lower_bound = Q1 - 1.5 * IQR
    4. upper_bound = Q3 + 1.5 * IQR
3. Identify Outliers: `df[(df["Salary"]<lower_bound) & (df["Salary"]>upperbound)]`
4. Remove Outliers: `df[(df["Salary"]>=lower_bound) & (df["Salary"]<=upper_bound)]`
5. Cap (clip) Outliers instead of removing: `df["Salary_capped"] = df["Salary"].clip(lower_bound, upper_bound)`