# How to handle outlier for linear model in machine learning

## What is an Outlier?

An outlier is a data point that is significantly different from the rest of the dataset.
It lies far away from the mean or trend of the data.

Example:
X	Y
1	2
2	3
3	4
50	200 ← ❌ Outlier

The last value (50, 200) doesn’t follow the pattern — it’s an outlier.

## Why Handle Outliers in Linear Models?

In [None]:
Linear models (like Linear Regression) are sensitive to outliers because:

1. They use the least squares method (which minimizes the sum of squared errors).

2. Outliers can pull the regression line away from the main data.

3. This leads to wrong slope, biased predictions, and low R².

## Steps to Handle Outliers

## Step 1: Detect Outliers

In [None]:
There are several ways:

## ️A.Using Boxplot
Visual method to see if some data points lie outside the whiskers.


In [None]:

import seaborn as sns
sns.boxplot(data['column_name'])


## B. Using IQR (Interquartile Range) Method

Mathematical way to find outliers.

In [None]:
Q1 = data['column_name'].quantile(0.25)
Q3 = data['column_name'].quantile(0.75)
IQR = Q3 - Q1

lower_limit = Q1 - 1.5 * IQR
upper_limit = Q3 + 1.5 * IQR

outliers = data[(data['column_name'] < lower_limit) | (data['column_name'] > upper_limit)]


# ️C. Using Z-Score
Checks how many standard deviations away a point is from the mean.

In [None]:
from scipy import stats
import numpy as np

z = np.abs(stats.zscore(data['column_name']))
outliers = data[z > 3]

(If Z > 3 → possible outlier)

# Step 2: Decide What to Do with Outliers
Once detected, you can choose a strategy:

In [None]:
| Method                           | Description                                                                       | When to Use                                   |
| -------------------------------- | --------------------------------------------------------------------------------- | --------------------------------------------- |
| **Remove Outliers**              | Drop those rows                                                                   | When outliers are due to errors or rare cases |
| **Cap or Floor (Winsorization)** | Replace outliers with upper/lower limit                                           | When you want to keep data size same          |
| **Transform Data**               | Apply log, sqrt, or Box-Cox transformation                                        | When data is skewed                           |
| **Use Robust Models**            | Use models less sensitive to outliers (e.g., `RANSACRegressor`, `HuberRegressor`) | When outliers are valid data points           |


## Step 3: Apply the Handling Technique
Example (Removing Outliers using IQR):

In [None]:
Q1 = data['Salary'].quantile(0.25)
Q3 = data['Salary'].quantile(0.75)
IQR = Q3 - Q1

lower_limit = Q1 - 1.5 * IQR
upper_limit = Q3 + 1.5 * IQR

data_clean = data[(data['Salary'] >= lower_limit) & (data['Salary'] <= upper_limit)]


## Example (Capping Outliers):

In [None]:
data['Salary'] = np.where(data['Salary'] > upper_limit, upper_limit,
                   np.where(data['Salary'] < lower_limit, lower_limit, data['Salary']))


## Example (Using Robust Regression):

In [None]:
from sklearn.linear_model import HuberRegressor

model = HuberRegressor()
model.fit(X, y)


## Step 4: Retrain the Model

In [None]:
After handling outliers:

1.Split data again

2.Retrain your Linear Regression model

3.Check if R² score or error improved

In [None]:
| Step | Action          | Example Method                              |
| ---- | --------------- | ------------------------------------------- |
| 1    | Detect Outliers | Boxplot, IQR, Z-score                       |
| 2    | Analyze Cause   | Error, natural variation                    |
| 3    | Handle          | Remove, Cap, Transform, or use Robust Model |
| 4    | Retrain Model   | Check performance improvement               |
