# Working with Outliers 

Outliers are data points that differ significantly from other observations in a dataset. They can arise due to variability in the data, measurement errors, or may indicate something interesting (like anomalies). Handling them correctly is crucial for accurate data analysis.

---

## 1. What Are Outliers?

Outliers are observations that lie an abnormal distance from other values in a random sample from a population. They can influence statistical analyses, skew means, and affect the performance of machine learning algorithms.

---

## 2. Outlier Detection and Removal Using the Z-score Method | [Link](https://github.com/AdilShamim8/50-Days-of-Machine-Learning/tree/main/Day%2025%20Outlier%20Removal%20using%20Z-Score)

### Concept

The Z-score method standardizes data points based on the mean and standard deviation. A Z-score tells you how many standard deviations a given value is from the mean. Common practice is to consider values with a Z-score greater than a threshold (typically 3 or -3) as outliers.

### Formula

$$
z = \frac{x - \mu}{\sigma}
$$

- x is an individual data point.
- μ is the mean of the dataset.
- σ is the standard deviation of the dataset.

### Python Code Example

```python
import numpy as np
import pandas as pd
from scipy import stats

# Sample data
np.random.seed(0)
data = pd.DataFrame({'value': np.random.randn(1000)})

# Calculate Z-scores
z_scores = np.abs(stats.zscore(data['value']))

# Define a threshold (e.g., 3) and filter data
threshold = 3
data_clean = data[z_scores < threshold]

print("Original data shape:", data.shape)
print("Clean data shape:", data_clean.shape)
```

This code calculates the Z-score for each data point and filters out those beyond the threshold.

---

## 3. Outlier Detection and Removal Using the IQR Method | [Link](https://github.com/AdilShamim8/50-Days-of-Machine-Learning/tree/main/Day%2026%20Outlier%20Removal%20using%20IQR%20Method)

### Concept

The Interquartile Range (IQR) method involves calculating the first quartile (Q1) and third quartile (Q3) of the data. The IQR is the difference between these quartiles. Data points that lie below or above the bounds defined by:

$$
[Q_1 - 1.5 \times IQR,\; Q_3 + 1.5 \times IQR]
$$

are considered outliers.

### Formula

1. Calculate the IQR:
   
   $$
   IQR = Q_3 - Q_1
   $$

2. Define the lower and upper bounds:
   
   $$
   \text{Lower Bound} = Q_1 - 1.5 \times IQR
   $$
   $$
   \text{Upper Bound} = Q_3 + 1.5 \times IQR
   $$

### Python Code Example

```python
# Calculate Q1 and Q3
Q1 = data['value'].quantile(0.25)
Q3 = data['value'].quantile(0.75)
IQR = Q3 - Q1

# Define bounds for outlier detection
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter data to remove outliers
data_clean_iqr = data[(data['value'] >= lower_bound) & (data['value'] <= upper_bound)]

print("Data shape after IQR cleaning:", data_clean_iqr.shape)
```

This code computes the IQR, defines the bounds, and filters out the data points that are considered outliers.

---

## 4. Percentile Method  | [Link](https://github.com/AdilShamim8/50-Days-of-Machine-Learning/tree/main/Day%2027%20Outlier%20Detection%20using%20Percentiles)

### Concept

The Percentile method involves defining outliers based on the extreme values at specific percentiles. For example, you might remove the lowest 1% and highest 1% of data values. This method is especially useful when you want to eliminate a fixed proportion of extreme values.

### Python Code Example

```python
# Define percentiles to keep (e.g., keep data between the 1st and 99th percentiles)
lower_percentile = data['value'].quantile(0.01)
upper_percentile = data['value'].quantile(0.99)

# Filter data based on percentile thresholds
data_clean_percentile = data[(data['value'] >= lower_percentile) & (data['value'] <= upper_percentile)]

print("Data shape after percentile cleaning:", data_clean_percentile.shape)
```

This example removes the extreme 1% of data from both ends of the distribution.

---

## Summary

- **Outliers:** Extreme data points that differ significantly from the rest of the data.
- **Z-score Method:** Uses the formula 
  $$ z = \frac{x - \mu}{\sigma} $$
  to detect outliers beyond a chosen threshold.
- **IQR Method:** Uses the interquartile range:
  $$ IQR = Q_3 - Q_1 $$
  and defines outliers as those outside
  $$ [Q_1 - 1.5 \times IQR, Q_3 + 1.5 \times IQR] $$
- **Percentile Method:** Removes a fixed percentage of the extreme values, for example, values outside the 1st and 99th percentiles.

Each method has its strengths, and the choice depends on your dataset's characteristics and the specific requirements of your analysis.