# Outlier Detection

## What is an Outlier in Machine Learning?

In machine learning, an outlier is a data point that significantly differs from other observations in the dataset. Outliers can occur due to variability in the data or errors in data collection, measurement, or processing. 

They can have a significant impact on the performance of machine learning models, as they may distort statistical measures and influence model training.

For example, imagine you’re analyzing the salaries of people in an office. Most employees earn between $30,000 and $60,000, but one person, let’s say the CEO, has a salary of $10 million. That $10 million figure is an outlier because it is far removed from the rest of the data.



## Why Outlier Detection is Important?

Outliers can significantly distort statistical analysis, often leading to inaccurate conclusions and misleading insights.

In tasks such as calculating the mean, median, or standard deviation, even a few extreme values can disproportionately influence the results and skew interpretations. By identifying and properly handling outliers, we can reduce their impact, maintain the reliability of statistical measures, and ensure the insights derived from the data are both meaningful and accurate.



- **Improves Accuracy** :
Handling outliers ensures that models are trained on representative data, leading to better predictions and more accurate outcomes.

- **Supports Fraud Detection** :
In domains like finance or cybersecurity, outliers often signal fraudulent or suspicious behavior, making their detection critical for early intervention.

- **Enhances Data Quality** :
Outlier detection helps maintain the integrity and cleanliness of data, which is essential for making sound business decisions.

- **Boosts Model Performance** :
Outliers can mislead statistical and machine learning models, reducing their robustness. Identifying and treating them improves model stability and performance.

- **Enables Insightful Discoveries** :
Sometimes, outliers reveal unique, rare, or emerging patterns. Investigating them can uncover new trends, risks, or opportunities hidden within the data.




### Types of Outliers:
1. **Global outliers:** Global outliers are isolated data points that are far away from the main body of the data. They are often easy to identify and remove.

2. **Contextual outliers:** Contextual outliers are data points that are unusual in a specific context but may not be outliers in a different context. They are often more difficult to identify and may require additional information or domain knowledge to determine their significance.

3. **Univariate Outliers:** These are outliers that occur in a single variable or feature.

4. **Multivariate Outliers:** These outliers occur when considering multiple variables simultaneously. A data point may not be an outlier in any single dimension but can be an outlier when considering multiple dimensions.
### Common Techniques to Detect Outliers:
 1. Statistical Methods (Z-Score / IQR)

 2. Visualization (Box Plot, Scatter Plot)

 3. Domain knowledge (business logic)


####  <u>Z- Score</u>

 - The Z-Score method identifies how far each data point deviates from the mean in terms of standard deviations.
 -  It is commonly used when the data follows a normal distribution.

 - If a Z-score is higher than a certain threshold (typically ±3), it’s considered an outlier.


 - Best Use Cases:

    - Suitable for large datasets

    - Ideal when data is approximately normally distributed

    - Not ideal for small datasets or highly skewed data


#### <u>IQR (Interquartile Range)</u>

- The IQR method is a robust statistical technique used to detect outliers without assuming a normal distribution. It focuses on the central portion of the data — the middle 50%.

- How It Works:

    - Calculate the first quartile (Q1 – 25th percentile) and the third quartile (Q3 – 75th percentile).

    - Compute the IQR:
            
            IQR=Q3−Q1
    - Determine the lower and upper bounds:

            Lower Bound=Q1−1.5×IQR
    
            Upper Bound=Q3+1.5×IQR
    - Any data point outside these bounds is considered an outlier.

![Image](attachment:image.png)



In [None]:


import pandas as pd


# Sample Data
df = pd.read_csv('./Titanic-Dataset.csv')

# Check Fare column
Q1 = df['Fare'].quantile(0.25)
Q3 = df['Fare'].quantile(0.75)
IQR = Q3 - Q1

# Define outlier boundaries
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Lower Bound: {lower_bound}")
print(f"Upper Bound: {upper_bound}")

# Identify outliers
outliers = df[(df['Fare'] < lower_bound) | (df['Fare'] > upper_bound)]
print("\nDetected Outliers:\n", outliers)

