# Outlier Detection with IQR (Interquartile Range)

## Objective
The **Interquartile Range (IQR)** is a measure of statistical dispersion, representing the difference between the 75th and 25th percentiles. This method is widely used for detecting outliers as part of data preprocessing. The goal of this lab is to apply the IQR algorithm for outlier detection.

## Prerequisites
Before proceeding, ensure you have completed all the content in submodule 3.2, specifically the lecture slides on the IQR algorithm. Familiarity with these concepts is crucial for understanding and implementing the outlier detection technique described here.

## IQR Algorithm for Outlier Detection
Follow these steps to detect outliers using the IQR method:
1. Arrange the data in ascending order.
2. Calculate the first quartile (Q1).
3. Calculate the third quartile (Q3).
4. Compute the IQR as $\(IQR = Q3 - Q1\)$.
5. Determine the lower bound $\(T_{\text{lower}} = Q1 - (1.5 \times IQR)\)$.
6. Determine the upper bound $\(T_{\text{upper}} = Q3 + (1.5 \times IQR)\)$.
7. Identify outliers. Data points outside the range $\([T_{\text{lower}}, T_{\text{upper}}]\)$ are considered outliers and should be filtered out.

## Instructions
- Implement the IQR algorithm using Python.
- Apply the algorithm to detect and remove outliers from the "LotArea" attribute in the training dataset of the House Price Prediction. This data can be found in `train.csv' and can be downloaded [here](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview).
- Compare the original and preprocessed datasets by plotting their distributions. Use box plots for this comparison, following the examples provided in the [matplotlib boxplot demo](https://matplotlib.org/3.1.1/gallery/pyplots/boxplot_demo_pyplot.html#sphx-glr-gallery-pyplots-boxplot-demo-pyplot-py).

This exercise will help you understand how to identify and remove outliers, improving the quality of your data for predictive modeling.

In [1]:
# !pip install scikit-learn 
import pandas as pd
import matplotlib.pyplot as plt

### 1. Laod data form csv using pandas

In [2]:
data = pd.read_csv('train.csv')

### Use describe to get overall statistics 

In [3]:
# use describe function for all data here

### 2. Use descirbe the determine the values Q1, Q2, and Q3 for LotArea

In [4]:
# use describe to determine Q1, Q2 and Q3 for LotArea here

### Compute Q1, Q3 and IQR using the .quantile funciton 

In [5]:
# code to compute Q1 and Q3 using quantile function here

### Determine the upper and lower bounds. Any value outside of this will be outliners

In [6]:
# code to determine upper and lower bounds here

### 5. Find outliers
Filter using upper and lower

In [7]:
# code to identify outliers here

### 6. Remove the outliers. Drop rows that are not in the [lower, upper]

In [8]:
# code to remove outliers here

### Compare the box plots

#### Create a box plot with orginal data

In [10]:
# box plot code here

#### Create a box plot with outliers removed

In [11]:
# outliers removed box plot code here

### Comment on the differnces between both box plots