# Detect Outliers

## Detecting and Removing Outliers Using Isolation Forest

Outliers in datasets can significantly affect the performance of machine learning models. These outliers are data points that differ significantly from other observations and have a low probability of appearing in real-world data. Including them in your dataset can lead to inaccurate models and misleading test results.

To improve the quality of your data, you can detect and remove these outliers using the Isolation Forest method. The Isolation Forest algorithm is an effective tool for identifying anomalies in your data. By isolating outliers, you can ensure that your model is trained and tested on data that better represents real-world scenarios.

Here is a step-by-step guide to using the Isolation Forest method for outlier detection and removal.

In [2]:
# Load packages
import numpy as np 
import pandas as pd 
import plotly.express as px
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import train_test_split

In [3]:
# Load data from the csv file
df = pd.read_csv("housing.csv", index_col=False)
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,5.33,36.2


Once the data is loaded, you can use the `IsolationForest` algorithm from the `sklearn` package to detect and remove outliers. The `IsolationForest` algorithm is an effective tool for identifying anomalies in your dataset. It works by isolating observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. The logic is that outliers are few and different, thus they are easier to isolate.

Here is an example of how to build and apply the Isolation Forest classifier:

In [4]:
CONTAMINATION=.2    # The expected outliers to real data ratio
BOOTSTRAP = False   # True if you want to use the bootstrap method; 

# Exctract the data without column names from the dataframe
data = df.values

# We will start by splitting the data into X/y train/test data
X, y = data[:, :-1], data[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Next, we will set up the Isolation Forest classifier that will detect the outliers
i_forest = IsolationForest(contamination=CONTAMINATION, bootstrap=BOOTSTRAP)
is_inlier = i_forest.fit_predict(X_train)    # +1 if inlier, -1 if outlier

# Finally, we will select the rows without outliers
mask = is_inlier != -1
# and remove these from the train data
X_train, y_train = X_train[mask, :], y_train[mask]

Before saving the data without outliers to a new CSV file, it's important to visualize and compare the distribution of the average number of rooms (RM) column. This can be effectively done using boxplots. Boxplots provide a graphical representation of the data distribution, highlighting the median, quartiles, and potential outliers.

Here's how you can create two boxplots to compare the RM column with and without outliers:

1. **With Outliers**: This boxplot will show the distribution of the RM column in the original dataset, including any outliers.
2. **Without Outliers**: This boxplot will show the distribution of the RM column after removing the outliers using the Isolation Forest algorithm.

By comparing these two boxplots, you can visually assess the impact of outlier removal on the RM column.

Below is the code to create these boxplots:

```python
import matplotlib.pyplot as plt

# Create a figure and axis
fig, ax = plt.subplots()

# Add boxplots to the figure
ax.boxplot([df['RM'], df_wo_outliers['RM']], labels=['With outliers', 'Without outliers'])

# Set the title and labels
ax.set_title('Comparison of RM Column with and without Outliers')
ax.set_ylabel('Average Number of Rooms (RM)')

# Show the plot
plt.show()

In [5]:
df_wo_outliers = pd.DataFrame(X_train, columns=df.columns[:-1])
df_boxplot = pd.DataFrame(data={'With outliers': df['RM'], 'Without outliers': df_wo_outliers['RM']})

fig = px.box(df_boxplot, 
             title="Comparison of Average Number of Rooms (RM) With and Without Outliers",
             labels={"value": "Average Number of Rooms (RM)", "variable": "Data"},
             color_discrete_sequence=["#636EFA", "#EF553B"])

fig.update_layout(
    yaxis_title="Average Number of Rooms (RM)",
    xaxis_title="",
    boxmode='group'  # Group the boxes together
)

fig.show()

As you can see, a significant number of outliers have been removed from the RM column. This process can be applied to other columns as well. If you wish to adjust the number of outliers being removed, you can modify the `CONTAMINATION` parameter. Increasing the `CONTAMINATION` value will result in more outliers being removed, while decreasing it will retain more data points.

To save this cleaned dataframe for future use, you can write it to a CSV file. Here is an example of how to do this in Python:

```python
# Save the dataframe without outliers to a CSV file
df_wo_outliers.to_csv('cleaned_data.csv', index=False)

In [6]:
df_wo_outliers.to_csv('housing_wo_outliers.csv')

### Dataset source
[housing.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv)