## Automatic Outlier Detection

#### The local outlier factor, or LOF for short, is a technique that attempts to harness the idea of nearest neighbors for outlier detection. Each example is assigned a scoring of how isolated or how likely it is to be outliers based on the size of its local neighborhood. Those examples with the largest score are more likely to be outliers.

#### We will use the Boston housing regression problem that has 13 inputs and one numerical target and requires learning the relationship between suburb characteristics and house prices.

In [47]:
# importing dataset
import pandas as pd
dataframe = pd.read_csv('housing.csv', header=None)
dataframe.head()


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


In [49]:
# retrieve the array
df = dataframe.values

In [50]:
# split into inpiut and output elements
X = df[: ,: -1]
y = df[:, -1]

In [51]:
# split into train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
print(X_train.shape, y_train.shape)

(339, 13) (339,)


In [52]:
# fit the model
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression()

In [53]:
# evaluate the model
from sklearn.metrics import mean_absolute_error
y_pred = model.predict(X_test)
# evaluate predictions
mae = mean_absolute_error(y_test, y_pred)
print('MAE: %.3F' % mae)

MAE: 3.417


#### Next, we can try removing outliers from the training dataset.

#### The expectation is that the outliers are causing the linear regression model to learn a bias or skewed understanding of the problem, and that removing these outliers from the training set will allow a more effective model to be learned.

#### We can achieve this by defining the LocalOutlierFactor model and using it to make a prediction on the training dataset, marking each row in the training dataset as normal (1) or an outlier (-1). We will use the default hyperparameters for the outlier detection model, although it is a good idea to tune the configuration to the specifics of your dataset.

In [54]:
# identify outliers in the training dataset
from sklearn.neighbors import LocalOutlierFactor
lof = LocalOutlierFactor()
y_pred = lof.fit_predict(X_train)

In [55]:
# select all rows that are not outliers
mask = y_pred != -1
X_train, y_train = X_train[mask, :], y_train[mask]

In [56]:
# summarize the shape of the updated training dataset
print(X_train.shape, y_train.shape)

(305, 13) (305,)


In [57]:
# fit the model
model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression()

In [58]:
# evaluate the model
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print('MAE: %.3F' % mae)

MAE: 3.356


#### Firstly, we can see that the number of examples in the training dataset has been reduced from 339 to 305, meaning 34 rows containing outliers were identified and deleted.

#### We can also see a reduction in MAE from about 3.417 by a model fit on the entire training dataset, to about 3.356 on a model fit on the dataset with outliers removed.