# Dealing with outliers

It's consensual that we should delete outliers because they don't represent well the domain and can make the model take wrong conclusions and therefor have a poor performance.

The problem here is defining what is considered an outlier for each of the features. Some values can seem to be outliers but with a deep analysis of the context we realize they arent. For example, the salary of a CEO is much greater than the salary of a simple employee. For an automatic outlier detection techinque, this great difference could lead the technique to think that the salary of a CEO is an outlier but with the knowledge of the context and domain, we know that it isn't an outlier.

**Two main techniques**:
- Use z-score
- Local Outlier Factor

In [15]:
import pandas as pd
import numpy as np

df = pd.read_csv('Smaller_Building_Permits.csv')
df_num = df.select_dtypes(np.number)



(10372, 44)

## Z-score

Z-Score test tells us the number of standard deviations that an observation is from the sample mean. The further away the value is, the greater the number of standard deviations.

The default number is normally 3

In [10]:
from scipy import stats
threshold = 1
new_df = df[(np.abs(stats.zscore(df_num)) < threshold).all(axis=1)]

  This is separate from the ipykernel package so we can avoid doing imports until


## Local Outlier Factor

Local Outlier Factor is a function that calculates the density of a certain value based on its neighbors. Considers that a value is outlier if it has a density substantially lower than that of its neighbors.

Here we have to define the number o neighbors.

In [14]:
from sklearn.neighbors import LocalOutlierFactor

df_num.fillna(-1, inplace=True) # have to remove missing values in order to use the function
clf = LocalOutlierFactor(n_neighbors=20)
y_pred = clf.fit_predict(df_num)  # y pred returns 1 if it is an outlier and -1 otherwise

df_with_outlier_pred = df.copy()
df_with_outlier_pred["lof"] = y_pred.reshape(-1, 1)
new_df = df_with_outlier_pred[df_with_outlier_pred["lof"] != -1]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  downcast=downcast, **kwargs)


(10306, 45)