# **Univariate Outlier Detection**

In 1980 Dawkins described an outlier as "Observations which deviate so much from other observations as to arouse suspicion that they were generated by a different mechanism". For me this description is to simple and would lead one to conclude that these observations should be excluded or deleted from our data before we complete our analysis. In fact, my view of an outlier is a data point that seems to standout from the rest of the pack. One should not immediately think this is an erronous data point. It could in fact be a quirk of our data. For example maybe after a certain period of time our data was generated from a system that became unstable. While this is a change in process, it tells us that this can happen. Sometimes outliers represent people or subgroups that do strange or alternative things and could potentially tell us where future fashions or trends are going. Maybe they help us identify what is possible? They can be a data miners gold and should not be excluded when identified. When you do identify an outlier you **should have strong evidence** that the data point is erroneous before you delete or exclude it. The rule I use is a data point cannot be excluded unless their is scientific proof that the value you get is not possible. For example if you were measuring heart rates on people and you found one with 1000bpm then this could be excluded from your analysis,as the fastest recorded heart beat is [600bpm](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3273956/). You will note here that I researched the topic and found a paper that basically supported my evidence. The other thing you may say is the data does not follow normal practice and this maybe the case but you can only exlcude the data point if your study is defined to be on normal or regular subjects. Lets look at another example. Imagine you were predicting house prices and most of them ranged between €150k to €1m, but you found one or two at €40m. This is quite possible and these data points cannot be excluded unless you are going to narrow your models to houses under €1m.

<!--![alt text](https://www.computing.dcu.ie/~amccarren/mcm_images/Outlier_pic1.jpg) -->





Lets look at some data that you can get from sklearn. Its known as the housing housing data and can be imported using the following code.

In [None]:
from sklearn.datasets import fetch_california_housing
import numpy as np
import pandas as pd
housing=fetch_california_housing()

#print(housing.keys())
print(housing.DESCR)

x = housing.data
y = housing.target
print(pd.Series(y))
columns = housing.feature_names
#create the dataframe
housing_df = pd.DataFrame(housing.data)
housing_df.columns = columns
housing_df['Y']=pd.Series(y)
#print(housing_df.columns)
housing_df

We can look for univariate outliers by using charts such box plot or an individual chart. Examples of both are shown below.

In [None]:
import seaborn as sns
import numpy as np
sns.boxplot(x=housing_df['AveOccup'])

In [None]:
import matplotlib.pyplot as plt
import numpy as np

#x = np.linspace(0, 2*np.pi, 1000)
#y1 = np.sin(x)
x=np.array(housing_df.index.tolist())
y1=np.array(housing_df['AveOccup'])
f = plt.figure()

ax = f.add_subplot(111)

plt.plot(x, y1)
plt.axhline(y=housing_df['AveOccup'].mean())
plt.axhline(y=housing_df['AveOccup'].mean()+3*housing_df['AveOccup'].std(),color='r')
plt.axhline(y=housing_df['AveOccup'].mean()-3*housing_df['AveOccup'].std(),color='r')

plt.title('Individual Charts  ', fontsize=8)




plt.show()

It can often be confusing when attempting to decide how to deal with outliers and there is often as I said before a compulsion to remove them as they don't necessarily fit our orginal hypothesis.

We have examined the weighted distance variable and we can see at least one outlier. Remember, do not remove outliers unless you have evidence to do so. Repeat this for the remaining variables.  What results do you find?

