# What are outliers

An outlier is a datapoint that is considerably distnict from other data samples. For example, in a student age distribution, having a student's age being 120 is an outlier, in this case impossible and should be removed. In a human age distribution, an age of 120 would be possible but it would be out of the norm. If outliers are considered as part of the analysis, it may skew some of the summary statistics (like the mean). Outliers may provide information about certain data that don't comply to the norm of the dataset, which could be useful to look into. Outliers are defined based upon the range that is most important to the analysis, which could be determined based on the `.info()` function. 

There are two common methods to determine outliers:

* Using the standard deviation
* Using the inter-quratile range method

The standard deviation method basically works like this:

![Standard Deviation](./Standard%20Deviation.png)

*Anything beyond mean+standard deviation is an outlier*

\
\
\
The inter-quartile range method basically works like this:

![IQR](./IQR.png)


Box plots work really well with inter-quartile range.



Outliers can be:

* removed from the analysis
* temporarily removed in order to get summary statistics
* Keep the outliers entirely

# Import Libraries

In [1]:
import pandas as pd

# Load Dataset

In [2]:
df = pd.read_csv('./outliers_data.csv')

# Basic information about the dataset

In [3]:
df.describe()

Unnamed: 0,Student ID,Age
count,527.0,527.0
mean,263.0,16.00759
std,152.276065,9.156432
min,0.0,3.0
25%,131.5,13.0
50%,263.0,14.0
75%,394.5,16.0
max,526.0,75.0


In [4]:
df.Age.sort_values()

524     3
340     4
124     4
164     6
245    10
       ..
153    64
152    69
162    71
389    71
360    75
Name: Age, Length: 527, dtype: int64

# Identifying Ouliers when the range is known

The range is assumed to be 11-18, anything out of that range is an oulier

In [5]:
df.loc[(df['Age'] < 11)]

Unnamed: 0,Student ID,Age
124,449,4
164,269,6
245,322,10
340,435,4
524,12,3


In [6]:
df.loc[(df['Age'] > 18)]

Unnamed: 0,Student ID,Age
8,385,52
99,171,63
108,259,38
117,44,48
152,414,69
153,427,64
162,128,71
189,4,60
197,221,52
210,353,40


# Identifying outliers using the standard deviation

Using the mean and the standard deviation calculated using `describe()`, the outliers can be found by doing mean +- standard deviation

In [7]:
summaries = df.describe().loc[['mean','std']]
summaries

Unnamed: 0,Student ID,Age
mean,263.0,16.00759
std,152.276065,9.156432


In [10]:
upper_bound = summaries['Age']['mean'] + summaries['Age']['std']
lower_bound = summaries['Age']['mean'] - summaries['Age']['std']

In [11]:
upper_bound

25.16402210510057

In [12]:
lower_bound

6.851158160554078

## Finding the violating rows

In [13]:
violating_rows = df[(df['Age']<lower_bound) | (df['Age'] > upper_bound)]
violating_rows

Unnamed: 0,Student ID,Age
8,385,52
99,171,63
108,259,38
117,44,48
124,449,4
152,414,69
153,427,64
162,128,71
164,269,6
189,4,60


# Removing outliers from the dataset

In [14]:
outliers_index = violating_rows.index
outliers_index

Index([  8,  99, 108, 117, 124, 152, 153, 162, 164, 189, 197, 210, 232, 277,
       318, 340, 360, 366, 383, 388, 389, 411, 420, 444, 445, 504, 510, 517,
       523, 524],
      dtype='int64')

In [15]:
newdf = df.drop(index=outliers_index)
newdf.describe()

Unnamed: 0,Student ID,Age
count,497.0,497.0
mean,262.917505,14.158954
std,152.291698,1.939843
min,0.0,10.0
25%,131.0,12.0
50%,263.0,14.0
75%,394.0,16.0
max,526.0,17.0


In [16]:
df.describe()

Unnamed: 0,Student ID,Age
count,527.0,527.0
mean,263.0,16.00759
std,152.276065,9.156432
min,0.0,3.0
25%,131.5,13.0
50%,263.0,14.0
75%,394.5,16.0
max,526.0,75.0


# Applying the drop operation on the original dataframe

In [17]:
df.drop(index=outliers_index,inplace=True)

In [18]:
df.describe()

Unnamed: 0,Student ID,Age
count,497.0,497.0
mean,262.917505,14.158954
std,152.291698,1.939843
min,0.0,10.0
25%,131.0,12.0
50%,263.0,14.0
75%,394.0,16.0
max,526.0,17.0
