# Calculating and Removing Outliers

In [1]:
import pandas as pd
import numpy as np

# Calculate and Remove Outliers

<b>OUTLIER</b>

An outlier is a data point that differs significantly from other observations. The value of an outlier is relative to the dataset as a whole. 

<center><img src='https://www.stevesjogren.com/wp-content/uploads/2011/06/1_TbUF_HTQ6jOhO8EoPnmekQ-696x385.jpg'></center>

In [2]:
filename = "datasets/gradedata.csv"
df = pd.read_csv(filename)
print(df.shape)
df.head()

(2000, 8)


Unnamed: 0,fname,lname,gender,age,exercise,hours,grade,address
0,Marcia,Pugh,female,17,3,10,82.4,"9253 Richardson Road, Matawan, NJ 07747"
1,Kadeem,Morrison,male,18,4,4,78.2,"33 Spring Dr., Taunton, MA 02780"
2,Nash,Powell,male,18,5,9,79.3,"41 Hill Avenue, Mentor, OH 44060"
3,Noelani,Wagner,female,14,2,7,83.2,"8839 Marshall St., Miami, FL 33125"
4,Noelani,Cherry,female,18,4,15,87.4,"8304 Charles Rd., Lewis Center, OH 43035"


## REVIEW

***************************************

#### MEAN

* Average in a series of numbers. To calculate the mean, you add up all the numbers in a series and divide the sum by the number of values. 

#### MEDIAN

* The median of a set of data is the middlemost number in the set. The median is also the number that is halfway into the set. If there is an even number of values, the median is the mean of the two middlemost numbers.  

#### MODE

* The mode of a set of data is the value in the set that occurs most often.

# Method 1: STANDARD DEVIATION
## Best with normally distributed data

### <u>How much does the data vary from the average?</u>

<center><img src='https://www.benlcollins.com/wp-content/uploads/2016/02/image.png'></center>

### SD is a measure of <b>spread</b>, how spread out a set of data is. 

A <b>low</b> SD means the data is closely clustered around the mean. A <b>high</b> SD means the data is spread out over a wider range of values. 

<center><img src='https://www.thepokerbank.com/images/std-dev-low-high.png'></center>


***************************************

### 68 / 95 / 99.7 RULE

When your data is normally distributed (bell curve), typically your data will fall within the following patterns: 

<center><img src='https://www.thedataschool.co.uk/wp-content/uploads/2016/02/Normal-distribution-curve.jpg'></center>


* 68% of the data falls within 1 SD of the mean

* 95% of the data falls within 2 SD's of the mean

* 99.7% of the data falls within 3 SD's of the mean

### Example:

The average cost of lemons in the US is 3.00 dollars per pound with a SD of 0.50 cents. 

* 68% of lemons cost between 2.50 and 3.50 dollars per pound. 
* 95% of lemons cost between 2.00 and 4.00 dollars per pound.
* 99.7% of lemons cost between 1.50 and 4.50 dollars per pound. 

***************************************

### Distance from the mean

A datapoints distance from the mean is measured by how many SD's it is away from the mean. A datapoint that is beyond a certain number of SD's away from the mean represents an outcome that is significantly above or below the mean. 

You can set a predetermined 'cut-off distance' from the mean and everything outside of that distance will be considered an outlier. The cutoff value that you use will be dependent on your dataset - the less data you have, the stricter cutoff you should have. 

#### A common cutoff point point is: +/- 1.96 SD's from the mean - anything above or below these values will be considered an outlier. 

***************************************

## REMOVING OUTLIERS USING SD

* Determine a cut-off point 
* Calculate mean, std, and the cut-off values
* Drop all values that fall outside of this range 

In [3]:
print(df.shape)
df.head()

(2000, 8)


Unnamed: 0,fname,lname,gender,age,exercise,hours,grade,address
0,Marcia,Pugh,female,17,3,10,82.4,"9253 Richardson Road, Matawan, NJ 07747"
1,Kadeem,Morrison,male,18,4,4,78.2,"33 Spring Dr., Taunton, MA 02780"
2,Nash,Powell,male,18,5,9,79.3,"41 Hill Avenue, Mentor, OH 44060"
3,Noelani,Wagner,female,14,2,7,83.2,"8839 Marshall St., Miami, FL 33125"
4,Noelani,Cherry,female,18,4,15,87.4,"8304 Charles Rd., Lewis Center, OH 43035"


In [4]:
#Standard Deviation Method

meangrade = df['grade'].mean() # what is the mean grade for the class?

stdgrade = df['grade'].std() # what is the standard deviation of class grades?

toprange = meangrade + stdgrade * 1.96 # the top limit for grade, anything above this is considered an outlier
botrange = meangrade - stdgrade * 1.96 # the bottom limit for grade, anything below this is considered an outlier

copydf = df.copy() ## create a new dataset that will remove outliers

copydf = copydf.drop(copydf[copydf['grade'] > toprange].index)
copydf = copydf.drop(copydf[copydf['grade'] < botrange].index)

print(copydf.shape) #54 outliers dropped
copydf.head()

(1946, 8)


Unnamed: 0,fname,lname,gender,age,exercise,hours,grade,address
0,Marcia,Pugh,female,17,3,10,82.4,"9253 Richardson Road, Matawan, NJ 07747"
1,Kadeem,Morrison,male,18,4,4,78.2,"33 Spring Dr., Taunton, MA 02780"
2,Nash,Powell,male,18,5,9,79.3,"41 Hill Avenue, Mentor, OH 44060"
3,Noelani,Wagner,female,14,2,7,83.2,"8839 Marshall St., Miami, FL 33125"
4,Noelani,Cherry,female,18,4,15,87.4,"8304 Charles Rd., Lewis Center, OH 43035"


# Method 2: INTERQUARTILE RANGE
## Best with non-normally distributed data

The Interquartile Range (IQR) is a measure of where the bulk of the data values lie. The IQR is the range of the middle of the dataset.

***************************************

### VOCAB

<b>Quartile:</b> the values that divide an ordered list of numbers into quarters

<b>Lower Quartile (Q1):</b> median of the lower half of data 

<b>Upper Quartiles (Q3):</b> median of the upper half of data  

<b>Range:</b> difference between the greatest and least values in a dataset

<b>IQR:</b> range of the middle of the data, difference between the upper and lower quartiles 

***************************************

In a given set of numbers, there are three values that divide the data into four sections. 

<center><img src='https://www.mathsisfun.com/data/images/interquartile-range.svg'></center>

## Q1
The median of the lower half of the data


## Q2
The median of the full set of data


## Q3
The median of the upper half of the data


## IQR
The difference between Q3 and Q1. The IQR is calcuated by the following formula:

IQR = Q3 - Q1 

***************************************

## 1.5 RULE

A commonly used rule of thumb says that a data point is an outlier if it is . . . 

* ... below Q1 - 1.5(IQR)

* ... above Q3 + 1.5(IQR)


### Example:

In a list of student absences {1, 3, 4, 6, 7, 7, 8, 8, 10, 12, 17}

* Q1 = 4
* Q2 = 7
* Q3 = 10

#### Q3(10) - Q1(4) = IQR = 6

lower limit : 4 - 1.5(6) = <b>-5</b>

upper limit : 10 + 1.5(6) = <b>19</b>

Data values less than or greater than these cut-off points are considered outliers. There are no outliers in this example.  

***************************************

## REMOVING OUTLIERS USING IQR

* Calculate quantiles 
* Calculate IQR
* Determine upper and lower limits
* Drop values less than the lower and/or greater than the upper

In [5]:
#Interquartile Range Method

q1 = df['grade'].quantile(.25) #calculate Q1
q3 = df['grade'].quantile(.75) #calculate Q3

iqr = q3-q1 # calcualate the IQR

toprange = q3 + iqr * 1.5 # determine the upper limit, values higher then this are outliers 
botrange = q1 - iqr * 1.5 # determine the lower limit, values lower than this are outliers

copydf = df.copy() ## create a new dataset that will remove outliers

copydf = copydf.drop(copydf[copydf['grade'] > toprange].index) #drop grades higher than the top range
copydf = copydf.drop(copydf[copydf['grade'] < botrange].index) #drop grades lower than the bottom range

print(copydf.shape) #2 outliers dropped
copydf.head()

(1998, 8)


Unnamed: 0,fname,lname,gender,age,exercise,hours,grade,address
0,Marcia,Pugh,female,17,3,10,82.4,"9253 Richardson Road, Matawan, NJ 07747"
1,Kadeem,Morrison,male,18,4,4,78.2,"33 Spring Dr., Taunton, MA 02780"
2,Nash,Powell,male,18,5,9,79.3,"41 Hill Avenue, Mentor, OH 44060"
3,Noelani,Wagner,female,14,2,7,83.2,"8839 Marshall St., Miami, FL 33125"
4,Noelani,Cherry,female,18,4,15,87.4,"8304 Charles Rd., Lewis Center, OH 43035"


## IQR & BOX PLOT CHARTS

In a given set of numbers, there are three values that divide the data into four sections. 

<LEFT><img src='https://i2.wp.com/makemeanalyst.com/wp-content/uploads/2017/05/simple.box_.defs_.gif?resize=540%2C370'></LEFT>