# Outliers

Outliers can be the result of measurement errors/mistakes, or can be extreme yet valid values. It may not always be clear which is the case. Either way, outliers (or 'leverage points') may have a large effect on aggregates like the average, and standard deviation, and also on regression estimates. 

There are several ways of dealing with outliers. Some methods remove the outliers from the dataset, while other methods change the extreme values to less extreme/normal values.

Methods discussed in this notebook:
    
- winsorizing (and trimming) based on percentiles
- trimming based on standard deviation, interquartile range 

Other ways to reduce the effect of outliers:
- binning, see [pandas-4-binning.ipynb](pandas-4-binning.ipynb)
- taking the log of the value
- ranking a variable, by giving the smallest value a value of 1, the next a 2, etc (this gives the variable a uniform distribution)

### Sample dataset

In [None]:
import pandas as pd
import numpy as np

# read sample dataset
data = pd.read_csv('../datasets/feedback.csv')
data.describe()

## Winsorizing

Winsorizing transforms a variable by reducing the extreme values (on both the lower and upper tail). For large sample sizes it is typically based on the first and last percentile. For smaller samples, it can be 2% and 98% or 5% and 95% as well.

What winsorizing does is best explained by a picture (see below). Both distributions have the same number of observations. The winsorized data (lower panel) doesn't have the tails and the outside 'buckets' are a bit fatter (the extreme observations are included in the last buckets). 

![winsorized](images/winsorized.png)

Image source: https://blogs.sas.com/content/iml/2017/02/08/winsorization-good-bad-and-ugly.html

In [None]:
# to winsorize use scipy mstats 
# the winsorize method returns a 'masked array', the data within that array can be accessed with .data
# https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.mstats.winsorize.html

# import mstats
from scipy.stats import mstats

# add winsorized price to dataframe
data['price_w'] = mstats.winsorize(data['price'], limits=[0.01, 0.01]).data
# compare the distributions
data[['price', 'price_w']].describe()

## Trimming

Trimming is like a hair-cut; observations with extreme values will be removed.

Typical ways to set the cut-off values:
    
- using percentiles (for example 1% and 99%)
- using the z-value (number of standard deviations)
- using the interquartile range

See https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/

### Trimming using percentiles

In [None]:
# re-read sample dataset
data = pd.read_csv('../datasets/feedback.csv')

# get the first and last percentiles for price
q_low = data["price"].quantile(0.01)
q_hi  = data["price"].quantile(0.99)

print("first percentile:", q_low, "last percentile: ", q_hi)

# filter
data_filtered = data[(data["price"] < q_hi) & (data["price"] > q_low)]
# see the distribution
data_filtered[['price']].describe()

### Trimming based on Z-score

The z-score is the number of standard deviations some observation is away from the mean. 

It is typical to drop observations that are more than 3 standard deviations away from the mean.

In [None]:
from scipy import stats

# re-read sample dataset
data = pd.read_csv('../datasets/feedback.csv')

# based on z-score (#standard deviations)
data_filtered = data[(np.abs(stats.zscore(data['price'])) < 3)]

# see the distribution
data_filtered[['price']].describe()

# https://stackoverflow.com/questions/23199796/detect-and-exclude-outliers-in-a-pandas-dataframe

### Trimming based on Interquartile range

The interquartile range is the difference between the 75th and 25th percentile. This range is multiplied by 1.5 and the 'added' to both sides of the distribution (subtracted from 25th percentile, and added to the 75th percentile). Observations with values outside this range are dropped.

In [None]:
from numpy import percentile

# re-read sample dataset
data = pd.read_csv('../datasets/feedback.csv')

# calculate interquartile range
q25, q75 = percentile(data['price'], 25), percentile(data['price'], 75)
iqr = q75 - q25

# calculate the outlier cutoff
cut_off = iqr * 1.5
lower, upper = q25 - cut_off, q75 + cut_off

# remove outliers
data_filtered = data[(data["price"] > lower) & (data["price"] < upper)]

# see the distribution
data_filtered[['price']].describe()