___

<a href='http://www.pieriandata.com'><img src='../Pierian_Data_Logo.png'/></a>
___
<center><em>Copyright by Pierian Data Inc.</em></center>
<center><em>For more information, visit us at <a href='http://www.pieriandata.com'>www.pieriandata.com</a></em></center>

# Dealing with Outliers

In statistics, an outlier is a data point that differs significantly from other observations.An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. An outlier can cause serious problems in statistical analyses.

Remember that even if a data point is an outlier, its still a data point! Carefully consider your data, its sources, and your goals whenver deciding to remove an outlier. Each case is different!

## Lecture Goals
* Understand different mathmatical definitions of outliers
* Use Python tools to recognize outliers and remove them

### Useful Links

* [Wikipedia Article](https://en.wikipedia.org/wiki/Outlier)
* [NIST Outlier Links](https://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm)

-------------

# Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Generating Data

In [None]:
# Choose a mean,standard deviation, and number of samples

def create_ages(mu=50,sigma=13,num_samples=100,seed=42):

    # Set a random seed in the same cell as the random call to get the same values as us
    # We set seed to 42 (42 is an arbitrary choice from Hitchhiker's Guide to the Galaxy)
    np.random.seed(seed)

    sample_ages = np.random.normal(loc=mu,scale=sigma,size=num_samples)
    sample_ages = np.round(sample_ages,decimals=0)
    
    return sample_ages

In [None]:
sample = create_ages()

In [None]:
sample

## Visualize and Describe the Data

In [None]:
sns.distplot(sample,bins=10,kde=False)

In [None]:
sns.boxplot(sample)

In [None]:
ser = pd.Series(sample)
ser.describe()

## Trimming or Fixing Based Off Domain Knowledge

If we know we're dealing with a dataset pertaining to voting age (18 years old in the USA), then it makes sense to either drop anything less than that OR fix values lower than 18 and push them up to 18.

In [None]:
ser[ser > 18]

In [None]:
# It dropped one person
len(ser[ser > 18])

In [None]:
def fix_values(age):
    
    if age < 18:
        return 18
    else:
        return age

In [None]:
# "Fixes" one person's age
ser.apply(fix_values)

In [None]:
len(ser.apply(fix_values))

--------

There are many ways to identify and remove outliers:
* Trimming based off a provided value
* Capping based off IQR or STD
* https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba
* https://towardsdatascience.com/5-ways-to-detect-outliers-that-every-data-scientist-should-know-python-code-70a54335a623

## Ames Data Set

Let's explore any extreme outliers in our Ames Housing Data Set

In [None]:
df = pd.read_csv("../DATA/Ames_Housing_Data.csv")

In [None]:
df.head()

In [None]:
sns.heatmap(df.corr())

In [None]:
df.corr()['SalePrice'].sort_values()

In [None]:
sns.distplot(df["SalePrice"])

In [None]:
sns.scatterplot(x='Overall Qual',y='SalePrice',data=df)

In [None]:
df[(df['Overall Qual']>8) & (df['SalePrice']<200000)]

In [None]:
sns.scatterplot(x='Gr Liv Area',y='SalePrice',data=df)

In [None]:
df[(df['Gr Liv Area']>4000) & (df['SalePrice']<400000)]

In [None]:
df[(df['Gr Liv Area']>4000) & (df['SalePrice']<400000)].index

In [None]:
ind_drop = df[(df['Gr Liv Area']>4000) & (df['SalePrice']<400000)].index

In [None]:
df = df.drop(ind_drop,axis=0)

In [None]:
sns.scatterplot(x='Gr Liv Area',y='SalePrice',data=df)

In [None]:
sns.scatterplot(x='Overall Qual',y='SalePrice',data=df)

In [None]:
df.to_csv("../DATA/Ames_outliers_removed.csv",index=False)

----