<center><img src='img/ms_logo.jpeg' height=40% width=40%></center>

<center><h1>Outlier Detection, Sample Size, and Confidence Intervals</h1></center>

When you're designing an experiment, numbers matter.  After all, we want out experiments to be statistically valid--otherwise, we're just guessing.  In this notebook, we'll learn a method for detecting outliers in our data set called "Tukey Fences", named after famed statistician John Tukey.  

Next, we'll learn about confidence inteverals, sample size, and the relationship between the two.  We'll learn how to calculate confidence intervals based on sample size, as well as how to determine the minimum sample size needed in order to reach a specific confidence interval.  

Let's get started!

<center><h2>Outlier Detection</h2></center>

Recall that before we begin an experiment, we usually start by "cleaning" our dataset.  This step usually includes things like:

* Exploring our dataset(s) to get a feel for what changes need to be made to make it more usable
* Examining and standardizing the values within cells (converting "yes"/"no" answers to 1's and 0's, for example)
* Dealing with cells that contain NaNs (Null values)
* Organizing and structuring datasets as needed (for instance, combining many small datasets into one big one)
* Normalizing continuous data into z-scores with a mean of 0 and unit variance.  

Another major step we need to do at this point in the project is to detect **outliers**, and determine how to deal wit them.  Outliers are extreme values that can skew our dataset, sometimes giving us an incorrect picture of how things actually are in our dataset.  The hardest part of this is determining which data points are acceptable, and which ones constitute "outlier" status.  This is where "Tukey Fences" come into play!

### 1.5 x IQR

In order to find outliers, we first need a working definition of what constitutes an outlier.  Tukey suggested we calculate the range between the first quartile (25%) and  third quartile (75%) in the data, called the **interquartile range**.  We then multiply this value by 1.5.  To get the Fence for high values, add this value to the Q3 value.  Anything greater than this "Fence" value is considered an outlier.  Similarly, to get the Fence for low values, subtract 1.5 x IQR from Q1.  Anything less than this "Fence" value is also considered an outlier.  

Let's try an example!

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(1547)
% matplotlib inline

In [2]:
# Generate a random normal distribution of 1000 samples with mean 100 and std_dev of 8
normal_dist = np.random.normal(100, 8, (1000)).astype('float64')
# Generate a random uniform distribution between 1 and 200 with 100 samples
uniform_dist = np.random.uniform(1, 200, (100)).astype('float64')
# Combine both distributions and store in and Pandas Series object
sample_dset = pd.Series(np.append(normal_dist, uniform_dist))
sample_dset.describe()

count    1100.000000
mean      101.190110
std        17.605682
min        10.235233
25%        94.658860
50%       100.340004
75%       106.459821
max       195.226466
dtype: float64

Now that we've created an ugly data set, let's see if we can identify some outliers.  

Start by calculating the **Inter-Quartile Range**: Q3 - Q1.

Next, calculate how far your fences are from the quartiles: f = IQR x 1.5

Finally, place your fences and filter for values outside them:  Lower Fence = Q1 - f, Upper Fence = Q3 + f

See if you can write write some code to filter for outliers in the `sample_dset` array we've just created.

In [4]:
# Get Locations for Q1 and Q3
q1 = None
q3 = None

# calculate fence locations
lower_fence = None
upper_fence = None

# Filter out the outliers and inspect them!


Great! That works, but it isn't efficient to calculate this manually every time we run across a new data set.  

**TASK:** Write a function that takes in a pandas series, and returns a new pandas series with the outliers removed!

In [5]:
def remove_outliers(series):
    pass

<center><h2>Sample Size and Confidence Intervals</h2></center>