# INFO 98: Data Science Skills, Spring 2019
## Lab 03: Data Analysis

---

## Table of Contents
* [Setup](#setup)
* [Data Analysis](#da)
    * [Estimation Preliminaries](#est)
    * [Bootstrap](#boot)
* [Example](#ex)
    * Hypothesis Testing (Optional Review Section)
        * Design (Optional Review Section)
        * Test Statistic (Optional Review Section)
        * Permutation Testing (Optional Review Section)
    * Bootstrap Confidence Interval
    * Summary
    
**Note:** Although there are some Optional Review Sections in this lab to recap the content from previous lectures that connect with the content in this lab, you would still be **required** to run the code blocks for performing the Bootstrap Confidence Internval Step. We also _highly recommend_ you to go over the content from these sections as its crucial to understand how this content connects with the bootstrap.

<a id='setup'></a>
# Setup
____

In [None]:
# Setup

!pip install nbinteract
import warnings
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
import re
import random
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

<a id='da'></a>
# Data Analysis
___

<a id='est'></a>
## 1. Estimation Preliminaries

In the previous labs we began to develop ways of inferential thinking. In particular, we learned how to use data to decide between two hypotheses about the world. But often we just want to know how big something is.

In this lab, we will develop a way to estimate an unknown parameter. Remember that a parameter is a numerical value associated with a population.

To figure out the value of a parameter, we need data. If we have the relevant data for the entire population, we can simply calculate the parameter.

But if the population is very large – for example, if it consists of all the households in the United States – then it might be too expensive and time-consuming to gather data from the entire population. In such situations, data scientists rely on sampling at random from the population.

This leads to a question of inference:
> How to make justifiable conclusions about the unknown parameter, based on the data in the random sample? We will answer this question by using inferential thinking.

A statistic based on a random sample can be a reasonable estimate of an unknown parameter in the population. For example, you might want to use the median annual income of sampled households as an estimate of the median annual income of all households in the U.S.

But the value of any statistic depends on the sample, and the sample is based on random draws. So every time data scientists come up with an estimate based on a random sample, they are faced with a question:

> “How different could this estimate have been, if the sample had come out differently?”

In this lab you will learn one way of answering this question. The answer will give you the tools to estimate a numerical parameter and quantify the amount of error in your estimate.

We will start with a preliminary about percentiles. The most famous percentile is the median, often used in summaries of income data. Other percentiles will be important in the method of estimation that we are about to develop. So we will start by defining percentiles carefully.

### Percentiles

Numerical data can be sorted in increasing or decreasing order. Thus the values of a numerical data set have a rank order. A percentile is the value at a particular rank.

For example, if your score on a test is on the 95th percentile, a common interpretation is that only 5% of the scores were higher than yours. The median is the 50th percentile; it is commonly assumed that 50% the values in a data set are above the median.

**The General Definition**

Let  `p`  be a number between 0 and 100. The  $p^{th}$  percentile of a collection is the smallest value in the collection that is at least as large as p% of all the values.
By this definition, any percentile between 0 and 100 can be computed for any collection of values, and it is always an element of the collection.
In practical terms, suppose there are  n elements in the collection. To find the $p^{th}$ percentile:
1. Sort the collection in increasing order.
2. Find p% of n: 
    (p/100)×n. Call that  k.
3. If  k is an integer, take the  $k^{th}$ element of the sorted collection.
4. If  k is not an integer, round it up to the next integer, and take that element of the sorted collection.

In [None]:
# Create a ten integer array using np.array
ten_array = ...

Go ahead and try out the <a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html#numpy.percentile"> `np.percentile`</a> function to find the element from `ten_array`that represents the 20th percentile.

In [None]:
# Return the value of the 20th percentile
twenty_percentile = ...
twenty_percentile

### Quantiles

We can also perform this operation directly in Pandas using a closely related concept of quantiles. Look up <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html"> `DataFrame.quantile()`</a> function to get a quick overview of this operation.

More formally, any set of data, arranged in ascending or descending order, can be divided into various parts, also known as partitions or subsets, regulated by quantiles. Quantile is a generic term for those values that divide the set into partitions of size n, so that each part represents 1/n of the set. Quantiles are not the partition itself, they are the numbers that define the partition. You can think of them as a sort of numeric boundary.

There are various kind of quantiles, like the quartiles (watch out for the different letter!) which divide a list of numbers into quarters. They are also known as 4-quantiles. The list of type of quantiles is quite big; you can also have 3-quartiles, 5-quartiles, 6-quartiles up to 1000-quantiles (and above, I guess). It all depends of the size of your data and what you have to do with it.

What follows is an example with quartiles, where x is a set of numbers:

> x = [5, 61, 9, 112, 13, 203, 26]
><br>First quartile, or Q1 = 6
><br>Scond quartile, or Q2 = 11
><br>Third quartile, or Q3 = 20

The second quartile always corresponds to the median of the set x.

In [None]:
# Creating a list of 10 random numbers from 0 to 100.
A = [random.randint(0,100) for i in range(10)]
# Creating a table of one column A with this values
df = pd.DataFrame({ 'A': A})
df

Using `DataFrame.quantile()` solve the following:

In [None]:
# Hint: You can call `quantile(i)` to get the i'th quantile, where `i` should be a fractional number.
# Find the 10th percentile
A_percentile10th = df.A.quantile(...) # Same as df['A'].quantile()

In [None]:
# Find the 50th Percentile of A or the Median
A_percentile50th = ... 
A_percentile50th

In [None]:
# Find the 90th percentile
A_percentile90th = ... 
A_percentile90th

Verify your answer for the median with another commonly used function <a href="http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.median.html">`DataFrame.median()`</a> 

In [None]:
A_median = df.A.median() # Same as df['A'].median()
A_median

<a id='boot'></a>
## 2. Bootstrap

A data scientist uses the data in a random sample to estimate an unknown parameter. Using the sample to calculate the value of a statistic as an estimate of the population directly could lead to incorrect responses as the the random sample might not be representative of the actual populaion.

Fortunately, a brilliant idea called the `bootstrap` can help. Since it is not feasible to generate new samples from the population, the bootstrap generates new random samples by a method called `resampling`: the new samples are drawn at random from the original sample.

**The Bootstrap Sampling Method**
Data scientists must often estimate an unknown population parameter using a random sample. Although we would ideally like to take numerous samples from the population to generate a sampling distribution, we are often limited to a single sample by money and time.

Fortunately, a large, randomly collected sample looks like the original population. The bootstrap procedure uses this fact to simulate new samples by resampling from the original sample.

To conduct the bootstrap, we perform the following steps:

1. Sample with replacement from the original sample (now the bootstrap population). These samples are called bootstrap samples. We typically take thousands of bootstrap samples (~10,000 is common).
1. Calculate the statistic of interest for each bootstrap sample. This statistic is called the bootstrap statistic, and the empirical distribution of these bootstrap statistics is an approximation to the sampling distribution of the bootstrapped statistic.

<img src="https://cdn-images-1.medium.com/max/2600/1*iH5w0MBdiOlxDOCX6nmqqw.png" width = 900px/>

To see the applicability of the bootstrap, in conjunction with hypothesis testing, we would take the example



<a id='da'></a>
# Example 
___


## 1. Hypothesis Testing (Optional Review Section)
When applying data science techniques to different domains, we are often faced with questions about the world. For example, does drinking coffee cause sleep deprivation? Do autonomous vehicles crash more often then non-autonomous vehicles? Does drug X help treat pneumonia? To help answer these questions, we use hypothesis tests to make informed conclusions based on observed evidence/data.

Since data collection is often an imprecise process, we are often unsure whether the patterns in our dataset are due to noise or real phenomena. Hypothesis testing helps us determine whether a pattern could have happened because of random fluctuations in our data collection.

To explore hypothesis testing, we start with an example. The table `baby` contains information on baby weights at birth. It records the baby's birth weight in ounces and whether or not the mother smoked during pregnancy for 1174 babies.

In [None]:
baby = pd.read_csv('./baby.csv')
baby = baby.loc[:, ["Birth Weight", "Maternal Smoker"]]
baby

### Design (Optional Review Section)

We would like to see whether maternal smoking was associated with birth weight.  To set up our hypothesis test, we can represent the two views of the world using the following null and alternative hypotheses:

**Null hypothesis:** In the population, the distribution of birth weights of babies is the same for mothers who don't smoke as for mothers who do. The difference in the sample is due to chance.

**Alternative hypothesis:** In the population, the babies of the mothers who smoke have a lower birth weight, on average, than the babies of the non-smokers.

Our ultimate goal is to make a decision between these two data generation models. One important point to notice is that we construct our hypotheses about the *parameters* of the data generation model rather than the outcome of the experiment. For example, we should not construct a null hypothesis such as "The birth weights of smoking mothers will be equal to the birth weights of nonsmoking mothers", since there is natural variability in the outcome of this process.

The null hypothesis emphasizes that if the data look different from what the null hypothesis predicts, the difference is due to nothing but chance. Informally, the alternative hypothesis says that the observed difference is "real."

We should take a closer look at the structure of our alternative hypothesis. In our current set up, notice that we would reject the null hypothesis if the birth weights of babies of the mothers who smoke are significantly lower than the birth weights of the babies of the mothers who do not smoke. In other words, the alternative hypothesis encompasses/supports one side of the distribution. We call this a **one-sided** alternative hypothesis. In general, we would only want to use this type of alternative hypothesis if we have a good reason to believe that it is impossible to see babies of the mothers who smoke have a higher birth weight, on average.

To visualize the data, we've plotted histograms of the baby weights for babies born to maternal smokers and non-smokers.

In [None]:
plt.figure(figsize=(9, 6))
smokers_hist = (baby.loc[baby["Maternal Smoker"], "Birth Weight"]
                .hist(normed=True, alpha=0.8, label="Maternal Smoker"))
non_smokers_hist = (baby.loc[~baby["Maternal Smoker"], "Birth Weight"]
                    .hist(normed=True, alpha=0.8, label="Not Maternal Smoker"))
smokers_hist.set_xlabel("Baby Birth Weights")
smokers_hist.set_ylabel("Proportion per Unit")
smokers_hist.set_title("Distribution of Birth Weights")
plt.legend()
plt.show()

The weights of the babies of the mothers who smoked seem lower on average than the weights of the babies of the non-smokers. Could this difference likely have occurred due to random variation? We can try to answer this question using a hypothesis test.

To perform a hypothesis test, we assume a particular model for generating the data; then, we ask ourselves, what is the chance we would see an outcome as extreme as the one that we observed? Intuitively, if the chance of seeing the outcome we observed is very small, the model that we assumed may not be the appropriate model. 

In particular, we assume that the null hypothesis and its probability model, the **null model**, is true. In other words, we assume that the null hypothesis is true and focus on what the value of the statistic would be under under the null hypothesis. This chance model says that there is no underlying difference; the distributions in the samples are different just due to chance.

### Test Statistic (Optional Review Section)

In our example, we would assume that maternal smoking has no effect on baby weight (where any observed difference is due to chance). In order to choose between our hypotheses, we will use the difference between the two group means as our **test statistic**.
Formally, our test statistic is

$$\mu_{\text{smoking}} - \mu_{\text{non-smoking}}$$

so that small values (that is, large negative values) of this statistic will favor the alternative hypothesis. Let's calculate the observed value of test statistic:


In [None]:
nonsmoker = baby.loc[~baby["Maternal Smoker"], "Birth Weight"]
smoker = baby.loc[baby["Maternal Smoker"], "Birth Weight"]
observed_difference = np.mean(smoker) - np.mean(nonsmoker)
observed_difference

If there were really no difference between the two distributions in the underlying population, then whether each mother was a maternal smoker or not should not affect the average birth weight. In other words, the label True or False with respect to maternal smoking should make no difference to the average.

Therefore, in order to simulate the test statistic under the null hypothesis, we can shuffle all the birth weights randomly among the mothers. We conduct this random permutation below.

In [None]:
def shuffle(series):
    '''
    Shuffles a series and resets index to preserve shuffle when adding series
    back to DataFrame.
    '''
    return series.sample(frac=1, replace=False).reset_index(drop=True)

In [None]:
baby["Shuffled"] = shuffle(baby["Birth Weight"])
baby

### Conducting a Permutation Test (Optional Review Section)

Tests based on random permutations of the data are called **permutation tests**. In the cell below, we will simulate our test statistic many times and collect the differences in an array. 

In [None]:
differences = np.array([])

repetitions = 5000
for i in np.arange(repetitions):
    baby["Shuffled"] = shuffle(baby["Birth Weight"])
  
    # Find the difference between the means of two randomly assigned groups
    nonsmoker = baby.loc[~baby["Maternal Smoker"], "Shuffled"]
    smoker = baby.loc[baby["Maternal Smoker"], "Shuffled"]
    simulated_difference = np.mean(smoker) - np.mean(nonsmoker)

    differences = np.append(differences, simulated_difference)

We plot a histogram of the simulated difference in means below:

In [None]:
differences_df = pd.DataFrame()
differences_df["differences"] = differences
diff_hist = differences_df.loc[:, "differences"].hist(normed = True)
diff_hist.set_xlabel("Birth Weight Difference")
diff_hist.set_ylabel("Proportion per Unit")
diff_hist.set_title("Distribution of Birth Weight Differences");

It makes intuitive sense that the distribution of differences is centered around 0 since the two groups should have the same average under the null hypothesis.

In order to draw a conclusion for this hypothesis test, we should calculate the p-value. The empirical p-value of the test is the proportion of simulated differences that were equal to or less than the observed difference. 

In [None]:
p_value = np.count_nonzero(differences <= observed_difference) / repetitions
p_value

At the beginning of the hypothesis test we typically choose a p-value **threshold of significance** (commonly denoted as alpha). If our p-value is below our significance threshold, then we reject the null hypothesis. The most commonly chosen thresholds are 0.01 and 0.05, where 0.01 is considered to be more "strict" since we would need more evidence in favor of the alternative hypothesis to reject the null hypothesis.

In either of these two cases, we reject the null hypothesis since the p-value is less than the significance threshold. 

## 2. Bootstrap Confidence Intervals

We may use the bootstrap sampling distribution to create a confidence interval which we use to estimate the value of the population parameter. 

Since the birth weight data provides a large, random sample, we may act as if the data on mothers who did not smoke are representative of the population of nonsmoking mothers. Similarly, we act as if the data for smoking mothers are representative of the population of smoking mothers.

Therefore, we treat our original sample as the bootstrap population to perform the bootstrap procedure:

1. Draw a sample with replacement from the nonsmoking mothers and calculate the mean birth weight for these mothers. We also draw a sample with replacement from the smoking mothers and calculate the mean birth weight.
1. Calculate the difference in means.
1. Repeat the above process 10000 times, obtaining 10000 mean differences.

This procedure gives us a empirical sampling distribution of differences in mean baby weights. But before we do that, go ahead and try to define the following functions. Hints for their functionalities are provided in the comments below.

In [None]:
def resample(sample):
    # Hint: Using np.random.choice() to randomly choose a resample from your provided sample
    return ...

def bootstrap(sample, stat, replicates):
    # Hint 1: stat is the statistic you aim to capture from the data. Here it would be np.mean as you wish to find the mean
    # Hint 2: Use a list comprehension or a loop to produce an array of outputs of your statistic for all resamples
    # Hint 3: Use the following commented pseudocode block (logic) to help you answer the question
    # for i in range(replicates):
    #     stat_lst.append(stat(resample(sample)))
    
    statistic_array = np.array(...)
    return statistic_array

Using the functions you just created, apply the bootstrap to your data

In [None]:
nonsmoker = baby.loc[~baby["Maternal Smoker"], "Birth Weight"]
smoker = baby.loc[baby["Maternal Smoker"], "Birth Weight"]

nonsmoker_means = bootstrap(nonsmoker, np.mean, 10000)
smoker_means = bootstrap(smoker, np.mean, 10000)

mean_differences = smoker_means - nonsmoker_means

We plot the empirical distribution of the difference in means below:

In [None]:
mean_differences_df = pd.DataFrame()
mean_differences_df["differences"] = np.array(mean_differences)
mean_diff = mean_differences_df.loc[:, "differences"].hist(normed=True)
mean_diff.set_xlabel("Birth Weight Difference")
mean_diff.set_ylabel("Proportion per Unit")
mean_diff.set_title("Distribution of Birth Weight Differences");

Finally, to construct a 95% confidence interval we take the 2.5th and 97.5th percentiles of the bootstrap statistics:

In [None]:
(np.percentile(mean_differences, 2.5), 
 np.percentile(mean_differences, 97.5))

This confidence interval allows us to state with 95% confidence that the population mean difference in birth weights is between -11.37 and -7.18 ounces.

## 3. Summary

In this section, we review hypothesis testing using the permutation test and learn about confidence intervals using the bootstrap. To conduct a hypothesis test, we must state our null and alternative hypotheses, select an appropriate test statistic, and perform the testing procedure to calculate a p-value. To create a confidence interval, we select an appropriate test statistic, bootstrap the original sample to generate an empirical distribution of the test statistic, and select the quantiles corresponding to our desired confidence level.