# Statistical Inference

So far, we have been looking at what would be classified as descriptive statistics. We were looking at how do explore or summarize a dataset.

Now, we are moving on to statistical inference, where we are interested not just in the sample that we have, but in inferring something about the population from which the sample came. Our goal is to draw conclusions from data.

Three common tasks in statistical inference:
* estimating an underlying parameter about a population (eg. the population mean)
* providing an interval inference about the underlying parameter
* testing if the underlying parameter satisfy certain conditions.

## Estimation

The goal in esimation is to infer a population parameter based on an observed sample. We do not (and cannot) know the true population parameter when taking a sample.

We'll look at two types of estimates - point estimates and interval estimates.

Point estimation is the process of using sample data to generate a "best guess" of an unknown population parameter. 

What do we want out of an estimator?

We want it to be unbiased - that is, we don't want it to be systematically different from the true value. Technically, we want the expected value over all possible samples to equal the true value).

It would also be nice if it had a small variance (standard error), meaning that we won't get a large spread of different estimates for different samples.

We would like to be using estimates which have as small bias and as small variance as possible.

## Condidence Intervals for the Mean

## Confidence Intervals for Proportion

Each given observation is a Bernoulli trial, meaning that the entire process of drawing a sample can be viewed as a series of Bernoulli trials => binomial distribution

## Bootstrap Confidence Intervals

The **empirical bootstrap** is a technique popularized by Bradley Efron in 1979. It is easy to understand and implement, but is just recently gaining popularity, since it is not really feasible without modern computing power. The bootstrap allows us to substitute fast computation for theoretical math.

**Big Idea:** perform computations on the data itself to estimate the variation of statistics that are themselves computed from the same data. That is, the data is ‘pulling itself up by its own bootstrap.’ 

Since the bootstrap allows us to estimate the variation of these statistics, we can use this technique to construct confidence intervals.

We know how to create confidence intervals for normally or nearly-normally distributed data. But what if our sample comes from some unknown distribution? We can still create a point estimate, but what we really want is a confidence interval.

The Central Limit Theorem says that for a large enough sample, the sampling distribution of the mean (or with some additional work, the proportion, or difference of means or proportions) is close enough to being normal that we can build confidence intervals. The trouble is, we don't know just how large "large enough" is. 

Also, what if we want an estimate for a more elaborate statistic than just a mean or proportion? Say we want a confidence interval for a median or for the third quartile.

As a bonus, when using the bootstrap, we can create confidence intervals not just for the mean, but for any type of statistic we would like to compute. For example, we might be interested in a confidence interval for the median. Tradional techniques of confidence intervals break down here, because we have to have some idea about the sampling distribution of this statistic. The bootstrap does not need this, since it is built out of the sample itself.

The bootstrap works be repeatedly resampling (with replacement) from our sample.

An **empirical bootstrap sample** is a resample of the same size as the original sample.

The **bootstrap principle** says that the distribution of statistics calculated from empirical bootstrap samples is approximately equal to the true distribution of the statistic of interest, and the variation is well-approximated by the variation in the boot strap statistics.

Note: we must use resamples of the same size, so that the variations match. Variation shrinks as sample size grows, so we need to use the same sized samples.

Note: The bootstrap will not ever improve out point estimate.

In [5]:
import pandas as pd
import numpy as np
sleeping = pd.read_csv('../data/atus_sleeping.csv')

In [2]:
sleeping

Unnamed: 0,participant_id,minutes_spent_sleeping,sex
0,20181211181182,270,Male
1,20180908180663,600,Male
2,20180706181412,355,Male
3,20181009181978,405,Male
4,20180503180964,270,Male
5,20181211181212,750,Male
6,20180111171481,439,Male
7,20181110182215,463,Male
8,20180201180821,690,Male
9,20180504180926,420,Male


Let's build a confidence interval for the median of the number of minutes spent sleeping.

In [3]:
sleeping.minutes_spent_sleeping.median()

550.0

Our point estimate is 550 minutes.

Now, the idea is to repeatedly resample with replacement from our observations.

In [20]:
num_resamples = 10000
conf_level = 0.9
margin = (1 - conf_level) / 2
lower_index = int(num_resamples * margin)
upper_index = int(num_resamples * (1-margin))

values = sleeping.minutes_spent_sleeping.to_list()

resample_medians = []

for i in range(num_resamples):
    resample = np.random.choice(values, len(values))
    resample_medians.append(np.median(resample))
    
resample_medians.sort()

print('lower bound: ', resample_medians[lower_index])
print('upper bound: ', resample_medians[upper_index])

lower bound:  497.5
upper bound:  595.0
