In [None]:
import matplotlib.pyplot as plt

import pandas as pd
from scipy.stats import sem as scipy_standard_error

from common import make_cdf, percentile_rank, load_forest_fires

%matplotlib inline

#  Estimation

See Chapter 8 of [Think Stats 2nd Edition](https://greenteapress.com/wp/think-stats-2e/), Chapter 2 of [Practical Statistics for Data Scientists](https://www.oreilly.com/library/view/practical-statistics-for/9781491952955/).

## What do we want to estimate?

[Estimator - Wikipedia](https://en.wikipedia.org/wiki/Estimator)

We want to estimate statistics.  Why?  Statistics allow us to measure, model and compare.

A statistic is a summary of the data
- mean
- min
- max
- variance

## How can we estimate?

By sampling!
- we will never have the entire population

An (insufficient) definition of statistics = making inference from samples to populations

We often make use of statistics drawn from samples
- these statistics will be different than the statistics of the entire population

The distribution of a given statistic is known as a **sampling distribution** - more on that later.  

## Three sources of error in estimation

How can we make mistakes in estimating statistics?

**Sampling bias**
- samples having different probabilities than others
- non-random sampling

**Sampling error**
- arises from using statistics of a subset of a larger population 
- usually impossible to measure exactly

**Measurement error**
- difference between measurement & true value

## Let's talk about bias

The effect of sample bias is simple = our sample will be different from the population.

The causes of sample bias are more complex - because there are so many!  

We can eaisly demonstrate one cause of sample bias:

In [None]:
population = np.random.normal(10, 3, 100000)

biased = [x for x in population if x > 11]

'population mean {:.2f} - biased sample mean {:.2f}'.format(np.mean(population), np.mean(biased))

Note that if our sampling error is random, we don't see the bias:

In [None]:
data = []
for x in population:
    if np.random.uniform() > 0.5:
        data.append(x)
    else:
        pass

'{:.2f}'.format(np.mean(data))

Bias is **systematic error**.

One form of bias a data scientist needs to be aware of is **selection bias**.  There are many reasons why this can happen:
- vast search = asking many different questions / training many different models
- non-random sampling (aka sampling bias)
- cherry picking data
- stopping an experiment based on the result
- after the fact selection of results

Many of the mistakes above are caused directly by the data scientist :) 

A particular case of selection bias is **regression to the mean**
- select a high performing athlete based on performance that is somewhat due to luck
- later on the luck disapears!

## Sampling error

Above we talked about sampling bias - now lets talk about **sampling error**
- sampling error

We can can estimate the sampling error through simulation
- we don't know the true statistics
- lets instead use estimates from our small number of samples

The question we are asking is
- if the true stats were the same as the population stats
- and we ran this experiment many times
- how much would our estimated mean vary

## Statistic sampling distributions

The distribution of a statistic is known as a **sampling distribution**
- is not the data distribution!

The question we are asking is - how do statistics vary with sampling?
- we want to know how variable our statistics are - their **sampling variability**

The **sampling distribution** of a statistic shows us how a sample statistic varies

## Estimating the sampling distribution using computation

There is an entire literature in traditional statistics that **developed under constraints of data & computation**.  Normal approximation methods such as t-distributions rely on the Central Limit Theorem to calculate sampling distributions.

We live in an era where computation is cheap - let's use it.

<img src="assets/ram_slack.jpg" alt="" width="350"/>

A key tool in computational statistics is **bootstrap sampling**.

## Boostrap sampling

Creating new datasets by **sampling with replacement**
- this is in contrast to shuffling / permutation sampling (without replacement)

Bootstrap samples 
- are always available
- require no assumption about the sample statistic being normally distributed
- can be widely applied

Sampling without replacement = probability of sampling a sample is unchanged

The boostrap does not compensate for a small dataset
- instead it answers questions about how additional

## Standard error

A metric that sums up **how variable a sampling distribution is**.  

It is estimated using 
- the standard deviation of the samples
- the sample size

The standard error is
- the expected error
- describes variability in the estimate

In [None]:
d = np.random.normal(10, 5, 100000)

np.testing.assert_allclose(scipy_standard_error(d), np.std(d) / np.power(len(d), 0.5), rtol=1e-05)

np.std(d) / np.power(len(d), 0.5)

We can also take a computational approach to estimating standard error.  Before we do this, let's introduce a new dataset.

## A new dataset (no more Iris!)

http://archive.ics.uci.edu/ml/datasets/Forest+Fires

Run bash commands to:
- make a folder called `data`
- download two text files
- move them into `data`

In [None]:
data = load_forest_fires()

data.shape

In [None]:
data.head(2)

Let's first look at the **data distribution** of a feature:

In [None]:
_ = data.plot(y='temp', kind='hist')

And some of the statistics of this data:

In [None]:
data.loc[:, 'RH'].describe()

## Practical

Take a computational approach to estimating standard error:
1. sample a bootstrap of n samples
2. record the sample mean of the n samples
3. repeat steps 1 & 2 m times - each time reporting the sample mean standard error (ie the standard deviation)
4. produce a histogram of the sample means

## Confidence intervals

A confidence interval = measurement of error in a sample estimate
- same purpose as histograms, boxplots & standard errors

The confidence interval can be thought of as
- the interval that encloses X % of the bootstrap sampling distribution
- an X % CI should, on average, contain similar sample estimates X % of the time

The general method for generating a boostrap confidence interval:
1. draw a bootstrap sample
2. record the statistic (mean, var etc)
3. repeat 1 & 2 many times
4. trim (100-x) / 2 % from either end of the distribution - the end points are now your confidence intervals

In [None]:
data = pd.read_csv('./data/forestfires.csv')
col = 'temp'

n_samples = 1000
sample_size = 1000

statistics = []
for _ in range(n_samples):
    idxs = np.random.randint(0, data.shape[0], sample_size)
    sample_mean = np.mean(data.loc[idxs, col])
    statistics.append(sample_mean)

In [None]:
statistics = sorted(statistics)

interval = 0.95
split = int(data.shape[0] * (1 - interval) / 2)
split

In [None]:
start, end = statistics[split], statistics[-split]

start, end

In [None]:
f, ax = plt.subplots()
ax.hist(statistics)
ax.axvline(start, color='red')
ax.axvline(end, color='red')

We can also use a cumulative density function (CDF) to 
- visualize the distribution of sample means
- calculate the confidence intervals

In [None]:
y, x = zip(*make_cdf(statistics))

y = np.array(y)
x = np.array(x)

start = x[y == 0.05]
end = x[y == 0.95]

f, ax = plt.subplots()
ax.plot(x, y)
ax.axvline(start, color='red')
ax.axvline(end, color='red')

In [None]:
print(start, end)

## Quiz

[Chapter 12 of Statistics Done Wrong - Alex Reinhart](https://www.statisticsdonewrong.com/)

As the sample size increases, what happens to the
- standard error
- standard deviation

What sources of error are we not accounting for with these two statistics?

Which of the three sources of error (sampling error, sampling bias or measurement) is pseudoreplication?
