#  Estimation

See Chapter 8 of [Think Stats 2nd Edition](https://greenteapress.com/wp/think-stats-2e/).

Error != mistake

Sources of error in estimation
- sampling error = arises from using statistics of a subset of a larger population - usually impossible to measure exactly
- sampling bias = samples having different probabilities than others
- measurement error = difference between measurement & true value

In [None]:
import random

import matplotlib.pyplot as plt
import numpy as np

from common import load_iris

features, target = load_iris()

## Practical

Explore the Iris dataset - what are possible sources of error?

## Estimating sample mean & variance

Let's imagine we have a process that generates some samples:

In [None]:
samples = np.random.choice(features.loc[:, 'sepal length (cm)'], size=10)

Let's create some estimators (aka models) by parameterizing Gaussians.  

The question is - what to use for the mean ($\mu$) and variance ($\sigma^2)$?  

We can use a simple **sample variance**:

In [None]:
mean = np.mean(samples)

sum([(sample - mean)**2 for sample in samples]) / len(samples)

The problem with the sample variance above is that it is biased, and will be too small for low numbers of samples.  

We can get an **unbiased sample variance** by removing a degree of freedom:

In [None]:
sum([(sample - mean)**2 for sample in samples]) / (len(samples) - 1)

We can estimate the **central tendency** of the distribution, using either the mean or median.  Lets use the mean.

What should we do with our estimated statistics?  Let's parameterize a Gaussian & sample from it:

In [None]:
mu = np.mean(samples)
sigma = np.var(samples, ddof=1)

n = 10

estimate = np.mean([random.gauss(mu, sigma) for _ in range(n)])

estimate

We can now compare this with the actual sample mean via the **root mean squared error (RMSE)**:

In [None]:
np.sqrt((estimate - mu)**2)

The standard error gives us the expected error for this specific distribution if:
- we use the median as a statistic
- with a sample size of 10

## Practical

The purpose of this exercise is to compare two methods of approximating the central tendency.

Above we used the mean.  Now do this using the median as the central tendency statistic, and run the error estimate `m` times (we only ran it once above).

Plot the standard error for each sample, along with a running average.  After the experiment is over, plot a CDF of the estimates.

You will need functions from `common.py` to do this.

## Estimating sampling error

Small number of samples -> **sampling error**

We can can estimate the sampling error through simulation
- we don't know the true statistics
- lets instead use estimates from our small number of samples

The question we are asking is
- if the true stats were the same as the population stats
- and we ran this experiment many times
- how much would our estimated mean vary

In [None]:
num_samples = 10

samples = np.random.choice(features.loc[:, 'sepal length (cm)'], size=num_samples)

In [None]:
mu = np.mean(samples)
var = np.var(samples)

num_simulations = 500
num_samp = 500
means = []
for idx in range(num_simulations):
    samp = np.random.normal(mu, np.sqrt(var), 50)
    means.append(np.mean(samp))

In [None]:
def make_cdf(samples):
    #  duplicate ot function in distributions.ipynb
    samples = sorted(samples)
    return [(percentile_rank(s, samples), s) for s in sorted(samples)]

def percentile_rank(value, samples):
    count = 0
    return sum([count + 1 for s in samples if s <= value]) / len(samples)

make_cdf(means[:20])

The 90th percentile is then:

In [None]:
y, x = zip(*make_cdf(means))

y = np.array(y)
x = np.array(x)

start = x[y == 0.1]
end = x[y == 0.95]

f, ax = plt.subplots()
ax.plot(x, y)
ax.axvline(start, color='red')
ax.axvline(end, color='red')

An alternative to the 90% confidence interval is the standard error
- the expected error
- describes variability in the estimate

In [None]:
np.sqrt(np.mean((x - mu)**2))

## Pseudoreplication

[Chapter 3 of Statistics Done Wrong - Alex Reinhart](https://www.statisticsdonewrong.com/)

Counting the same sample multiple times
- dependence is the problem here (non independent sampling)

Additional measurements that depend on previous data don't prove your results generalize 
- they only increase certainty about specific sample studied

Eliminate hidden sources of correlation between variables
- meausure 1,000 paitients rather than 100 paitents 10 times
- 100's neurons in two animals
- comparing growth rates of different crops in different fields

Solutions
- average dependent data points 
- analyze each point separately - don't combine, analyze only a subset (ie day 5)

Doing PCA on different batches of results
- if the number of the batch is important, then you have problems with the distribution for each time

## Peer based learning

[Chapter 12 of Statistics Done Wrong - Alex Reinhart](https://www.statisticsdonewrong.com/)

Discuss (& take positions) on the following:

As the sample size increases, what happens to the
- standard error
- standard deviation

What sources of error are we not accounting for with these two statistics?

Which of the three sources of error (sampling error, sampling bias or measurement) is pseudoreplication?