# Week 8: Parameter Estimation

In [2]:
# Loading the libraries
import numpy as np
import sympy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import scipy.stats as stats
import scipy.optimize as opt
#from scipy.integrate import quad

## Day 1: Bootstrap Method

* The **bootstrap** methods are a class of techniques that can be used to construct an interval estimate of a parameter.
* Bootstrap methods are **general** in a sense that they can be applied to *any* parameter
* These methods are "empirical" and do not require knowledge of the sampling distribution of the statistic that corresponds to the parameter of interest. Instead, they rely on resampling and simulation


## How does it work?
The situation is as follows:
* We have a sample of data of size `n`
* The sample comes from a population which follows some distribution (which we do not know)
* We wish to estimate some parameter of the population distribution by constructing an interval estimate for it

To simplify things, we will talk about the population mean $\mu$, but the same approach applies to any other parameter (e.g. median, standard devaition etc).
To get what we need we do the following:
* Calculate the mean of the original sample $\bar{x}$
* Generate many **bootstrapped samples** from the original sample, allowing for sampling **with replacement**. If you generate `m` bootstrapped samples (thousands), then for each of these calculate the mean. This will give you a sequence
$$\bar{x}_1^*, \bar{x}_2^*, \ldots, \bar{x}_{m}^*$$
of means of the bootstrapped samples.
* Calculate a sequence of differences $\delta_1^*, \delta_2^*, \ldots, \delta_m^*$ where:
$$\delta_i^* = \bar{x}_i^* - \bar{x}$$
for each $i = 1, 2, \ldots, m$. This sequence captures the variability of the original distribution, so it is crucial to the process
* Say we want a `90%` "confidence" interval. We calculate $\overline{\delta}$, and $\underline{\delta}$, the 5-th and 95-th percentile respectively
* Finally, the interval estimate for the population mean $\mu$ is given as:
$$ \left( \bar{x} - \underline{\delta},\, \bar{x} - \overline{\delta} \right) $$
Note that this interval is **NOT** the same as the actual confidence interval for the population parameter!

## Example 1
Test the efficiency of the bootstrap method by using it to construct a 90% "confidence" interval for the mean of a normal distribution.
* Use normal distribution $\mathcal{N}(\mu=10, \sigma=2)$
* Draw a random sample of size `n=20` from this distribution
* Estimate $\mu$ with an interval using bootstrapping

In [None]:
# setting the random seed
np.random.seed(123)

# getting the original sample


# plot the sample


In [None]:
# start with the bootstrapping
m = 1000 # number of bootstrap samples
deltas = ... # the differences
sample_mean = ...

for i in range(m):
    
    # choose a random sample
    
    # calculate the difference

    
# plot the deltas, for fun


In [None]:
# constructing the interval

print(f'The 90% "confidence interval is ({}, {})"')

## Example 2
For the same distribution given in **Example 1**, using the same sample, estimate the standard deviation of the population.

In [None]:
# start with the bootstrapping


# constructing the interval


In [None]:
# try again with a larger sample
np.random.seed(112)
sample = stats.norm(10, 2).rvs(100)

# plot the new sample


In [None]:
# re-run the bootstrap


## Example 3: Old Faithful and estimating the median
Old Faithful is a geyser in Yellowstone National Park in Wyoming. Data given in `faithful.csv` contains information about a sample of consecutive eruptions of the geyser.

Costruct a 95% bottstrap confidence interval for the median length of the eruptions.

In [None]:
# reading the file
df = pd.read_csv('faithful.csv')
display(df.head())

# get the sample

# plot the sample


In [None]:
# bootstrap


## Hypothesis testing with bootstrapping
It is possible to use the bootstrap method to test hypotheses. The approach is pretty much the same as with generating bootstrap "confidence" intervals, except that in this scenario we calculate a p-value for the test.

We will illustrate the application of this approach in a case when we test about the population mean, but as before -- it can be set up to test hypotheses about any parameter.

### Inference for the population mean
Given a sample of size $n$, we wish to test the following hypotheses about the population mean $\mu$:
$$
\begin{align}
H_0: \mu &= \mu_0\\
H_a: \mu &\neq \mu_0
\end{align}
$$

The process is identical up to the point when we calculate the sequence of differences $(\delta_i^*)$ for $i=1, \ldots, n$. In the last step, we just calculate the p-value as the probability:
$$
p = P\left( \delta^* >  \left| \bar{x}-\mu_0 \right| \right)
$$

If we work with one-sided alternative, then:
* If $H_a : \mu < \mu_0$, then the p-value is $p = P\left( \delta^* <  \bar{x}-\mu_0 \right)$, and
* If $H_a : \mu > \mu_0$, then the p-value is $p = P\left( \delta^* >  \bar{x}-\mu_0 \right)$

Similar rules would apply for other parameters

### Example 4
Verify the bootstrap method for hypothesis testing on the following *controled* case. Choose a random sample of size $n=20$ from a normal distribution $\mathcal{N}(10, 2)$ and then test the hypotheses:
$$
\begin{align}
H_0: \mu &= 11\\
H_a: \mu &\neq 11
\end{align}
$$

In [None]:
#set up, generate, and plot the sample
np.random.seed(123)
mu_0 = 11
n = 20

sample = stats.norm(10, 2).rvs(n)
sns.histplot(sample)

In [None]:
# start with the bootstrapping


np.random.seed(32123)
for i in range(m):
    
    
# plot the deltas, for fun


In [None]:
# calculate |mean - mu_0|
abs_diff = ...
print('abs_diff = ', abs_diff)

# calculate the p-value
p = ...
print('p-value = ', p)

In [None]:
# Just for fun, see what does the t-test say
stats.ttest_1samp(sample, mu_0)

### Example 5
For the eruption times for the Old Faithful data, test the following hypotheses about the population IQR:
$$
\begin{align}
H_0: \mathrm{IQR} &= 145\\
H_a: \mathrm{IQR} &< 145
\end{align}
$$

In [None]:
# recall the data
display(df.head())
sample = df['eruptions'].to_numpy()

In [None]:
# calculate the sample IQR
sample_iqr 

In [None]:
# bootstrap
np.random.seed(32123)


# get the difference
diff = sample_iqr - iqr_0

# plot the deltas and the difference 
plt.figure()

plt.show()

In [None]:
# get the p-value
p = ...

## Random sampling from a distribution

In applications frequently there is need to draw/generate a random sample which follows a certain distribution. Now we will describe one very simple method that can generate random samples from just about any distribution. This method depends on an already implemeted random number generator for the uniform distribution.

Say you have a probability densiti function $f(x)$ over the interval $[a, b]$. Let $M$ be a number such that

$$M > \max_{x \in [a, b]}{f(x)}$$

Next we generate random pairs of numbers $(x_k, y_k)$ such that $a \leqslant x_k \leqslant b$ and $0 \leqslant y_k \leqslant M$ ($y$ must be positive because the density function is always positive). These numbers are generated uniformly in the intervals.

Now, we can select the sample in the following way: if $y_k < f(x_k)$, then put $x_k$ in the sample; otherwise, discard $x_k$ and move to the next pair. At the end of the process, the sample that we selected will follow the distribution given by the density $f(x)$.

### Example 5
Generate a random sample of 150 numbers in the interval $[0, 10]$ that follow a $\chi^2$-disribution with 3 degrees of freedom.

In [None]:
# Let's first sketch the density to get an idea what to expect
X = stats.chi2(df=3)
xs = np.linspace(0, 10, 1000)
plt.figure()
plt.plot(xs, X.pdf(xs))
plt.show()

In [None]:
# Generate the points
n_pts = 150
a = 0
b = 10
M = 0.25

np.random.seed(12)
x = []
y = []
num_in_sample = 0

while num_in_sample < 150:
    # Generate random pair
    x_rand = ...
    y_rand = ...
    
    # Decide if you keep the x
    if ...
        


# Plot the sample 


### Example 6
Generate a random sample of 100 numbers in the interval $[-5, 5]$ that follow a $t$-distribution with one degree of freedom.

In [None]:
#Get the distribution
T = stats.t(df=1)

# Generate the points
n_pts = 150
a = -5.0
b = 5.0
M = 0.35

np.random.seed(124)
x = []
y = []
num_in_sample = 0

