# Statistical Inference
Author: Julian Li√üner

For questions and feedback write a mail to: [lissner@mib.uni-stuttgart.de](mailto:lissner@mib.uni-stuttgart.de)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import sys
from scipy import stats as scipy_distributions

sys.path.append( 'provided_functions')
import sample

sys.path.append( 'incomplete_functions' )
import maximum_likelihood_estimators as MLE
import data_binning as binning

## Maximum likelhood estimator



### binomial distribution
- a random variable $Y$ of the binomial distribution can only take 2 values, 0 and 1
- the probability is thus defined by $\displaystyle P(\{Y=1\})=\theta \quad\text{and} \quad P( \{ Y=0 \} ) = 1-\theta$
- the Maximum Likelihood Estimator (MLE) yields: $$\displaystyle \widehat{\theta} = \frac{1}{n} \sum\limits_{i=1}^n y_i$$
- the quality of the MLE is often heavily dependent on sample size in order to dampen out the inherent randomness of sampled values
- since the data is sampled ad hoc, you will get different results for each execution

------- 
__Task:__ Write a function to estimate the parameter of the binomial distribution in 'incomplete_functions/maximum_likelihood_estimators.py' and apply/validate it.<br>
Run the cell multiple times and try out different parameters.

In [None]:
n_flips = 20 #TODO try out different values
weight = 0.55

coin_results = sample.flip_coin( n_flips, weight=weight, datatype='array' )

theta = #MLE.binomial#TODO #estimate the parameter of the binomial distribution
print( 'true parameter:   \t', weight)

print( 'estimated parameter:\t', theta)

## Uniform distribution
- a random variable $Y$ of the uniform distribution can take any value in the interval $[a,b]$
- the probability density is given by constants as 
$ f(x) = \frac1{b-a} $
- the MLE yields: $$ \widehat{a} = \underset{i=1, \dots, n}{\text{min}} y_i\,,\qquad \qquad \widehat{b} = \underset{i=1, \dots, n}{\text{max}} y_i$$

---------- 
__Task:__ Write a function to estimate the parameter of the uniform distribution in 'incomplete_functions/maximum_likelihood_estimators.py' and apply/validate it.<br>

In [None]:
n_samples = 20 #TODO #try out different parameters
a_true = -1.5
b_true = 5
samples = sample.uniform_distribution( n_samples, a_true, b_true )

a, b = MLE.#TODO #estimate the parameters of the uniform distribution

print( 'true values of the uniform distribution:')
print( '\ta: {:5.2f}\n\tb: {:5.2f}'.format( a_true, b_true) )
print( 'estimated values of the uniform distribution:')
print( '\ta: {:7.4f}\n\tb: {:7.4f}'.format( a, b) )
print()
print( 'absolute deviation of estimated interval length: {:.4f}'.format(#TODO #compute the deviation of interval length)) 

--------------
---------------
## Real world distribution
- for real world data the underlying distribution is often unknown
- the distribution often does not perfectly follow common distributions like the normal or the uniform distribution
- the expert (you in this case) has to guess a fitting distribution and validate the model error $\blacktriangleright$ statistical inference

-------
__Task:__ Fit the data to the uniform distribution. 

In [None]:
data = np.load( 'data/samples.npz' )['arr_0'] 
n_samples = len( data)

a, b = #TODO #estimated parameters of the uniform distribution

- data is often easier analyzed in bins
- with data binning, there is always a loss of information
- a bin is defined by:
    - number of samples found in the $k$-th bin $N_k$<br>
    - center value of the bin (generally not the mean of samples in the bin)<br>
    - width of the bin $w^{\rm bin}_k$
- significant measurements introduced for binned data are:
    - relative frequency: $\quad P_k = \frac{N_k}{n_{\rm samples} }$<br>
    - cumulative frequency: $F_k = \sum\limits_i^k P_i$ <br>
    - bin density: $\qquad\quad$ $f_k = \frac{ P_k}{w^{\rm bin}_k} $<br>
- Hint: The plots below might help you debugging
------
__Task:__ Bin the data into a reasonable amount of bins, write the `bin_data` function in 'data_binning.py'. Compute the relative and cumulative frequency of each bin as well as the bin density.

In [None]:
n_bins = 2 #TODO #parameter to be adjusted

bin_occurence, bin_centers, bin_width = binning.bin_data(#TODO 
rel_freq = #TODO
#help( np.cumsum )
cum_freq = #TODO 
bin_density = #TODO

- binned data can be nicely visualized
- recall that binned data is given as discrete data, thus, line plots are rarely a good choice
- when plotting the CDF the cumulative frequency can not be anchored at `bin_center`<br>
$\quad$ the line has to go through the 'right corner' of the current bin (see lecture jupyter for an example)

----------
__Task:__ Plot the cumulative density function and the bin density/probability density of the binned data and the estimated distribution. Make sure that the plot looks nice by specifying e.g. linewidth or colors.

In [None]:
plotting_centers = bin_centers + #TODO #shift the x-value of the plots
fig, axes = plt.subplots( 1, 2, figsize=(12,7))
# cumulative distribution function
axes[0].plot( [a, b, 5*b ], [0,1,1], color='red', lw=2.2, label='estimate - uniform distribution')
axes[0].plot( #TODO #plot the CDF of the binned data , label='underlying distribution')
axes[0].bar( #TODO #plot the bins in the CDF , width=#TODO, color='lightblue', alpha=0.5 ) 

# relative distribution function
axes[1].plot( [a-0.0001, a, b, b+0.00001], [0, 1/(b-a), 1/(b-a), 0 ], color='red', lw=2.2, label='estimate - uniform distribution' )
axes[1].bar( #TODO #display the bin density of the binned data ,label='underlying distribution' )

## style of plots
titles = ['cumulative distribution function', 'relative distribution function' ]
for ax in axes:
    ax.grid( color='#AAAAAA', ls=':' )
    ax.set_title( titles.pop( 0) )
    ax.legend()

axes[0].set_ylim( ymin=0, ymax=1.01 )
axes[0].set_xlim( xmax=b ) 

- the MLE maximizes the likelihood function that the data is the most probable under the assumed distribution
- the MLE does not give an error measure
- we define an error measure for error $e$ based on the bin density:
$$
e = \frac{n_{\rm samples}}{n_{\rm bins} }\, \sum\limits_{i=1}^k \Big|\frac{P_k}{w^{\rm bin}_k} - 
\underbrace{\frac{F(u_k| \text{distr}) - F(l_k|\text{distr})}{w^{\rm bin}_k}}_{\frac{\text d F}{\text d x}}\,\Big|
$$
$\quad$ with $P_k$ being the relative frequency at bin $k$, $w^{\rm bin}_k$ the width of bin $k$.<br>
$\quad$ $F(x|\text{distr})$ denotes the cumulative distribution function of the estimated distribution evaluated at $x$<br>
$\quad$ $u_k$ and $l_k$ denote the _upper_ and _lower_ bound of each bin, respectively.<br>
- the error measure is interpreted as the amount of 'wrongly binned' samples averaged over all bins, if the estimated distribution was equally binned
- `scipy.stats` also contains distribution <br>
$\quad$ (scipy.stats is already imported as `scipy_distributions`)
- when learning a new module, always look at the help function!

-------
__Task:__ Compute the defined error measure for the estimated uniform distribution.

In [None]:
error_scaling = n_samples/n_bins/bin_width
distribution = scipy_distributions.uniform
parameters = [a, b] #previously estimated parameters using the MLE

#help( distribution.cdf)
error = 0
for i in range( len( rel_freq)):
    l_k = #TODO
    u_k = #TODO
    dF = #TODO distribution.#TODO( #TODO, *parameters) #TODO...
    error += np.abs( (rel_freq[i] - dF)) * #TODO

# the error should lie between 25 and 30   
print( 'error measure of the uniform distribution:', error)

- scipy has many distributions implemented, see their official documentation [https://docs.scipy.org/doc/scipy/reference/tutorial/stats/continuous.html]( https://docs.scipy.org/doc/scipy/reference/tutorial/stats/continuous.html)
- you can implement the MLE for the respective distribution yourself, or call the `parameters = distribution.fit( data)` method
- **NOTE** that scipy has generally implemented biased estimators for their `fit` method<br>
$\quad\blacktriangleright$ parameters could be corrected to obtain the unbiased estimator
$$\quad$$
------ 
__Task:__ Find a better distribution which fits the data better. The error should be at least $e \leq 3.5$.<br>

In [None]:
distribution = scipy_distributions#TODO
parameters = #TODO

error = 0
for i in range( len( rel_freq)):
    l_k = #TODO #(you might want to copy paste the code from above)
    u_k = #TODO
    dF = #TODO
    error += np.abs((rel_freq[i] - dF))  * #TODO


print( 'achieved error:', error)

----------
__Task:__ Compare the first 4 order moments of your estimated distribution to the moments of the random variable `data`

In [None]:
#help( distribution.moment)
expectation = lambda x: x.mean()
#TODO...

print( '           \t Estimated\t Underlying')
print( '-------------------------------------------------')
print( 'expectation:\t {:9.5f}\t {:9.5f}'.format( expectation( data), distribution.moment( 1, *parameters)) ) #TODO distribution 
print( 'std        :\t {:9.5f}\t {:9.5f}'.format( #TODO))

- the sampled CDF is directly given by the data
- it can be recreated by asserting `n_bins = n_samples`<br>
$\quad \blacktriangleright$ each sample has the relative frequency of $\frac{1}{n_{\text{samples}} }$
- to plot the CDF the data does not have to be binned, it can be directly written in one line
- additional variables might help the readability and the derivation

-----------
__Task:__ Plot the cumulative distribution and the density functions of the well matching distribution. Make sure that the plot looks nice by specifying e.g. linewidth or colors.

In [None]:
fig, axes = plt.subplots( 1, 3, figsize=(16,7))
# cumulative distribution function
axes[0].plot( bin_centers, distribution.#TODO #plot the CDF of the estimated distribution, label='estimated distribution')
axes[0].plot( #TODO #plot the CDF of the binned data , label='underlying distribution' )
axes[0].bar( #TODO #plot the bins in the CDF , width=#TODO, color='lightblue', alpha=0.5 ) 

# relative distribution function
axes[1].plot( bin_centers, distribution.#TODO #plot the pdf of the estimated distribution , label='estimated distribution' )
axes[1].bar( #TODO #display the relative frequency of the binned data  ,label='underlying distribution' )

axes[2].plot( #TODO #plot the true CDF of the data ,label='sampled CDF' )
axes[2].plot( #TODO #plot the CDF of the estimated distribution ,label='estimated distribution') 

## style of plots
titles = ['cumulative frequency', 'relative frequency', 'real and estimated CDF' ]
for ax in axes:
    ax.grid( color='#AAAAAA', ls=':' )
    ax.set_title( titles.pop( 0) )
    ax.legend()

axes[0].set_ylim( ymin=0, ymax=1.01 )
axes[2].set_ylim( ymin=0, ymax=1.01 )