<a target="_blank" href="https://colab.research.google.com/github/AndreasRupp/ecdf_estimator_examples/blob/bsc_thesis/tutorial/02-deviation-of-normal-distribution-bootstrap.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Parameter Identification for Normal Distribution Data - Bootstrap Method

This notebook introduces a bootstrap parameter identification method for identifying deviation parameters in simple normal distribution data. The approach is based on the empirical cumulative distribution function (eCDF). First, normal distribution data is created using known deviation. Subsequently, new data is generated using various deviation parameters to identify the initial deviation parameter. The implementation of the method using Python and `ecdf-estimator` package is explained step-by-step within this notebook. The implementation of the method is very similar with the standard method demonstrated in the first notebook of the tutorial, and it is recommended to read first.

## Setup of the Process

###Environment Setup

First, `ecdf-estimator` is installed and imported as `ecdf`.

In [None]:
pip install ecdf-estimator

After the installation, the estimator is imported so that its' functions can actually be used. It's abbreviated as `ecdf` so it's easier and faster to write in the code and makes it easier to read. Now, the `ecdf`'s functions can be called like `ecdf.function_name()`.


In [None]:
import ecdf_estimator as ecdf

Next, the other two necessary modules, `numpy` and `matplotlib.pyplot`, are imported. The modules are used to do array operations efficiently and to visualize the data and results.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

### Creation of Training Data

Next, normally distributed training data must be created. First, the variables which are used to create the data are defined:

* `mean`: The mean value of training data.<br>
* `dev`: The training data's standard deviation, which is tried to identidy.<br>
* `size`: The number of created data points.<br>

In [None]:
mean = 0
dev = 5
size = 2000

After the parameters are defined, the normally distibuted training data can be created using them and `numpy`'s `random.normal` function.

In [None]:
data = np.random.normal(mean, dev, size)

If desired, noise can be added to the training data by setting the variable `noise_dev` as the deviation of the noise distribution. By default, `noise_dev` is set as `None` and noise is not added.

In [None]:
noise_dev = None

if noise_dev is not None:
  data += np.random.normal(mean, noise_dev, size)

## Parameter Identification Process

After the creation of training data, the actual parameter identification begins.

### Definition of Variables

The variables which will be used in the process are defined.

* `subset_sizes`: With bootstrapping, the first two items of this list define the sizes of subsets, which are selected randomly from the training data later in the process when creating eCDFs.

In [None]:
subset_sizes = [150] * 2

* `mean_new`: The mean value of new datasets to be created. In practice, this is now the same as the mean of the training data, but this variable is defined for clarity.

In [None]:
mean_new = 0

* `n_bins`: The maximum number of bins which will be used for the eCDF-vectors. The exact number will be defined by the `ecdf`'s own functions.

In [None]:
n_bins = 7

- `min_dev`: The start parameter to be estimated.<br>
- `max_dev`: The end parameter to be estimated.<br>

`min_dev` and `max_dev` define the interval between which all integer values are tested to be the standard deviation of the training data.

In [None]:
min_dev = 2
max_dev = 8

Next, a function named `distance`, which has a major part in the creation of eCDF-vectors in the whole identification process, must be defined. The function computes the absolute value of the subtraction of the data points, which is the so called Euclidian distance between the points.

In [None]:
def distance(data_a, data_b):
  return np.abs(data_a-data_b)

###Generation of Objective Functions and eCDF-vectors

Next, the eCDF-vectors have to be created and plotted. This is done by initializing a new instance of `ecdf`'s `bootstrap` class object function. Because the vectors are wanted to plot using a large number of bins and with smaller amount of bins, they must be created twice.

The `ecdf-estimator`'s `estimate_radii_values` is called to determine appropriate region values for the bin values. The values are selected based on the computed distances between data points of the first two subsets of the training data. The returned region values are stored to `min_val` and `max_val` and the distances between data points to the matrix `distance_data`.

In [None]:
min_val, max_val, distance_data = ecdf.estimate_radii_values(data, subset_sizes, distance)

Then, the following interval is split into 50 bins.

In [None]:
bins = np.linspace(min_val, max_val, 50)

The objective function is assembled by initializing a new instance of `estimator`'s `bootstrap` class object. `data`, `bins`, `distance`-function, `subset_sizes[0]` and `subset_sizes[1]` are given as input arguments. In this part, there is a notable difference compared to standard method. Now, the whole dataset is not divided to different subsets. Instead, distances between all single values of the data points are calculated and a distance matrix is created from the distances. Then, `subset_sizes[0]` and `subset_sizes[1]` distances are selected randomly from the distance matrix (so that the same values can't be selected twice) to form two subsets. Distances between these subsets are then computed with `distance` function and a eCDF vector of these distances is created. This selection of subsets and creation of eCDF vectors is done 1000 times. All values computed are stored to `aux_func` to specify the first objective function with all 50 bins.

In [None]:
aux_func = ecdf.bootstrap(data, bins, distance, subset_sizes[0], subset_sizes[1])

Then, the other function with smaller amount of bins must be defined. Otherwise there could be unwanted correlation between neighbouring bin values. The `estimator`'s `choose_bins`-function is called to select the reasonable bin values from larger choice.

In [None]:
bins = ecdf.choose_bins(distance_data, bins, n_bins)

Now, the second objective function `func` can be defined. This time, only the small amount of bins is used to create the eCDF-vectors and compute the statistics for them.

In [None]:
func = ecdf.bootstrap(data, bins, distance, subset_sizes[0], subset_sizes[1])

After the two objective functions have been created, they can be plotted. First, the figure is defined using `matplotlib.pyplot`'s `subplots`-function. Then, the plots are made using the `estimator`'s functions. The vectors with 50 bins are plotted as purple, the ones with selected bins only as light blue, and their mean values as black.

In [None]:
fig, ax = plt.subplots(figsize=(12, 4))
ecdf.plot_ecdf_vectors(aux_func, ax, 'm.')
ecdf.plot_ecdf_vectors(func, ax, 'c.')
ecdf.plot_mean_vector(func, ax, 'k.')
plt.title("Distribution of the eCDF-vectors")
plt.xlabel("Bin Values")
plt.ylabel("Cumulative Probability")
plt.show()

###$\chi^2$ test: Checking normality of eCDF vectors

The eCDF-vectors should be Gaussian with big enough sample size of training data. This is ensured using the $\chi^2$ test which is done by calling the `estimator`'s `plot_chi2_test`-function. The function computes the negative log-likelihood values for each eCDF vector to examine how well each vector fits the objective function `func`'s model. The log-likelihood values are then normalized and a histogram of them is created and plotted. After that, the probability density function of the chi-square distribution with an appropriate degree of freedom (which is the number of the eCDF vectors) is plotted on top of the histogram. If the histogram fits the density function well, the normality of eCDF vectors is confirmed and the parameter estimation process should be valid and reliable.

In [None]:
fig, ax = plt.subplots(figsize=(12, 4))
ecdf.plot_chi2_test(func, ax)
plt.title("Gaussianity Test by Chi-squared -criterion")
plt.xlabel("Normalized Log-likelihood")
plt.ylabel("Probability Density")
plt.show()

###Evaluation of Deviation Parameters

After ensuring the reliability of the identification process, the evaluation of different standard deviation parameters begins. First, a list `devs` containing all the deviation parameters which will be evaluated, is created. The number of parameters is stored to `n_devs`.



In [None]:
devs = list(range(min_dev, max_dev + 1))
n_devs = len(devs)

Next, a list of lists, or a matrix, `values`, is initialized. The matrix will store objective function values, which are negative log-likelihood values, for datasets created for each deviation parameter. One row of the matrix will contain the values for one deviation parameter.

List `means_log` is initialized as zeros with a length of `n_devs`, and will store mean log-likelihood values for each deviation parameter.

In [None]:
values = [[] for i in range(n_devs)]
means_log = [0.] * n_devs

A variable `n_subsets` defines how many new datasets are created for each deviation parameter and evaluated with the objective function. By default, 5 subsets are evaluated for each parameter. It is good to note that with bootstrapping also smaller number of subsets may be enough to get decent results.

In [None]:
n_subsets = 5

####Negative Log-likelihood Analysis

The creation and evaluation of new datasets starts and the negative log-likelihood values will be computed. The loop iterates `n_devs` times, and in each iteration loop, a new dataset `newdata` with the current deviation parameter `devs[i]` is created using numpy's `random.normal`-function, just like with the creation of training data. The size of a new dataset is the same as the dataset size in training data, `subset_sizes[0]`.

Then, the `estimator`'s `evaluate`-function is called to evaluate the new datasets with the objective function `func`. After a new dataset `newdata` is created, another new dataset is formed from it by changing its' order. An eCDF vector of the distances between these datasets is created. Then, the negative log-likelihood value of the new eCDF vector is calculated to examine how it fits the objective function created with training data. This value is the one which the function returns, and it is stored into list `values` by appending it to the place of the current deviation parameter.

In [None]:
for i in range(n_devs):
  for j in range(n_subsets):
    newdata = np.random.normal(mean_new, devs[i], subset_sizes[0])
    values[i].append(ecdf.evaluate(func, newdata))
  means_log[i] = np.mean(values[i])

The evaluated log-likelihood values for all datasets and their means are plotted using `matplotlib.pyplot`'s `plot`-function. Every dataset's value is plotted with red and the means for each deviation parameter with black. The smaller the mean value is to a certain parameter, the better the parameter fits the training data.

In [None]:
fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(devs, values, 'ro')
ax.plot(devs, means_log, 'bo')
plt.title("Evaluation of the Negative Log-likelihood Values")
plt.xlabel("Deviation Parameter")
plt.ylabel("Log-likelihood Value")
plt.show()

####Normalization of Log-likelihood Values

The final step of the process is the normalization of the log-likelihood values. First, each value is multiplied with -0.5 and then exponentiated. This operation is now suitable because it turns the smallest negative log-likelihood values, which means the best fits to training data, to the biggest values.

In [None]:
values = [[np.exp(-0.5*values[i][j]) for j in range(n_subsets)] for i in range(n_devs)]

Then, the sum of each column, which all include one log-likelihood value for each deviation parameter, is computed and stored in list `sums`.

In [None]:
sums = np.sum(values, axis=0)

After that, each value in `values` is divided by the sum of its' column to get the likelihood values for each deviation parameter in the column to fit the training data. Basically, one dataset is selected for each parameter, and the likelihood value for each dataset is computed.

In [None]:
values = [[values[i][j] / sums[j] for j in range(n_subsets)] for i in range(n_devs)]

The means of each row, which are the means of the datasets' likelihood values for each parameter, are computed and stored to list `means_nor`.

In [None]:
means_nor = [np.mean(values[i]) for i in range(n_devs)]

Finally, the normalized likelihood values for each deviation parameter are plotted as before, and the goodness of each parameter can be estimated.

In [None]:
fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(devs, values, 'ro')
ax.plot(devs, means_nor, 'bo')
#plt.title("Normalized Likelihood Values and Average Values Over All Evaluations")
plt.xlabel("Deviation Parameter")
plt.ylabel("Likelihood")
plt.show()