<a target="_blank" href="https://colab.research.google.com/github/AndreasRupp/ecdf_estimator_examples/blob/bsc_thesis/tutorial/01-deviation-of-normal-distribution.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Parameter Identification for Normal Distribution Data - Standard Method

This notebook introduces a parameter identification method for identifying deviation parameters in simple normal distribution data. The approach is based on the empirical cumulative distribution function (eCDF). First, normal distribution data is created using known deviation. Subsequently, new data is generated using various deviation parameters to identify the initial deviation parameter. The implementation of the method using Python and `ecdf-estimator` -package is explained step-by-step within this notebook.

## Setup of the Process

###Environment Setup

First, `ecdf-estimator` is installed. It contains a Python implementation of the eCDF-based approach which will be used. Basically, the main functionalities in the method are done with functions included in this module. The installation is needed because the module is not pre-installed in Jupyter notebook or other Python environments used.

In [None]:
pip install --ignore-requires-python ecdf-estimator

After the installation, the estimator is imported so that its' functions can actually be used. It's abbreviated as `ecdf` so it's easier and faster to write in the code and makes it easier to read. Now, the `ecdf`'s functions can be called like `ecdf.function_name()`.


In [None]:
import ecdf_estimator as ecdf

Next, the other two necessary modules, `numpy` and `matplotlib.pyplot`, are imported. These modules are one of the most commonly used in Python, and pre-imported in the most common Python environments, so the installation is usually not needed. `Numpy`, usually abbreviated as `np`, is one of the most important modules for scientific, numeric computing in Python. It offers efficient array operations, numerical functions, and is easy to integrate with other scientific libraries. One of these is `matplotlib.pyplot`, usually abbreviated as `plt`, which is a submodule of `matplotlib`. It includes a simple interface for creating plots and vizualisation, and is commonly used in different kinds of data analysis. One of its' most important features is the possibility to easily integrate it with `numpy`.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

### Creation of Training Data

Next, normally distributed training data must be created. First, the variables which are used to create the data are defined:

`mean`: The mean value of training data.<br>
`dev`: The training data's standard deviation, which is tried to identidy.<br>
`size`: The number of created data points.<br>

In [None]:
mean = 0
dev = 5
size = 2000

After the parameters are defined, the normally distibuted training data can be created using *numpy*'s `random.normal`-function. The function creates random numbers sampled from the normal distribution data specified by the three input arguments: the mean value of the normal distribution, the standard deviation of the normal distribution, and the number of points created. By default, we create 2000 data points of normal distribution, which mean value is 0 and standard deviation is 5. The training data is stored in a list `data`.

In [None]:
data = np.random.normal(mean, dev, size)

If desired, noise can be added to the training data by setting the variable `noise_dev` as the deviation of the noise distribution. Then, the noise is created with *numpy*'s `random.normal`-function just like training data and added point-wise to the training data using `+=` -operator. By default, `noise_dev` is set as `None` and noise is not added.

In [None]:
noise_dev = None

if noise_dev is not None:
  data += np.random.normal(mean, noise_dev, size)

## Parameter Identification Process

After the creation of training data, the actual parameter identification begins.

### Definition of Variables

The variables which will be used in the process are defined.

`subset_sizes`: The list of subset sizes of training data. By default, the length of the list is 20, and each value of the list is 100. This means that the training data will be split into 20 subsets, and each subset's size is 100. Also the sizes of new datasets, which will be created later, will depend on this variable. Note that the sum of the sizes of subsets should equal the size of training data!

In [None]:
subset_sizes = [100]*20

`n_subsets`: The number of subsets of the training data, calculated with Python's `len`-function which now computes the length of list `subset_sizes`. A command `len(subset_sizes)` could be used instead of `n_subsets` later on, but this variable makes the code easier to understand.

In [None]:
n_subsets = len(subset_sizes)

`mean_new`: The mean value of new datasets to be created. In practice, this is now the same as the mean of the training data, but this variable is defined for clarity.

In [None]:
mean_new = 0

`n_bins`: The maximum number of bins which will be used for the eCDF-vectors. The exact number will be defined by the *ecdf*'s own functions.

In [None]:
n_bins = 7

`min_dev`: The start parameter to be estimated.<br>
`max_dev`: The end parameter to be estimated.<br>

`min_dev` and `max_dev` define the interval between which all integer values are tested to be the standard deviation of our training data. By default the deviation of the data is 5, thus, all integers from 3 to 8 will be evaluated.

In [None]:
min_dev = 2
max_dev = 8

Next, a function named `distance`, which has a major part in the creation of eCDF-vectors in the whole identification process, must be defined. All distances between data points in this script are computed with this function. The function takes two data points, `data_a` and `data_b`, as input arguments. Then, it uses *numpy*'s `abs`-function to compute the absolute value of the subtraction of the data points, which is the so called Euclidian distance between the points.

In these notebook tutorials, we use a distance function with two arguments. However, it is possible to use with arbitrary many function arguments. If the data is one-dimensional, as in this example, it is also possible to use distance function with only one argument and return the same scalars without any actual distance computing. With three or more arguments, the running time of the script will be quite long.  

In [None]:
def distance(data_a, data_b):
  return np.abs(data_a-data_b)

###Generation of Objective Functions and eCDF-vectors

Next, the eCDF-vectors have to be created and plotted. This is done by initializing a new instance of *estimator*'s `standard` class object function. Because the vectors are wanted to plot using a large number of bins and with smaller amount of bins, they must be created twice.

The *ecdf-estimator*'s `estimate_radii_values` is called to determine appropriate region values for the bin values. The values are selected based on the computed distances between data points of the first two subsets of the training data. The function takes the whole training data, subset sizes and the `distance`-function, which computes all the distances between data points, as input parameters. In the function, the distances between first two subsets are computed. The returned region values are stored to `min_val` and `max_val` and the distances between data points to the matrix `distance_data`.

In [None]:
min_val, max_val, distance_data = ecdf.estimate_radii_values(data, subset_sizes, distance)

Then, the following interval is split to 50 bins using *numpy*'s `linspace`-function. It takes the region points of the interval and the number of equally spaced values between the points as input arguments and returns the values as a list, now named as `bins`.

In [None]:
bins = np.linspace(min_val, max_val, 50)

The objective function is assembled by initializing a new instance of *estimator*'s `standard` class object. `data`, `bins`, `distance`-function and `subset_sizes` are given as input arguments. The distances between all subsets of the training data are calculated. Thus, for each possible subset pair, each possible distance between two data points of the pairs is computed with the `distance`-function. Then, the eCDF-vector of each subset pair is created by cumulatively calculating how many of these distances belong to a certain of the 50 bins. Also, the mean and covariance of all vectors are computed. All these values are stored to `aux_func` to specify the first objective function with all 50 bins. This function is only used for plotting and for that reason named as an auxiliary function.

In [None]:
aux_func = ecdf.standard(data, bins, distance, subset_sizes)

Then, the other function with smaller amount of bins must be defined. Otherwise there could be unwanted correlation between neighbouring bin values. The *estimator*'s `choose_bins`-function is called to select the reasonable bin values from larger choice. Note that the previously selected 50 bin values, `bins`, is given as an input argument and overwritten with a new list of maximum of `n_bins` values. Third input argument is the matrix `distance_data`, which was computed earlier.

In [None]:
bins = ecdf.choose_bins(distance_data, bins, n_bins)

Now, the second objective function `func` can be defined. The only difference compared to the creation of `aux_func` is that this time only the small amount of bins is used to create the eCDF-vectors and compute the statistics for them. This function is used for plotting and evaluation of different standard deviation parameters.

In [None]:
func = ecdf.standard(data, bins, distance, subset_sizes)

After the two objective functions have been created, they can be plotted. First, the figure `fig` and its axes `ax` have to be defined using *matplotlib.pyplot*'s `subplots`-function. The dimensions of the figure are defined with an input argument `figsize=(12, 4)`. Then, the plots are made using the *estimator*'s functions, which take three input arguments: the objective function containing the eCDF-vectors and their statistics, axes of the figure to plot on, and the plotting style (colour and shape, which is now small circles). The vectors with 50 bins are plotted as purple, with `aux_func` as an input argument. The ones with selected bins only are plotted as light blue and their mean values as black. Then, title and labels are defined. The `plt.show()`-command ensures that the figure is displayed properly without unwanted prints.

In [None]:
fig, ax = plt.subplots(figsize=(12, 4))
ecdf.plot_ecdf_vectors(aux_func, ax, 'm.')
ecdf.plot_ecdf_vectors(func, ax, 'c.')
ecdf.plot_mean_vector(func, ax, 'k.')
plt.title("Distribution of the eCDF-vectors")
plt.xlabel("Bin Values")
plt.ylabel("Cumulative Probability")
plt.show()

### $\chi^2$ test: Checking normality of eCDF-vectors

The eCDF-vectors should be Gaussian with big enough sample size of training data. This is ensured using the $\chi^2$-test which is done by calling the *estimator*'s `plot_chi2_test`-function. The function computes the negative log-likelihood values for each eCDF-vector to examine how well each vector fits the objective function `func`'s model. The vectors being multinormally distributed, log-likelihood values can be computed with the help of mean and covariance values of the objective function. The log-likelihood values are then normalized and a histogram of them is created and plotted. After that, the probability density function of the chi-square distribution with an appropriate degree of freedom (which is the number of the eCDF vectors) is plotted on top of the histogram. If the histogram fits the density function well, the normality of eCDF-vectors is confirmed and the parameter estimation process should be valid and reliable.

Finally, the title and labels are defined and the correct display of the figure with plots is ensured.

In [None]:
fig, ax = plt.subplots(figsize=(12, 4))
ecdf.plot_chi2_test(func, ax)
plt.title("Gaussianity Test by Chi-squared -criterion")
plt.xlabel("Normalized Log-likelihood")
plt.ylabel("Probability Density")
plt.show()

### Evaluation of Deviation Parameters

After ensuring the reliability of the identification process, the evaluation of different standard deviation parameters begins. First, a list *devs*, which includes all the deviation parameters which will be evaluated, is created. Command `range(min_dev, max_dev + 1)` defines the correct range of integer values based on the region points, and `list` creates a list of the values. Then, the number of parameters is stored to `n_devs` by computing the length of the list.



In [None]:
devs = list(range(min_dev, max_dev + 1))
n_devs = len(devs)

Next, a list of lists, or a matrix, `values`, is initialized. When referring to a list of lists as a matrix, one list can be thought as a row of a matrix. The for loop creates a new empty list `n_devs` times, one for each standard deviation parameter. Note that the `range`-command starts from 0 if only one input argument is given. The matrix will store objective function values, which are negative log-likelihood values, for datasets created for each deviation parameter. One row of the matrix will contain the values for one deviation parameter. A single value of a matrix can be selected with command `matrix[i][j]`, `i` being the row index and `j` the column index of a matrix. As mentioned, indexing starts from 0.

List `means_log` is initialized as zeros with a length of `n_devs`, and will store mean log-likelihood values for each deviation parameter.

In [None]:
values = [[] for i in range(n_devs)]
means_log = [0.] * n_devs

####Negative Log-likelihood Analysis

The creation and evaluation of new datasets starts and the negative log-likelihood values will be computed. `N_subsets` new datasets are wanted to create for each deviation parameter. Thus, the outer loop iterates `n_devs` times and the inner loop `n_subset` times. This way it is easy to refer to a current deviation parameter using `devs[i]` or to a current subset size using `subset_sizes[j]`. In each iteration loop, a new dataset `newdata` with the current deviation parameter `devs[i]` is created using numpy's `random.normal`-function, just like with the creation of training data. The size of a new dataset is the same as the similarly indexed dataset's size in training data, `subset_sizes[j]`.

Then, the *estimator*'s `evaluate`-function is called to evaluate the new dataset with the objective function `func`. First, one subset of the training data is selected and the distances between each point of the training data subset and the new dataset are computed. Again, an eCDF-function of the distances is created, similarly to what was done before with training data subsets. Then, the negative log-likelihood value of the new eCDF-vector is calculated to examine how it fits the objective function created with training data. This value is the one which the function returns, and it is stored into matrix `values` by appending it to the end of the current deviation parameter's row `i` of the matrix with `values[i].append()`. When the inner loop is exited, the negative log-likelihood values for all datasets with one deviation parameter are computed. Thus, the mean value of that row is computed with *numpy*'s `mean`-function and stored in list `means_log`.

In [None]:
for i in range(n_devs):
  for j in range(n_subsets):
    newdata = np.random.normal(mean_new, devs[i], subset_sizes[j])
    values[i].append(ecdf.evaluate(func, newdata))
  means_log[i] = np.mean(values[i])

The evaluated log-likelihood values for all datasets and their means are plotted. Again, the figure and axes are defined. This time, the plots are made using *matplotlib.pyplot*'s ready `plot`-function, which takes the x and y -coordinates as input arguments with a plotting style. Every dataset's value is plotted with red and the means for each deviation parameter with black. The smaller the mean value is to a certain parameter, the better the parameter fits the training data.

In [None]:
fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(devs, values, 'ro')
ax.plot(devs, means_log, 'bo')
plt.title("Evaluation of the Negative Log-likelihood Values")
plt.xlabel("Deviation Parameter")
plt.ylabel("Log-likelihood Value")
plt.show()

####Normalization of Log-likelihood Values

The final step of the process is the normalization of the log-likelihood values. First, each value is multiplied with -0.5 and then exponentiated. This operation is now suitable because it turns the smallest negative log-likelihood values, which means the best fits to training data, to the biggest values. The operation is done with *numpy*'s `exp`-function. The previous nested loops are now replaced with list comprehension, which tend to be faster in Python. The inner loop for each subset has to be before the outer loop for each parameter, and the square brackets have to be used between the loops, so that the operations are done correctly to each element of the matrix.

In [None]:
values = [[np.exp(-0.5*values[i][j]) for j in range(n_subsets)] for i in range(n_devs)]

Then, the sum of each column, which all include one log-likelihood value for each deviation parameter, is computed and stored in list *sums*. This is done with *numpy*'s `sum`-function. When using `axis=0` as a second input argument, it computes the sum of columns, and returns it as a list which is wanted. Without the second input argument it would compute the sum of all elements, and with the second input argument `axis=1` it would compute the sum of rows.

In [None]:
sums = np.sum(values, axis=0)

After that, each value in `values` is divided by the sum of its' column to get the likelihood values for each deviation parameter in the column to fit the training data. Basically, one dataset is selected for each parameter, and the likelihood value for each dataset is computed. This is done `n_subsets` times, so that all datasets are used. The calculation is done using list comprehension as before.

In [None]:
values = [[values[i][j] / sums[j] for j in range(n_subsets)] for i in range(n_devs)]

The means of each row, which are the means of the datasets' likelihood values for each parameter, are computed. The means are computed with *numpy*'s `mean`-function, where every row of the matrix is put as an input argument, using list comprehension. The mean values are stored to list `means_nor`. Note the difference compared to previous list comprehensions when now a list is used instead of a matrix.

In [None]:
means_nor = [np.mean(values[i]) for i in range(n_devs)]

Finally, the normalized likelihood values (red) and the means of them (blue) for each deviation parameter are plotted as before, and the goodness of each parameter can be estimated.

In [None]:
fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(devs, values, 'ro')
ax.plot(devs, means_nor, 'bo')
plt.title("Normalized Likelihood Values and Average Values Over All Evaluations")
plt.xlabel("Deviation Parameter")
plt.ylabel("Likelihood")
plt.show()