<a target="_blank" href="https://colab.research.google.com/github/AndreasRupp/ecdf_estimator_examples/blob/bsc_thesis/tutorial/03-deviation-of-normal-distribution-synthetic-likelihood.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Parameter Identification for Normal Distribution Data - Synthetic Likelihood Method

This notebook introduces a synthetic likelihood parameter identification method for identifying deviation parameters in simple normal distribution data. The approach is based on the empirical cumulative distribution function (eCDF).

When eCDF's standard and bootsrap method are used normally, the objective function is created with the training data, and then new data with different parameters is evaluated with that same function. However, using standard or bootstrap method with synthetic likelihood this is other way around: the objective functions are created with the new datasets with different parameters, and the training data, deviation of which is identified, is then evaluated with these objective functions.

The implementation of the method using Python and ecdf-estimator package is explained step-by-step within this notebook.

## Setup of the Process

### Environment Setup


First, `ecdf-estimator` is installed.

In [None]:
pip install --ignore-requires-python ecdf-estimator

After the installation, the estimator is imported so that its' functions can actually be used. It's abbreviated as `ecdf` so it's easier and faster to write in the code and makes it easier to read.

In [None]:
import ecdf_estimator as ecdf

Next, the other two necessary modules, `numpy` and `matplotlib.pyplot`, are imported. The modules are used to do array operations efficiently and to visualize the data and results.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

### Creation of Training Data

Next, normally distributed training data must be created. First, the variables which are used to create the data are defined:

* `mean`: The mean value of training data.<br>
* `dev`: The training data's standard deviation, which is tried to identidy.<br>
* `size`: The number of created data points.<br>

In [None]:
mean = 0
dev = 5
size = 1000

After the parameters are defined, the normally distibuted training data can be created using them and `numpy`'s `random.normal` function.

In [None]:
data = np.random.normal(mean, dev, size)

## Parameter Identification Process


After the creation of training data, the actual parameter identification begins.

### Definition of Variables

The variables which will be used in the process are defined.

* `ecdf_type`: Defines the identification method used. By default `standard` method is used, but also `bootstrap` is possible. Especially with bootstrapping, smaller sized training data could be used to keep computing efficient.

In [None]:
ecdf_type = "standard"
#ecdf_type = "bootstrap"

* `subset_sizes`: The list of subset sizes of new datasets which are created later. Note that the sum of subset sizes defines how many data values from the training data are used in the evaluation process. Basically, it only means that values from larger training data will be ignored. In bootstrapping, only the first two datasets define the sizes of the subsets which are chosen in the bootstrapping process. However, the length of the list defines the number of new datasets created for each parameter tried so it can be more than 2 also for bootstrapping to get more reliable results.

In [None]:
subset_sizes = [100]*10

* `n_newsets`: A variable `n_newsets` defines how many new datasets are created for each deviation parameter and evaluated with the objective function as mentioned before.

In [None]:
n_newsets = len(subset_sizes)

* `mean_new`: The mean value of new datasets to be created. In practice, this is now the same as the mean of the training data, but this variable is defined for clarity.

In [None]:
mean_new = 0

* `n_bins`: The maximum number of bins which will be used for the eCDF-vectors. The exact number will be defined by the `ecdf`'s own functions.

In [None]:
n_bins = 7

- `min_dev`: The start parameter to be estimated.<br>
- `max_dev`: The end parameter to be estimated.<br>

`min_dev` and `max_dev` define the interval between which all integer values are tested to be the standard deviation of the training data.

In [None]:
min_dev = 3
max_dev = 9

- `devs`: A list which includes all the deviation parameters which will be evaluated, is created.
- `n_devs`: The number of parameters to be tried is defined.

In [None]:
devs = list(range(min_dev, max_dev + 1))
n_devs = len(devs)

Next, a function named distance, which has a major part in the creation of eCDF-vectors in the whole identification process, must be defined. The function computes the absolute value of the subtraction of the data points, which is the so called Euclidian distance between the points.

In [None]:
def distance(data_a, data_b):
  return np.abs(data_a-data_b)

### Evaluation of Deviation Parameters

Then, the evaluation of different standard deviation parameters begins. First, a list of lists, or a matrix, `values`, is initialized. The matrix will store objective function values, which are negative log-likelihood values, for datasets created for each deviation parameter. One row of the matrix will contain the values for one deviation parameter.

A list `means_log` is initialized as zeros with a length of `n_devs`, and will store mean log-likelihood values for each deviation parameter.

In [None]:
values = [[] for i in range(n_devs)]
means_log = [0.] * n_devs

####Negative Log-likelihood Analysis

The actual evaluation is done in a for loop, which goes through each deviation parameter. The process is same than in previous notebooks, but as mentioned, the objective functions are created for each new dataset as synthetic likelihood is used. For each parameter, new dataset `newdata` is created, and its' size is the sum of `subset_sizes` variable.

The `ecdf-estimator`'s `estimate_radii_values` is called to determine appropriate region values for the bin values. The values are selected based on the computed distances between data points of the first two subsets of the new data. The new data is divided to subsets based on `subset_sizes` variable. The returned region values are stored to `min_val` and `max_val` and the distances between data points to the matrix `distance_data`.

Then, the following interval is split to 50 bins using `numpy`'s `linspace` function. Note that if the eCDF vectors would be wanted to plot, it should be done now by using the `bins` variable and creating auxiliary function. Then, the plotting could be done like demonstrated in previous notebooks. Also, the normality of eCDF vectors should be checked using and plotting the $\chi^2$ test with `ecdf`'s functions. The plotting and normality checking is not done now, because it's not reasonable to be done `n_devs` times with each parameter which is tried.

The objective function which is used to evaluate training data must be defined with smaller amount of bins. Otherwise there could be unwanted correlation between neighbouring bin values. The `estimator`'s `choose_bins` function is called to select the reasonable bin values from larger choice, and variable `bins` is overwritten with these values.

The objective function `func` is assembled for each parameter by initializing a new instance of `estimator`'s `standard` or `bootstrap` class object with new dataset `newdata`. Then, in another for loop, each subset of the training data is evaluated with that function by calling the `ecdf`'s `evaluate` function. The command `data[sum(subset_sizes[0:j]):sum(subset_sizes[0:j+1])]` chooses the subsets of the training data which will be evaluated. If the training data is bigger than the sum of `subset_sizes` variable, all training data values are not used in the identification process. The returned negative log-likelihood values are stored into matrix `values`. After the inner loop, the log-likelihood values for current deviation parameter are computed, and the mean of them is computed and stored in list `means_log`.

In [None]:
for i in range(n_devs):
  newdata = np.random.normal(mean_new, devs[i], np.sum(subset_sizes))
  min_val, max_val, distance_data = ecdf.estimate_radii_values(newdata, subset_sizes, distance)

  bins = np.linspace(min_val, max_val, 50)
  bins = ecdf.choose_bins(distance_data, bins, n_bins)

  if (ecdf_type == "standard"):
    func = ecdf.standard(newdata, bins, distance, subset_sizes)
  elif (ecdf_type == "bootstrap"):
    func = ecdf.bootstrap(newdata, bins, distance, subset_sizes[0], subset_sizes[1])
  else:
    print("Invalid eCDF type.")
    break

  for j in range(n_newsets):
    values[i].append(ecdf.evaluate(func, data[sum(subset_sizes[0:j]):sum(subset_sizes[0:j+1])]))

  means_log[i] = np.mean(values[i])

The evaluated log-likelihood values for all datasets and their means are plotted using `matplotlib.pyplot`'s `plot`-function. Every dataset's value is plotted with red and the means for each deviation parameter with black. The smaller the mean value is to a certain parameter, the better the parameter fits the training data.

In [None]:
fig, ax = plt.subplots(figsize=(12, 4))

ax.plot(devs, values, 'ro')
ax.plot(devs, means_log, 'bo')
plt.title("Evaluation of the Negative Log-likelihood Values")
plt.xlabel("Deviation Parameter")
plt.ylabel("Log-likelihood Value")
plt.show()

####Normalization of Log-likelihood Values

The final step of the process is the normalization of the log-likelihood values. First, each value is multiplied with -0.5 and then exponentiated. This operation is now suitable because it turns the smallest negative log-likelihood values, which means the best fits to training data, to the biggest values.

In [None]:
values = [[np.exp(-0.5*values[i][j]) for j in range(n_newsets)] for i in range(n_devs)]

Then, the sum of each column, which all include one log-likelihood value for each deviation parameter, is computed and stored in list `sums`.

In [None]:
sums = np.sum(values, axis=0)

After that, each value in `values` is divided by the sum of its' column to get the likelihood values for each deviation parameter in the column to fit the training data. Basically, one dataset is selected for each parameter, and the likelihood value for each dataset is computed.

In [None]:
values = [[values[i][j] / sums[j] for j in range(n_newsets)] for i in range(n_devs)]

The means of each row, which are the means of the datasets' likelihood values for each parameter, are computed and stored to list `means_nor`.

In [None]:
means_nor = [np.mean(values[i]) for i in range(n_devs)]

Finally, the normalized likelihood values (red) and the means of them (blue) for each deviation parameter are plotted as before, and the goodness of each parameter can be estimated.

In [None]:
fig, ax = plt.subplots(figsize=(12, 4))

ax.plot(devs, values, 'ro')
ax.plot(devs, means_nor, 'bo')

plt.title("Normalized Likelihood Values and Average Values Over All Evaluations")
plt.xlabel("Deviation Parameter")
plt.ylabel("Likelihood")
plt.show()