In [None]:
%%R
for_pdf <- TRUE
for_html <- FALSE

In [None]:
%%R


In [None]:
from preamble import *
plt.close('all')
codpy_param = {'rescale:xmax': 1000,
'rescale:seed':42,
'sharp_discrepancy:xmax':1000,
'sharp_discrepancy:seed':30,
'sharp_discrepancy:itermax':5,
'discrepancy:xmax':500,
'discrepancy:ymax':500,
'discrepancy:zmax':500,
'discrepancy:nmax':2000}

# Brief overview of methods of machine learning

<!-- \fancyhead[CO,CE]{Your Document Header} -->

## A framework for machine learning 

### Prediction machine for supervised/unsupervised machine learning

Machine learning methods can be roughly split into two main approaches: unsupervised and supervised methods. Both can be described in a general framework, referred to here as a **prediction machine**. In short, a predictor, denoted by $\mathcal{P}_m$, is an extrapolation or interpolation procedure, described by an operator 
\begin{equation} 
  f_z = \mathcal{P}_{m}(x,y=[],z=x,f(x)=[]).
  (\#eq:Pm)
\end{equation}
Python notation is used here and the brakets mean that the variables $y, z, f(x)$ are optional input data.

* The choice of the method is indicated by the subscript $m$. Each method relies on a set of **external parameters**. Fine tuning such parameters is sometimes very cumbersome and provide a source of error and, in fact, some of the strategies in the literature propose to rely on a learning machine in order to determine these external parameters. No performance indicator is provided for this parameter tuning step, and this is an issue to take into account in the applications before selecting up a particular method.

* The input data $x, y, z, f(x)$ are as follows. 
  * The only non-optional parameter is the variable $x \in \mathbb{R}^{N_x \times D}$, called the **training set**. The parameter $D$ is usually referred as the **total number of features**.
  * The variable $f(x) \in \mathbb{R}^{N_x \times D_f}$ is called the **training set values**, whilethe parameter $D_f$ is the **number of training features**.
  * The variable $z \in \mathbb{R}^{N_z \times D}$ is called the **test set**. If it is not specified, we tacitly assume that $z=x$.
  * The variable $y \in \mathbb{R}^{N_y \times D}$ is called the **internal parameter set**\footnote{also called weight set in neural network theory} and is necessary in order to define $\mathcal{P}_m$.
* The output data are as follow: 
  * **Supervised learning**: this corresponds to choosing the input function values $f(x)$ and we then write 
$$  
f_z = \mathcal{P}_m(x,y=[],z=x,f(x)), (\#eq:Pms)
$$
  where the values $f_z \in \RR^{N_z \times D}$ are called a **prediction**.
We distinguish between two cases:
    * If the input data $y$ is left empty, then the prediction machine \@ref(eq:Pm) is called a **feed-backward machine**. In this case, the method computes this set with an internal method and determine $f_z$.
    * If $y$ is specified as input data, then the prediction machine \@ref(eq:Pm) is referred as a **feed-forward machine**. In this case, the method uses the set of internal parameters and compute the prediction $f_z$.
  * **Unsupervised learning**: we may also choose 
\begin{equation} 
  f_z = \mathcal{P}_m(x,z=x), (\#eq:Pmu)
\end{equation}
where the output values $f_z \in \mathbb{R}^{N_z \times D}$ are sometimes called **clusters** for the so-called clustering methods (described later on).
  
Other machine learning methods can be described with the same notation. For instance, two methods $m_1,m_2$ begin given, then the following composition describes a feed-backward machine, which is quite close to the definition of **semi-supervised learning** in the literature and also encompasses feed-backward learning machines: 
$$ 
  f_z = \mathcal{P}_{m_1}(x, \mathcal{P}_{m_2}(x,f(x)),z,f(x)), (\#eq:Pmsu)
$$
We summarize our main notation in Table \@ref(tab:mainnotations). The sizes of the input data, that is, the integers $D, N_x, N_y, N_z, D_f$, are also considered as input parameters. The distinction between supervised and unsupervised learning is a matter of having, or not, optional input data and the correspondence will be clarified in the rest of this chapter.

In [None]:
%%R
summary = data.frame(
  stringsAsFactors = FALSE,
       check.names = FALSE,
                 x = c("training set","size Nx * D"),
                 y = c("parameter set","size Ny * D"),
                 z = c("test set","size Nz * D"),
            `f(x)` = c("training values","size Nx * Df"),
           `fz` = c("predictions", "size Nz * Df")
)
knitr::kable(summary, label = "mainnotations", caption = "Main parameters for machine learning")

| $x$           |  $y$           | $z$       |          $f(x)$  |        $f_z$ |
|---            |          ---   |---        |---      |---     |
| training set  | parameter set  |  test set | training values  | predictions  |
| size $N_x \times D$ | size   $N_y \times D$       |  size $N_z \times D$    |      size $N_x \times Df$       |     size $N_z \times Df$     |

Table:  (\#tab:mainnotations) Main parameters for machine learning


### Techniques of supervised learning 

Supervised learning \@ref(eq:Pms) corresponds to the case where the function values $f(x)$ is part of input data. 
\begin{equation} 
  f_z = \mathcal{P}_m(x,y=[],z=x,f(x)).
\end{equation}
Supervised learning can be best understood as a simple extrapolation procedure: from historical observations of a given function $x, f(x)$, one wants to predict, or extrapolate, the function on a new set of values $z$.
Concerning the terminology, a method is said to be **multi-class** or multi-output if the function $f$ under consideration can be vector-valued, that is, $D_f \ge 1$ with our notations. Note that one can always stack learning machines to produce multi-class methods. However, this comes usually at a quite heavy computational cost, motivating this definition. Moreover, the input function $f$ can be

* discrete, that is the set of unique values $f(\RR^D)$ is a discrete set, denoted $Ran(f)$. The set is referred as **labels**, and this set can always be mapped to integer $[1,\ldots,\#(Ran(f))]$, where $\#(E)$ denotes the number of elements, or cardinal, of a set.
* continuous.
* mixed (some discrete, some continuous).

A classification of existing methods for supervised learning can be found at scikit-learn 

[^200]:[a classification of methods is available using this link]
(https://scikit-learn.org/stable/supervised_learning.html). 

There are

* Different family of methods: linear models, support vector machines, neural networks, …

![](.\CodPyFigs\SMLT.png){width=50%}

* Different methods: neural networks, gaussian processes, etc...

![](.\CodPyFigs\SML.png){width=50%}

* Different libraries: scikit-learn, Tensorflow, ...

### Techniques of unsupervised learning 

Unsupervised learning corresponds to the case where the function values $f(x)$ is not part of input data, see \@ref(eq:Pms) :
\begin{equation} 
   \mathcal{P}_m(x, y = [] , z = x). 
\end{equation}
Unsupervised learning can be best understood as a simple interpolation procedure: from historical observations of a given distribution $x$, one wants to extract, or interpolate, $N_y$ features that best represent $x$.
The output data of a standard clustering method are the **cluster set**, denoted $y \in \mathbb{R}^{N_y\times D}$.

There are natural connections between supervised and unsupervised learning.

* In the context of semi-supervised clustering methods, the clusters $y$ are used in a supervised learning machine to produce a prediction $f_z \in\mathbb{R}^{N_z \times D_f}$, see \@ref(eq:Pmsu).
* In the context of unsupervised clustering methods, a prediction $f_z \in\mathbb{R}^{N_z}$ can also be made. This prediction attaches each point $z^i$ of the test set to the cluster set $y$, producing $f_z$ as a map $[1,\ldots,N_z] \mapsto [1,\ldots,N_y]$.

There exists several clustering methods performing this approach, see for instance the dedicated Wikipedia page[^202].

[^202]:[link to cluster analysis Wikipedia page](https://en.wikipedia.org/wiki/Cluster_analysis). 

* Different family of methods: linear models, support vector machines, neural networks,...

![](.\CodPyFigs\UMLT.png){width=50%}

* Different methods: neural networks, Gaussian processes, etc..

![](.\CodPyFigs\UML.png){width=50%}

* Different libraries: Scikit-learn, ...

Clustering is one family of unsupervised learning method. The library Scikit-learn proposes this quite impressive list of clustering methods, see [^203]. We extracted the following figure and comment it briefly to illustrate our notation.

[^203]:[link to scikit-learn clustering](https://scikit-learn.org/stable/modules/clustering.html)

![list of scikit-learn clustering methods.](CodPyFigs/scikitclustercomparaison.png)

* Each column describes a particular clustering algorithm.
* Each row describes a particular clustering, unsupervised problem:
  * Each image scatter plots the training set $x$ and the test set $z$, that are equals.
  * Each image color codes the predicted values $f_z$.

## Exploratory data analysis

### Preliminaries
Exploratory data analysis plays a central role in data engineering and allows one to understand the structure of a given dataset, including its correlation and statistical properties. For instance, we can study whether a data distribution is multi-modal, skew, or discontinuous, among other features. The technique can help in many different applications and, for instance in unsupervised learning, one can produce a first guess concerning the number of possible clusters associated with a given dataset, or concerning the type of kernels one should choose before applying a kernel regression method.

As an example, we illustrate the visualization tools that we are using, consider the Iris flower data set. Iris data set introduced by the British statistician, eugenicist, and biologist Ronald Fisher in his 1936 paper "The use of multiple measurements in taxonomic problems". The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.

### Visualization based on non-parametric estimations
The density of the input data is estimated using a kernel density estimate (KDE). Let $(x^1, x^2, \dots, x^n)$ be independent and identically distributed samples, drawn from some univariate distribution with unknown density denoted by $f$ at any given point $x$. We are interested in estimating the shape of this function $f$ and the kernel density estimator is
$$
\widehat{f}_{h}(x)={\frac {1}{n}}\sum _{i=1}^{n}K_{h}(x-x^{i})={\frac {1}{nh}}\sum _{i=1}^{n}K{\Big (}{\frac {x-x^{i}}{h}}{\Big )},
$$
where $K$ is a kernel (say any non-negative function) and $h > 0$ is a smoothing parameter called the **bandwidth**. Among the range of possible kernels that are are commonly used, we have: uniform, triangular, biweight, triweight, Epanechnikov, normal, and many others. The ability of the KDE to accurately represent the data depends on the choice of the smoothing bandwidth. An over-smoothed estimate can remove meaningful features, but an under-smoothed estimate can obscure the true shape within the random noise.


In [None]:
D,Nx,Ny,Nz= -1,-1,-1,-1
x, fx, y, fy, z, fz = iris_data_generator().get_data(D = D,Nx= Nx, Ny = Ny, Nz = Nz)
f_names = iris_data_generator().get_feature_names()
xfx = pd.DataFrame(x, index = np.reshape(fx,(len(fx))), columns = f_names)
multi_plot([xfx.T],data_plots.distribution_plot1D, f_names = f_names)

### Visualization based on scatter plots

Another way to visualize data is to rely on a scatter plot, where the data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis.

In [None]:
multi_plot(x.T,data_plots.scatter_plot, f_names = f_names)

### Visualization based on correlation matrices

The correlation matrix of $n$ random variables $x^{1},\ldots ,x^{n}$ is the $n\times n$ matrix whose $(i,j)$ entry is $corr(x^{i},x^{j})$. Thus the diagonal entries are all identically unity. 

In [None]:
data_plots.heatmap(x, title= "Correlation matrix", f_names = f_names)

### Visualization based on summary plots

The summary plot visualizes the density of each feature of the data on the diagonal. The KDE plot on the lower diagonal and the scatter plot on the upper diagonal.

In [None]:
data_plots.density_scatter(xfx)

In [None]:
scenarios_list = [ (784, 2**(5), 2**(5-2), 10000)]

## Performance indicators for machine learning

### Indicators for supervised learning

**Comparison to ground truth values**. A huge family of indicators is available in order to evaluate the performance of a learning machine, most of them being readily described and implemented in scikit-learn[^204].

[^204]:[link to scikit-learn metrics](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics). 

We do not discuss them all, but rather overview those that we have included in the CodPy library. First of all, in the context of supervised clustering methods, if the function $f$ is known in advance, then predictions of learning machines $f_z$ can be compared with **ground truth values**, $f(z) \in \mathbb{R}^{N_z \times D_f}$. Below we list the main metrics that are used.

* For labeled functions (i.e., discrete functions), a common indicator is the **score**, defined as
$$
    \frac{1}{N_z} \#\{ f_z^n = f(z)^n, n=1\ldots N_z\} (\#eq:score)
$$
producing an indicator between 0 and 1, the higher being the better.
* For continuous functions (i.e., discrete functions), a common indicator is $\ell^p$ norms, defined as
$$
    \frac{1}{N_z}\| f_z - f(z) \|_{\ell^p}, \ 1 \le p \le \infty. 
$$
the case $p=2$ is referred as the *root-mean-square error (RMSE)*. 

* As the above indicator is not normalized, the following version is preferred.
$$
    \frac{\| f_z - f(z)\|_{\ell^p}}{\| f_z\|_{\ell^p} +\|f(z)\|_{\ell^p}}, \ 1 \le p \le \infty. (\#eq:rmse)
$$
producing an indicator between 0 and 1, the smaller being the better, interpreted as error-percentages.
In finance, this notion is sometimes referred to as the basis point indicator.

**Cross validation scores**. The cross validation score consists in randomly selecting a part of the training set and values as test set and values, and to perform a score or RMSE type error analysis on each run[^205]

**Confusion matrix**. This indicator is available for labeled, supervised learning, is a matrix representation of the numbers of ground-truth labels in a row, while each column represents the predicted labels in an actual class. Confusion matrix is a quite simple and efficient data error visualization methods, a simple example is shown in Section \@ref(k-means-confusion-matrix). Its common form is
$$
  M(i,j) = \#\{f(z) = i \quad and \quad f_z = j\},
$$
representing correct predicted numbers in the matrix diagonal, since off-diagonal elements counts false positive predictions. Note that numerous others performance indicators can be straightforwardly deduced from the confusion matrix, as Rand Index, Fowlkes-Mallows scores, etc...

**Norm of output**. If no ground truth values are known, the quality of the prediction $f_z$, depends on **a priori error estimates** or error bounds. Such estimates exist only for kernel methods (to the best of the knowledge of the authors), and are described in the next chapter, see \@ref(eq:err). Such estimates uses the norm of functions described in \@ref(eq:norm), and was proven to be a useful indicator in the applications.

[^205]:see the [dedicated page on scikit-learn](https://scikit-learn.org/stable/modules/cross_validation.html). 

**ROC curves**. A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The method was originally developed for operators of military radar receivers starting in 1941, which led to its name.

ROC is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:

| Metric   |      Formula |  Equivalent |
|----------|:-------------:|------:|
| True Positive Rate TPR |  $\frac{TP}{TP + FN}$ | Recall, sensitivity |
| False Positive Rate FPR|    $\frac{FP}{TN+FP}$  |   	1-specificity |

We can use precision score ($PRE$) to measure the performance across all classes:

$$
PRE=\frac{TP}{TP+FP}.
$$
In “micro averaging”, we calculate the performance, e.g., precision, from the individual true positives, true negatives, false positives, and false negatives of the the k-class model:
$$
PRE_{micro}=\frac{TP_{1}+\dots+TP_{k}}
{TP_{1}+\dots+TP_{k}+FP_{1}+\dots+FP_{k}}.
$$
And in macro-averaging, we average the performances of each individual class
$$
PRE_{marco}=\frac{PRE_{1}+\dots+PRE_{k}}{k}.
$$

### Indicators for unsupervised learning

**Discrepancy error associated to kernel**. Evaluation of clustering algorithms benefits from a lot of performance indicators, a lot of them being implemented in Scikit-learn [^206]
[^206]:[see this link](https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation). 

We list in this section those that we are computing. First of all, the discrepancy error is an indicator based on a kernel and will be fully described in the next chapter, see \@ref(eq:dk). It is used primarily to produce worst error estimates, together with the norm of functions, as described in \@ref(eq:norm). It was also found to be useful as a performance indicator for unsupervised learning machine.

**Inertia indicator**. The inertia indicator is used for *k-means* type algorithms. We describe it precisely, as it uses a notation that will be used in other parts. It shares some similarities with the discrepancy error one but is not equivalent. 
To define inertia, one first pick a distance, denoted $d(x, y)$, as the squared Euclidean one, although other distance are considered, as the Manhattan one or log-entropy, depending upon the problem under consideration. Consider now any point $w \in \RR^D$. Then $w$ is attached naturally to a point $y^{\sigma_d(w,y)}$, where the discrete function $\sigma_d(w,y)$ is computed as 
$$
\sigma_d(w,y) := \{ j : d(w,y^j) = \inf_k d(w,y^k) \}. (\#eq:sigmaw)
$$
Then the inertia is defined as
$$
I(x,y)= \sum_{n=0}^{N_x} ( |x^n-y^{\sigma_{d}(x^n,y)}|^2).
$$
Observe that this functional might not be convex, even if the distance under consideration is convex, as is the squared Euclidean distance. For k-means algorithms, the cluster centers $y$ are computed minimizing this functional. The parameter set $y$ is called **centroids** for k-means algorithms. 

**Homogeneity score**. The homogeneity score, see the dedicated scikit-learn for a definition [^492], is a performance indicator that holds for supervised, labeled, clustering problems. This indicator performs a conditional entropy to estimate a score $s(f(z), f_z)$ between 0 and 1 - higher the better.
[^492]:[see this link](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.homogeneity_score.html#sklearn.metrics.homogeneity_score) 

**Silhouette coefficient**. If the ground truth labels are not known, evaluation must be performed using the model itself. The [Silhouette Coefficient](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html#sklearn.metrics.silhouette_score) is an example of such an evaluation, where a higher Silhouette Coefficient score relates to a model with better defined clusters. 

## General specification of tests

### Preliminaries 

We now overview a benchmark methodology and apply it to a few methods of supervised learning. For each machine, 
* we illustrate the prediction function $\mathcal{P}_m$, and 
* we illustrate the computation of some performance indicators.
We then present benchmarks using these indicators. In this section, we restrict attention to toy examples while more significant examples will be studied in Chapter \@ref(application-to-supervised-machine-learning).

We begin by describing a general, multi-dimensional, first quality assurance test for supervised learning machines. We illustrate this test framework with one and two-dimensional examples, and the reader can toy with functions and methods.
The goal of this framework is to measure accuracy of any learning machines, while using the extrapolation operator (\#eq:EI). Hence all our unit tests are based on the following input sizes:
$$
  \text{a function: f },\text{a method: m }, \text{five integers: } D, N_x, N_y, N_z, D_f
$$
To benchmark our machine, we use a list of scenarios, that is a list of entries $D, N_x, N_y, N_z, D_f$. Table \@ref(tab:299) is an example of a list of 5 scenarios.

In [None]:
scenarios_list = [ (1, 100*i, 100*i ,100*i ) for i in np.arange(1,5,1)]
pd_scenarios_list = pd.DataFrame(scenarios_list,columns = ["D","Nx","Ny","Nz"])

In [None]:
%%R
knitr::kable(py$pd_scenarios_list, caption = "scenario list", col.names = c("$D$","$N_x$","$N_y$","$N_z$"), escape = FALSE)%>%
  kable_styling(latex_options = "HOLD_position")

For the function $f$ we choose a period and an increasing function:
\begin{equation} \label{2D}
f(x) = \Pi_{d=1..D} \cos (4\pi x_d) + \sum_{d=1..D} x_d.
\end{equation}
It is defined in python code of this document, and the reader can change it to any other continuous function.

In [None]:
def my_fun(x):
    import numpy as np
    from math import pi
    D = len(x)
    res = 1.;
    for d in range(0,D):
        res *= np.cos(4 * x[d] * pi) 
    for d in range(0,D):
        res += x[d]
    return res

### An example in one dimension

**Initialization**. For this tutorial, we used a generator, configured to select $x$ (resp. $y, z$) as $N_x$ (resp. $N_y, N_z$) points regularly (resp. randomly, regularly) generated on a unit cube. We chose to select $z$ distributed over a larger cube, to observe extrapolation and interpolation effects.

In [None]:
data_random_generator_ = data_random_generator(fun = my_fun,types=["cart","sto","cart"])
x, fx, y, fy, z, fz =  data_random_generator_.get_data(D=1,Nx=100,Ny=100,Nz=100)

As an illustration, in Figure \@ref(fig:xfxzfz) we show both graphs $(x, f(x))$ (left, training set),$(z, f(z))$ (right, test set).

In [None]:
multi_plot([(x, fx),(z, fz)],plot1D)

## Benchmark methodology:  kernel-based predictors

### Periodic kernel regression model from CodPy

This test illustrates a kernel-based projection operator, described in Section \@ref(fundamental-notions-for-supervised-learning). The set of external parameters for kernel-based methods consists simply in picking-up a kernel, and is discussed in the next chapter; see Section \@ref(kernel-methods-for-machine-learning). We pick-up in the corresponding python chunk a standard periodic Gaussian kernel, with a linear regression kernel, allowing us to fit both periodic and polynomial parts of these data. These settings are explained in Chapter \@ref(dealing-with-kernels).

In [None]:
set_per_kernel = kernel_setters.kernel_helper(kernel_setters.set_gaussianper_kernel,2,1e-8,None)

We then run all the scenarios in Section \@ref(tab:299).

In [None]:
scenario_generator_ = scenario_generator()
scenario_generator_.run_scenarios(scenarios_list,data_random_generator_,
codpyexRegressor(set_kernel = set_per_kernel),
data_accumulator(), **codpy_param)
results = scenario_generator_.accumulator.get_output_datas().dropna(axis=1)

We plot the first two results of this test in Figure \@ref(fig:xfxzfzper) : predictions, denoted $f_z$ of the function $f(z)$, see Figure \@ref(fig:xfxzfz), for the first two scenarios defined in Section \@ref(tab:299).

In [None]:
list_results = [(s.z,s.f_z) for s in scenario_generator_.accumulator.predictors]
multi_plot(list_results,plot1D,mp_max_items = 2)

Table \@ref(tab:654) shows the computed indicators during this test.

In [None]:
%%R
knitr::kable(py$results,  caption = "CodPy performance indicators", col.names = c("$predictor_{id}$", "$D$", "$N_x$", "$N_y$", "$N_z$", "$D_f$", "time", "scores", "norm function", "discr.error"), escape = FALSE)%>%
  kable_styling(latex_options = "HOLD_position")

### The kernel regression model from SciPy

Scipy proposes a solid and robust kernel regression predictor, see [this link](https://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.Rbf.html). We often benchmark our kernel implementation with it. Let us first set up the external parameters for Scipy. 

In [None]:
rbf_param = {'function': 'gaussian', 'epsilon':None, 'smooth':1e-8, 'norm':'euclidean'}

Indeed, we now proceed by copy-pasting the previous section, to highlight that benchmark methodologies should be method-independent. We then run our scenario list and collect results.


In [None]:
scenario_generator_.run_scenarios(scenarios_list,data_random_generator_,
ScipyRegressor(set_kernel = set_per_kernel),
data_accumulator(), **codpy_param)
results = scenario_generator_.accumulator.get_output_datas().dropna(axis=1)

We plot the two first results in Figure \@ref(fig:xfxzfzscipy) : these are the predictions, denoted $f_z$, of the function $f(z)$; see Figure \@ref(fig:xfxzfz), for the first two scenarios defined in Section \@ref(tab:299).

In [None]:
list_results = [(s.z,s.f_z) for s in scenario_generator_.accumulator.predictors]
multi_plot(list_results,plot1D,mp_max_items = 2)

Table \@ref(tab:568) shows the computed indicators after running all scenarios indicated in the Table  \@ref(tab:299).

In [None]:
%%R
knitr::kable(py$results,  caption = "scipy performance indicators", col.names = c("$predictor_{id}$", "$D$", "$N_x$", "$N_y$", "$N_z$", "$D_f$", "time", "scores", "norm function", "discr.error"), escape = FALSE)%>%
  kable_styling(latex_options = "HOLD_position")

### Support vector regression model

For this test, the interpolation machine is chosen to be a support vector classifier, taken from scikit learn. It specified by a decision function (support vector classifier) and the kernel function associated to it, see [this dedicated page for a description of SVC](https://scikit-learn.org/stable/modules/svm.html). The reader can tune this set of parameters.

In [None]:
svm_param = {'kernel': 'linear', 'gamma': 'auto', 'C': 1}

In [None]:
scenario_generator_.run_scenarios(scenarios_list,
data_random_generator_,
SVR(set_kernel = set_per_kernel),
data_accumulator(), **codpy_param, **svm_param)
results = scenario_generator_.accumulator.get_output_datas().dropna(axis=1)

Figure \@ref(fig:1013) shows the results of the first two scenarios of this test.

In [None]:
list_results = [(s.z,s.f_z) for s in scenario_generator_.accumulator.predictors]
multi_plot(list_results,plot1D,mp_max_items = 2)

Table \@ref(tab:1012) provides all computed indicators after running all scenarios indicated in the Table  \@ref(tab:299).

In [None]:
%%R
knitr::kable(py$results,  caption = "SVM performance indicators", col.names = c("$predictor_{id}$", "$D$", "$N_x$", "$N_y$", "$N_z$", "$D_f$", "time", "scores", "norm function", "discr.error"), escape = FALSE)%>%
  kable_styling(latex_options = "HOLD_position")

##  Benchmark methodology: neural network predictors

### TensorFlow neural network regression model

For this test, we use as an interpolation machine a standard neural network one, taken from TensorFlow, commonly called **deep learning** method. It consists in a network of *layers* defined by the following settings, see [this dedicated page for a description of TensorFlow neural networks](https://www.tensorflow.org/tutorials/customization/basics). The reader can tune this set of parameters:

In [None]:
import tensorflow as tf
codpy_param['tfRegressor'] = {'epochs': 50,
'batch_size':16,
'validation_split':0.1,
'loss':tf.keras.losses.mean_squared_error,
'optimizer':tf.keras.optimizers.Adam(0.001),
'layers':[8,64,64,1],
'activation':['relu','relu','relu','linear'],
'metrics':['mse']}

We then run the scenarios. We plot the two first results of this test in Figure \@ref(fig:xfxzfzbl) : these are the predictions, denoted $f_z$, of the function $f(z)$; see figure \@ref(fig:xfxzfz), for the first two scenarios defined in Table \@ref(tab:299).

In [None]:
scenario_generator_.run_scenarios(scenarios_list,data_random_generator_,tfRegressor(set_kernel = set_per_kernel),data_accumulator(), **codpy_param)
results = scenario_generator_.accumulator.get_output_datas().dropna(axis=1)

In [None]:
list_results = [(s.z,s.f_z) for s in scenario_generator_.accumulator.predictors]
multi_plot(list_results,plot1D,mp_max_items = 2)

The table \@ref(tab:569) shows computed indicators after running all scenarios indicated in Table  \@ref(tab:299). 

In [None]:
%%R
knitr::kable(py$results,  caption = "Tensorflow neural network performance indicators", col.names = c("$predictor_{id}$", "$D$", "$N_x$", "$N_y$", "$N_z$", "$D_f$", "time", "scores", "norm function", "discr.error"), escape = FALSE)%>%
  kable_styling(latex_options = "HOLD_position")

### Pytorch neural network regression model

For this test, we use as interpolation machine a standard neural network one, taken from Pytorch. It consists in a network of *layers* defined by the following settings, see [this dedicated page for a description of Pytorch neural networks](https://pytorch.org/tutorials/beginner/basics/buildmodel_tutorial.html). We constructed the same neural network as in the case of Tensorflow.

In [None]:
torch_param = {'PytorchRegressor': {'epochs': 128,
'layers': [8,64,64],
'activation':['relu','linear'],
'batch_size': 16,
'loss': nn.MSELoss(),
'activation': nn.ReLU(),
'optimizer': torch.optim.Adam,
"out_layer": 1}}

In [None]:
scenario_generator_.run_scenarios(scenarios_list,
data_random_generator_,
PytorchRegressor(set_kernel = set_per_kernel),
data_accumulator(), **codpy_param, **torch_param)
results = scenario_generator_.accumulator.get_output_datas().dropna(axis=1)

Figure \@ref(fig:1001) shows the results of first two scenarios of this test.

In [None]:
list_results = [(s.z,s.f_z) for s in scenario_generator_.accumulator.predictors]
multi_plot(list_results,plot1D,mp_max_items = 2)

We run the scenarios and output the results: Table \@ref(tab:1000) provides all computed indicators after running all scenarios indicated in Table  \@ref(tab:299). 

In [None]:
%%R
knitr::kable(py$results, caption = "Pytorch performance indicators", col.names = c("$predictor_{id}$", "$D$", "$N_x$", "$N_y$", "$N_z$", "$D_f$", "time", "scores", "norm function", "discr.error"), escape = FALSE) %>%
  kable_styling(latex_options = "HOLD_position")

##  Benchmark methodology: regression-tree predictors

### Decision tree regression

We use as interpolation machine a decision tree, taken from scikit learn. It allows to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piecewise constant approximation; see [this dedicated page for a description of decision trees](https://scikit-learn.org/stable/modules/tree.html). (The reader can tune this set of parameters).

In [None]:
DT_param = {'max_depth': 10}

In [None]:
scenario_generator_.run_scenarios(scenarios_list,
data_random_generator_,
DecisionTreeRegressor(set_kernel = set_per_kernel),
data_accumulator(), **codpy_param, **DT_param)
results = scenario_generator_.accumulator.get_output_datas().dropna(axis=1)

Figure \@ref(fig:1003) shows the results of the first two scenarios of this test.

In [None]:
list_results = [(s.z,s.f_z) for s in scenario_generator_.accumulator.predictors]
multi_plot(list_results,plot1D,mp_max_items = 2)

Table \@ref(tab:1002) provides all computed indicators after running all scenarios indicated in Table \@ref(tab:299).

In [None]:
%%R
knitr::kable(py$results, caption = "Decision Tree performance indicators", col.names = c("$predictor_{id}$", "$D$", "$N_x$", "$N_y$", "$N_z$", "$D_f$", "time", "scores", "norm function", "discr.error"), escape = FALSE)%>%
  kable_styling(latex_options = "HOLD_position")

### AdaBoost regression

Now, for the interpolation machine we use an AdaBoost algorithm, taken from scikit learn. The core principle of AdaBoost is to fit a sequence of weak learners (i.e., models that are only slightly better than random guessing, such as small decision trees) on repeatedly modified versions of the data. The predictions from all of them are then combined through a weighted majority vote (or sum) to produce the final prediction, see [this dedicated page for a description of AdaBoost algorithm](https://scikit-learn.org/stable/modules/ensemble.html#adaboost). The reader can tune this set of parameters.

In [None]:
ada_param = {'tree_no': 50, 'learning_rate': 1}

In [None]:
scenario_generator_.run_scenarios(scenarios_list,
data_random_generator_,
AdaBoostRegressor(set_kernel = set_per_kernel),
data_accumulator(), **codpy_param, **ada_param)
results = scenario_generator_.accumulator.get_output_datas().dropna(axis=1)

Figure \@ref(fig:1005) shows the results of the first two scenarios of this test.

In [None]:
list_results = [(s.z,s.f_z) for s in scenario_generator_.accumulator.predictors]
multi_plot(list_results,plot1D,mp_max_items = 2)

Table \@ref(tab:1004) provides all computed indicators after running all scenarios indicated in Table  \@ref(tab:299). 

In [None]:
%%R
knitr::kable(py$results, caption = "AdaBoost performance indicators", col.names = c("$predictor_{id}$", "$D$", "$N_x$", "$N_y$", "$N_z$", "$D_f$", "time", "scores", "norm function", "discr.error"), escape = FALSE)%>%
  kable_styling(latex_options = "HOLD_position")

### Gradient boosting regression

For this test, we use as interpolation machine a gradient decision tree boosting (GBDT), taken from scikit learn.  It allows for the optimization of arbitrary differentiable loss functions. In each stage a regression tree is fit on the negative gradient of the given loss function; see [this dedicated page for a description of Gradient Tree Boosting](https://scikit-learn.org/stable/modules/ensemble.html#gradient-boosting). (The reader can tune this set of parameters.)

In [None]:
gb_param = {'tree_no': 50, 'learning_rate': 1}

In [None]:
scenario_generator_.run_scenarios(scenarios_list,
data_random_generator_,
GradientBoostingRegressor(set_kernel = set_per_kernel),
data_accumulator(), **codpy_param, **gb_param)
results = scenario_generator_.accumulator.get_output_datas().dropna(axis=1)

Figure \@ref(fig:1007) shows the results of the first two scenarios of this test.

In [None]:
list_results = [(s.z,s.f_z) for s in scenario_generator_.accumulator.predictors]
multi_plot(list_results,plot1D,mp_max_items = 2)

Table \@ref(tab:1006) provides all computed indicators after running all scenarios indicated in Table  \@ref(tab:299).

In [None]:
%%R
knitr::kable(py$results, caption = "Gradient Boosting performance indicators", col.names = c("$predictor_{id}$", "$D$", "$N_x$", "$N_y$", "$N_z$", "$D_f$", "time", "scores", "norm function", "discr.error"), escape = FALSE)%>%
  kable_styling(latex_options = "HOLD_position")

### XGBoost algorithm

For this test, we use as XGBoost as an interpolation machine. It is essentially a computationally efficient implementation of the original gradient
boost algorithm, see [this dedicated page for a description of XGBoost project](https://xgboost.readthedocs.io/en/latest/tutorials/model.html). (The reader can tune this set of parameters.)

In [None]:
xgb_param = {'max_depth': 5, 'n_estimators': 10}

In [None]:
scenario_generator_.run_scenarios(scenarios_list,
data_random_generator_,
XGBRegressor(set_kernel = set_per_kernel),
data_accumulator(), **codpy_param, **xgb_param)
results = scenario_generator_.accumulator.get_output_datas().dropna(axis=1)

Figure \@ref(fig:1009) shows the results of the first two scenarios of this test.

In [None]:
list_results = [(s.z,s.f_z) for s in scenario_generator_.accumulator.predictors]
multi_plot(list_results,plot1D,mp_max_items = 2)

Table \@ref(tab:1008) provides all computed indicators after running all scenarios indicated in Table  \@ref(tab:299).

In [None]:
%%R
knitr::kable(py$results, caption = "XGBoost performance indicators", col.names = c("$predictor_{id}$", "$D$", "$N_x$", "$N_y$", "$N_z$", "$D_f$", "time", "scores", "norm function", "discr.error"), escape = FALSE)%>%
  kable_styling(latex_options = "HOLD_position")

### Random forest regression

For this test, as an interpolation machine we use a random forest regression.  It operates by constructing a large number of decision trees at training time and producing the class that is the mode of the classes (classification) or mean/average prediction (regression) of the individual trees; see [this dedicated page for a description of forests of randomized trees](https://scikit-learn.org/stable/modules/ensemble.html#forest). (The reader can tune this set of parameters.)

In [None]:
RF_param = {'max_depth': 5, 'n_estimators': 5}

In [None]:
scenario_generator_.run_scenarios(scenarios_list,
data_random_generator_,
RandomForestRegressor(set_kernel = set_per_kernel),
data_accumulator(), **codpy_param, **RF_param)
results = scenario_generator_.accumulator.get_output_datas().dropna(axis=1)

Figure \@ref(fig:1011) shows the results of first two scenarios of this test.

In [None]:
list_results = [(s.z,s.f_z) for s in scenario_generator_.accumulator.predictors]
multi_plot(list_results,plot1D,mp_max_items = 2)

Table \@ref(tab:1010) provides all computed indicators after running all scenarios indicated in Table  \@ref(tab:299).

In [None]:
%%R
knitr::kable(py$results, caption = "Random Forest performance indicators", col.names = c("$predictor_{id}$", "$D$", "$N_x$", "$N_y$", "$N_z$", "$D_f$", "time", "scores", "norm function", "discr.error"), escape = FALSE)%>%
  kable_styling(latex_options = "HOLD_position")

### A comparison between methods

We benchmark methods, comparing any computed indicators as follows.

In [None]:
scenario_generator_.compare_plots(
axis_field_labels = [("Nx","scores")],
mp_title = "Benchmark methods",mp_ncols=1
)

In [None]:
scenario_generator_.compare_plots(
axis_field_labels = [("Nx","discrepancy_errors")],
mp_title = "Benchmark methods",mp_ncols=1
)

In [None]:
scenario_generator_.compare_plots(
axis_field_labels = [("Nx","execution_time")],
mp_title = "Benchmark methods",mp_ncols=1
)

Observe that function norms and discrepancy errors are not method-dependent. Clearly, for this example, a periodical kernel-based method outperforms the two other ones. However, it is not our goal to illustrate a particular method supremacy, but a benchmark methodology, particularly in the context of extrapolating test set data far from the training set ones.

## Tutorial in $N$ dimensions 

### Initialization

Now we illustrate the fact that the dimension arising in the problem under consideration does not change benchmark methods. To illustrate this point, we simply copy/paste the previous step used for the one-dimensional case, but setting the dimension to two, that is $D=2$, and the user can test with this parameter. Only data visualization changes.

In [None]:
data_random_generator_ = data_random_generator(fun = my_fun,types=["cart","sto","cart"])
x, fx, y, fy, z, fz =  data_random_generator_.get_data(D=1,Nx=100,Ny=100,Nz=100)

We first pick-up a scenario list, see Table \@ref(tab:879), to be compared to the one-dimensional scenario Table \@ref(tab:299).

In [None]:
D = 2
scenarios_list = [ (D, 100*(i**2), 100*(i**2),100*(i**2) ) for i in np.arange(5,1,-1)]
pd_scenarios_list = pd.DataFrame(scenarios_list,columns = ["D","Nx","Ny","Nz"])

In [None]:
%%R
knitr::kable(py$pd_scenarios_list, caption = "scenario list", col.names = c("$D$","$N_x$","$N_y$","$N_z$"), escape = FALSE)

Then we generate data and in Figure \@ref(fig:xfxzfz2) we show both graphs $(x,f(x))$ (left, training set),$(z,f(z))$ (right, test set) for illustration purposes, $f$ being defined in Section \@ref(2D). Observe that, if the dimension is greater to two, we use a two dimensional visualization, plotting $\tilde{x},f(x)$, where $\tilde{x}$ is obtained

* either setting indices $\tilde{x}:=x[index1,index2]$
* or performing a PCA over $x$ and setting $\tilde{x}:=PCA(x)[index1,index2]$.

In [None]:
data_random_generator_ = data_random_generator(fun = my_fun, types=["cart","sto","cart"])
x, fx, y, fy, z, fz =  data_random_generator_.get_data(D=D,Nx=2000,Ny=2000,Nz=2000)
multi_plot([(x,fx),(z,fz)],plot_trisurf,projection='3d')

### Periodic kernel for machine learning

This defines a standard periodic Gaussian kernel, with a linear regression kernel, allowing us to fit both periodical and polynomial parts of our data. 

In [None]:
scenario_generator_ = scenario_generator()
scenario_generator_.run_scenarios(scenarios_list,data_random_generator_,
codpyexRegressor(set_kernel = set_per_kernel),
data_accumulator(),data_generator_crop = False, **codpy_param)
results = scenario_generator_.accumulator.get_output_datas().dropna(axis=1)

Table \@ref(tab:657) shows the computed indicators after running all scenarios indicated in Table  \@ref(tab:299).

In [None]:
%%R
knitr::kable(py$results,  caption = "CodPy performance indicators", col.names = c("$predictor_{id}$", "$D$", "$N_x$", "$N_y$", "$N_z$", "$D_f$", "time", "scores", "norm function", "discr.error"), escape = FALSE)

We plot the first two results of this test: the predictions, denoted $f_z$, of the function $f(z)$; see Figure \@ref(fig:xfxzfz2), for the first two scenarios defined in Table \@ref(tab:299).

In [None]:
list_results = [(s.z,s.f_z) for s in scenario_generator_.accumulator.predictors]
multi_plot(list_results,plot_trisurf,mp_max_items = 2,projection='3d')

### Scipy library

In this section we present the result of an extrapolation using SciPy's function RBF.

In [None]:
scenario_generator_.run_scenarios(scenarios_list,data_random_generator_,
ScipyRegressor(set_kernel = set_per_kernel),
data_accumulator(), data_generator_crop = False,**codpy_param)
results = scenario_generator_.accumulator.get_output_datas().dropna(axis=1)

We provide all computed indicators after running all scenarios indicated in  Table \@ref(tab:299).

In [None]:
%%R
knitr::kable(py$results,  caption = "scipy performance indicators", col.names = c("$predictor_{id}$", "$D$", "$N_x$", "$N_y$", "$N_z$", "$D_f$", "time", "scores", "norm function", "discr.error"), escape = FALSE)

We end this test plotting the two first results of this test, to be compared to Figure \@ref(fig:xfxzfz2).

In [None]:
list_results = [(s.z,s.f_z) for s in scenario_generator_.accumulator.predictors]
multi_plot(list_results,plot_trisurf,title="z, f(z)",mp_max_items = 2,projection='3d')

### A comparison between methods

Methods are compared in the corresponding figure.

In [None]:
scenario_generator_.compare_plots(
axis_field_labels = [("Nx","scores"),("Nx","discrepancy_errors"),("Nx","execution_time")]
)

## Benchmark methodology for unsupervised learning

### Purpose 

The goal of this section is to overview our own methodology (which will be fully described in the next chapter).

* We illustrate the prediction function $\mathcal{P}_m$ for some methods in the context of supervised learning.
* We illustrate the computations of some performance indicators, as well as to present a toy benchmark using these indicators.

The data is generated using a multi-modal, multi-variate, Gaussian distribution with a covariance matrix $\Sigma = \sigma I_d$. The problem is to identify the modes of the distribution using clustering method.
In the following we will generate distribution with a predetermined number of modes, it will allow to test validation scores on this toy example.

###  Analysis via k-means clustering

In this paragraph, we compute k-means clustering, using a scikit-learn implementation[^201]

[^201]:[the scikit-learn implementation is available using this link](https://scikit-learn.org/stable/modules/clustering.html#k-means)

We first run all scenarios. We provide all computed indicators after running all scenarios indicated in Table \@ref(tab:299).

In [None]:
from clusteringCHK import *
set_kernel = set_gaussian_kernel
scenarios_list = [ (2, 1000, i,1000 ) for i in np.arange(2,7,1)]
scenario_generator_ = scenario_generator()
scenario_generator_.run_scenarios(scenarios_list,data_blob_generator(),scikitClusterClassifier(set_kernel = set_kernel),cluster_accumulator(), **codpy_param)
results = scenario_generator_.accumulator.get_output_datas().dropna(axis=1).T

In [None]:
%%R
pyresults <- py$results
row.names(pyresults) <-  c("$predictor_{id}$", "$D$", "$N_x$", "$N_y$", "$N_z$", "$D_f$", "time", "scores", "norm function", "discr.error", "score calinsky", "score harabazs", "homogeneity test", "inertia")
knitr::kable(pyresults,  caption = "scikit: clusters indicators", escape = FALSE, col.names = NULL)

**k-means blob visualization**. We now plot the first two distributions as well as the corresponding computed clusters.

In [None]:
scenario_generator_.accumulator.plot_clusters(**codpy_param, index1=0,index2=1,xlabel = 'x',ylabel = 'y',mp_max_items = 2)

**k-means confusion matrix**. We next plot the first two confusion matrices.

In [None]:
scenario_generator_.accumulator.plot_confusion_matrices(mp_max_items = 2)

### Analysis via mini-batch clustering

To compute minibatch clustering, we use [scikit-learn implementation](https://scikit-learn.org/stable/modules/clustering.html#mini-batch-kmeans)

In [None]:
scenario_generator_.run_scenarios(scenarios_list,data_blob_generator(),MinibatchClusterClassifier(set_kernel = set_kernel),cluster_accumulator(), **codpy_param)
results = scenario_generator_.accumulator.get_output_datas().dropna(axis=1).T

We provide all computed indicators after running all scenarios indicated in  Table \@ref(tab:299).

In [None]:
%%R
pyresults <- py$results
row.names(pyresults) <-  c("$predictor_{id}$", "$D$", "$N_x$", "$N_y$", "$N_z$", "$D_f$", "time", "scores", "norm function", "discr.error", "scores calinsky", "score harabazs", "homogeneity test", "inertia")
knitr::kable(pyresults,  caption = "Minibatch: clusters indicators", escape = FALSE, col.names = NULL) %>%
  kable_styling(full_width = T)

**Minibatch blob visualization**. We next plot the first two distributions as well as the corresponding computed clusters.

In [None]:
scenario_generator_.accumulator.plot_clusters(**codpy_param, index1=0,index2=1,xlabel = 'x',ylabel = 'y',mp_max_items = 2)

**Minibatch confusion matrix**. The figure below illustrates two confusion matrices. 

In [None]:
scenario_generator_.accumulator.plot_confusion_matrices(mp_max_items = 2)

### Analysis via CodPy clustering 

In [None]:
scenario_generator_.run_scenarios(scenarios_list,data_blob_generator(),codpyClusterPredictor(set_kernel = set_kernel),cluster_accumulator(), **codpy_param)
results = scenario_generator_.accumulator.get_output_datas().dropna(axis=1).T

We also provide all the indicators after running all of the scenarios in Table  \@ref(tab:299).

In [None]:
%%R
pyresults <- py$results
row.names(pyresults) <-  c("$predictor_{id}$", "$D$", "$N_x$", "$N_y$", "$N_z$", "$D_f$", "time", "scores", "norm function", "discr.error", "score calinsky", "score harabazs", "homogeneity test", "inertia")
knitr::kable(pyresults,  caption = "codpy: clusters indicators", escape = FALSE, col.names = NULL)

**CodPy blob visualization**. We finally plot the two first distributions as well as the corresponding computed clusters

In [None]:
scenario_generator_.accumulator.plot_clusters(**codpy_param, index1=0,index2=1,xlabel = 'x',ylabel = 'y',mp_max_items = 2)

**CodPy confusion matrix**. The figure below illustrates two confusion matrices. 

In [None]:
scenario_generator_.accumulator.plot_confusion_matrices(mp_max_items = 2)

### A comparison between methods

We compare the various methods under consideration, by comparing performance indicators, as illustrated by Figure \@ref(fig:740). 

In [None]:
scenario_generator_.compare_plots(
axis_field_labels = [("Ny","scores"),("Ny","discrepancy_errors"),("Ny","inertia"),("Ny","execution_time")]
)