In [None]:
%%R


In [None]:
from preamble import *
import tensorflow as tf
plt.close('all')
from clusteringCHK import *
from housing_prices import *
codpy_param = {'rescale:xmax': 1000,
'rescale:seed':42,
'sharp_discrepancy:xmax':1000,
'sharp_discrepancy:seed':30,
'sharp_discrepancy:itermax':5,
'discrepancy:xmax':500,
'discrepancy:ymax':500,
'discrepancy:zmax':500,
'discrepancy:nmax':2000,
'validator_compute': ['accuracy_score','discrepancy_error','norm']}

# Application to supervised machine learning 
In this chapter and the following ones, we present some examples of more concrete learning machines problems. Some of these tests are taken from  [kaggle, see this url](https://www.kaggle.com/).

Supervised learning problems can be split into Regression and Classification problems. Both problems have as goal the construction of a model that can predict the value of the output from the input variables. In the case of regression the output is a real valued variable, whereas in the case of classification the output is category (e.g. "disease" or "no disease"). Codpy's extrapolate and projection function can be used to treat each of above mentioned problems.

We present two cases corresponding two each typical problems in supervised learning: Boston housing prices prediction and MNIST classification.

## Regression problem: housing price prediction

This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. There are 506 cases and 13 attributes (features) with a target column (price). More details can be found in the article published by Harrison, D. and Rubinfeld, D.L. "Hedonic prices and the demand for clean air", J. Environ. Economics & Management, vol.5, 81-102, 1978.

### Codpy's extrapolation

Starting from the training set $x \in \RR^{N_x \times D}$, we extrapolate the labels $f_z$, and compare to test set labels $f(z)$, using the extrapolation operator defined in \@ref(eq:EI)-left.

In [None]:
set_kernel = kernel_setters.kernel_helper(kernel_setters.set_tensornorm_kernel, 2,1e-8 ,map_setters.set_unitcube_map)
data_generator_ = Boston_data_generator()
x, fx, x, fx, z, fz = data_generator_.get_data(-1, -1, -1, -1)
length_ = len(x)
scenarios_list = [ (-1, i, i, -1)  for i in np.arange(length_,20,-(length_-20)/10) ]
#scenarios_list = [ (-1, 2**(i), -1, 2**(i))  for i in np.arange(5,9,1) ]
scenarios = scenario_generator()
scenarios.run_scenarios(scenarios_list,data_generator_, housing_codpy_extrapolator(set_kernel = set_kernel), data_accumulator(), **codpy_param)
results = scenarios.accumulator.get_output_datas().dropna(axis=1).T
df_results = results

We output at table \@ref(tab:669) the list of performance indicators for this test.

In [None]:
%%R
knitr::kable(py$results,  caption = "Codpy: indicators for Boston housing prices")%>%
  kable_styling(latex_options = "HOLD_position")

### Tensorflow

The benchmark method is described chapter \@ref(a-quick-tour-to-machine-learning). The following lines defines a standard neural network for a regression model.

In [None]:
tf_param = {'tfRegressor': {'epochs': 50,
'batch_size':16,
'validation_split':0.1,
'loss':tf.keras.losses.mean_squared_error,
'optimizer':tf.keras.optimizers.Adam(0.001),
'layers':[8,64,64,1],
'activation':['relu','relu','relu','linear'],
'metrics':['mse']}
}
scenarios.run_scenarios(scenarios_list,data_generator_, tfRegressor(set_kernel = set_kernel), data_accumulator(), **codpy_param,**tf_param)
results = scenarios.accumulator.get_output_datas().dropna(axis=1).T
df_results = results

We output at table \@ref(tab:671) the list of performance indicators for this test.

In [None]:
%%R
knitr::kable(py$results,  caption = "Tensorflow Neural Network: indicators for Boston housing prices")%>%
  kable_styling(latex_options = "HOLD_position")

### Pytorch

The Pytorch neural network model is described chapter \@ref(pytorch-neural-network-model). We use this parameters set to define this Pytorch regression model, defined below

In [None]:
torch_param = {'PytorchRegressor': {'epochs': 50,
'layers': [8,64,64],
'loss': nn.MSELoss(),
'batch_size': 16,
'loss': nn.MSELoss(),
'activation': nn.ReLU(),
'optimizer': torch.optim.Adam,
'out_layer': 1}}
scenarios.run_scenarios(scenarios_list,data_generator_, PytorchRegressor(set_kernel = set_kernel), data_accumulator(), **codpy_param,**torch_param)
results = scenarios.accumulator.get_output_datas().dropna(axis=1).T
df_results = pd.concat([df_results,results.T],axis=0)

We output at table \@ref(tab:672) the list of performance indicators for this test.

In [None]:
%%R
knitr::kable(py$results,  caption = "Pytorch Neural Network: indicators for Boston housing prices")%>%
  kable_styling(latex_options = "HOLD_position")

### Decision tree

The decision tree model is described chapter \@ref(decision-tree-model).

In [None]:
scenarios.run_scenarios(scenarios_list,data_generator_, DecisionTreeRegressor(set_kernel = set_kernel), data_accumulator(), **codpy_param)
results = scenarios.accumulator.get_output_datas().dropna(axis=1).T
df_results = pd.concat([df_results,results.T],axis=0)

We output at table \@ref(tab:673) the list of performance indicators for this test.

In [None]:
%%R
knitr::kable(py$results,  caption = "Decision tree: indicators for Boston housing prices")%>%
  kable_styling(latex_options = "HOLD_position")

### Methods comparison

The following picture compares methods in term of scores Figure \@ref(fig:309), discrepancy errors Figure \@ref(fig:310), and execution time Figure \@ref(fig:311). We give an interpretation of these results.

* First notice that the kernel method *codpy lab extra*, that is the extrapolation method, obtains both best scores and worst execution time. 
* Notice also that one, minus the discrepancy error, matches the scores of the method *codpy lab extra*. This indicates that the discrepancy error is a pertinent indicator.
* Another kernel method, *codpy lab proj*, that is the projection method above, is a more balanced method [^268].
* Both kernel methods are shipped with a very standard kernel, that is the Gaussian one, that is the only parameter for kernel methods. We emphasize that Kernel engineering can easily improves these results. We do not present these improved kernel methods, as our purposes is to benchmark standard methods.

In [None]:
scenarios.compare_plots(
axis_field_labels = [("Nx","scores")])

In [None]:
scenarios.compare_plots(
axis_field_labels = [("Nx","discrepancy_errors")])

In [None]:
scenarios.compare_plots(
axis_field_labels = [("Nx","execution_time")])

## Classification problem: handwritten digits

This section contains an example of classification for images, which is a typical academic example referred to as the MNIST problem, and allows us to benchmark our results against more popular methods.

MNIST ("Modified National Institute of Standards and Technology") contains 60,000 training images and 10,000 testing images. Half of the training set and half of the test set were taken from NIST's training dataset, while the other half of the training set and the other half of the test set were taken from NIST's testing dataset. Since its release in 1999, this classic dataset of handwritten images has served as the basis for benchmarking classification algorithms.

In this section, we propose a benchmark of several machine learning methods, including kernel ones. Our goals, above benchmarking our methods against popular alternatives, are to demonstrate that all these tests are problem dependent, not method dependents. To illustrate this fact, we purposely almost copy paste each test, to test another method. The motivation here is also to provide to our users a bank of code, where they can just copy paste one section of this document to test their own learning machines.

### Short introduction to MNIST

The MNIST dataset is composed of $60,000$ images defining a training set of handwritten digits. Each image is a vector having dimensions $784$ (a $28 \times 28$ grayscale image flattened in row-order). There are $10$ digits $0–9$. The test set is composed of $10,000$ images with their labels.

We formalize the problem as follows. Given the test set represented by a matrix $x\in \mathbb{R}^{N_x \times D}$, $D=784$, the labels $f(x) \in \mathbb{R}^{N_x \times D_f}$, $D_f=10$, and the test set $z\in \mathbb{R}^{N_z \times D}$, $N_z= 10000$, predict the label function $f(z) \in \mathbb{R}^{N_z \times D_f}$. Data are retrieved from Y. LeCun MNIST home page \cite{YL}, and we will test different values for $N_x$.

The following picture shows an image of hand-written number, that is the first image $x^1$, as well as numerous others

![](.\CodPyFigs\MNIST.png){width=50%}

The following line defines our scenario list

In [None]:
MNIST_data_generator_ = MNIST_data_generator()
scenarios_list = [ (784, 2**(i), 2**(i-2), 10000)  for i in np.arange(5,9,1)]
pd_scenarios_list = pd.DataFrame(scenarios_list)

The table \@ref(tab:538) output this scenario list

In [None]:
%%R
knitr::kable(py$pd_scenarios_list, label = "list of scenari", col.names = c("D","Nx","Ny","Nz"))

Scores are computed using the formula \@ref(eq:score), a scalar in the interval between 0 and 1, which counts the number of correctly predicted images.

Our kernel setup for this MNIST test is the following

In [None]:
set_mnist_kernel = kernel_setters.kernel_helper(kernel_setters.set_gaussian_kernel, 0,1e-8 ,map_setters.set_mean_distance_map)

#### Keras Tensorflow scores

The benchmark method is described chapter \@ref(a-quick-tour-to-machine-learning). The following lines defines a standard neural network for studying the MNIST problem.


In [None]:
import tensorflow as tf
tf_param = {'tfClassifier' : {'epochs': 10,
'batch_size':16,
'validation_split':0.1,
'loss': tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
'optimizer':tf.keras.optimizers.Adam(0.001),
'activation':['relu',''],
'layers':[128,10],
'metrics':[tf.keras.metrics.SparseCategoricalAccuracy()]} }

We then run the benchmarks

In [None]:
scenarios = scenario_generator()
scenarios.run_scenarios(scenarios_list,MNIST_data_generator_,tfClassifier(set_kernel = set_mnist_kernel),data_accumulator(),**codpy_param,**tf_param)
results = scenarios.accumulator.get_output_datas().dropna(axis=1).T
df_results = pd.concat([df_results,results.T],axis=0)

We output at table \@ref(tab:574) the list of performance indicators for this test.


In [None]:
%%R
knitr::kable(py$results,  caption = "tensorflow: indicators for MNIST")%>%
  kable_styling(latex_options = "HOLD_position")

Finally, we output as well the confusion matrix for the last scenario in figure \@ref(fig:584).

In [None]:
multi_plot([scenarios.predictor] ,add_confusion_matrix.plot_confusion_matrix,title='')

#### CodPy scores extrapolation

Starting from the training set $x \in \RR^{N_x \times 784}$, we extrapolate the labels $f_z$, and compare to test set labels $f(z)$, using the extrapolation operator defined in \@ref(eq:EI)-left.

In [None]:
scenarios.run_scenarios(scenarios_list,MNIST_data_generator_,codpyexClassifier(set_kernel = set_mnist_kernel),data_accumulator(),**codpy_param)
results = scenarios.accumulator.get_output_datas().dropna(axis=1).T
df_results = pd.concat([df_results,results.T],axis=0)

We output at table \@ref(tab:594) the list of performance indicators for this test.

In [None]:
%%R
knitr::kable(py$results,  caption = "codpy extrapolation: indicators for MNIST")%>%
  kable_styling(latex_options = "HOLD_position")

Finally, we output as well the confusion matrix for the last scenario in figure \@ref(fig:585).

#### CodPy scores projection

In this section we apply straightfowardly the projection operator \@ref(eq:P), where
 the training set is $x \in \RR^{N_x \times 784}$, and $y \in \RR^{N_y \times 784} \subset x$ is randomly chosen. Then we use the projection operator defined in \@ref(eq:P).

In [None]:
scenarios.run_scenarios(scenarios_list,MNIST_data_generator_,codpyprClassifier(set_kernel = set_mnist_kernel),data_accumulator(),**codpy_param)
results = scenarios.accumulator.get_output_datas().dropna(axis=1).T
df_results = pd.concat([df_results,results.T],axis=0)

We output at table \@ref(tab:595) the list of performance indicators for this test.

In [None]:
%%R
knitr::kable(py$results,  caption = "codpy extrapolation: indicators for MNIST")%>%
  kable_styling(latex_options = "HOLD_position")

Finally, we output as well the confusion matrix for the last scenario in figure \@ref(fig:586).

#### Pytorch

The Pytorch neural network model is described chapter \@ref(pytorch-neural-network-model). We use this parameters set to define this Pytorch machine.


In [None]:
torch_param = {'PytorchClassifier': {'epochs': 10,
'layers': [128],
'batch_size': 16,
'loss': nn.CrossEntropyLoss(),
'activation': nn.ReLU(),
'optimizer': torch.optim.Adam,
"datatype": "long",
"prediction": "labeled",
"out_layer": 10}}

In [None]:
scenarios.run_scenarios(scenarios_list,MNIST_data_generator_,PytorchClassifier(set_kernel = set_mnist_kernel),data_accumulator(),**codpy_param, **torch_param)
results = scenarios.accumulator.get_output_datas().dropna(axis=1).T
df_results = pd.concat([df_results,results.T],axis=0)

We output at table \@ref(tab:600) the list of performance indicators for this test.

In [None]:
%%R
knitr::kable(py$results,  caption = "Pytorch Neural Network: indicators for MNIST")%>%
  kable_styling(latex_options = "HOLD_position")

#### Decision Tree

The decision tree model is described chapter \@ref(decision-tree-model).

In [None]:
scenarios.run_scenarios(scenarios_list,MNIST_data_generator_,DecisionTreeClassifier(set_kernel = set_mnist_kernel),data_accumulator(), **codpy_param)
results = scenarios.accumulator.get_output_datas().dropna(axis=1).T
df_results = pd.concat([df_results,results.T],axis=0)

We output at table \@ref(tab:601) the list of performance indicators for this test.

In [None]:
%%R
knitr::kable(py$results,  caption = "Decision tree classifier: indicators for MNIST")%>%
  kable_styling(latex_options = "HOLD_position")

#### AdaBoost

The Adaboost model is described chapter \@ref(adaboost-model).

In [None]:
scenarios.run_scenarios(scenarios_list,MNIST_data_generator_,AdaBoostClassifier(set_kernel = set_mnist_kernel),data_accumulator(),**codpy_param)
results = scenarios.accumulator.get_output_datas().dropna(axis=1).T
df_results = pd.concat([df_results,results.T],axis=0)

We output at table \@ref(tab:602) the list of performance indicators for this test.

In [None]:
%%R
knitr::kable(py$results,  caption = "AdaBoost classifier: indicators for MNIST")%>%
  kable_styling(latex_options = "HOLD_position")

#### Gradient Boosting

The gradient boosting model is described chapter \@ref(gradient-boosting-model).


In [None]:
gb_scenarios_list = [ (784, 2**(i), 2**(i-2), 10000)  for i in np.arange(5,10,1)]

In [None]:
scenarios.run_scenarios(gb_scenarios_list,MNIST_data_generator_,GradientBoostingClassifier(set_kernel = set_mnist_kernel),data_accumulator(), **codpy_param)
results = scenarios.accumulator.get_output_datas().dropna(axis=1).T
df_results = pd.concat([df_results,results.T],axis=0)

We output at table \@ref(tab:603) the list of performance indicators for this test.

In [None]:
%%R
knitr::kable(py$results,  caption = "Gradient Boosting classifier: indicators for MNIST")%>%
  kable_styling(latex_options = "HOLD_position")

#### XGBoost

The XGBoost model is described chapter \@ref(xgboost-model). We set its parameters as follows.

In [None]:
xgb_param = {'epochs': 5,
'max_depth': 3,
'eta' : 0.3,
'objective': 'multi:softmax',
'num_class': 10}

<!-- ```{python} -->
<!-- scenarios.run_scenarios(scenarios_list,MNIST_data_generator_,label_xgboost_predictor(set_kernel = set_mnist_kernel),data_accumulator(), **xgb_param, **codpy_param) -->
<!-- results = scenarios.accumulator.get_output_datas().dropna(axis=1).T -->
<!-- ``` -->

<!-- We output at table \@ref(tab:604) the list of performance indicators for this test. -->

<!-- ```{r, label= 604} -->
<!-- knitr::kable(py$results,  caption = "XGBoost: indicators for MNIST")%>% -->
<!--   kable_styling(latex_options = "HOLD_position") -->
<!-- ``` -->

#### Random Forest

The random forest model and its parameter set are described chapter \@ref(random-forest-model).

In [None]:
scenarios.run_scenarios(scenarios_list,MNIST_data_generator_,RandomForestClassifier(set_kernel = set_mnist_kernel),data_accumulator(),**codpy_param)
results = scenarios.accumulator.get_output_datas().dropna(axis=1).T
df_results = pd.concat([df_results,results.T],axis=0)

We output at table \@ref(tab:605) the list of performance indicators for this test.

In [None]:
%%R
knitr::kable(py$results,  caption = "Random Forest classifier: indicators for MNIST")%>%
  kable_styling(latex_options = "HOLD_position")

#### Support vector classifier

The SVC model and its parameter set are described chapter \@ref(support-vector-classifier-model).

In [None]:
scenarios.run_scenarios(scenarios_list,MNIST_data_generator_,SVC(set_kernel = set_mnist_kernel),data_accumulator(), **codpy_param)
results = scenarios.accumulator.get_output_datas().dropna(axis=1).T
df_results = pd.concat([df_results,results.T],axis=0)

We output at table \@ref(tab:606) the list of performance indicators for this test.

In [None]:
%%R
knitr::kable(py$results,  caption = "SVC classifier: indicators for MNIST")%>%
  kable_styling(latex_options = "HOLD_position")

In [None]:
%%R
knitr::kable(py$df_results,  caption = "SVC classifier: indicators for MNIST")%>%
  kable_styling(latex_options = "HOLD_position")

### Comparing methods

The following picture compares methods in term of scores Figure \@ref(fig:309), discrepancy errors Figure \@ref(fig:310), and execution time Figure \@ref(fig:311). We give an interpretation of these results.

* First notice that the kernel method *codpy lab extra*, that is the extrapolation method, obtains both best scores and worst execution time. 
* Notice also that one, minus the dicrepancy error, matches the scores of the method *codpy lab extra*. This indicates that the discrepancy error is a pertinent indicator.
* Another kernel method, *codpy lab proj*, that is the projection method above, is a more balanced method [^268].
* Both kernel methods are shipped with a very standard kernel, that is the gaussian one, that is the only parameter for kernel methods. We emphasize that Kernel engineering can easily improves these results. We do not present these improved kernel methods, as our purposes is to benchmark standard methods.


[^268]: except Gradient boosting method, for which we did not succeed retrieving a competitive set of parameters for this test.


In [None]:
scenarios.compare_plots(
axis_field_labels = [("Nx","scores")]
)

In [None]:
scenarios.compare_plots(
axis_field_labels = [("Nx","discrepancy_errors")]
)

In [None]:
scenarios.compare_plots(
axis_field_labels = [("Nx","execution_time")]
)

## Reconstruction problems : learning from sub-sampled signals in tomography.

This numerical experience illustrates an interesting capability of learning machines to reconstruction problems from sub-sampled signals. Indeed, in this test, we will be learning from a well-established algorithm, that is the SART one, to fasten the reconstruction.

There are many applications of such problems. We illustrate this section with a problem coming from a medical image reconstruction, that can be used also as a medical helping diagnosis decision tool. However, such problems occur in a wide variety of other situations: biology, oceanography, astrophysics, ... 

Poor input signal quality can sometimes be a choice. For instance, in nuclear medicine, it is better to work with lower radioisotopes concentration for obvious health reasons.
Another interesting motivation for sub-sampling signals can be also accelerating data acquisition processes from expensive machines. 

We illustrate this section with an example of such a reconstruction coming from reconstructing a signal from a sub-sampled SPEC (tomography) problem that we describe now. 


In [None]:
from preamble import *
plt.close('all')
from radon import *
codpy_param = {'rescale:xmax': 1000,
'rescale:seed':42,
'sharp_discrepancy:xmax':1000,
'sharp_discrepancy:seed':30,
'sharp_discrepancy:itermax':5,
'discrepancy:xmax':500,
'discrepancy:ymax':500,
'discrepancy:zmax':500,
'discrepancy:nmax':2000,
'validator_compute': ['accuracy_score']}

### A problem coming from SPECT tomography

The purpose of this test is to illustrate a sub-sampling reconstruction in the context of medical imagery, more precisely from sub-sampled SPECT images. To that aim, we start from collecting a set of *high resolution* images[^594]. The set itself  is not really important for our illustration sake in this section. However it  should be chosen carefully for real, production problem.

[^594]: the image set is available publicly at this [kaggle link](https://www.kaggle.com/vbookshelf/computed-tomography-ct-images/). 

This database image consists in high resolution (512x512) images, consisting in approximately 30 images of 82 patients. The training set is built on the first 81 patient. The 82-th patient is used for the test set. We first transform the training set database to produce our data. For each image in the training set (2470 images):

* We perform a "high" resolution (256x256) radon transform [^357], called a **sinogram** [^324]. A sinogram is quite close to a Fourier transform of the original image, generating sinusoids.
* We perform a "low" resolution (8x256) radon transform.
* We reconstruct the original image from the high resolution sinogram to simulate high resolution SPECT images from these data. The reconstruction algorithm consists in computing an inverse radon transform [^424]. 

An example of training set construction is presented Figure \ref{fig:SPECT}. Left is the reconstructed image from the "high resolution" sinogram (middle). The low resolution sinogram is plot at right.

In [None]:
%%R
knitr::include_graphics(path = "./CodPyFigs/SPECT.png")

[^357]: An introduction to radon transform can be found at [ this wikipedia page](https://en.wikipedia.org/wiki/Radon_transform).

[^324]: We used the standard radon transform from scikit, [available at this url](https://scikit-image.org/docs/dev/api/skimage.transform.html#skimage.transform.radon).

[^424]: We used a SART algorithm, 3 iterations, for reconstruction, [available at this url](https://scikit-image.org/docs/dev/api/skimage.transform.html#skimage.transform.iradon_sart).

The test consists then in reconstructing all images of the 82-th patient using low-resolution sinograms.

### Performing the test

We present here the test resulting from a benchmark of a kernel-based method and the SART algorithm[^269]

[^269]: We did not succeed finding competitive parameters for other methods.

Following our notations, section  \@ref(notation-and-setup), we introduce

* The training set $x \in \RR^{2473 \times 2304}$, consisting in 2473 sinograms having resolution $8 \times 256$, consisting in all low-resolution sinograms of the 81 first patients, plus the first one of the 82-th patient. This last figure is added to check an important feature in these problems : the learning machine must be able to retrieve an already input example.
* The test set $z \in \RR^{29 \times 2304}$, consisting in 29 sinograms of the 82-th patient, having resolution $8 \times 256$.
* The training values set $f_x \in \RR^{2473 \times 65536}$, consisting in the 2473 images in "high-resolution".
* The ground truth values $f(z) \in \RR^{29 \times 65536}$, consists in 29 images in "high-resolution".

We perform the tests and output the results in Table \ref{tab:207}. The columns are the predictor identifiant, $D,N_x,N_y,N_z,D_f$, the execution time, and the score, computed with the RMSE \% error indicator, see \@ref(eq:rmse).

* The first line, named *exact*, simply output the original figures, leading to zero error.
* The second one, named *SART*, reconstruct the figures from the SART algorithm with sub-sampled data.
* The third one, named *codpy*, reconstruct the figures from the sub-sampled data with the kernel extrapolation method \@ref(eq:EI).

The figure \@ref(fig:359) plots the first 8 images, presenting the original one at left, the reconstruction from SART algorithm, middle, and our algorithm, right. One can check visually that this kernel method better reconstruct the original image. It would be erroneous to conclude that this reconstruction process performs better than the SART algorithm, and it is not at all our speech here. We simply illustrate here the capacity of our algorithm to recognize existing patterns: indeed, note that the first image is perfectly reconstructed, as it is part of the training set. This property emphasizes that such methods suit well to pattern recognition problems, as automated tools to support professionals diagnosis.

In [None]:
%%R
knitr::include_graphics(here::here("CodPyFigs", "reconstruction.png"))

In [None]:
## remote execute file radon.py,as execution time is too long.