In [None]:
%%R


In [None]:
from preamble import *
plt.close('all')
from clusteringCHK import *
codpy_param = {'rescale:xmax': 1000,
'rescale:seed':42,
'sharp_discrepancy:xmax':1000,
'sharp_discrepancy:seed':30,
'sharp_discrepancy:itermax':5,
'discrepancy:xmax':500,
'discrepancy:ymax':500,
'discrepancy:zmax':500,
'discrepancy:nmax':2000,
'validator_compute': ['accuracy_score','discrepancy_error','norm']}
MNIST_data_generator_ = MNIST_data_generator()

# Application for unsupervised machine learning

In this section we apply some clustering methods for a number of use cases.

We benchmarked our kernel-based algorithms, see section \@ref(a-kernel-based-clustering-algorithm) against the popular k-means algorithms. Both are distance-based minimization algorithms, aiming to solve the problem \@ref{eq:dist}, that we recall here

\[
y = \arg \inf_{y \in \RR^{N_y \times D}} d(x,y)
\]
The clusters $y\in \RR^{N_y \times D}$ are the results of this minimization algorithm, where :

* For k-means based algorithms, the distance is called the *inertia*, see section \@ref(inertia).

* For kernel-based algorithms, the distance is called the *discrepancy error*, see section \@ref(discrepancy-error-1).

Importantly, if the distance functional $d(x,y)$ is not convex, then a solution to \@ref(eq:dist) might not be unique. For instance, a k-mean based algorithm usually output different clusters output at different runs.


In [None]:
codpy_param = {'rescale:xmax': 1000,
'rescale:seed':42,
'sharp_discrepancy:xmax':1000,
'sharp_discrepancy:seed':30,
'sharp_discrepancy:itermax':5,
'discrepancy:xmax':500,
'discrepancy:ymax':500,
'discrepancy:zmax':500,
'discrepancy:nmax':2000,
'num_threads':25,
'validator_compute':['accuracy_score','discrepancy_error','inertia']}

## Classification problem: handwritten digits

The MNIST test is also studied in the section \@ref(short-introduction-to-MNIST). Here we consider it as a semi-supervised learning: we use the train set $x \in \mathbb{R}^{N_x \times D}$ to compute the cluster's centroids $y \in \mathbb{R}^{N_y \times D}$. Then we use these clusters to predict the test labels $f_z \in \mathbb{R}^{N_z \times D_f}$, corresponding to the test set $z \in \mathbb{R}^{N_z \times D}$.

The following lines define our setting for this test.

In [None]:
set_kernel = set_gaussian_kernel
scenarios_list = [ (-1, 1000, 2**i,1000) for i in np.arange(7,9,1)]
scenario_generator_ = scenario_generator()

### Scikit k-means

First we use Scikit's k-means algorithm implementation, which is simply partitioning the input data $x \in \mathbb{R}^{N_x \times D}$ into $Ny$ sets so as to minimize the within-cluster sum of squares, which is defined as "inertia", see \@ref(inertia). The inertia represents the sum of distances of all points to the centroid $y \in \mathbb{R}^{N_y \times D}$ in a cluster. K-means algorithm starts with a group of randomly initialized centroids and then performs iterative calculations to optimize the position of centroids until the centroids stabilizes, or the defined number of iterations is reached.

In [None]:
scenario_generator_.run_scenarios(scenarios_list,MNIST_data_generator(),scikitClusterClassifier(set_kernel = set_kernel),cluster_accumulator(),**codpy_param)
results = scenario_generator_.accumulator.get_output_datas().dropna(axis=1)

The result of k-means algorithm is $N_y$ clusters in $D=784$ dimensions, i.e. $y \in \mathbb{R}^{N_y\times D}$. Note that the cluster centroids themselves are 784-dimensional points, and can themselves be interpreted as the "typical" digit within the cluster. The figure \@ref(fig:847) plots some examples of computed clusters, interpreted as images. As can be seen, they are perfectly recognizable.

In [None]:
multi_plot(scenario_generator_.accumulator.get_ys(),fun_plot = show_mnist_pictures,mp_ncols = 1, mp_max_items = 10)

The table \@ref(tab:848) displays the dimension of the problem $D$, the size of training set $Nx$, the number of clusters $Ny$, the size of the test set $Nz$, the execution time, inertia and discrepancy errors, scores. The higher the scores and the lower are the inertia and discrepancy errors the better. 

In [None]:
%%R
knitr::kable(py$results, caption = "Scikit: Ny clusters") %>%
  kable_styling(latex_options = "HOLD_position")

### Scikit minibatch k-mean

We replicate our previous tests for a different scikit algorithm, that is mini batch. Minibatch is a k-mean algorithm that is optimized for computational time.

In [None]:
scenario_generator_.run_scenarios(scenarios_list,MNIST_data_generator(),MinibatchClusterClassifier(set_kernel = set_kernel),cluster_accumulator(),**codpy_param)
results = scenario_generator_.accumulator.get_output_datas().dropna(axis=1)

The figure \@ref(fig:827) plots some examples of computed clusters, interpreted as images. 

In [None]:
multi_plot(scenario_generator_.accumulator.get_ys(),fun_plot = show_mnist_pictures,mp_ncols = 1, mp_max_items = 10)

The table \@ref(tab:828) displays the performance indicator for this test. 

In [None]:
%%R
knitr::kable(py$results, caption = "Scikit: Ny clusters") %>%
  kable_styling(latex_options = "HOLD_position")

### Codpy

In this section we apply codpy's algorithm described in \@ref(a-kernel-based-clustering-algorithm) using the distance $d_k(x,y)$ induced by a Gaussian kernel:
$$
k(x,y)=\exp(-(x-y)^2)
$$
We repeat the same test as in the previous section. We first run all scenarios.

In [None]:
scenario_generator_.run_scenarios(scenarios_list,MNIST_data_generator(),codpyClusterClassifier(set_kernel = set_kernel),cluster_accumulator(),**codpy_param)
results = scenario_generator_.accumulator.get_output_datas().dropna(axis=1)

Then we dispay figure \@ref(fig:858) some computed clusters as images, and as for k-means, notice that these are recognizable numbers.

In [None]:
multi_plot(scenario_generator_.accumulator.get_ys(),fun_plot = show_mnist_pictures,mp_ncols = 1, mp_max_items = 10)

The table \@ref(tab:849) displays computed performance indicators for all scenarios. 

In [None]:
%%R
knitr::kable(py$results, caption = "codpy: Ny clusters") %>%
  kable_styling(latex_options = "HOLD_position")

### Benchmarks results

Finally, we illustrate a benchmark plot, displaying the computed performance indicator of Scikit's k-means and codpy's sharp discrepancy algorithms in terms of discrepancy errors, inertia, accuracy scores (when applicable) and execution time.

In [None]:
scenario_generator_.compare_plots(
    axis_field_labels = [("Ny","scores"),("Ny","discrepancy_errors"),("Ny","inertia"),("Ny","execution_time")])

Notice that the score are quite high, when compared to supervised methods for similar size of training set, see results section \@ref(apps-i---supervised-machine-learning-applications). Notice also that codpy, which algorithm relies on a discrepancy distance minimization, displays an inertia indicator that is lower than minibatch, and quite comparable to k-means. This is surprizing as k-means algorithms are based on inertia minimization. Moreover, scores seems to indicate than the disrepancy distance is a more reliable criteria than inertia on this pattern recognition problem.

## German credit risk

The original dataset contains 1000 entries with 20 categorial/symbolic attributes. In this dataset, each entry represents a person who takes a credit by a bank. The goal is to categorize each person as good or bad credit risks according to the set of attributes. The dataset is described on [kaggle page link](https://www.kaggle.com/uciml/german-credit).

The following lines define our setting for this test.

In [None]:
scenarios_list = [(-1, -1, i,-1) for i in range(10, 21,10)]
scenario_generator_ = scenario_generator()

### Scikit

The tests follows the very same method as in the previous section. We first run our scenarios in the following line.

In [None]:
scenario_generator_.run_scenarios(scenarios_list,german_credit_data_generator(),scikitClusterPredictor(set_kernel = set_kernel),cluster_accumulator(),**codpy_param)
results = scenario_generator_.accumulator.get_output_datas().dropna(axis=1)

The result of k-means algorithm is $Ny$ clusters in $D$ dimensions. Notice that the cluster centroids themselves are $D$-dimensional points.

Next we visualize the clusters and corresponding centroids of scikit, where we vary the number of clusters $Ny$ from 1 to 8. Obviously in this example we see that the high number of clusters leads to overfitting and one is unable to interpret the resulting clusters when $Ny = 8$.

In [None]:
scenario_generator_.accumulator.plot_clusters(**codpy_param)

The table \@ref(tab:374) displays inertia, discrepancy errors and execution time performance indicators. 


In [None]:
%%R
knitr::kable(py$results, format="latex", caption = "performance indicator for scikit")%>%
  kable_styling(full_width = T,font_size = 6)

### Codpy

In this section we apply the same methodology as in the previous sections.

In [None]:
scenario_generator_.run_scenarios(scenarios_list,german_credit_data_generator(),codpyClusterPredictor(set_kernel = set_kernel),cluster_accumulator(),**codpy_param)
results = scenario_generator_.accumulator.get_output_datas().dropna(axis=1)

The result of codpy's sharp discrepancy algorithm is $Ny$ clusters in $D$ dimensions. Notice that the cluster centroids themselves are $D$-dimensional points.

Next we visualize the clusters and corresponding centroids computed using codpy's sharp discrepancy algorithm, where we vary the number of clusters $Ny$ from $1$ to $8$. Obviously in this example we see that the high number of clusters leads to an overfitting and one is unable to interpret the resulting clusters when $Ny = 8$.

In [None]:
scenario_generator_.accumulator.plot_clusters(**codpy_param)

The table \@ref(tab:375) displays inertia, discrepancy errors and execution time performance indicators. 


In [None]:
%%R
knitr::kable(py$results, format="latex", caption = "performance indicator for codpy")%>%
  kable_styling(full_width = T,font_size = 6)

### Benchmarks results

Finally, we illustrate a benchmark plot, displaying the computed performance indicator of Scikit's k-means and codpy's sharp discrepancy algorithms in terms of discrepancy errors, inertia, accuracy scores (when applicable) and execution time.

In [None]:
scenario_generator_.compare_plots(
    axis_field_labels = [("Ny","discrepancy_errors"),("Ny","inertia"),("Ny","execution_time")])

## Credit card marketing strategy

The problem can be formalized as follows.  Develop a customer segmentation to define marketing strategy. The sample dataset summarizes the usage behavior of 8,950 active credit card holders during the last 6 months. The dataset contains 17 features and 8,950 records. The data describes customer’s purchase and payment habits, such as how often a customer installment purchases, or how often they make cash advances, how much payments are made, etc. By inspecting each customer, we can find which type of purchase he/she is keen on, or if he/she prefers cash advance over purchases. The dataset is detailed on this dedicated [kaggle page](https://www.kaggle.com/arjunbhasin2013/ccdata).


The following lines define our setting for this test.

In [None]:
scenarios_list = [(-1, -1, i,-1) for i in np.arange(2,21,3)]
scenario_generator_ = scenario_generator()

### Scikit

First we use Scikit's k-means algorithm implementation, which is simply partitioning the input data $x \in \mathbb{R}^{N_x \times D}$ into $Ny$ sets so as to minimize the within-cluster sum of squares, which is defined as "inertia". The inertia represents the sum of distances of all points to the centroid $y \in \mathbb{R}^{N_y \times D}$ in a cluster. K-means algorithm starts with a group of randomly initialized centroids and then performs iterative calculations to optimize the position of centroids until the centroids stabilize, or the defined number of iterations is reached.



In [None]:
scenario_generator_.run_scenarios(scenarios_list,credit_card_data_generator(),scikitClusterPredictor(set_kernel = set_kernel),cluster_accumulator(),**codpy_param)
results = scenario_generator_.accumulator.get_output_datas().dropna(axis=1)

The result of k-means algorithm is $Ny$ clusters in $D$ dimensions. Notice that the cluster centroids $y \in \mathbb{R}^{N_y \times D}$ themselves are $D$-dimensional points.

Next we visualize the clusters and corresponding centroids of scikit's k-means implementation, where we vary the number of clusters $Ny$ from $2$ to $4$.

In [None]:
scenario_generator_.accumulator.plot_clusters(**codpy_param)

The table below demonstrates the performance of Scikit's k-means algorithm in terms of inertia, discrepancy errors and time. 

In [None]:
%%R
knitr::kable(py$results, caption = "Scikit:2,4 clusters") %>%
  kable_styling(latex_options = "HOLD_position")

### Codpy

In this section we apply the same methodology as in the previous sections.


In [None]:
scenario_generator_.run_scenarios(scenarios_list,credit_card_data_generator(),scikitClusterPredictor(set_kernel = set_kernel),cluster_accumulator(),**codpy_param)
results = scenario_generator_.accumulator.get_output_datas().dropna(axis=1)

The result of codpy's sharp discrepancy algorithm is $Ny$ clusters in $D$ dimensions. Notice that the cluster centroids themselves are $D$-dimensional points.

Next we visualize the clusters and corresponding centroids of codpy's sharp discrepancy algorithm, where we vary the number of clusters $Ny$ from $2$ to $4$ clusters.

In [None]:
scenario_generator_.accumulator.plot_clusters(**codpy_param)

The table below demonstrates the performance of codpy's sharp discrepancy algorithm in terms of inertia, discrepancy errors and time.

In [None]:
%%R
knitr::kable(py$results, caption = "Codpy:2,4 clusters") %>%
  kable_styling(latex_options = "HOLD_position")

### Benchmarks results

Finally, we illustrate a benchmark plot, displaying the computed performance indicator of Scikit's k-means and codpy's sharp discrepancy algorithms in terms of discrepancy errors, inertia, accuracy scores (when applicable) and execution time.

In [None]:
scenario_generator_.compare_plots(
    axis_field_labels = [("Ny","discrepancy_errors"),("Ny","inertia"),("Ny","execution_time")])

## Credit card fraud detection

The dataset contains transactions made by credit cards in September 2013 by European cardholders.
It presents transactions that occurred in two days, where we have $492$ frauds out of $284,807$ transactions. The dataset is highly unbalanced, the positive class (frauds) account for $0.172\%$ of all transactions.

The study addresses the fraud detection system to analyze the customer transactions in order to identify the patterns that lead to frauds. In order to facilitate this pattern recognition work, the k-means clustering algorithm is used which is an unsupervised learning algorithm and applied to find out the normal usage patterns of credit card users based on their past activity \cite{ST2019}.

It contains only numerical input variables which are the result of a PCA transformation.  The only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. 

Feature 'Class' is the response variable and it takes value $1$ in case of fraud and $0$ otherwise. You can find more details on this Credit Card Fraud use case following this [kaggle page](https://www.kaggle.com/mlg-ulb/creditcardfraud) link.

In [None]:
scenarios_list = [( -1, 500, i,-1 ) for i in np.arange(15,100,15)]
scenario_generator_ = scenario_generator()

### Scikit

We run our tests in the following line.

In [None]:
scenario_generator_.run_scenarios(scenarios_list,credit_card_fraud_data_generator(),scikitClusterClassifier(set_kernel = set_kernel),cluster_accumulator(),**codpy_param)
results = scenario_generator_.accumulator.get_output_datas().dropna(axis=1)

The table \@ref(tab:376) displays inertia, discrepancy errors and execution time performance indicators. 

In [None]:
%%R
knitr::kable(py$results, caption = "Scikit: Ny clusters") %>%
  kable_styling(latex_options = "HOLD_position")

Finally, we output as well the confusion matrix for the last scenario in figure \@ref(fig:589).

In [None]:
multi_plot([scenario_generator_.predictor] ,add_confusion_matrix.plot_confusion_matrix,title='')

### Codpy

We repeat the same benchmark methodology as in the previous sections. We first run all scenarios.

In [None]:
scenario_generator_.run_scenarios(scenarios_list,credit_card_fraud_data_generator(),codpyClusterClassifier(set_kernel = set_kernel),cluster_accumulator(),**codpy_param)
results = scenario_generator_.accumulator.get_output_datas().dropna(axis=1)

The table \@ref(tab:749) displays computed performance indicators for all scenarios. 

In [None]:
%%R
knitr::kable(py$results, caption = "codpy: Ny clusters") %>%
  kable_styling(latex_options = "HOLD_position")

Finally, we output as well the confusion matrix for the last scenario in figure \@ref(fig:580).

In [None]:
multi_plot([scenario_generator_.predictor] ,add_confusion_matrix.plot_confusion_matrix,title='')

### Benchmarks results

Finally, we illustrate a benchmark plot, that shows the performance of Scikit's k-means and codpy's sharp discrepancy algorithms in terms of discrepancy errors, inertia, accuracy scores (when applicable) and execution time.

In [None]:
scenario_generator_.compare_plots(
    axis_field_labels = [("Ny","scores"),("Ny","discrepancy_errors"),("Ny","inertia"),("Ny","execution_time")])

## Portfolio of stock clustering

This case represents daily stock price movements $x \in \mathbb{R}^{N_{x} \times D}$ (i.e. the dollar difference between the closing and opening prices for each trading day) from 2010 to 2015.


In [None]:
scenarios_list = [(-1, -1, i,-1) for i in range(10, 21,10)]
scenario_generator_ = scenario_generator()

### Scikit

The tests follows the very same method as in the previous section. We first run our scenarios in the following line.

In [None]:
scenario_generator_.run_scenarios(scenarios_list,company_stock_movements_data_generator(),scikitClusterPredictor(set_kernel = set_kernel),cluster_accumulator(),**codpy_param)
results = scenario_generator_.accumulator.get_output_datas().dropna(axis=1)
idx = scenario_generator_.predictor.get_map_cluster_indices()

The table \@ref(tab:371) displays inertia, discrepancy errors and execution time performance indicators. 

In [None]:
%%R
knitr::kable(py$idx, caption = "Scikit: Ny clusters") %>%
  kable_styling(latex_options = "HOLD_position")

k-means clusters displays stocks into coherent groups. The performance indicators for this step are the following.

In [None]:
%%R
knitr::kable(py$results, caption = "Ny clusters") %>%
  kable_styling(latex_options = "HOLD_position")

### Codpy

The tests follows the very same method as in the previous section. We first run our scenarios in the following line.

In [None]:
scenario_generator_.run_scenarios(scenarios_list,company_stock_movements_data_generator(),scikitClusterPredictor(set_kernel = set_kernel),cluster_accumulator(),**codpy_param)
results = scenario_generator_.accumulator.get_output_datas().dropna(axis=1)
idx = scenario_generator_.predictor.get_map_cluster_indices()

The table \@ref(tab:372) displays inertia, discrepancy errors and execution time performance indicators. 

In [None]:
%%R
knitr::kable(py$results, caption = "Scikit: Ny clusters") %>%
  kable_styling(latex_options = "HOLD_position")

k-means clusters displays stocks into coherent groups. The performance indicators for this step are the following.

In [None]:
%%R
knitr::kable(py$idx, caption = "Codpy: Stocks' clusters", col.names = c("Stocks"))  %>%
  kable_styling(latex_options = "HOLD_position")

### Benchmarks results

Finally, we illustrate a benchmark plot, that shows the performance of Scikit's k-means and codpy's sharp discrepancy algorithms in terms of discrepancy errors, inertia, accuracy scores (when applicable) and execution time.

In [None]:
scenario_generator_.compare_plots(
    axis_field_labels = [("Ny","discrepancy_errors"),("Ny","inertia"),("Ny","execution_time")])