# Clustering with <a href="https://cran.r-project.org/"><img src="https://cran.r-project.org/Rlogo.svg" style="max-width: 40px; display: inline" alt="R"/></a>

---

The objective of this tutorial is to apply the different concepts studied during the course on clustering to identify groups of wines.

---

## Data exploration

In this tutoral we will studied the _wine_ dataset (_wine.txt_).
This dataset includes physico-chemical measurements performed on a sample of $n=600$ wines (red and white) from Portugal. These measurements are complemented by a sensory evaluation of the quality by a set of experts. Each wine is described by the following variables:
- _Quality:_ Wine quality according to experts (“bad“, “medium”, “good”),
- _Type:_ 1 for red wine and 0 for white wine,
- _AcidVol:_ The volatile acid content (in g/dm3 of acetic acid),
- _AcidCitr:_ The citric acid content (in g/dm3),
- _SO2lbr:_ The measurement of free sulfur dioxide (in mg/dm3),
- _SO2tot:_ Total sulfur dioxide measurement (in mg/dm3),
- _Density:_ The density (in g/cm3),
- _Alcohol:_ The alcohol level (in % of Vol.).

### Some important parkages

In [None]:
# install.packages('...')

In [None]:
library(ggplot2)
library(reshape)
library(gridExtra)

##### <span style="color:purple">**Todo:** Load the _wine.txt_ dataset and: </span>

- Use the `str()` function to show information about variables. Are all variables on the appropriated type?
- If not transform quantitative variables to factors with the `as.factor` function.
- Rename the levels of the variable _type_ : 1=red and 0=white.

In [None]:
### TO BE COMPLETED ### 

wine = ...

[...]

head(wine)

In [None]:
# solutions/data/load_data.r

### Exploratory data analysis

In [None]:
library(corrplot)

In [None]:
summary(wine)

##### <span style="color:purple">**Todo:** Descriptive statistic and bivariate analysis: </span>
- Show description of variables with the `summary` function,
- Draw boxplots of quantitative variables. Analyze the results,
- Do graphical description of qualitative variables (barplot),
- Analyze correlation between numeric variables.

In [None]:
### TO BE COMPLETED ### 
# Descriptive statistics of quantitative data


In [None]:
# solutions/data/quanti.r

In [None]:
### TO BE COMPLETED ### 
# Descriptive statistics of qualitative data


In [None]:
# solutions/data/quali.r

In [None]:
### TO BE COMPLETED ### 
# Correlation study


In [None]:
# solutions/data/correlation.r

### Principal Component Analysis

In [None]:
library(FactoMineR)
library(factoextra)

In [None]:
library(ggpubr)  #to get the ggarrange function

##### <span style="color:purple">**Todo:** PCA of wine dataset: </span>

- What impact can the above analyses have on the PCA result?
- Perform PCA of the _wine_ data (Quantitative variables should be specified as _supplementary_ variables) and make visualization of wines (ind.) on the first factorial plan (use _habillage_ parameter to show groups according to qualitative variables). 
- How many clusters groups of wines can be suggested?

In [None]:
### TO BE COMPLETED ### 
# PCA of wine data -- Variables

wine2 = wine
wine2[,-c(1,2)] = scale(...)

In [None]:
# solutions/data/pca_var.r

In [None]:
### TO BE COMPLETED ### 
# PCA of wine data -- Individuals


In [None]:
# solutions/data/pca_ind.r

## Clustering with $k$-means

In this part, we will perform the $k$-means clustering of wines by using only quantitative variable. Qualitative variable will be used to explains obtained clusters.

In [None]:
library(cluster)

##### <span style="color:purple">**Todo:** Clustering with $k=3$: </span>

- By using the [`kmeans`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/kmeans) function, perform the clustering the wines. Numeric variables should be standardized before.
- Use the [`fviz_cluster`](https://search.r-project.org/CRAN/refmans/factoextra/html/fviz_cluster.html) function to visualize cluster on the first factorial plan of the PCA.
- Analyze the links between clusters and qualitative variables.

In [None]:
### TO BE COMPLETED ### 
# k-means, with k=3

reskmeans = kmeans(...)

In [None]:
# solutions/kmeans/kmeans.r

In [None]:
### TO BE COMPLETED ### 
# Clusters vs Type of wine


In [None]:
# solutions/kmeans/clust_vs_type.r

In [None]:
### TO BE COMPLETED ### 
# Clusters vs Quality of the wine


In [None]:
# solutions/kmeans/clust_vs_quality.r

##### <span style="color:purple">**Todo:** Determine the best value of $k$: </span>

- using the elbow method
- using the silhouette score

**Note**: _One can use the [`fviz_nbclust`](https://search.r-project.org/CRAN/refmans/factoextra/html/fviz_nbclust.html) and [`fviz_silhouette`](https://search.r-project.org/CRAN/refmans/factoextra/html/fviz_silhouette.html) functions of the `factoextra` package_.

In [None]:
### TO BE COMPLETED ### 
# Elbow method used with total within sum of square as metric

fviz_nbclust(...)

In [None]:
# solutions/kmeans/elbow_wss.r

In [None]:
### TO BE COMPLETED ### 
# Elbow method used with silhouette score as metric

fviz_nbclust(...)

In [None]:
# solutions/kmeans/elbow_silhouette.r

In [None]:
### TO BE COMPLETED ### 
# Silhouette plots, according to the number of clusters

fviz_silhouette(...)

In [None]:
# solutions/kmeans/plot_silhouette.r

## Clustering with CAH

In this section, we will perform the CAH to make the same analysis as in th preview section.

##### <span style="color:purple">**Todo**: Use the [`hclust`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/hclust) function to perform a hierarchical classification of the wine data</span>

- Test the different type of linkage : _single_, _complete_ and _average_,
- Graphically, compare the associated dendrograms (_cf._ [`fviz_dend`](https://search.r-project.org/CRAN/refmans/factoextra/html/fviz_dend.html)), and comment on the results.

In [None]:
### TO BE COMPLETED ### 

# Clustering
hclustsingle = hclust(...)
hclustcomplete = hclust(...)
hclustaverage = hclust(...)

# Dendograms visualization
fviz_dend(...)

In [None]:
# solutions/cah/cah.r

##### <span style="color:purple">**Todo:** Find the appropriate number of clusters for the _complete_ linkage ($\texttt{hclustcomplete}$) by using the both methods (_wss_ and _silhouette_)</span>

In [None]:
### TO BE COMPLETED ### 


In [None]:
# solutions/cah/cah_nb.r

##### <span style="color:purple">**Todo:** Get a 3 classes cluster of wines, for the the _complete_ linkage ($\texttt{hclustcomplete}$).</span>
- You may use the [`cutetree`](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/cutree.html) function,
- View the dendrogram associated with this cut (_cf._ [`fviz_dend`](https://search.r-project.org/CRAN/refmans/factoextra/html/fviz_dend.html)),
- Explain clusters with qualitative variables.

In [None]:
### TO BE COMPLETED ### 

reshclust = cutree(...)

In [None]:
# solutions/cah/cah_cut.r

In [None]:
### TO BE COMPLETED ### 
# Clusters vs Type of wine


In [None]:
# solutions/cah/clust_vs_type.r

In [None]:
### TO BE COMPLETED ### 
# Clusters vs Quality of wine


In [None]:
# solutions/cah/clust_vs_quality.r

## Clustering with Gaussian Mixture

In this part, we will do the same analysis as above by with the GMM method.

In [None]:
library(mclust)

##### <span style="color:purple">**Todo:** Perform GMM clustering. </span>
- You may use the [`Mclust`](https://www.rdocumentation.org/packages/mclust/versions/5.4.6/topics/Mclust) function of the `mclust` package,
- Select the best model for the _BIC_ criteria (_cf._ [`fviz_mclust`](https://search.r-project.org/CRAN/refmans/factoextra/html/fviz_mclust.html)),
- Visualize the obtained clusters.

In [None]:
wine3 = wine2[, -c(1, 2)]

In [None]:
### TO BE COMPLETED ### 
# GMM with BIC

resBICall = Mclust(...)
summary(resBICall)

fviz_mclust(...)

In [None]:
# solutions/gmm/gmm_bic.r

In [None]:
### TO BE COMPLETED ### 
# Best model with BIC


In [None]:
# solutions/gmm/gmm_best_bic.r

##### <span style="color:purple">**Todo:** Redo the same analysis with the _ICL_ criteria. </span>
- You may use the [`Mclust`](https://www.rdocumentation.org/packages/mclust/versions/5.4.6/topics/Mclust) function of the `mclust` package,
- Select the best model for the _ICL_ criteria (_cf._ [`fviz_mclust`](https://search.r-project.org/CRAN/refmans/factoextra/html/fviz_mclust.html)),
- Visualize the obtained clusters.

In [None]:
### TO BE COMPLETED ### 
# GMM with ICL

resICLall = ...
summary(resICLall)

In [None]:
# solutions/gmm/gmm_icl.r

In [None]:
### TO BE COMPLETED ### 
# Best model with ICL


In [None]:
# solutions/gmm/gmm_best_icl.r

##### <span style="color:purple">**Question:** Which _GMM_ model to choose?</span>

##### <span style="color:purple">**Todo:** Analyze cluster with qualitative variables.</span>

In [None]:
### TO BE COMPLETED ### 


In [None]:
# solutions/gmm/quali.r

## Comparison of clustering algorithms

The purpose of this last section is to compare the different results we obtained previously.

### $k$-means _vs._ CAH

In [None]:
library(cvms)
library(ggimage)
library(rsvg)

In [None]:
# We remain that best model for these algorithms are:

reskmeans = kmeans(wine2[,-c(1,2)], centers=3)
ClassK3 = cutree(hclustcomplete, 3)

##### <span style="color:purple">**Todo:** Visualize clusters of these models on the principal component plane</span>

In [None]:
### TO BE COMPLETED ### 


In [None]:
# solutions/compare/cah_vs_kmeans.r

##### <span style="color:purple">**Todo:** Analyze the result obtained with the `table` function</span>

In [None]:
### TO BE COMPLETED ### 


In [None]:
# solutions/compare/cah_vs_kmeans_conf.r

### $k$-means _vs._ GMM

##### <span style="color:purple">**Todo:** Do the same analysis as for $k$-means _vs._ GMM</span>

In [None]:
### TO BE COMPLETED ### 


In [None]:
# solutions/compare/gmm_vs_kmeans.r

### CAH _vs._ GMM

##### <span style="color:purple">**Todo:** Do the same analysis as for CAH _vs._ GMM</span>

In [None]:
### TO BE COMPLETED ### 


In [None]:
# source("solutions/compare/cah_vs_gmm.r", echo=TRUE)

In [None]:
# source("solutions/compare/kmeans_vs_cah_vs_gmm.r", echo=TRUE)