<a href="https://colab.research.google.com/github/GaiaSaveri/intro-to-ml/blob/main/notebooks/Lab-5.UnsupervisedLearning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unsupervised Learning on the Dry Bean Dataset

In this lab we will try to obtain valuable information using Unsupervised Learning techniques.

The original data has been downloaded from https://archive-beta.ics.uci.edu/dataset/602/dry+bean+dataset (Dry Bean Dataset. (2020). UCI Machine Learning Repository)

**Data Set Description**:

Images of 13,611 grains of 7 different registered dry beans were taken with a high-resolution camera. 
A total of 16 features; 12 dimensions and 4 shape forms, were obtained from the grains.

## Loading and Pre-treatment of the Data

In [1]:
# Load libraries and modules
import pandas as pd
from sklearn import preprocessing
import numpy as np
from numpy import linalg as LA
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist
from sklearn.linear_model import LinearRegression
from sklearn.metrics.cluster import normalized_mutual_info_score
import os
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

In [None]:
FFILE = './Dry_Bean_Dataset.xlsx'
if os.path.isfile(FFILE): 
    print("File already exists")
    if os.access(FFILE, os.R_OK):
        print ("File is readable")
    else:
        print ("File is not readable, removing it and downloading again")
        !rm FFILE
        !wget "https://raw.github.com/alexdepremia/ML_IADA_UTs/main/Lab5/Dry_Bean_Dataset.xlsx"
else:
    print("Either the file is missing or not readable, download it")
    !wget "https://raw.github.com/alexdepremia/ML_IADA_UTs/main/Lab5/Dry_Bean_Dataset.xlsx"

In [None]:
# Load the data
data = pd.read_excel('./Dry_Bean_Dataset.xlsx')
data

In [None]:
# Transform the data to use it as numpy arrays. 
X = data.iloc[:,:-1].values
label = data.iloc[:,16].values
print(X.shape)
N = X.shape[0]  # Number of data points
nc = X.shape[1]  # Number of features/components
print(np.unique(label)) 

In [None]:
# Ordinal encoder for the ground truth labels
enc = preprocessing.OrdinalEncoder()
enc.fit(label.reshape(-1, 1))
y = enc.transform(label.reshape(-1, 1))
print(y)  # Encoded labels

In [6]:
# Rescale the features of the data since the units are different: substract the average and divide by the standard deviation
scaler = preprocessing.StandardScaler().fit(X)
X_scaled = scaler.transform(X)

## Principal Component Analysis (PCA)

**Objective**: find the set of orthogonal directions along which the variance of the data is the highest.

Summary of the method:

* Center the data feature matrix $X$;

* Compute the covariance matrix $C$ of the data as $C=X^T X$;

* Compute eigenvalues and eigenvectors of the covariance matrix $C$ (using the function `eigh` of the `scipy.LA` module), and sort them according to the decreasing order of the eigenvalues. Arrange them as coumn of a matrix $A$.

**Recall**: eigenvalues are the variance of the data along the direction of the corresponding eigenvector. 

* Compute principal components as $X\cdot A$

### Model Assessment: choose the number of PCs to keep

**Objective**: choose the final dimension $d$ of the transformed data.

1. Proportion of variance explained: given eigenvalues $\lambda_i$ of the covariance matrix and a threshold $t\in [0, 1]$, choose $d$ s.t. the ratio $\chi_d = \frac{\sum_{i=1}^d \lambda_i}{\sum_{i=1}^D \lambda_i} > t$ 

2. Check the existence of a gap in the spectrum of the covariance matrix.

In [8]:
# Perform PCA with sklearn 
pca = PCA()
pca.fit(X_scaled)
projection = pca.transform(X_scaled)
cumul = np.zeros(nc)
for i in range(nc):
   cumul[i] = np.sum(pca.explained_variance_ratio_[:i+1])
lambs = pca.explained_variance_
comp = np.arange(nc) + 1

In [None]:
f, [ax1 ,ax2] = plt.subplots(1, 2,figsize = (14, 7))
ax1.set_title('Spectrum')
ax1.scatter(comp, lambs)
ax1.set_xticks(comp)
ax2.set_title('Explained variance')
ax2.scatter(comp, cumul)
ax2.set_xticks(comp)
plt.show()

In [None]:
# Number of components depending on the explained variance threshold.
for t in [0.8, 0.85, 0.9, 0.95, 0.97, 0.99, 0.999]:
    nc = np.argmax(cumul - t > 0.) + 1
    print(t, nc)

In [None]:
# plotting the data set in 2D (i.e. keep only 2 PCs) colored by its ground truth label
fig, ax = plt.subplots(figsize=(7,7))
ax.scatter(projection[:,0],projection[:,1], c=y)
ax.set_title('PCA')
plt.show()

In [None]:
# Now in 3D (i.e. keep only 3 PCs)
fig = plt.figure(figsize=(9, 9))
ax = fig.add_subplot(projection='3d')
ax.scatter(projection[:,0],projection[:,1], projection[:,2],c=y)
plt.show()

1. What would happen if we don't rescale the features?

2. Which is the Intrinsic Dimension (ID) of the data set?

  **Recall**: the ID of a dataset is the minimum number of dimensions we need to describe the data in a accurate way.

3. Could you compute the two-NN estimate of the ID?

  The procedure works as follows:

  1. Compute pairwise distances among points;
  2. For each point $i$, extract the distance from its two closest neighbors $r_{i1}, r_{i2}$ respectively;
  3. Compute the ratio $\mu_i = \frac{r_{i2}}{r_i1}$;
  4. Compute the empirical cumulative distribution $\mathcal{F}(\mu)$ of $\mu$;
  5. Find the best fitting line for the dataset $\{\log \mu_i, \log (1 - \mathcal{F}(\mu_i)\}_{i=1}^N$
  6. The intrinsic dimension is given by the slope of this fitted line.

## K-means

Flat clustering algorithm whose goal is to minimize the intracluster distance while maximizing the intercluster distance. 

We will compute the k-means clustering using two types of initialization:
    
  1. Random initialization: cluster centroids are initialized picking random points from the dataset;
  
  2. k-means++: choose first cluster center at random, then choose new cluster centers in such a way that they are far from existing centers.

In [None]:
def k_means_internal(k, X, init):
    '''
    Parameters
    ----------
    k : int
      Number of clusters
    X : matrix of dimension N x D
      Dataset 
    init : str either '++' or 'random'
      Type of initialization for k-means algorithm
    '''
    pass

In [None]:
def k_means(k, X, init='++', n_init=20):
    '''
    Parameters
    ----------
    k : int
      Number of clusters
    X : matrix of dimension N x D
      Dataset 
    init : str either '++' or 'random'
      Type of initialization for k-means algorithm
    n_init : int
      Number of runs of the algorithms (with different initializations)
    '''
    pass

### k-means with a fixed number of clusters

In [None]:
kmeans_labels, l_kmeans = k_means(7, X_scaled, init='++', n_init=20)
print(l_kmeans)

In [None]:
# Plot the projection according to the k-means clusters
fig = plt.figure(figsize=(9, 9))
ax = fig.add_subplot(projection='3d')
ax.scatter(projection[:,0], projection[:,1], projection[:,2], c=kmeans_labels)
plt.show()

In [None]:
# k is set to the ground truth number of clusters
kmeans = KMeans(n_clusters=7, random_state=0, n_init=20).fit(X_scaled)
# Plot the projection according to the k-means clusters
fig = plt.figure(figsize=(9, 9))
ax = fig.add_subplot(projection='3d')
ax.scatter(projection[:,0], projection[:,1], projection[:,2], c=kmeans.labels_)
plt.show()
normalized_mutual_info_score(kmeans.labels_, y.flatten())

### Cluster Validation 

Several methods: 

* (Normalized) Mutual Information: it measure the agreement of the label assegned by k-means vs true labels;

* Scree Plot: perform k-means with different number of clusters, register the loss. Plot the loss as function of the number of the classes. An elbow in the scree plot should provide useful information about the parameter $k$.

In [None]:
# Compute the normalized mutual information between the predicted and the ground truth classification
normalized_mutual_info_score(kmeans_labels, y.flatten())

In [None]:
# scree plot
nk_base = np.arange(2,21) # possible values for k in k-means
loss = np.zeros(nk_base.shape[0])
i = 0
for nk in nk_base:
    ll,l_kmeans = k_means(nk, X_scaled, init='++', n_init=20)
    loss[i] = l_kmeans
    i = i + 1
fig, ax = plt.subplots(figsize=(7,7))
ax.scatter(nk_base, np.log(loss), c='b')
ax.set_xticks(nk_base)
ax.set_title('Scree Plot')
plt.show()

### Build the scree plot

In order to locate approximately the elbow, we fit the first, let's say 4 points with a line and the last 4 points with a line, then the elbow will be approximately at the intersection. 

In [None]:
nk_base = np.arange(2,21)
loss = np.zeros(nk_base.shape[0])
i = 0
for nk in nk_base:
    kmeans = KMeans(n_clusters=nk, random_state=0,n_init=20).fit(X_scaled)
    loss[i] = kmeans.inertia_
    i = i + 1
reg = LinearRegression().fit(nk_base[:4].reshape(-1, 1),np.log(loss[:4]))
aa=reg.predict(nk_base[:8].reshape(-1, 1))
reg2 = LinearRegression().fit(nk_base[16:20].reshape(-1, 1),np.log(loss[16:20]))
bb=reg2.predict(nk_base.reshape(-1, 1))
fig, ax = plt.subplots(figsize=(7,7))
ax.scatter(nk_base,np.log(loss),c='b')
ax.set_xticks(nk_base)
ax.set_title('Scree Plot')
ax.plot(nk_base[:8],aa[:8])
ax.plot(nk_base,bb)
plt.show()

In [None]:
# linear fit of first 4 points
reg = LinearRegression().fit(nk_base[:4].reshape(-1, 1), np.log(loss[:4]))
aa = reg.predict(nk_base[:8].reshape(-1, 1))
# linear fit of last 4 points
reg2 = LinearRegression().fit(nk_base[16:20].reshape(-1, 1), np.log(loss[16:20]))
bb = reg2.predict(nk_base.reshape(-1, 1))

In [None]:
fig, ax = plt.subplots(figsize=(7,7))
ax.scatter(nk_base, np.log(loss), c='b')
ax.set_xticks(nk_base)
ax.set_title('Scree Plot')
ax.plot(nk_base[:8], aa[:8])
ax.plot(nk_base, bb)
plt.show()

Which is the optimal number of clusters according with the scree plot?

What happens if you don't initialize many times the algorithm?

## Play with other algorithms dimensionality reduction/clustering algorithms.

Try to obtain more information using other algorithms that we have seen during the lectures. Among the suggested algorithms, you can use ISOMAP or t-SNE for dimensional reduction, ward's hierarchical clustering, GMM or DBSCAN for clustering. You don't need to implement these algorithms, use any of the libraries in which them are already implemented (sklearn/scipy). 

Since it's relatively easy, you can try to implement Density Peaks clustering.

### t-SNE (t-distributed stpchastoc neighbor embedding)

**Goal**: estimate, from the distances in the high- dimensional space, the probability of each point to be a neighbor of each other point. Then, the algorithm goal is to obtain a set of projected coordinates in which these “neighborhood probabilities” are as similar as possible to the ones in the original space.

In [None]:
from sklearn.manifold import TSNE
X_embedded = TSNE(n_components=2, learning_rate='auto',
                  init='random', perplexity=15).fit_transform(X_scaled)
fig, ax = plt.subplots(figsize=(7,7))
ax.scatter(X_embedded[:,0], X_embedded[:,1], c=y)
ax.set_title('t-SNE')
plt.show()

### DBSCAN (density-based spatial clustering of application with noise)

**Goal**: clusters are connected regions with density above a threshold surrounded by regions with a density below this threshold. The density threshold is defined by two parameters: a neighborhood distance (*eps*) and the minimum number of configurations within this distance needed for considering a given configuration above the density threshold (*min_samples*)

In [None]:
from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=0.7, min_samples=12).fit(X_scaled)

fig = plt.figure(figsize=(9, 9))
ax = fig.add_subplot(projection='3d')
ax.scatter(projection[:,0], projection[:,1], projection[:,2], c=dbscan.labels_)
plt.show()

### Agglomerative Clustering (hierarchical clustering)

**Goal**: each configuration is initially assigned to a different cluster. At every iteration two existing clusters (i.e. the two closest according to the chosen criterion) are combined, so the next level of the dendrogram has one fewer cluster. 



In [None]:
from scipy.cluster.hierarchy import dendrogram
from sklearn.cluster import AgglomerativeClustering

def plot_dendrogram(model, **kwargs):
    # Create linkage matrix and then plot the dendrogram
    # create the counts of samples under each node
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    # model.children_[i][0] and model.children_[i][1] are merged at level i
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count += 1  
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count
    # linkage should contain [children_0, children_1, distance, n_points]
    linkage_matrix = np.column_stack(
        [model.children_, model.distances_, counts]
    ).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, **kwargs)


# setting distance_threshold=0 ensures we compute the full tree.
model = AgglomerativeClustering(distance_threshold=0, n_clusters=None)

model = model.fit(X_scaled)
plt.title("Hierarchical Clustering Dendrogram")
# plot the top three levels of the dendrogram
plot_dendrogram(model, truncate_mode="level", p=4)
plt.xlabel("Number of points in node (or index of point if no parenthesis).")
plt.show()

**Ward** linkage criterionminimizes the variance of the clusters being merged.

In [None]:
ward = AgglomerativeClustering(n_clusters=7).fit(X_scaled) # ward is default linkage criterion
fig = plt.figure(figsize=(9, 9))
ax = fig.add_subplot(projection='3d')
ax.scatter(projection[:,0],projection[:,1], projection[:,2],c=ward.labels_)
plt.show()

In [None]:
print ("kmeans", normalized_mutual_info_score(kmeans_labels, y.flatten()))
print ("dbscan", normalized_mutual_info_score(dbscan.labels_, y.flatten()))
print ("ward's", normalized_mutual_info_score(ward.labels_, y.flatten()))