## Prospecção de Dados 2022/2023
### Mestrado em Ciência de Dados | Third Home Assignment

#### Group 14: Tiago Gil, Nº59453  |  Nuno Lopes, Nº59461  |  Daniela Moutinho, Nº57064

#### Task: Build Unsupervised Learning Models (Clustering)
Dataset: UCI Supercoductivity Data

In [2]:
import pandas as pd 
import numpy as np
import time
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score, calinski_harabasz_score
from sklearn.metrics.cluster import silhouette_score, silhouette_samples, calinski_harabasz_score, homogeneity_score, completeness_score, v_measure_score

In [3]:
#Load both base datasets
data_train = pd.read_csv("train.csv")
data_unique = pd.read_csv('unique_m.csv')

#ChatGPT generated conversion method but slightly adapted from original output
def get_temp_class(temp):
    if temp < 1.0:
        return "VeryLow"
    elif temp < 5.0:
        return "Low"
    elif temp < 20.0:
        return "Medium"
    elif temp < 100.0:
        return "High"
    else:
        return "VeryHigh"
    
data_train['temp_class'] = data_train['critical_temp'].apply(get_temp_class)

#Separate the dependent variable as our target. This value is the same in both datasets and same order
y_classes = data_train.values[:,82]

#Drop the dependent variable and the identifier variables
data_train.drop(columns=["critical_temp", "temp_class"], inplace=True)
data_unique.drop(columns=["critical_temp", "material"], inplace=True)

#Scale
train_scaled = StandardScaler().fit_transform(data_train)
unique_scaled = StandardScaler().fit_transform(data_unique)

## Objective 1: Clustering

Before clustering we could have performed some dimension reduction to reduce noisy and irrelevant information, as the size of the dataset, in an attempt to reduce computational effort. However, since that was not asked, and due to very limited space, that task was not performed. 

In [3]:
def apply_clustering(X, n_clusters, method, eps_, min_samples_):
    if method == 'kmeans':
        model = KMeans(n_clusters=n_clusters, random_state=42)
        labels = model.fit_predict(X)
        inertia = model.inertia_
        silhouette = silhouette_score(X, labels)
        ch_score = calinski_harabasz_score(X, labels)
        return labels, inertia, silhouette, ch_score
           
    elif method == 'hca':
        model = AgglomerativeClustering(n_clusters=n_clusters, linkage="ward")
        labels = model.fit_predict(X)
        silhouette = silhouette_score(X, labels)
        ch_score = calinski_harabasz_score(X, labels)
        return labels, None, silhouette, ch_score
    
    elif method == 'dbscan':
        model = DBSCAN(eps=eps_, min_samples=min_samples_)
        labels = model.fit_predict(X)
        n_clusters = len(np.unique(labels)) - 1
        if n_clusters == 0:
            return None, None, None, None
        silhouette = silhouette_score(X, labels)
        ch_score = calinski_harabasz_score(X, labels)
        return labels, None, silhouette, ch_score

In [4]:
clusters_arr=[2,3,4,5,6,7,8]
methods=["kmeans","hca"]
duration_arr=[]
labels_arr=[]
inertia_arr=[]
silhouette_arr=[]
ch_score_arr=[]
datasets=[train_scaled, unique_scaled]

for dataset in datasets:    
    for method in methods:
        duration_arr.clear()
        labels_arr.clear()
        inertia_arr.clear()
        silhouette_arr.clear()
        ch_score_arr.clear()
        
        for n_clusters in clusters_arr:
            t_start=time.time()
            labels,inertia,silhouette,ch_score=apply_clustering(dataset, n_clusters, method,0,0)
            t_end=time.time()

            duration_arr.append(round(t_end-t_start,3))
            labels_arr.append(labels)
            if inertia is None:
                inertia_arr.append(inertia)
            else:
                inertia_arr.append(round(inertia,3))
            silhouette_arr.append(round(silhouette,3))
            ch_score_arr.append(round(ch_score,3))

        print("\nMethod: ", method)
        print("Clusters:", clusters_arr)
        print("Duration: ", duration_arr)
        print("Labels: ",labels_arr)
        print("Inertia: ",inertia_arr)
        print("Silhouette score: ",silhouette_arr)
        print("Calinski-Harabasz score: ",ch_score_arr)
        


Method:  kmeans
Clusters: [2, 3, 4, 5, 6, 7, 8]
Duration:  [8.236, 8.862, 9.432, 8.855, 10.655, 10.985, 12.474]
Labels:  [array([0, 0, 0, ..., 1, 1, 1]), array([0, 0, 0, ..., 1, 1, 2]), array([1, 1, 1, ..., 0, 0, 3]), array([1, 1, 1, ..., 2, 2, 0]), array([0, 0, 0, ..., 2, 2, 4]), array([3, 3, 3, ..., 1, 1, 6]), array([6, 6, 6, ..., 7, 7, 4])]
Inertia:  [1173756.429, 1039073.838, 961811.049, 898430.53, 842678.088, 788053.678, 751401.145]
Silhouette score:  [0.347, 0.314, 0.304, 0.256, 0.263, 0.266, 0.268]
Calinski-Harabasz score:  [9936.174, 6989.615, 5603.075, 4873.466, 4437.802, 4199.884, 3923.436]

Method:  hca
Clusters: [2, 3, 4, 5, 6, 7, 8]
Duration:  [56.665, 40.177, 39.797, 48.168, 39.47, 40.121, 39.522]
Labels:  [array([1, 1, 1, ..., 0, 0, 0], dtype=int64), array([0, 0, 0, ..., 2, 2, 1], dtype=int64), array([1, 1, 1, ..., 2, 2, 0], dtype=int64), array([0, 0, 0, ..., 2, 2, 1], dtype=int64), array([3, 3, 3, ..., 2, 2, 0], dtype=int64), array([3, 3, 3, ..., 0, 0, 2], dtype=int64)

Though HCA takes longer than kmeans (it is a much complex statistical approach), it apparently gives better results for both datasets with higher scores. Bear in mind that for HCA we only used euclidean distance. Linkage was tested for the best approach.

For the TRAIN dataset considering kmeans and its resulting silhoutte scores, the best result points to a 2 clusters solution, however the difference in scores to other cluster solutions is not big. Nevertheless, CH score is clearly higher for a 2 cluster solution. When applying HCA the resulting CH score indicates that a 4 or 5 clusters solution fits data the best, when considering average, complete or Ward's linkage.

On the other hand, UNIQUE dataset always shows better results for a 2 or 3 cluster solution, when using kmeans. However, HCA, when applying average or complete linkage, gave back some highly suspicous silhouette score values (0.95) for every cluster solution. Ward's linkage seem to provide a more balanced solution, pointing to a 2 or 3 cluster solution as the best solution, with sillouette scores arround 0.70). 

This apparently indicates that both algorithms have some trouble providing a good solution. The datasets are probably too noisy. An approach to overcome this scenario could be using kmeans to have 2 clusters, and then apply HCA independently to each cluster and see the results.

Results were a little bit different when using DBSCAN: we have searched for the best parameters, both eps and minimal samples, for both datasets. 

EPS is the maximum distance between two samples for one to be considered as in the neighborhood of the other and is the most important DBSCAN parameter to choose appropriately. The minimal number of samples in a neighborhood for a point to be considered as a core point is another important parameter.

For TRAIN dataset eps until 11 give the best scores together with min_sample 15, returning a 2 cluster solution, with a score=0.55. For UNIQUE dataser eps=30 and min_sample=30 returns a 4 cluster solution, with a score=0.80. For higher scores, a 2 cluster solution is achieved. 

In [5]:
#Apply different DBSCAN models to train dataset
eps_arr=11
min_samples_arr=15
print("For TRAIN dataset:")
t_start=time.time()                                                    
labels,inertia,silhouette,ch_score=apply_clustering(train_scaled, 0, "dbscan",eps_arr,min_samples_arr) #melhor eps=11 e min_samples=30 (2clusters)
t_end=time.time()  
print("With eps=",eps_arr," and min_sample=",min_samples_arr)
print("Duration: ", t_end-t_start)
print("Labels: ",labels)
print("Number of Clusters: ", len(np.unique(labels)))
print("Silhouette score: ",silhouette)
print("CH score: ",ch_score)

For TRAIN dataset:
With eps= 11  and min_sample= 15
Duration:  17.790655374526978
Labels:  [0 0 0 ... 0 0 0]
Number of Clusters:  2
Silhouette score:  0.552460766781026
CH score:  214.5243814518202


In [6]:
#Apply different DBSCAN models to UNIQUE dataset
eps_arr=30
min_samples_arr=10
print("For TRAIN dataset:")
t_start=time.time()                                                    
labels,inertia,silhouette,ch_score=apply_clustering(unique_scaled, 0, "dbscan",eps_arr,min_samples_arr) #melhor eps=30 e min_samples=10 (2clusters)
t_end=time.time()  
print("With eps=",eps_arr," and min_sample=",min_samples_arr)
print("Duration: ", t_end-t_start)
print("Labels: ",labels)
print("Number of Clusters: ", len(np.unique(labels)))
print("Silhouette score: ",silhouette)
print("CH score: ",ch_score)

For TRAIN dataset:
With eps= 30  and min_sample= 10
Duration:  20.2274968624115
Labels:  [0 0 0 ... 0 0 0]
Number of Clusters:  4
Silhouette score:  0.8066185975569313
CH score:  305.0690514929589


## Objective 2: Evaluating clustering with extrinsic methods

In [19]:
#train dataset
kmdt = KMeans(n_clusters=2,n_init=10, random_state=0).fit(train_scaled)
hcadt = AgglomerativeClustering(n_clusters=2, linkage="ward").fit(train_scaled)
dbscandt = DBSCAN(eps=11, min_samples=15).fit(train_scaled)
#unique dataset
kdum = KMeans(n_clusters=2,n_init=10, random_state=0).fit(unique_scaled)
hcadum = AgglomerativeClustering(n_clusters=2, linkage="ward").fit(unique_scaled)
dbscandu = DBSCAN(eps=30, min_samples=10).fit(unique_scaled)

In [5]:
def showClusteringResults(X, clusters, labels=None):
    sil=silhouette_score(X, clusters)
    ch=calinski_harabasz_score(X, clusters)
    print("Silhouette score", sil)
    print("Calinski Harabasz score", ch)
    if labels is not None:
        hom=homogeneity_score(clusters, labels)
        cmp=completeness_score(clusters, labels)
        vms=v_measure_score(clusters, labels)
        print("Homogeneity score", hom)    
        print("Completeness score", cmp)
        print("V-measure score", vms)

### Train Dataset

In [21]:
print("\nClustering results from 2 clusters kmeans for Train dataset: ")
showClusteringResults(train_scaled, kmdt.labels_, y_classes)
print("\nClustering results from 2 clusters HCA for Train dataset: ")
showClusteringResults(train_scaled, hcadt.labels_, y_classes)
print("\nClustering results clusters DBSCAN for Train dataset: ")
showClusteringResults(train_scaled, dbscandt.labels_+1, y_classes)


Clustering results from 2 clusters kmeans for Train dataset: 
Silhouette score 0.34653151281555533
Calinski Harabasz score 9936.174442755846
Homogeneity score 0.42815131138588053
Completeness score 0.22511844630054353
V-measure score 0.2950840961690294

Clustering results from 2 clusters HCA for Train dataset: 
Silhouette score 0.33284350107579774
Calinski Harabasz score 9125.608098749608
Homogeneity score 0.39878967636158946
Completeness score 0.21115377979858774
V-measure score 0.2761106678265331

Clustering results clusters DBSCAN for Train dataset: 
Silhouette score 0.5524607667809489
Calinski Harabasz score 214.52438145182023
Homogeneity score 0.04477098684262668
Completeness score 0.00033809193642512494
V-measure score 0.0006711158838524886


### Unique Dataset

In [22]:
print("\nClustering results from 2 clusters kmeans for Unique dataset: ")
showClusteringResults(unique_scaled, kdum.labels_, y_classes)
print("\nClustering results from 2 clusters HCA for Unique dataset: ")
showClusteringResults(unique_scaled, hcadum.labels_, y_classes)
print("\nClustering results clusters DBSCAN for Unique dataset: ")
showClusteringResults(unique_scaled, dbscandu.labels_+1, y_classes)


Clustering results from 2 clusters kmeans for Unique dataset: 
Silhouette score 0.7005853361857551
Calinski Harabasz score 572.9173211837198
Homogeneity score 0.03822493667563354
Completeness score 0.0006152216063172981
V-measure score 0.0012109531980918117

Clustering results from 2 clusters HCA for Unique dataset: 
Silhouette score 0.7005853361857551
Calinski Harabasz score 572.9173211837198
Homogeneity score 0.03822493667563354
Completeness score 0.0006152216063172981
V-measure score 0.0012109531980918117

Clustering results clusters DBSCAN for Unique dataset: 
Silhouette score 0.8066185975569313
Calinski Harabasz score 305.0690514929589
Homogeneity score 0.12747222571847328
Completeness score 0.0028036653234625977
V-measure score 0.005486655375650999


The present data has been classified in 5 classes according to its critical temperature.

However, when trying to perform clustering this solution was not achieved by any of the proposed algorithms. The optimal state for all of these clustering techniques was around 2 clusters which goes away from our pre-known 5 classes. This goes away from what the truth is. We can see that reflected across the board in the extrinsic methods.

According to the results, clustering both datasets using KMeans appears to be the best alternative. For the Train dataset, KMeans has the highest Silhouette score, an intrinsic method for assessing clustering quality, and the highest Calinski Harabasz score, another intrinsic method based on variance ratio. Furthermore, Homogeneity, Completeness, and V-measure scores—extrinsic techniques that assess the cluster's consistency with existing labels—show that KMeans performs better than HCA.

Regarding the Unique dataset, the Silhouette, Calinski Harabasz, Homogeneity, Completeness, and V-measure scores for KMeans and HCA are the same. It is crucial to remember that the distribution of the Unique dataset differs from that of the Train dataset and that the same clustering algorithm may act differently depending on the dataset. Given the results, clustering the Unique dataset using KMeans is still a reasonable choice.

Based on both intrinsic and extrinsic evaluation methodologies, KMeans emerges as the top candidate for clustering both datasets.