**II. Programming and critical analysis [9v]**


Recall the column_diagnosis.arff dataset from previous homeworks. For the following exercises,
normalize the data using sklearn’s MinMaxScaler.

**1)** [4v] Using sklearn, apply k-means clustering fully unsupervisedly on the normalized data with
𝑘 ∈ {2,3,4,5} (random=0 and remaining parameters as default). Assess the silhouette and purity of
the produced solutions.

In [None]:
import pandas as pd
from sklearn import datasets, metrics, cluster, mixture, preprocessing
import numpy as np
from scipy.io.arff import loadarff

def purity_score(y_true, y_pred):
    # compute contingency/confusion matrix
    confusion_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
    return np.sum(np.amax(confusion_matrix, axis=0)) / np.sum(confusion_matrix) 

data = loadarff('column_diagnosis.arff')
df = pd.DataFrame(data[0])
df['class'] = df['class'].str.decode('utf-8')

X = df.drop('class', axis=1)
y = df['class']

# MinMaxScaler
scaler = preprocessing.MinMaxScaler()
X_normalized = scaler.fit_transform(X)

# Values of k
k_values = [2, 3, 4, 5]

# Silhouette and purity scores
silhouette_scores = []
purity_scores = []

# Clustering
for k in k_values:
    kmeans = cluster.KMeans(n_clusters=k, random_state=0)
    
    # Isolate the k=3 values for question 3
    if k==3:
        labels3 = kmeans.fit_predict(X_normalized)
        cluster_labels = labels3
    else:
        cluster_labels = kmeans.fit_predict(X_normalized)
    
    silhouette_avg = metrics.silhouette_score(X_normalized, cluster_labels)
    purity = purity_score(y, cluster_labels)
    # Silhouette score
    silhouette_scores.append(silhouette_avg)
    purity_scores.append(purity)

# Print silhouette and purity scores for each k
for i in range(len(k_values)): 
   k = k_values[i] 
   silhouette = silhouette_scores[i] 
   purity = purity_scores[i] 
   print(f'k={k}: \nSilhouette Score = {silhouette} \nPurity = {purity}\n')


**2)** [2v] Consider the application of PCA after the data normalization:
    
**i.** Identify the variability explained by the top two principal components.


In [None]:
from sklearn.decomposition import PCA

# Create a PCA instance
pca = PCA(n_components=2)

# Fit the PCA model to the normalized data
X_pca = pca.fit_transform(X_normalized)

# Variability explained by the top two principal components
variability = pca.explained_variance_ratio_[0] + pca.explained_variance_ratio_[1]

# Print the explained variance ratio
print(f"Variability by the Top 2 Principal Components: {variability*100:.4f} %")


**ii.** For each one of these two components, sort the input variables by relevance by inspecting the absolute weights of the linear projection.

In [None]:
# Get the absolute values
absolute_loadings = np.abs(pca.components_)

# DataFrame to associate the loadings with the input variables
loadings_df = pd.DataFrame(absolute_loadings, columns=X.columns, index=['PC1', 'PC2'])

# Sort the input variables by relevance for each principal component
sorted_loadings_pc1 = loadings_df.loc['PC1'].sort_values(ascending=False)
sorted_loadings_pc2 = loadings_df.loc['PC2'].sort_values(ascending=False)

# Prints
# Starting positions for each column, to organize the prints
variable_position = 0
loading1_position = 30
loading2_position = 60

# Prints
print("More relevant variables for Component 1:")
print("{0:41} {1:30} {2:30}".format("Variable", "Component 1 Loading", "Component 2 Loading\n"))

for attr in sorted_loadings_pc1.index:
    print("{0:30} {1:30} {2:30}".format(attr, sorted_loadings_pc1[attr], sorted_loadings_pc2[attr]))

print("\n\nMore relevant variables for Component 2:")
print("{0:41} {1:30} {2:30}".format("Variable", "Component 1 Loading", "Component 2 Loading\n"))

for attr in sorted_loadings_pc2.index:
    print("{0:30} {1:30} {2:30}".format(attr, sorted_loadings_pc1[attr], sorted_loadings_pc2[attr]))

**3)** [2v] Visualize side-by-side the data using: i) the ground diagnoses, and ii) the previously learned
𝑘 = 3 clustering solution. To this end, projected the normalized data onto a 2-dimensional data
space using PCA and then color observations using the reference and cluster annotations.

In [None]:
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder

# Create a label encoder
label_encoder = LabelEncoder()

# Encode the labels to numeric values
c_original = label_encoder.fit_transform(y)
c_kmeans = label_encoder.fit_transform(labels3)

# Change the labels from the default to desired labels
labels = np.unique(y)
label_mapping = {'$\\mathdefault{0}$': labels[0], 
                 '$\\mathdefault{1}$': labels[1], 
                 '$\\mathdefault{2}$': labels[2]}
print(c_original, "\n", c_kmeans)
# Plot of the ground diagnosis
plt.figure(figsize=(14, 5))
plt.subplot(121)
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=c_original)
plt.title("Ground Diagnoses")
plt.xlabel(X.columns[0])
plt.ylabel(X.columns[1])

# Change the labels to desired ones
handles, labels = scatter.legend_elements()
custom_labels = [label_mapping[label] for label in labels]
plt.legend(handles, custom_labels)

# Plot for k=3
plt.subplot(122)
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=c_kmeans)
plt.title("k-Means, with k=3")
plt.xlabel(X.columns[0])
plt.ylabel(X.columns[1])
handles, labels = scatter.legend_elements()
plt.legend(handles, labels)

plt.show()


**4)** [1v] Considering the results from questions (1) and (3), identify two ways on how clustering can
be used to characterize the population of ill and healthy individuals.

1 - Ao especificar o nº de clusters, podemos dividir grupos em subgrupos. Neste caso, é possível dividir o grupo de doentes em Hernia e Spondylolisthesis, utilizando 3 clusters.
2 - Uma vez que as classes sobrepoem-se bastante (como evidenciado no primeiro gráfico do 3), nao há uma separação clara entre os 3 grupos, pelo que é muito dificil encontrar uma maneira de os dividir em clusters. No gráfico do k-means e nos resultados do exercício 1, reparamos que certamente não é o melhor método para este dataset. Assim, clustering é uma boa opção em casos em que as diferentes classes e subclasses estão bem separadas (e sem muitos outliers???), ou seja, onde a distância inter-cluster é maior que as distância intra-cluster para a maioria dos pontos.