# Lab 3 - Part 2: PCA and Clustering (12 marks)
### Due Date: Monday, March 13 at 12pm

Author: *Steven Duong (30022492)*

The purpose of this portion of the assignment is to practice using PCA and clustering techniques on a given dataset

In [3]:
import numpy as np
import pandas as pd

## 0. Function definitions (2 marks)

In [4]:
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score

def cluster_fn(n_clusters, X, n_components=0):
    '''Calculate silhouette score for a given dataset, number of clusters, 
       and number of principle components using KMeans clustering (random_state=0)
        
        n_clusters (int): number of clusters to use for KMeans
        X (numpy.array or pandas.DataFrame): unlabelled dataset
        n_components (int): number of principle components (optional)
        
        returns: silhouette score
    
    '''
    # Apply PCA if number of components is specified
    if n_components > 0:
        pca = PCA(n_components=n_components)
        X = pca.fit_transform(X)
    
    # Apply KMeans clustering and calculate silhouette score
    kmeans = KMeans(n_clusters=n_clusters, random_state=0)
    cluster_labels = kmeans.fit_predict(X)
    silhouette_avg = silhouette_score(X, cluster_labels)
    
    return silhouette_avg


## 1. Load data (2 marks)

For this assignment, we will use the dataset found below:

https://archive.ics.uci.edu/ml/datasets/Chemical+Composition+of+Ceramic+Samples

In [25]:
import pandas as pd

# Load the dataset into a pandas dataframe
df = pd.read_csv('Chemical Composion of Ceramic.csv')

# Print the first few rows of the dataframe
print(df.head())


  Ceramic Name  Part  Na2O   MgO  Al2O3   SiO2   K2O   CaO  TiO2  Fe2O3  MnO  \
0      FLQ-1-b  Body  0.62  0.38  19.61  71.99  4.84  0.31  0.07   1.18  630   
1      FLQ-2-b  Body  0.57  0.47  21.19  70.09  4.98  0.49  0.09   1.12  380   
2      FLQ-3-b  Body  0.49  0.19  18.60  74.70  3.47  0.43  0.06   1.07  420   
3      FLQ-4-b  Body  0.89  0.30  18.01  74.19  4.01  0.27  0.09   1.23  460   
4      FLQ-5-b  Body  0.03  0.36  18.41  73.99  4.33  0.65  0.05   1.19  380   

   CuO  ZnO  PbO2  Rb2O  SrO  Y2O3  ZrO2  P2O5  
0   10   70    10   430    0    40    80    90  
1   20   80    40   430  -10    40   100   110  
2   20   50    50   380   40    40    80   200  
3   20   70    60   380   10    40    70   210  
4   40   90    40   360   10    30    80   150  


Two of the columns are non-numeric. For this assignment, we will remove those two columns and focus on clustering the ceramic samples based on the numerical measurements

In [26]:
# Remove non-numeric columns
df = df.drop(columns=['Ceramic Name', 'Part'])

# Print the first few rows of the updated dataframe
print(df.head())

   Na2O   MgO  Al2O3   SiO2   K2O   CaO  TiO2  Fe2O3  MnO  CuO  ZnO  PbO2  \
0  0.62  0.38  19.61  71.99  4.84  0.31  0.07   1.18  630   10   70    10   
1  0.57  0.47  21.19  70.09  4.98  0.49  0.09   1.12  380   20   80    40   
2  0.49  0.19  18.60  74.70  3.47  0.43  0.06   1.07  420   20   50    50   
3  0.89  0.30  18.01  74.19  4.01  0.27  0.09   1.23  460   20   70    60   
4  0.03  0.36  18.41  73.99  4.33  0.65  0.05   1.19  380   40   90    40   

   Rb2O  SrO  Y2O3  ZrO2  P2O5  
0   430    0    40    80    90  
1   430  -10    40   100   110  
2   380   40    40    80   200  
3   380   10    40    70   210  
4   360   10    30    80   150  


## 2. Implement clustering (8 marks)

### 2.1 Cluster using raw data (1 mark)

Implement Kmeans clustering using the raw data. Compare the silhouette scores using 2, 3, 4, 5 and 6 clusters

In [56]:
# Implement clustering with raw data using cluster_fn above

# Try clustering with 2, 3, 4, 5, and 6 clusters
for n_clusters in range(2, 7):
    score = cluster_fn(n_clusters, df)
    scores_df = pd.concat([scores_df, pd.DataFrame({'Num Clusters': [n_clusters], 'Silhouette Score': [score]})], ignore_index=True)


### 2.2 Cluster using PCA-transformed data (2 marks)

Implement Kmeans clustering using the PCA-transformed data. Compare the silhouette scores using 2, 3, 4, 5 and 6 clusters and 2, 3, 4, 5 and 6 principle components 

In [57]:
# Implement clustering with PCA-transformed data using cluster_fn above

# Initialize a list to hold the silhouette scores
scores = []

# Try clustering with 2, 3, 4, 5, and 6 clusters and 2, 3, 4, 5, and 6 principle components
for n_clusters in range(2, 7):
    for n_components in range(2, 7):
        score = cluster_fn(n_clusters, df, n_components)
        scores.append({'Num Clusters': n_clusters, 'Num Components': n_components, 'Silhouette Score': score})


### 2.3 Display results (2 marks)

Print the results for 2.1 and 2.2 in a table. Include column and row labels

In [58]:
# Display results

# For 2.1 Results:
# Pivot the dataframe to create a table of the results
table = scores_df.pivot_table(index='Num Clusters', values='Silhouette Score')

# Format the table with two decimal places for the silhouette scores
styled_table = (table.style
                .format({'Silhouette Score': '{:.2f}'}))

# Add the caption to the table
caption = 'Silhouette Scores for KMeans Clustering'
print("2.1 Results:\n"+caption)

# Display the table
display(styled_table)

# For 2.2 Results:
# Create a pandas DataFrame from the scores list
scores_df = pd.DataFrame(scores)

# Pivot the dataframe to create a table of the results
table = scores_df.pivot_table(index='Num Clusters', columns='Num Components', values='Silhouette Score')

# Format the silhouette scores with two decimal places
styled_table = table.style.format('{:.2f}')

# Add the caption to the table
caption = 'Silhouette Scores for KMeans Clustering with PCA-transformed Data'
print("\n2.2 Results:\n"+caption)

# Display the table
display(styled_table)


2.1 Results:
Silhouette Scores for KMeans Clustering


Unnamed: 0_level_0,Silhouette Score
Num Clusters,Unnamed: 1_level_1
2,0.58
3,0.55
4,0.54
5,0.5
6,0.5



2.2 Results:
Silhouette Scores for KMeans Clustering with PCA-transformed Data


Num Components,2,3,4,5,6
Num Clusters,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2,0.62,0.6,0.59,0.59,0.59
3,0.61,0.59,0.57,0.57,0.56
4,0.6,0.57,0.55,0.55,0.55
5,0.57,0.55,0.52,0.52,0.51
6,0.57,0.55,0.53,0.52,0.52


**Question**: Which combination of number of clusters and number of components produced the best results? What is the silhouette score for this combination? **(3 marks)**

**2 principal components** with **2 clusters produced** the best result with a silhouette score of **0.62**

## 3. Improve results (Bonus - 3 marks)

Think about how you could improve the results from the previous section. Two potential methods include preprocessing the data or selecting a different clustering algorithm. Repeat section 2 with your selected improvement method to determine what the new silhouette scores would be

In [None]:
# TODO: Repeat steps 2.1-2.3 using a different method/preprocessing/etc.

In [108]:
import numpy as np
from sklearn.cluster import AgglomerativeClustering
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score

def cluster_fn(n_clusters, X, n_components=0):
    '''Calculate silhouette score for a given dataset, number of clusters, 
       and number of principle components using agglomerative clustering without scaling
        
        n_clusters (int): number of clusters to use for agglomerative clustering
        X (numpy.array or pandas.DataFrame): unlabelled dataset
        n_components (int): number of principle components (optional)
        
        returns: silhouette score
    
    '''
    # Apply PCA if number of components is specified
    if n_components > 0:
        pca = PCA(n_components=n_components)
        X = pca.fit_transform(X)

    # Apply agglomerative clustering
    agglom = AgglomerativeClustering(n_clusters=n_clusters)
    cluster_labels = agglom.fit_predict(X)
    
    # Calculate silhouette score
    silhouette_avg = silhouette_score(X, cluster_labels)
    
    return silhouette_avg


In [109]:
# Implement clustering with raw data using cluster_fn above

# Try clustering with 2, 3, 4, 5, and 6 clusters
for n_clusters in range(2, 7):
    score = cluster_fn(n_clusters, df)
    scores_df = pd.concat([scores_df, pd.DataFrame({'Num Clusters': [n_clusters], 'Silhouette Score': [score]})], ignore_index=True)


In [110]:
# Implement clustering with PCA-transformed data using cluster_fn above

# Initialize a list to hold the silhouette scores
scores = []

# Try clustering with 2, 3, 4, 5, and 6 clusters and 2, 3, 4, 5, and 6 principle components
for n_clusters in range(2, 7):
    for n_components in range(2, 7):
        score = cluster_fn(n_clusters, df, n_components)
        scores.append({'Num Clusters': n_clusters, 'Num Components': n_components, 'Silhouette Score': score})


In [112]:
# Display results

# For 2.1 Results:
# Pivot the dataframe to create a table of the results
table = scores_df.pivot_table(index='Num Clusters', values='Silhouette Score')

# Format the table with two decimal places for the silhouette scores
styled_table = (table.style
                .format({'Silhouette Score': '{:.2f}'}))

# Add the caption to the table
caption = 'Silhouette Scores for Agglomerative Clustering'
print("2.1 Results:\n"+caption)

# Display the table
display(styled_table)

# For 2.2 Results:
# Create a pandas DataFrame from the scores list
scores_df = pd.DataFrame(scores)

# Pivot the dataframe to create a table of the results
table = scores_df.pivot_table(index='Num Clusters', columns='Num Components', values='Silhouette Score')

# Format the silhouette scores with two decimal places
styled_table = table.style.format('{:.2f}')

# Add the caption to the table
caption = 'Silhouette Scores for Agglomerative Clustering with PCA-transformed Data'
print("\n2.2 Results:\n"+caption)

# Display the table
display(styled_table)


2.1 Results:
Silhouette Scores for Agglomerative Clustering


Unnamed: 0_level_0,Silhouette Score
Num Clusters,Unnamed: 1_level_1
2,0.58
3,0.56
4,0.55
5,0.51
6,0.52



2.2 Results:
Silhouette Scores for Agglomerative Clustering with PCA-transformed Data


Num Components,2,3,4,5,6
Num Clusters,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2,0.6,0.59,0.58,0.58,0.58
3,0.6,0.57,0.55,0.55,0.54
4,0.58,0.56,0.54,0.54,0.53
5,0.55,0.52,0.5,0.49,0.49
6,0.56,0.53,0.51,0.5,0.5


**Question**: Why did you select this improvement method? Which combination of number of clusters and number of components produced the best results? Did you improve the silhouette scores? If yes, how much of an improvement did you get over the previous results?

I selected agglomerative clustering as a potential improvement over KMeans clustering because agglomerative clustering is better suited for small datasets, and has better robustness to outliers. The best results were found with the following cominations: Num Components = 2, Num Clusters = 2 and Num Components = 2, Num Clusters = 3 at silhoutette scores of 0.60. The silhouette scores did not improve over KMeans clustering, indicating that our dataset is more convex, and thus KMeans clustering is more efficient.