# Lab 3 - Part 2: PCA and Clustering (12 marks)
### Due Date: Monday, March 13 at 12pm

Author: Sam Rainbow

The purpose of this portion of the assignment is to practice using PCA and clustering techniques on a given dataset

In [1]:
import numpy as np
import pandas as pd

## 0. Function definitions (2 marks)

In [20]:
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score

def cluster_fn(n_clusters, X, n_components=0):
    '''Calculate silhouette score for a given dataset, number of clusters, 
       and number of principle components using Kmeans clustering (random_state=0)
        
        n_clusters (int): number of clusters to use for Kmeans
        n_components (int): number of principle components (optional)
        X (numpy.array or pandas.DataFrame): unlabelled dataset
        
        returns: silhouette score
    
    '''
    # TODO: Implement function body
    if n_components > 0:
        pca = PCA(n_components=n_components)
        X = pca.fit_transform(X)

    kmeans = KMeans(n_clusters=n_clusters).fit(X)
    score = silhouette_score(X, kmeans.labels_)

    return score


## 1. Load data (2 marks)

For this assignment, we will use the dataset found below:

https://archive.ics.uci.edu/ml/datasets/Chemical+Composition+of+Ceramic+Samples

In [35]:
# TODO: Import dataset
df = pd.read_csv('Chemical Composion of Ceramic.csv')

Two of the columns are non-numeric. For this assignment, we will remove those two columns and focus on clustering the ceramic samples based on the numerical measurements

In [36]:
# TODO: Remove non-numeric columns
df = df.drop(['Ceramic Name', 'Part'], axis =1)

## 2. Implement clustering (8 marks)

### 2.1 Cluster using raw data (1 mark)

Implement Kmeans clustering using the raw data. Compare the silhouette scores using 2, 3, 4, 5 and 6 clusters

In [26]:
# TODO: Implement clustering with raw data using cluster_fn above
raw_data = []

for n_clusters in range(2, 7):
    kmeans = KMeans(n_clusters=n_clusters, random_state=0)
    kmeans.fit(df)
    score = silhouette_score(df, kmeans.labels_)
    raw_data.append(score)

### 2.2 Cluster using PCA-transformed data (2 marks)

Implement Kmeans clustering using the PCA-transformed data. Compare the silhouette scores using 2, 3, 4, 5 and 6 clusters and 2, 3, 4, 5 and 6 principle components 

In [27]:
# TODO: Implement clustering with PCA-transformed data using cluster_fn above
pca_data = []
for n_clusters in range(2, 7):
    for n_components in range(2, 7):
        score = cluster_fn(n_clusters=n_clusters, X=df, n_components=n_components)
        pca_data.append(score)


### 2.3 Display results (2 marks)

Print the results for 2.1 and 2.2 in a table. Include column and row labels

In [29]:
# TODO: Display results
rows = ['PCA - 2 PCs', 'PCA - 3 PCs', 'PCA - 4 PCs', 'PCA - 5 PCs', 'PCA - 6 PCs', 'Raw Data']
columns = ['2 Clusters', '3 Clusters', '4 Clusters', '5 Clusters', '6 Clusters']
data = np.array(pca_data + raw_data).reshape(6, 5)

table = pd.DataFrame(data, index=rows, columns=columns)

print(table)

             2 Clusters  3 Clusters  4 Clusters  5 Clusters  6 Clusters
PCA - 2 PCs    0.619442    0.599961    0.589955    0.587472    0.585963
PCA - 3 PCs    0.611625    0.586609    0.570949    0.567470    0.564725
PCA - 4 PCs    0.600752    0.570531    0.553715    0.549286    0.546752
PCA - 5 PCs    0.567088    0.542209    0.521348    0.519533    0.512537
PCA - 6 PCs    0.569320    0.546232    0.529829    0.524216    0.521257
Raw Data       0.584013    0.561640    0.543411    0.508064    0.510399


**Question**: Which combination of number of clusters and number of components produced the best results? What is the silhouette score for this combination? **(3 marks)**

The best results were produced by the combination of 2 clusters and 2 principal components with a silhouette score of 0.619

## 3. Improve results (Bonus - 3 marks)

Think about how you could improve the results from the previous section. Two potential methods include preprocessing the data or selecting a different clustering algorithm. Repeat section 2 with your selected improvement method to determine what the new silhouette scores would be

In [39]:
# TODO: Repeat steps 2.1-2.3 using a different method/preprocessing/etc.
from sklearn.preprocessing import StandardScaler, MinMaxScaler
raw_data = []
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)

for n_clusters in range(2, 7):
    kmeans = KMeans(n_clusters=n_clusters, random_state=0)
    kmeans.fit(X_scaled)
    score = silhouette_score(X_scaled, kmeans.labels_)
    raw_data.append(score)

pca_data = []
for n_clusters in range(2, 7):
    for n_components in range(2, 7):
        score = cluster_fn(n_clusters=n_clusters, X=X_scaled, n_components=n_components)
        pca_data.append(score)

print("-----------------Scaled Data Silhouette Scores----------------------")
rows = ['PCA - 2 PCs', 'PCA - 3 PCs', 'PCA - 4 PCs', 'PCA - 5 PCs', 'PCA - 6 PCs', 'Raw Data']
columns = ['2 Clusters', '3 Clusters', '4 Clusters', '5 Clusters', '6 Clusters']
data = np.array(pca_data + raw_data).reshape(6, 5)

table = pd.DataFrame(data, index=rows, columns=columns)

print(table)



-----------------Scaled Data Silhouette Scores----------------------
             2 Clusters  3 Clusters  4 Clusters  5 Clusters  6 Clusters
PCA - 2 PCs    0.548234    0.460124    0.406468    0.375081    0.350386
PCA - 3 PCs    0.609737    0.477870    0.419754    0.369801    0.339810
PCA - 4 PCs    0.591239    0.436154    0.392035    0.329281    0.310642
PCA - 5 PCs    0.510540    0.422588    0.370892    0.319624    0.296287
PCA - 6 PCs    0.512535    0.425467    0.363046    0.318278    0.299232
Raw Data       0.286008    0.268343    0.242909    0.231110    0.221541


**Question**: Why did you select this improvement method? Which combination of number of clusters and number of components produced the best results? Did you improve the silhouette scores? If yes, how much of an improvement did you get over the previous results?

I attempted to scale the data using the MinMaxScaler() as well as the StandardScaler() but neither improved the model. The highest result was 0.615 with the MinMaxScaler with 2 clusters and 2 principal components and 0.6097 with the StandardScaler with 2 clusters and 3 principle components. I selected this improvment method in an attempt to reduce the influence of features with large values that might create biased results. 