# `DSML_WS_13` - Clustering Task

Today is the final workshop of the DSML course. Since we do not have any new concepts to implement, the only bit that is left is discussing the final preparation task.

## 1. Clustering breast cancer samples

In the last workshop, we have illustrated k-means clustering using the iris flower dataset. Put what you have learned into practice by applying it to our known cancer dataset. Would you have chosen the true number of 2 clusters without knowing that there are only two cancer types? Do the following:
- load and prepare data (including feature scaling)
- run a principal component analysis and generate as many principal components so that at least 95% of the variance in the original data is preserved
- run k-means for different values for k using your the principal components as features
- select the most suitable k using the elbow method
- re-train your model using your selected number for k
- generate two scatterplots of the first and second principal components: one showing the true label and one showing your generated clusters

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

In [None]:
# load data
cancer = pd.read_csv('breast_cancer.csv', index_col="id")
cancer.dropna(axis=1, inplace=True)
cancer_wo_target = cancer.drop("diagnosis", axis=1)

cancer_wo_target.head()

In [None]:
# scale features
scaler = StandardScaler()
cancer_scaled = pd.DataFrame(scaler.fit_transform(cancer_wo_target))

# run PCA to reduce dimensionality (hint: 10 PCs should be fine)
pca = PCA(n_components=10)
cancer_scaled_pca = pca.fit_transform(cancer_scaled)

# 10 seems good
sum(pca.explained_variance_ratio_)

In [None]:
# determine number of clusters via elbow method
pca_clusters = []
pca_losses = []

for i in range(20):
    model = KMeans(n_clusters=i+1, n_init=10)
    model.fit(cancer_scaled_pca)
    pca_clusters.append(i+1)
    pca_losses.append(model.inertia_)
    
plt.plot(pca_clusters, pca_losses)
plt.xticks(range(21))
plt.show()

plt.plot(pca_clusters, pca_losses)
plt.xlim([0,10])
plt.show()

In [None]:
# train model for chosen n clusters (i.e., 2)
n=2
cancer_two = KMeans(n_clusters=n, n_init=10)
cancer_two.fit(cancer_scaled_pca)

In [None]:
# produce joint df
cancer_scaled_pca_df = pd.DataFrame(cancer_scaled_pca, index=cancer.index)
cancer_scaled_pca_df.columns = ["PC"+str(column+1) for column in cancer_scaled_pca_df.columns]
cancer_scaled_pca_df["Cluster"] = cancer_two.predict(cancer_scaled_pca)
cancer_scaled_pca_df["Diagnosis"] = cancer["diagnosis"]
cancer_scaled_pca_df.head(2)

In [None]:
# visualize
sns.lmplot(x="PC1", y="PC2", data=cancer_scaled_pca_df, fit_reg=False, hue="Cluster")
sns.lmplot(x="PC1", y="PC2", data=cancer_scaled_pca_df, fit_reg=False, hue="Diagnosis",hue_order=["B","M"])