After completing this exercise, students should be able to:
1. Apply the k-means algorithm to identify clusters in a dataset.
2. Construct and utilize a scikit-learn Pipeline in order to fit a multistage model.
3. Identify hyperparameters used by a model.
4. Apply a metric for model evaluation.
5. Recognize an attribute that is calculated when calling the .fit() method on a scikit-learn model.
6. Using manual iteration to optimize a hyperparameter.

In [None]:
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Load in the dataset and remove rows with missing values

penguins = pd.read_csv('data/penguins.csv').dropna().reset_index(drop = True)

penguins.head(1)

In this notebook, we'll focus exclusively on the numeric variables contained in this dataset. If you are interested in seeing how to apply clustering methods to datasets with categorical variables, see the k-modes or k-prototypes algorithms (described here http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.15.4028&rep=rep1&type=pdf and implemented in Python in the kmodes library: https://github.com/nicodv/kmodes).

In [None]:
# Create a list to hold the variables we will be working with.
variables = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']

Before getting started, let's take a look at the documentation for KMeans (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html).

**Question 1:** A **hyperparameter** is used to control the learning process and is set prior to fitting the model. What **hyperparameters** can we set for the scikit-learn KMeans model?

In [None]:
# First, instantiate a KMeans instance which will fit 3 clusters
n_clusters = 3

kmeans = KMeans(n_clusters = n_clusters)

In [None]:
# Then fit it to the numeric variables of the penguins dataset
kmeans.fit(penguins[variables])

**Question 2:** What attributes for this model have been set by the `.fit()` method? (Hint: These are attributes whose name ends in an underscore. Hint 2: Make use of Tab in Jupyter to expose the attributes and methods of your fit object. You can also look at the documentation to see what attributes are available and the meanings of them.)

First, take a look at the inertia value.

In [None]:
kmeans.inertia_

**Question 3:** Do we want to minimize or maximize this attribute?

**Question 4: True or False** By increasing the number of clusters, we can always decrease the inertia value.

Now, let's visualize the result of running this algorithm. Notice that we'll make use of the `cluster_centers_` attribute so that we can include the centroids of the clusters in our plot.

In [None]:
# Choose the variables you want to visualize.
# i and j indicate the index of the variable from the variables list 
# ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
i = 0
j = 1

plt.figure(figsize = (10,6))
sns.scatterplot(data = penguins,
               x = variables[i],
               y = variables[j],
               hue = kmeans.labels_)

sns.scatterplot(x = kmeans.cluster_centers_[:,i],
                y = kmeans.cluster_centers_[:,j],
                s = 500, 
                hue = list(range(n_clusters)), 
                marker = 'D',
                legend = False);

**Question 5:** Change the values of `i` and `j` in the code above so that you can see different views of the clustering and the centers. For certain combinations of features, your clusters may have quite a bit of overlap. You might even get seemingly disconnected clusters (look at bill_depth_mm and body_mass_g, for example). Why might that be the case? 

Now, we'll use a Pipeline so that we can scale our variables prior to clustering them.

In [None]:
n_clusters = 3

pipeline = Pipeline(
    steps = [
        ('scaler', StandardScaler()),
        ('cluster', KMeans(n_clusters = n_clusters))
    ]
)

pipeline.fit(penguins[variables])

Once we have a fit pipeline, we can examine the individual components by referencing then either by index or by name. For example, to see the cluster component, we can use either `pipeline['cluster']` or `pipeline[1]`.

With that in mind, look at the inertia value for our multistage model.

In [None]:
pipeline['cluster'].inertia_

**Question 6:** Is it fair to compare the intertia values for the unscaled results to those from the scaled results? Why or why not?

We can now examine the results. Notice that we'll be making use of the `inverse_transform` method of our StandardScaler so that we can transform the cluster centers back into the original units.

In [None]:
i = 0
j = 1

plt.figure(figsize = (10,6))
sns.scatterplot(data = penguins,
               x = variables[i],
               y = variables[j],
               hue = pipeline[1].labels_)

sns.scatterplot(x = pipeline['scaler'].inverse_transform(pipeline['cluster'].cluster_centers_)[:,i],
                y = pipeline['scaler'].inverse_transform(pipeline['cluster'].cluster_centers_)[:,j],
                s = 500, 
                hue = list(range(n_clusters)), 
                marker = 'D',
                legend = False);

**Question 7:** What major differences do you see with this clustering compared to the previous one?

Finally, let's try and find an optimal number of clusters. We'll employ the "elbow method" which you saw in DataCamp.

In [None]:
inertias = []

max_clusters = 5
for n_clusters in range(1, max_clusters + 1):
    
    pipeline = Pipeline(
        steps = [
            ('scaler', StandardScaler()),
            ('cluster', KMeans(n_clusters = n_clusters))
        ]
    )

    pipeline.fit(penguins[variables])
    
    inertias.append(pipeline['cluster'].inertia_)

In [None]:
plt.figure(figsize = (10,6))
plt.plot(range(1, max_clusters + 1), inertias)
plt.scatter(range(1, max_clusters + 1), inertias, s = 100);

**Question 8:** Based on this plot, how many clusters do you think you should use?

After you decide the above question, fill in the code below in order to refit the model.

In [None]:
n_clusters = 2

pipeline = Pipeline(
    steps = [
        ('scaler', StandardScaler()),
        ('cluster', KMeans(n_clusters = n_clusters))
    ]
)

pipeline.fit(penguins[variables])

Finally, use the `pd.crosstab()` function so that you can see how well the clustering distinguishes between penguin species.

In [None]:
pd.crosstab(penguins['species'], pipeline['cluster'].labels_)

### Bonus Material
Do not attempt until you have completed all of the above questions.

So far, we have evaluated our clusters using the _intertia_ metric, which is built into the KMeans class. Another commonly used metric for evaluating clusters is the [**silhouette score**](https://en.wikipedia.org/wiki/Silhouette_(clustering)).

Scikit-learn offers functions for computing silhouette scores: https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html.

**Bonus Question 1:** Read the description of the silhouette score on wikipedia (https://en.wikipedia.org/wiki/Silhouette_(clustering)). Do we want to minimize or maximize the silhouette score?

**Bonus Question 2:** True or False: We can always increase (or always decrease) the silhouette score by increasing the number of clusters.

Let's import the `silhouette_samples` and `silhouette_score` function from the metrics module of scikit-learn.

In [None]:
from sklearn.metrics import silhouette_samples, silhouette_score

Before diving into silhouette scores, let's refit with 3 clusters.

In [None]:
n_clusters = 3

pipeline = Pipeline(
    steps = [
        ('scaler', StandardScaler()),
        ('cluster', KMeans(n_clusters = n_clusters))
    ]
)

pipeline.fit(penguins[variables])

You can first calculate the silhouette value of each point using the `silhouette_samples` function from scikit-learn's metrics module. Use this function to produce an array of silhouette scores (one per data point). Do you need to use the scaled or the unscaled data for this?

In [None]:
# Your code here

In [None]:
silhouette_values = silhouette_samples(pipeline['scaler'].transform(penguins[variables]), pipeline['cluster'].labels_)

**Question** Look at the distribution of silhouette values (either overall or by cluster). What do you notice? Are they any negative silouette values? If so, how is that possible?

In [None]:
# Your code here

In [None]:
silhouette_df = pd.DataFrame({'species': penguins['species'],
             'cluster': pipeline['cluster'].labels_,
             'silhouette_value': silhouette_values})

In [None]:
silhouette_df.groupby('cluster')['silhouette_value'].describe()

You can also find the silhouette score (the average silhouette value across all datapoints) by using the `silhouette_score` function. Try this out.

In [None]:
# Your code here

In [None]:
silhouette_avg = silhouette_score(pipeline['scaler'].transform(penguins[variables]), pipeline['cluster'].labels_)

silhouette_avg

**Bonus Coding Task:** Create a for loop which will fit a k-means model over some range of number of clusters. Compare silhouette scores to choose how many clusters to use. How does what you find here compare to what you found using intertia above?

In [None]:
# Your Code Here

In [None]:
silhouette_scores = []

max_clusters = 8
for n_clusters in range(2, max_clusters + 1):
    
    pipeline = Pipeline(
        steps = [
            ('scaler', StandardScaler()),
            ('cluster', KMeans(n_clusters = n_clusters))
        ]
    )

    pipeline.fit(penguins[variables])
    
    silhouette_scores.append(silhouette_score(pipeline['scaler'].transform(penguins[variables]), pipeline['cluster'].labels_))

In [None]:
plt.figure(figsize = (10,6))
plt.plot(range(2, max_clusters + 1), silhouette_scores)
plt.scatter(range(2, max_clusters + 1), silhouette_scores, s = 100);

**Coding Challenge:** Refit your kmeans model with 3 clusters. Then, manually check the calculations that scikit-learn produces using the definition of the silhouette score.

**Hint:** Since this calculation requires computing the distance between all points in your sample, you might find the `pdist` function useful from scipy to quickly do these calculations (https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html). You may also want to look at the `squareform` function (https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.squareform.html#scipy.spatial.distance.squareform) to make the output from `pdist` easier to work with.

In [None]:
from scipy.spatial.distance import pdist, squareform

In [None]:
n_clusters = 3

pipeline = Pipeline(
    steps = [
        ('scaler', StandardScaler()),
        ('cluster', KMeans(n_clusters = n_clusters))
    ]
)

pipeline.fit(penguins[variables])

In [None]:
# Your code For calculating silhouette values

In [None]:
dists = squareform(pdist(pipeline['scaler'].transform(penguins[variables])))

labels = pipeline['cluster'].labels_

silhouette_scores = []

for pt in range(len(penguins)):
    
    cluster = labels[pt]

    others = [x for x in range(n_clusters) if x != cluster]

    a = dists[:, pt][labels == cluster].sum() / ((labels == cluster).sum() - 1)
    b = min([dists[:, pt][labels == i].mean() for i in others])

    silhouette_scores.append((b - a) / max(a, b))

In [None]:
silhouette_scores

In [None]:
silhouette_samples(pipeline['scaler'].transform(penguins[variables]), pipeline['cluster'].labels_)