# Clustering the Iris data set

The goal of this notebook is to apply hierarchical clustering, k-means clustering, and DBSCAN to the Iris data set.

This notebook was created by [Chloé-Agathe Azencott](http://cazencott.info), inspired by material from [Marc Harper](http://marcharper.codes/).

This notebook was created using
* python 3.4.3
* numpy 1.15.0
* matplotlib 2.2.2
* scikit-learn 0.19.2

You can check your version of Python by running
```python
import sys
print(sys.version)
```

and the version of any module by running
```python
import <module name>
print(<module name>.__version__)
```

## Loading the data science libraries

In [None]:
%pylab inline
import pandas as pd

## 1. Data

The Iris Dataset is a small data set, originally introduced in 1936 by the British statistician and biologist Ronald Fisher, which is very often used to illustrate machine learning concepts. 

It contains 150 plant samples, from 3 different species of iris (_Iris setosa_, _Iris virginica_ and _Iris versicolor_). Four features were measured from each sample:
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm 

Here, we will try to cluster the plans, _without using their labels_, based only on the petal descriptors. We will then compare our clustering to the actual plant labels.

### Loading the data
The [Iris data set](http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html) is available from scikit-learn.

In [None]:
from sklearn import datasets

In [None]:
iris = datasets.load_iris()
X = iris.data[:, 2:4]  # we only use the last two features (petal descriptors)
y = iris.target

In [None]:
print(X.shape)

## 2. Hierarchical clustering

Let us use the agglomerative clustering algorithm implemented in scikit-learn's [cluster.AgglomerativeClustering](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html). In this implementation, you actually specify the number of clusters, rather than produce a dendrogram on which to decide of this number.

In [None]:
from sklearn import cluster

As we know there are three classes in the data, we will use the algorithm to produce 3 clusters. Let us use complete linkage.

In [None]:
clustering = cluster.AgglomerativeClustering(n_clusters=3, linkage='complete')

clustering.fit(X)

### Visualizing the cluster assignments

We will now plot the samples, and color them according to their cluster assignment.

`clustering.labels_` contains the cluster assignments of all data points in `X`.

In [None]:
fig = plt.figure(figsize=(6, 6))

plt.scatter(X[:, 0], X[:, 1],           
            c=clustering.labels_ , # color by cluster assignment
            edgecolor='none', # remove dot border
           ) 
plt.colorbar(label='cluster label', ticks=range(3))

plt.xlabel("Petal length (cm)", fontsize=14)
plt.ylabel("Petal width (cm)", fontsize=14)
plt.title("Agglomerative clustering, single random initialization, petal features")

__Question 1:__ What do you think of the results obtained using this clustering algorithm on the data? Are you satisfied these clusters make sense?

__Answer:__

### Comparing the cluster assignments to the actual labels

Let us visualize the actual labels of the samples.

In [None]:
fig = plt.figure(figsize=(6, 6))

plt.scatter(X[:, 0], X[:, 1],           
            c=y , # color by label
            edgecolor='none', # remove dot border
           ) 
plt.colorbar(label='true label', ticks=range(3))

plt.xlabel("Petal length (cm)", fontsize=14)
plt.ylabel("Petal width (cm)", fontsize=14)
plt.title("True labels")

__Question 2:__ Visually, how well do you think the clustering algorithm matches the actual labels?

__Answer:__ 

The adjusted Rand index, implemented in scikit-learn's [metrics.adjusted_rand_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html), measures the similarity of the two assignments, ignoring permutations and with chance normalization.

You can read more about it on [Wikipedia](https://en.wikipedia.org/wiki/Rand_index#Adjusted_Rand_index) and in the [documentation](http://scikit-learn.org/stable/modules/clustering.html#adjusted-rand-index).

In [None]:
from sklearn import metrics

In [None]:
print("%.3f" % metrics.adjusted_rand_score(y, clustering.labels_))

__Question 3:__ How do you interpret this adjusted Rand index value? Is this a good match or a poor match?

__Answer:__

### Varying the clustering hyperparameters

__Question 4:__ What do you think will happen if you ask the hierarchical clustering algorithm for 2 clusters only? Check whether this is the case. What is the adjusted Rand index value of this clustering?

In [None]:
# TODO

__Question 5:__ Are you surprised by the clustering you obtain? Try using a different linkage function.

In [None]:
# TODO

__Question 6:__ What do you think will happen if you ask the hierarchical clustering algorithm for 4 clusters? Check whether this is the case. What is the adjusted Rand index value of this clustering?

In [None]:
# TODO

## 4. K-means clustering

To illustrate the K-means algorithm we will use the _first two_ features of our data, describing the sepals and not the petals. You can also try it on the petals descriptors, but as you've seen above, it is a rather easy problem.

In [None]:
X = iris.data[:, :2]  # we only use the first two features (sepal descriptors)

Let us first visualize the true labels of the data here:

In [None]:
fig = plt.figure(figsize=(6, 6))

plt.scatter(X[:, 0], X[:, 1],           
            c=y , # color by label
            edgecolor='none', # remove dot border
           ) 
plt.colorbar(label='true label', ticks=range(3))

plt.xlabel("Sepal length (cm)", fontsize=14)
plt.ylabel("Sepal width (cm)", fontsize=14)
plt.title("True labels")

__Question 8:__ How easy do you think clustering this data is going to be? Why?

__Answer:__

Let us start with hierarchical clustering:

In [None]:
clustering = cluster.AgglomerativeClustering(n_clusters=3, linkage='complete')

clustering.fit(X)

fig = plt.figure(figsize=(6, 6))

plt.scatter(X[:, 0], X[:, 1],           
            c=clustering.labels_ , # color by cluster assignment
            edgecolor='none', # remove dot border
           ) 
plt.colorbar(label='cluster label', ticks=range(3))

plt.xlabel("Sepal length (cm)", fontsize=14)
plt.ylabel("Sepal width (cm)", fontsize=14)
plt.title("Agglomerative clustering, complete linkage, sepal features")

print("Adjusted Rand Index: %.3f" % metrics.adjusted_rand_score(y, clustering.labels_))

As you can see, the clustering we obtain does not make much sense (the clusters are not well separated, for example) and does not match the actual labels.

### Vanilla K-Means

K-means is implemented in scikit-learn's [cluster.KMeans](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html). Lloyd's algorithm is obtained when using a single random initialization.

In [None]:
clustering = cluster.KMeans(n_clusters=3, 
                            n_init=1, init='random' # use a single, random initialization
                           )
clustering.fit(X)

In [None]:
fig = plt.figure(figsize=(6, 6))

plt.scatter(X[:, 0], X[:, 1],           
            c=clustering.labels_ , # color by label
            edgecolor='none', # remove dot border
           ) 
plt.colorbar(label='cluster label', ticks=range(3))

plt.xlabel("Sepal length (cm)", fontsize=14)
plt.ylabel("Sepal width (cm)", fontsize=14)
plt.title("K-Means clustering, single random initialization, sepal features")

print("Adjusted Rand Index: %.3f" % metrics.adjusted_rand_score(y, clustering.labels_))

__Question 9:__ How does the k-means algorithm perform on the sepals?

__Answer:__

### K-means++

__Question 10:__ Use the default scikit-learn parameters for the k-means clustering. What do they mean, and how is the resulting clustering?

In [None]:
# Answer

## 5. DBSCAN

DBSCAN is a density-based clustering algorithm, implemented in scikit-learn's [cluster.DBSCAN](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html).

__Question 11:__ Do you think a density-based clustering will give satisfactory results on our data?

__Answer:__

In [None]:
clustering = cluster.DBSCAN(eps=0.25)
clustering.fit(X)

In [None]:
fig = plt.figure(figsize=(6, 6))

plt.scatter(X[:, 0], X[:, 1],           
            c=clustering.labels_ , # color by label
            edgecolor='none', # remove dot border
           ) 
plt.colorbar(label='cluster label', ticks=range(3))

plt.xlabel("Sepal length (cm)", fontsize=14)
plt.ylabel("Sepal width (cm)", fontsize=14)
plt.title("DBSCAN clustering, sepal features")

print("Adjusted Rand Index: %.3f" % metrics.adjusted_rand_score(y, clustering.labels_))

__Question 12:__ Does this clustering match your expectations? Notice that points labeled -1 are considered outliers.

__Answer:__