A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do **not** write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select _Kernel_, and restart the kernel and run all cells (_Restart & Run all_).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select _File_ → _Save and CheckPoint_)

5. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

# Problem 5.2. Clustering.

In this problem, we will continue from where we left off in Problem 1, and apply the k-means clustering algorithm on Delta Airline's aircrafts.

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import sklearn

from sklearn.utils import check_random_state
from sklearn.cluster import KMeans

from nose.tools import assert_equal, assert_is_instance, assert_true, assert_is_not
from numpy.testing import assert_array_equal, assert_array_almost_equal, assert_almost_equal

I saved the `reduced` array (the first 10 principal components of the Delta Airline data set) from Problem 1 as an `npy` file.

```python
>>> np.save("delta_reduced.npy", reduced)
```

This file is in `/home/data_scientist/data/misc`. We will load this file as a Numpy array and start from there.

In [None]:
reduced = np.load("/home/data_scientist/data/misc/delta_reduced.npy")

# k-means

- Write a function named `cluster()` that fits a k-means clustering algorithm, and returns a tuple `(sklearn.cluster.k_means_.KMeans, np.array)`.

- The first element of the tuple is an instance of `KMeans()`. For example,
```python
def cluster(array, random_state, n_clusters):
    # YOUR CODE HERE
    model = KMeans(
        # YOUR CODE HERE
    )
    clusters = # YOUR CODE HERE
    return model, clusters
```
- The second element of the tuple is a 1-d array that contains the predictions of k-means clustering, i.e. which cluster each data point belongs to.

- Use default values for all parameters in `KMeans()` execept for `n_clusters` and `random_state`.

In [None]:
def cluster(array, random_state, n_clusters):
    """
    Fits and predicts k-means clustering on "array"
    
    Parameters
    ----------
    array: A numpy array
    random_state: Random seed, e.g. check_random_state(0)
    n_clusters: The number of clusters. Default: 4
    
    Returns
    -------
    A tuple (sklearn.KMeans, np.ndarray)
    """
    
    # YOUR CODE HERE
    
    return model, clusters

In [None]:
k_means, clusters = cluster(reduced, random_state=0, n_clusters=4)

In [None]:
k_means_t, cluster_t = cluster(reduced, random_state=1, n_clusters=5)

assert_is_instance(k_means_t, sklearn.cluster.k_means_.KMeans)
assert_is_instance(cluster_t, np.ndarray)
assert_equal(k_means_t.n_init, 10)
assert_equal(k_means_t.n_clusters, 5)
assert_equal(len(cluster_t), len(reduced))
assert_true((cluster_t < 5).all()) # n_cluster = 5 so labels should be between 0 and 5
assert_true((cluster_t >= 0).all())
labels_gold = -1. * np.ones(len(reduced), dtype=np.int)
mindist = np.empty(len(reduced))
mindist.fill(np.infty)
for i in range(5):
    dist = np.sum((reduced - k_means_t.cluster_centers_[i])**2., axis=1)
    labels_gold[dist < mindist] = i
    mindist = np.minimum(dist, mindist)
assert_true((mindist >= 0.0).all())
assert_true((labels_gold != -1).all())
assert_array_equal(labels_gold, cluster_t)

Now, we would like to apply the k-means clustering technique, but how do we determine k, the number of clusters?

The simplest method is [the elbow method](https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set#The_Elbow_Method), which is similar to what we did in Problem 1. But what criterion should we use, i.e. what should go on the y-axis?

According to [scikit-learn documentation](http://scikit-learn.org/stable/modules/clustering.html#k-means),

```
The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance,
minimizing a criterion known as the inertia or within-cluster sum-of-squares.
```

The scikit-learn documentation on [sklearn.cluster.KMeans](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn-cluster-kmeans) says that `sklearn.cluster.KMeans` has the inertia value in the `inertia_` attribute. So we can vary the number of clusters in `KMeans`, plot `KMeans.inertia_` as a function of the number of clusters, and pick the "elbow" in the plot.

![](https://github.com/UI-DataScience/accy571-fa16/raw/master/Week7/assignments/images/elbow.png)

Using the elbow method, we choose four clusters, i.e., $k = 4$. Using $k = 4$, we now visualize the clusters in terms of first four principal components.

![](https://github.com/UI-DataScience/accy571-fa16/raw/master/Week7/assignments/images/pca_pair_plot.png)

We can see that one outlier is in its own cluster, there are 3 or 4 points in another cluster, and the remaining points are split into two clusters of greater size. Let's take a closer look at each cluster.

In [None]:
df = pd.read_csv('/home/data_scientist/data/delta.csv', index_col='Aircraft')
df['Clusters'] = clusters
df['Aircraft'] = df.index
df_grouped = df.groupby('Clusters').mean()
print(df_grouped.Accommodation)

In [None]:
print(df_grouped['Length (ft)'])

Cluster 3 has only one aircraft:

In [None]:
clust3 = df[df.Clusters == 3]
print(clust3.Aircraft)

Airbus A319 VIP is not one of Delta Airline's regular fleet and is one of Airbus corporate jets.

Cluster 2 has four aircrafts.

In [None]:
clust2 = df[df.Clusters == 2]
print(clust2.Aircraft)

These are small aircrafts and only have economy seats.

In [None]:
cols_seat = ['First Class', 'Business', 'Eco Comfort', 'Economy']
print(df.loc[clust2.index, cols_seat])

Next, we look at Cluster 1.

In [None]:
clust1 = df[df.Clusters == 1]
print(clust1.Aircraft)

These aircrafts do not have first class seating.

In [None]:
print(df.loc[clust1.index, cols_seat])

Finally, cluster 0 has the following aircrafts:

In [None]:
clust0 = df[df.Clusters == 0]
print(clust0.Aircraft)

The aircrafts in cluster 0 (except for one aircraft) have first class seating but no business class.

In [None]:
print(df.loc[clust0.index, cols_seat])