This example was developed by Professor Michael Ellis, University of Central Arkansas.
Adapted from the book *Pandas for Everyone: Python Data Analysis*, by Daniel Chen


# Clustering
File(s) needed: wine.csv

There are two main categories of machine learning methods: supervised and unsupervised. Linear regression is a supervised method. Next we'll look at clustering, which is an unsupervised method.
- **_Supervised_** methods predict a known value, like the amount of the tip paid by restaurant customers.
- **_Unsupervised_** methods have no target value. The point of an unsupervised learning methods is to discover unknown patterns or relationships in the data.

**_Clustering_** creates groups (or **"clusters"**) of similar observations without being told anything about what the clusters should look like. It does this by applying two basic guiding principles:
1. Items in a cluster should be very similar to one another.
2. Items in a cluster should be very different from other clusters.

Where clustering methods differ is in how the similarity and difference between items is determined. We will look at two common clustering methods, k-means and hierarchical clustering. We'll use data about wine to see how we can group items to learn something about the data.

To start, let us prepare the data. Note that all columns are numeric data types. The algorithms we will use will not work with nonnumeric data. 

In [1]:
# Import pandas & numpy, and read the data
import pandas as pd
import numpy as np

In [3]:
wine = pd.read_csv('wine.csv')
wine.head()

Unnamed: 0,Cultivar,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


The value of the column "Cultivar" needs to be dropped from the data frame.

In [4]:
# Drop the Cultivar column
wine = wine.drop('Cultivar', axis=1)
wine.head()

Unnamed: 0,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


# K-means 

The k-means algorithm applies the idea that an item should be closer to the center of its cluster than any other cluster. It does so by selecting _k_ points to be cluster centers, then it assigns the other points to the cluster with the closest center. It then calculates the new center of the cluster and assigns all the points to clusters based upon the new centers. This process repeats until there is a stable result.

Here are two interesting visualizations that show how k-means works: 
- http://shabal.in/visuals.html click on the "series of 5 gif animations" link
- https://www.naftaliharris.com/blog/visualizing-k-means-clustering/

We can use the `KMeans` function from the sklearn library to create a k-means model.

In [5]:
# Load the KMeans function
from sklearn.cluster import KMeans

In [6]:
# Create a model with k=3.
# The random_state parameter is optional. It sets the seed for random values used
#   in the algorithm. By using the same seed we make sure to get the same results every time.
#   This should only be used when developing a model. It has to be removed for use.

kmeans = KMeans(n_clusters=3, random_state=42).fit(wine.values)
print(kmeans)



KMeans(n_clusters=3, random_state=42)


To look at the results, we get the labels and counts from the object 'kmeans'

In [7]:
# inspect the labels and counts
print(np.unique(kmeans.labels_, return_counts=True))

(array([0, 1, 2]), array([69, 47, 62], dtype=int64))


So there are three labels (0, 1, and 2) as expected. It would be more helpful to combine these labels with our data. First, we'll create a data frame containing the cluster names so we have it for use later. Then, we'll add the cluster labels directly to a copy of the data.

In [8]:
# Create a cluster data frame for later use
kmeans_3 = pd.DataFrame(kmeans.labels_, columns=['cluster'])
kmeans_3.head()

Unnamed: 0,cluster
0,1
1,1
2,1
3,1
4,2


In [9]:
# Add the cluster labels to a copy of the original data
clustered_wine = wine
clustered_wine['cluster']=kmeans.labels_
clustered_wine

Unnamed: 0,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline,cluster
0,14.23,1.71,2.43,15.6,127,2.80,3.06,0.28,2.29,5.64,1.04,3.92,1065,1
1,13.20,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.40,1050,1
2,13.16,2.36,2.67,18.6,101,2.80,3.24,0.30,2.81,5.68,1.03,3.17,1185,1
3,14.37,1.95,2.50,16.8,113,3.85,3.49,0.24,2.18,7.80,0.86,3.45,1480,1
4,13.24,2.59,2.87,21.0,118,2.80,2.69,0.39,1.82,4.32,1.04,2.93,735,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
173,13.71,5.65,2.45,20.5,95,1.68,0.61,0.52,1.06,7.70,0.64,1.74,740,2
174,13.40,3.91,2.48,23.0,102,1.80,0.75,0.43,1.41,7.30,0.70,1.56,750,2
175,13.27,4.28,2.26,20.0,120,1.59,0.69,0.43,1.35,10.20,0.59,1.56,835,2
176,13.17,2.59,2.37,20.0,120,1.65,0.68,0.53,1.46,9.30,0.60,1.62,840,2


With the cluster labels added to the data, we can now see which cluster each of the rows belongs to, but it is not very intuitive. It would be much easier to use our results if we could plot them. The problem is that we have 13 variables (plus a cluster designation) to plot for each row, and humans can only see things in three dimensions (and we can only plot in two dimensions). So if we want to create a plot, we need to find a way to represent those 14 columns in terms of just two variables and a cluster label.