# Unsupervised Learning with Scikit-Learn

## Lesson Goals

This lesson will serve as an introduction to unsupervised learning using Scikit-learn. A number of essential algorithms will be covered along with implementation and examples.


## Introduction

Clustering is a family of algorithms for uncovering relationships and insight in a dataset. The data is not labeled and so there is no ground truth answer that we are trying to predict. Instead, we use different algorithms to group observations together and uncover what they might have in common. There are multiple clustering techniques. In this lesson we will cover two clustering techniques - K-means and Hierarchical Clustering.


## K-means

K-means is one of the oldest and most popular clustering techniques. The main idea behind k-means clustering is that we choose how many clusters we would like to create (typically we call that number k). We then select random starting points for those cluster centroids. We compute the distance between each observation and the clusters. We reassign a cluster to each observation and then recompute the centroids. We keep doing so until the labels stay constant and we no longer need to reassign.


## K-means in Scikit-Learn

We will explore k-means with scikit-learn using our census data. We first load the data. 

In [None]:
import pandas as pd

census = pd.read_csv('../data/acs2015_county_data.csv')

census.describe()

Before using our algorithm, we need to do some munging. Our first step should be to check for missing data and based on the amount of missing data decide on a strategy.

In [None]:
census.isnull().sum(axis = 0)

There are only a few columns with missing data and each one of them does not have more than one missing observation. Therefore, the simplest strategy would be to remove the missing data.

In [34]:
census_missing = census.dropna()

Additionally, we should only be clustering using columns that contain actual information about the data. Therefore, we should probably remove the State and County columns. We should also remove the CensusId column because it contains no information about the each county.

In [35]:
census_columns = [col for col in census.columns.values if col not in ['CensusId', 'State', 'County']]

Now let's import Kmeans from scikit-learn: 

In [36]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters = 4)

We define a k-means object with 4 clusters and then fit our data

In [None]:
census_clusters = kmeans.fit(census_missing[census_columns])

census_clusters.cluster_centers_

The cluster centers contain the 4 centroids. Since the data contains 34 columns describing each county, each centroid is in a 34 dimensional plane.

Using fit_predict, we can assign a cluster to each observation and then add this information back to our dataset. 

In [None]:
census_missing['Cluster'] = census_clusters.fit_predict(census_missing[census_columns])

Let's look at the counts of counties in each cluster:

In [None]:
census_missing.Cluster.value_counts()

The majority of the data is in the first cluster, while cluster 2 has only one obervation.

Plotting the data will not provide us with a great deal of meaningful information. This is because the data has 34 dimensions. Therefore, creating a two dimensional plot will only capture some of the information and might not show completely separable clusters. However, it is interesting to look at some summary statistics for our clusters.

We can look at the count of counties by state for each cluster

In [None]:
census_missing[census_missing.Cluster == 0].State.value_counts()

In [None]:
census_missing[census_missing.Cluster == 1].State.value_counts()

In [None]:
census_missing[census_missing.Cluster == 2].State.value_counts()

In [None]:
census_missing[census_missing.Cluster == 3].State.value_counts()

We can also look at the mean income and the mean rate of child poverty for each of the 4 clusters.



In [None]:
census_missing.groupby(['Cluster'])['Income'].mean()

In [None]:
census_missing.groupby(['Cluster'])['ChildPoverty'].mean()

# Hierarchical Clustering

Hierarchical clustering is a clustering technique where we create a hierarchy of clusters. The advantage over k-means is that we do not need to specify the number of clusters. We can observe relationships between observations without a predetermined number of clusters. We can also generate a dendogram which is a visualization that displays the relationship between observations in the data.

There are two types of hierarchical clustering

    Agglomerative - This is a bottom up approach. We start off with a cluster for each observation and 
    then combine similar clusters until we are left with only one large cluster

    Divisive - This is a top down approach. We start with one large cluster and keep dividing until 
    we are left with clusters

**Hierarchical Clustering with Scikit-learn**

Hierarchical clustering with scikit-learn is performed using the AgglomerativeClustering function.

In order to demonstrate hierarchical clustering, we will use the census data again. This time, we will take a sample to ensure a clear and uncluttered dendogram just for the sake of this demo.

In [46]:
from sklearn.cluster import AgglomerativeClustering

census_sample = census_missing[census_columns].sample(n = 100)

hier_clust = AgglomerativeClustering(linkage = 'ward')

census_hier = hier_clust.fit(census_sample)

In order to plot our dendogram, we need to do some data manipulation. This is because the function to plot a dendogram exists in scipy and not in scikit-learn and requires a slightly different data format.

In [None]:
import numpy as np
from scipy.cluster.hierarchy import dendrogram
import matplotlib.pyplot as plt

def plot_dendrogram(model, **kwargs):

    # Children of hierarchical clustering
    children = model.children_

    # Distances between each pair of children
    # Since we don't have this information, we can use a uniform one for plotting
    distance = np.arange(children.shape[0])

    # The number of observations contained in each cluster level
    no_of_observations = np.arange(2, children.shape[0] + 2)

    # Create linkage matrix and then plot the dendrogram
    linkage_matrix = np.column_stack([children, distance, no_of_observations]).astype(float)

    # Plot the corresponding dendrogram
    plt.figure(figsize=(12, 6))
    dendrogram(linkage_matrix, **kwargs)
    
plot_dendrogram(census_hier, labels = census_hier.labels_)