# Make the preliminary clusters, supervised 
In this notebook, I apply KMeans and PCA to generate clusters of senators. Each senator is an observation and each of the 502 features is a discrete binary random variable indicating a `yea` or `nay` vote. 

We'll use the predicted label as a substitute for party label, and we'll compare the results with the true party labels. Using PCA, we can take the first two components and treat them like axes to visualize the results. 


In [1]:
import pandas as pd
import numpy as np


# clustering and modelling
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

#plotting
from bokeh.charts import *
from bokeh.plotting import figure, show
from bokeh.models import HoverTool
import matplotlib.pyplot as plt
output_notebook()

In [2]:
votes = pd.read_csv('../data/cleaned_votes.csv', index_col=0)

In [3]:
c = votes.iloc[:, :-1].T # remove the result column

In [4]:
c.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,462,478,487,492,494,496,497,500,501,502
Alexander (R-TN),0,1,1,1,1,1,1,0,0,1,...,1,1,1,1,1,1,1,1,1,1
Ayotte (R-NH),0,1,1,1,1,1,1,0,0,1,...,1,0,1,0,1,1,1,1,1,1
Baldwin (D-WI),1,1,0,0,0,1,0,1,0,1,...,1,1,1,0,1,1,1,1,0,1
Barrasso (R-WY),0,1,1,1,1,1,1,0,1,1,...,1,0,1,1,1,1,1,1,1,1
Bennet (D-CO),0,1,1,0,0,1,0,1,0,1,...,1,1,1,0,1,1,1,1,1,1


Let's apply KMeans clustering and plot to visually analyze the distances between senators. 

In [5]:
km = KMeans(n_clusters=2)
km.fit(c)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [6]:
predicted_parties = pd.Series(km.labels_) # we'll color by predicted party label

In [7]:
predicted_parties.value_counts()

1    54
0    46
dtype: int64

In [8]:
# PCA
pca = PCA(n_components=2)

In [9]:
pca.fit(c.T)

PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [10]:
pca.explained_variance_ratio_

array([ 0.50643735,  0.18673049])

In [11]:
c1, c2 = pca.components_

In [12]:
c1.shape, c2.shape, predicted_parties.shape

((100,), (100,), (100,))

In [13]:
names = votes.T.index.values[:-1]
len(names)

100

Using Bokeh, we can make interactive html plots that work well on the web

In [14]:
# color by cluster
color_dict = {0:'firebrick', 1:'darkblue'}
cs = pd.Series(predicted_parties).map(color_dict)
color = cs.values

#make source data
source = ColumnDataSource(data=dict(
    c1 = c1,
    c2=c2,
    color=color,
    names = names
))


p1 = figure(plot_width=800, plot_height=600, title="A House Divided", tools = 'hover, save, box_zoom,reset')
p1.circle('c1', 'c2', color='color', source=source, size=13)

p1.select_one(HoverTool).tooltips = [
    ('Senator', '@names')]

output_file('../gallery/senate_divided.html')

show(p1)

In [15]:
# let's show the actual parties

parties = []
for name in names:
    party = name[-5:-4]
    parties.append(party)

In [16]:
colors_dict = {'R':'red', 'D':'blue', 'I':'green'}
cp = pd.Series(parties).map(colors_dict)

In [17]:
# color by cluster
colors = cp.values
#make source data
source = ColumnDataSource(data=dict(
    c1 = c1,
    c2=c2,
    color=colors,
    names = names
))


p2 = figure(plot_width=800, plot_height=600, title="A House Divided", tools = 'hover, save, box_zoom,reset')
p2.circle('c1', 'c2', color='color', source=source, size=13)

p2.select_one(HoverTool).tooltips = [
    ('Senator', '@names')]

output_file('../gallery/senate_divided_2.html')

show(p2)

INFO:bokeh.core.state:Session output file '../gallery/senate_divided_2.html' already exists, will be overwritten.


These images suggest there is serious division in the senate. This extends the results we noticed from the decision tree. 