# 1. The Dataset
Instructions
Import the Pandas library.
Use the read_csv() method in Pandas to read 114_congress.csv into the variable votes.
Hint
You'll have to pass the string 114_congress.csv into the read_csv() method.

In [33]:
import pandas as pd
votes = pd.read_csv("114_congress.csv")
votes.head()

Unnamed: 0,name,party,state,00001,00004,00005,00006,00007,00008,00009,00010,00020,00026,00032,00038,00039,00044,00047
0,Alexander,R,TN,0,1,1,1,1,0,0,1,1,1,0,0,0,0,0
1,Ayotte,R,NH,0,1,1,1,1,0,0,1,0,1,0,1,0,1,0
2,Baldwin,D,WI,1,0,0,1,0,1,0,1,0,0,1,1,0,1,1
3,Barrasso,R,WY,0,1,1,1,1,0,1,1,1,1,0,0,1,0,0
4,Bennet,D,CO,0,0,0,1,0,1,0,1,0,0,0,1,0,1,0


# 2. Exploring the Data
Instructions
Find how many Senators are in each party.
Use the value_counts() method on the party column of votes. Print the results.
Find what the "average" vote for each bill was.
Use the mean() method on the votes Dataframe. If the mean for a column is less than .5, more Senators voted against the bill, and vice versa if it's over .5. Print the results.
Hint
Remember to print out both results.

In [34]:
print(votes["party"].value_counts())

R    54
D    44
I     2
Name: party, dtype: int64


In [35]:
print(votes.mean())

00001    0.325
00004    0.575
00005    0.535
00006    0.945
00007    0.545
00008    0.415
00009    0.545
00010    0.985
00020    0.525
00026    0.545
00032    0.410
00038    0.480
00039    0.510
00044    0.460
00047    0.370
dtype: float64


# 3.Distance Between Senators
Instructions Compute the Euclidean distance between the first row and the third row.
Assign the result to distance.
Hint
You'll have to use the euclidean_distances method.

In [36]:
from sklearn.metrics.pairwise import euclidean_distances

print(euclidean_distances(votes.iloc[0,3:].reshape(1, -1), votes.iloc[1, 3:].reshape(1, -1)))

[[ 1.73205081]]


In [37]:
distance = euclidean_distances(votes.iloc[0,3:].reshape(1, -1), votes.iloc[2, 3:].reshape(1, -1))
print(distance)

[[ 3.31662479]]


# 4. Initial Clustering
Instructions Use the fit_transform() method to fit kmeans_model on the votes DataFrame. Only select columns after the first 3 from votes when fitting.
Assign the result to senator_distances.
Hint
Remember to pass in a : to the iloc method indicating that you want to select all of the rows.

In [38]:
from sklearn.cluster import KMeans
kmeans_model = KMeans(n_clusters=2,random_state=1)
sentator_distances = kmeans_model.fit_transform(votes.iloc[:,3:])
print(sentator_distances[0:10])

[[ 3.12141628  1.3134775 ]
 [ 2.6146248   2.05339992]
 [ 0.33960656  3.41651746]
 [ 3.42004795  0.24198446]
 [ 1.43833966  2.96866004]
 [ 0.33960656  3.41651746]
 [ 3.42004795  0.24198446]
 [ 0.33960656  3.41651746]
 [ 3.42004795  0.24198446]
 [ 0.31287498  3.30758755]]


# 5. Exploring The Clusters
Instructions Use the labels_ attribute to extract the labels from kmeans_model. Assign the result to the variable labels.
Use the crosstab() method to print out a table comparing labels to votes["party"], in that order.
Hint
Remember to call print() in the last step.

In [39]:
labels = kmeans_model.labels_
crosstabulate = pd.crosstab(labels, votes["party"])
print(crosstabulate)

party   D  I   R
row_0           
0      41  2   0
1       3  0  54


# 6. Exploring Senators In The Wrong Cluster
Instructions Select all senators who were assigned to the second cluster that were Democrats. Assign the subset to democratic_outliers.
Print out democratic_outliers.
Hint
Remember to use parentheses and the & character when subsetting.

In [40]:
democratic_outliers = votes[(labels == 1) & (votes["party"] =="D")]
print(democratic_outliers)

        name party state  00001  00004  00005  00006  00007  00008  00009  \
42  Heitkamp     D    ND    0.0    1.0    0.0    1.0    0.0    0.0    1.0   
56   Manchin     D    WV    0.0    1.0    0.0    1.0    0.0    0.0    1.0   
74      Reid     D    NV    0.5    0.5    0.5    0.5    0.5    0.5    0.5   

    00010  00020  00026  00032  00038  00039  00044  00047  
42    1.0    0.0    0.0    0.0    1.0    0.0    0.0    0.0  
56    1.0    1.0    0.0    0.0    1.0    1.0    0.0    0.0  
74    0.5    0.5    0.5    0.5    0.5    0.5    0.5    0.5  


# 7. Plotting Out The Clusters
Instructions Make a scatterplot using plt.scatter(). Pass in the following keyword arguments:
x should be the first column of senator_distances.
y should be the second column of senator_distances.
c should be labels. This will shade the points according to label.
Use plt.show() to show the plot.
Hint
Make sure to pass keyword arguments directly into the plt.scatter function call. Remember that senator_distances is a NumPy array. You can select the first column with senator_distances[:,0].

In [41]:
import matplotlib.pyplot as plt
plt.scatter(x=sentator_distances[:,0], y=sentator_distances[:,1], c=labels)
plt.show()

# 8. Finding The Most Extreme
Instructions Compute an extremism rating by cubing every value in senator_distances, then finding the sum across each row. Assign the result to extremism.
Assign the extremism variable to the extremism column of votes.
Sort votes on the extremism column, in descending order, using the sort_values() method on DataFrames.
Print the top 10 most extreme Senators.
Hint
You can use the print function and the head() method in the last step.

In [62]:
votes.insert(3, column="extremism", value=extremism)

In [74]:
extremism = (sentator_distances ** 3).sum(axis=1)
votes.insert(3, column="extremism", value=extremism)
votes.sort_values("extremism", inplace=True, ascending=False)
print(votes.head(10))

         name party state  extremism  00001  00004  00005  00006  00007  \
98     Wicker     R    MS  46.250476      0      1      1      1      1   
53   Lankford     R    OK  46.046873      0      1      1      0      1   
69       Paul     R    KY  46.046873      0      1      1      0      1   
80      Sasse     R    NE  46.046873      0      1      1      0      1   
26       Cruz     R    TX  46.046873      0      1      1      0      1   
48    Johnson     R    WI  40.017540      0      1      1      1      1   
47    Isakson     R    GA  40.017540      0      1      1      1      1   
65  Murkowski     R    AK  40.017540      0      1      1      1      1   
64      Moran     R    KS  40.017540      0      1      1      1      1   
30       Enzi     R    WY  40.017540      0      1      1      1      1   

    00008  00009  00010  00020  00026  00032  00038  00039  00044  00047  
98      0      1      0      1      1      0      0      1      0      0  
53      0      1      1 