# Lecture 24: Demo 1 – Data 100, Summer 2024

Data 100, Summer 2024

[Acknowledgments Page](https://ds100.org/su24/acks/)

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import yaml
from datetime import datetime
from ds100_utils import *
import plotly.express as px

## Congressional Vote Records

Let's examine how the House of Representatives (of the 116th Congress, 1st session) voted in the month of **September 2019**.

From the [U.S. Senate website](https://www.senate.gov/reference/Index/Votes.htm):

> Roll call votes occur when a representative or senator votes "yea" or "nay," so that the names of members voting on each side are recorded. A voice vote is a vote in which those in favor or against a measure say "yea" or "nay," respectively, without the names or tallies of members voting on each side being recorded.

The data, compiled from ProPublica [source](https://github.com/eyeseast/propublica-congress), is a "skinny" table of data where each record is a single vote by a member across any roll call in the 116th Congress, 1st session, as downloaded in February 2020. The member of the House, whom we'll call **legislator**, is denoted by their bioguide alphanumeric ID in http://bioguide.congress.gov/.

In [None]:
# February 2019 House of Representatives roll call votes
# Downloaded using https://github.com/eyeseast/propublica-congress
votes = pd.read_csv('data/votes.csv')
votes = votes.astype({"roll call": str}) 
votes

Suppose we pivot this table to group each legislator and their voting pattern across every (roll call) vote in this month. We mark 1 if the legislator voted Yes (yea), and 0 otherwise (No/nay, no vote, speaker, etc.).

In [None]:
def was_yes(s):
    return 1 if s.iloc[0] == "Yes" else 0    
vote_pivot = votes.pivot_table(index='member', 
                                columns='roll call', 
                                values='vote', 
                                aggfunc=was_yes, 
                                fill_value=0)
print(vote_pivot.shape)
vote_pivot.head()    

How do we analyze this data?

While we could consider loading information about the legislator, such as their party, and see how this relates to their voting pattern, it turns out that we can do a lot with PCA to cluster legislators by how they vote.

### PCA

In [None]:
vote_pivot_centered = vote_pivot - np.mean(vote_pivot, axis = 0)
vote_pivot_centered

In [None]:
vote_pivot_centered.shape

In [None]:
u, s, vt = np.linalg.svd(vote_pivot_centered, full_matrices = False)

### PCA plot

In [None]:
vote_2d = pd.DataFrame(index = vote_pivot_centered.index)
vote_2d[["z1", "z2", "z3"]] = (u * s)[:, :3]
px.scatter(vote_2d, x='z1', y='z2', title='Vote Data', width=800, height=600)


It would be interesting to see the political affiliation for each vote.

### Component Scores

If the first two singular values are large and all others are small, then two dimensions are enough to describe most of what distinguishes one observation from another. If not, then a PCA scatter plot is omitting lots of information.

An equivalent way to evaluate this is to determine the **variance ratios**, i.e., the fraction of the variance each PC contributes to total variance.

In [None]:
np.round(s**2 / sum(s**2), 2)

## Scree plot

A **scree plot** (and where its "elbow" is located) is a visual way of checking the distribution of variance.

In [None]:
fig = px.line(y=s**2 / sum(s**2), title='Variance Explained', width=700, height=600, markers=True)
fig.update_xaxes(title_text='Principal Component')
fig.update_yaxes(title_text='Proportion of Variance Explained')

In [None]:
fig = px.scatter_3d(vote_2d, x='z1', y='z2', z='z3', title='Vote Data', width=800, height=600)
fig.update_traces(marker=dict(size=5))

Baesd on the plot above, it looks like there are two clusters of datapoints. What do you think this corresponds to?

## Incorporating Member Information

Suppose we load in more member information, from https://github.com/unitedstates/congress-legislators. This includes each legislator's political party.

In [None]:
# You can get current information about legislators with this code. In our case, we'll use
# a static copy of the 2019 membership roster to properly match our voting data.

# base_url = 'https://raw.githubusercontent.com/unitedstates/congress-legislators/main/'
# legislators_path = 'legislators-current.yaml'
# f = fetch_and_cache(base_url + legislators_path, legislators_path)

# Use 2019 data copy
legislators_data = yaml.safe_load(open('data/legislators-2019.yaml'))

def to_date(s):
    return datetime.strptime(s, '%Y-%m-%d')

legs = pd.DataFrame(
    columns=['leg_id', 'first', 'last', 'gender', 'state', 'chamber', 'party', 'birthday'],
    data=[[x['id']['bioguide'], 
           x['name']['first'],
           x['name']['last'],
           x['bio']['gender'],
           x['terms'][-1]['state'],
           x['terms'][-1]['type'],
           x['terms'][-1]['party'],
           to_date(x['bio']['birthday'])] for x in legislators_data])
legs['age'] = 2024 - legs['birthday'].dt.year
legs.set_index("leg_id")
legs.sort_index()

We can combine the vote data projected onto the principal components with the biographic data. 

In [None]:
vote_2d = vote_2d.join(legs.set_index('leg_id')).dropna()

Then we can visualize this data all at once.

In [None]:
px.scatter(vote_2d, x='z1', y='z2', color='party', symbol="gender", size='age',
           title='Vote Data', width=800, height=600, size_max=10,
           opacity = 0.7,
           color_discrete_map={'Democrat':'blue', 'Republican':'red', "Independent": "green"},
           hover_data=['first', 'last', 'state', 'party', 'gender', 'age'])

There seems to be a bunch of overplotting, so let's jitter a bit.

In [None]:
np.random.seed(42)
vote_2d['z1_jittered'] = vote_2d['z1'] + np.random.normal(0, 0.1, len(vote_2d))
vote_2d['z2_jittered'] = vote_2d['z2'] + np.random.normal(0, 0.1, len(vote_2d))
vote_2d['z3_jittered'] = vote_2d['z3'] + np.random.normal(0, 0.1, len(vote_2d))

In [None]:
px.scatter(vote_2d, x='z1_jittered', y='z2_jittered', color='party', symbol="gender", size='age',
           title='Vote Data', width=800, height=600, size_max=10,
           opacity = 0.7,
           color_discrete_map={'Democrat':'blue', 'Republican':'red', "Independent": "green"},
           hover_data=['first', 'last', 'state', 'party', 'gender', 'age'])

In [None]:
px.scatter_3d(vote_2d, x='z1_jittered', y='z2_jittered', z='z3_jittered', color='party', symbol="gender", size='age',
           title='Vote Data', width=800, height=600, size_max=10,
           opacity = 0.7,
           color_discrete_map={'Democrat':'blue', 'Republican':'red', "Independent": "green"},
           hover_data=['first', 'last', 'state', 'party', 'gender', 'age'])

<br>

## Analysis: Regular Voters

Not everyone voted all the time.  Let's examine the frequency of voting.

First, let's recompute the pivot table where we only consider Yes/No votes, and ignore records with "No Vote" or other entries.

In [None]:
vote_2d["num votes"] = (
    votes[votes["vote"].isin(["Yes", "No"])]
        .groupby("member").size()
)
vote_2d.dropna(inplace=True)
vote_2d.head()

In [None]:
# histogram with a jittered marginal
px.histogram(vote_2d, x="num votes", log_x=True, width=800, height=600)

In [None]:
px.scatter(vote_2d, x='z1_jittered', y='z2_jittered', color='party', symbol="gender", size='num votes',
           title='Vote Data (Size is Number of Votes)', width=800, height=600, size_max=10,
           opacity = 0.7,
           color_discrete_map={'Democrat':'blue', 'Republican':'red', "Independent": "green"},
           hover_data=['first', 'last', 'state', 'party', 'gender', 'age'])

## Exploring the Principal Components

We can also look at Vt directly to try to gain insight into why each component is as it is.

In [None]:
fig_eig = px.bar(x=vote_pivot_centered.columns, y=vt[0,:])
# extract the trace from the figure
fig_eig

We have the party affiliation labels so we can see if this eigenvector aligns with one of the parties.

In [None]:
party_line_votes = (
    vote_pivot_centered.join(legs.set_index("leg_id")['party'])
                       .groupby("party").mean()
                       .T.reset_index()
                       .rename(columns={"index": "call"})
                       .melt("call")
)
fig = px.bar(
    party_line_votes,
    x="call", y="value", facet_row = "party", color="party",
    color_discrete_map={'Democrat':'blue', 'Republican':'red', "Independent": "green"})
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))


In [None]:
fig_eig

## Biplot

In [None]:
loadings = pd.DataFrame({"pc1": np.sqrt(s[0]) * vt[0,:], "pc2": np.sqrt(s[1])*vt[1,:]}, index=vote_pivot_centered.columns)   
loadings.head()

In [None]:
fig = px.scatter(
    vote_2d, x='z1_jittered', y='z2_jittered', color='party', symbol="gender", size='num votes',
    title='Biplot', width=800, height=600, size_max=10,
    opacity = 0.7,
    color_discrete_map={'Democrat':'blue', 'Republican':'red', "Independent": "green"},
    hover_data=['first', 'last', 'state', 'party', 'gender', 'age'])

for (call, pc1, pc2) in loadings.head(20).itertuples():
    fig.add_scatter(x=[0,pc1], y=[0,pc2], name=call, 
                    mode='lines+markers', textposition='top right',
                    marker= dict(size=10,symbol= "arrow-bar-up", angleref="previous"))
fig

Each roll call from the 116th Congress - 1st Session: https://clerk.house.gov/evs/2019/ROLL_500.asp
* 555: Raising a question of the privileges of the House ([H.Res.590](https://www.congress.gov/bill/116th-congress/house-resolution/590))
* 553: [https://www.congress.gov/bill/116th-congress/senate-joint-resolution/54/actions]
* 527: On Agreeing to the Amendment [H.R.1146 - Arctic Cultural and Coastal Plain Protection Act](https://www.congress.gov/bill/116th-congress/house-bill/1146)