<a href="https://colab.research.google.com/github/KekaiApana/datasci112_final_project/blob/main/DATASCI_112_Supreme_Court_Data_Extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **DATASCI 112 Final Project: Supreme Court Data Extraction**
### *By: Kekai and Adrian*

This project explores the Cornell Supreme Court Oral Arguments Corpus (https://convokit.cornell.edu/documentation/supreme.html). This data includes data from cases spanning 1955 to 2019.

Research questions:
1. Can we predict how the Justices in the Roberts Court will vote on a case based on their linguistic patterns in oral arguments and voting history?
2. Do certain justice’s have similar arguments and does that relate to how they vote?


In this file, we extract data from different parts of the Corpus along with external JSON data and join it to form a single dataset we will be working with.

### Step 1: Retrieve the data from the Cornell Supreme Court Oral Arguments Corpus

In [None]:
!pip install convokit



In [None]:
from convokit import Corpus, download
import pandas as pd

corpus = Corpus(filename=download("supreme-corpus"))
corpus

Dataset already exists at /root/.convokit/saved-corpora/supreme-corpus


<convokit.model.corpus.Corpus at 0x7d0f7c70f0d0>

Looking into the corpus, we have utterances, or small pieces of what a justice said during a case. Let's look at a example below.

In [None]:
utt = corpus.random_utterance()

In [None]:
print("ID:", utt.id, "\n")
print("Reply_to:", utt.reply_to, "\n")
print("Timestamp:", utt.timestamp, "\n")
print("Text:", utt.text, "\n")
print("Conversation ID:", utt.conversation_id, "\n")
print("Speaker ID:", utt.speaker.id)

ID: 19741__2_006 

Reply_to: 19741__2_005 

Timestamp: None 

Text: We would draw the line, Justice Scalia, on the function and we think what function is the actor, insurance company in this instance, performing?
If it is performing a compulsorily, an ERISA-mandated function, in this instance claims administration, we would say that that was ERISA and preemption applied. 

Conversation ID: 19741 

Speaker ID: john_e_nolan_jr


### Step 2: Get all utterances from the corpus and clean the data

In [None]:

df_utterances = corpus.get_utterances_dataframe()
df_utterances

Unnamed: 0_level_0,timestamp,text,speaker,reply_to,conversation_id,meta.case_id,meta.start_times,meta.stop_times,meta.speaker_type,meta.side,meta.timestamp,vectors
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
13127__0_000,,"Number 71, Lonnie Affronti versus United State...",j__earl_warren,,13127,1955_71,"[0.0, 7.624]","[7.624, 9.218]",J,,0.0,[]
13127__0_001,,May it please the Court.\nWe are here by writ ...,harry_f_murphy,13127__0_000,13127,1955_71,"[9.218, 11.538, 15.653, 22.722, 28.849, 33.575]","[11.538, 15.653, 22.722, 28.849, 33.575, 48.138]",A,1,9.218,[]
13127__0_002,,Consecutive sentences.,j__william_o_douglas,13127__0_001,13127,1955_71,[48.138],[49.315],J,,48.138,[]
13127__0_003,,"Consecutive sentences.\nIn this case, the defe...",harry_f_murphy,13127__0_002,13127,1955_71,"[49.315, 51.844, 60.81, 67.083, 72.584, 89.839...","[51.844, 60.81, 67.083, 72.584, 89.839, 95.873...",A,1,49.315,[]
13127__0_004,,Was the aggregate prison sentence was 20 or 25...,<INAUDIBLE>,13127__0_003,13127,1955_71,[174.058],[176.766],,,174.058,[]
...,...,...,...,...,...,...,...,...,...,...,...,...
24969__2_007,,-- has all sorts of meaning that you're not en...,j__sonia_sotomayor,24969__2_006,24969,2019_19-67,"[3496.8, 3500.32, 3502.96, 3504.68]","[3500.32, 3502.96, 3504.68, 3506.04]",J,,3496.8,[]
24969__2_008,,"No, Your Honor --",eric_j_feigin,24969__2_007,24969,2019_19-67,[3506.04],[3506.56],A,1,3506.04,[]
24969__2_009,,-- altogether?,j__sonia_sotomayor,24969__2_008,24969,2019_19-67,[3506.56],[3507.76],J,,3506.56,[]
24969__2_010,,-- we are using the principles of complicity a...,eric_j_feigin,24969__2_009,24969,2019_19-67,"[3507.76, 3535.8]","[3535.8, 3536.32]",A,1,3507.76,[]


In [None]:
# Filters the data to include only the justices
df_utt_justices = df_utterances[df_utterances["meta.speaker_type"] == "J"]

# Drops any unnecessary columns
columns = ["timestamp", "conversation_id", "meta.start_times", "meta.stop_times", "meta.side", "meta.timestamp", "vectors"]
df_utt_justices = df_utt_justices.drop(columns=columns)

# Reconfigures the columns to be more organized
df_utt_justices["utt_id"] = df_utt_justices.index
df_utt_justices = df_utt_justices.iloc[:, [5, 0, 1, 2, 3, 4]]
df_utt_justices

Unnamed: 0_level_0,utt_id,text,speaker,reply_to,meta.case_id,meta.speaker_type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
13127__0_000,13127__0_000,"Number 71, Lonnie Affronti versus United State...",j__earl_warren,,1955_71,J
13127__0_002,13127__0_002,Consecutive sentences.,j__william_o_douglas,13127__0_001,1955_71,J
13127__0_008,13127__0_008,"Mr. Murphy, what statutory language do you rel...",j__earl_warren,13127__0_007,1955_71,J
13127__0_010,13127__0_010,3651?,j__earl_warren,13127__0_009,1955_71,J
13127__0_012,13127__0_012,What language there do you rely on to support ...,j__earl_warren,13127__0_011,1955_71,J
...,...,...,...,...,...,...
24969__2_003,24969__2_003,-- what in reading this statute would give an ...,j__sonia_sotomayor,24969__2_002,2019_19-67,J
24969__2_005,24969__2_005,But accomplice liability --,j__sonia_sotomayor,24969__2_004,2019_19-67,J
24969__2_007,24969__2_007,-- has all sorts of meaning that you're not en...,j__sonia_sotomayor,24969__2_006,2019_19-67,J
24969__2_009,24969__2_009,-- altogether?,j__sonia_sotomayor,24969__2_008,2019_19-67,J


### Step 3: Get the data for all SCOTUS cases and clean the data

In [None]:
url = "https://zissou.infosci.cornell.edu/convokit/datasets/supreme-corpus/cases.jsonl"
df_cases = pd.read_json(url, lines=True)
df_cases

Unnamed: 0,id,year,citation,title,petitioner,respondent,docket_no,court,decided_date,url,...,adv_sides_inferred,known_respondent_adv,advocates,win_side,win_side_detail,scdb_docket_id,votes,votes_detail,is_eq_divided,votes_side
0,1955_71,1955,350 US 79,Affronti v. United States,Affronti,United States,71,Warren Court,"Dec 5, 1955",https://www.oyez.org/cases/1955/71,...,True,True,"{'Harry F. Murphy': {'id': 'harry_f_murphy', '...",0.0,2.0,1955-009-01,"{'j__john_m_harlan2': 2.0, 'j__hugo_l_black': ...","{'j__john_m_harlan2': 1.0, 'j__hugo_l_black': ...",0.0,"{'j__john_m_harlan2': 0.0, 'j__hugo_l_black': ..."
1,1955_410,1955,351 US 79,"American Airlines, Inc. v. North American Airl...","American Airlines, Inc.","North American Airlines, Inc.",410,Warren Court,"Apr 23, 1956",https://www.oyez.org/cases/1955/410,...,True,True,{'Howard C. Westwood': {'id': 'howard_c_westwo...,1.0,4.0,1955-071-01,"{'j__john_m_harlan2': 2.0, 'j__hugo_l_black': ...","{'j__john_m_harlan2': 1.0, 'j__hugo_l_black': ...",0.0,"{'j__john_m_harlan2': 1.0, 'j__hugo_l_black': ..."
2,1955_351,1955,350 US 532,Archawski v. Hanioti,Archawski,Hanioti,351,Warren Court,"Apr 9, 1956",https://www.oyez.org/cases/1955/351,...,True,False,"{'Harry D. Graham': {'id': 'harry_d_graham', '...",1.0,4.0,1955-053-01,"{'j__john_m_harlan2': 2.0, 'j__hugo_l_black': ...","{'j__john_m_harlan2': 1.0, 'j__hugo_l_black': ...",0.0,"{'j__john_m_harlan2': 1.0, 'j__hugo_l_black': ..."
3,1955_38,1955,350 US 568,Armstrong v. Armstrong,Armstrong,Armstrong,38,Warren Court,"Apr 9, 1956",https://www.oyez.org/cases/1955/38,...,True,False,"{'Robert N. Gorman': {'id': 'robert_n_gorman',...",0.0,2.0,1955-056-01,"{'j__john_m_harlan2': 2.0, 'j__hugo_l_black': ...","{'j__john_m_harlan2': 1.0, 'j__hugo_l_black': ...",0.0,"{'j__john_m_harlan2': 0.0, 'j__hugo_l_black': ..."
4,1955_49,1955,350 US 198,"Bernhardt v. Polygraphic Company of America, Inc.",Bernhardt,"Polygraphic Company of America, Inc.",49,Warren Court,"Jan 16, 1956",https://www.oyez.org/cases/1955/49,...,True,False,"{'Manfred W. Ehrich, Jr.': {'id': 'manfred_w_e...",1.0,4.0,1955-020-01,"{'j__john_m_harlan2': 1.0, 'j__hugo_l_black': ...","{'j__john_m_harlan2': 2.0, 'j__hugo_l_black': ...",0.0,"{'j__john_m_harlan2': 0.0, 'j__hugo_l_black': ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7743,2019_19-46,2019,591 US _,U.S. Patent and Trademark Office v. Booking.co...,United States Patent and Trademark Office,Booking.com B.V.,19-46,Roberts Court,"Jun 30, 2020",https://www.oyez.org/cases/2019/19-46,...,False,True,"{'Erica L. Ross': {'id': 'erica_l_ross', 'name...",0.0,2.0,2019-049-01,"{'j__john_g_roberts_jr': 2.0, 'j__clarence_tho...","{'j__john_g_roberts_jr': 1.0, 'j__clarence_tho...",0.0,"{'j__john_g_roberts_jr': 0.0, 'j__clarence_tho..."
7744,2019_19-177,2019,591 US _,United States Agency for International Develop...,United States Agency for International Develop...,"Alliance for Open Society International, Inc.,...",19-177,Roberts Court,"Jun 29, 2020",https://www.oyez.org/cases/2019/19-177,...,False,True,{'Christopher G. Michel': {'id': 'christopher_...,1.0,3.0,2019-052-01,"{'j__john_g_roberts_jr': 2.0, 'j__clarence_tho...","{'j__john_g_roberts_jr': 1.0, 'j__clarence_tho...",0.0,"{'j__john_g_roberts_jr': 1.0, 'j__clarence_tho..."
7745,2019_18-1584,2019,590 US _,United States Forest Service v. Cowpasture Riv...,"United States Forest Service, et al.","Cowpasture River Association, et al.",18-1584,Roberts Court,"Jun 15, 2020",https://www.oyez.org/cases/2019/18-1584,...,False,True,"{'Anthony A. Yang': {'id': 'anthony_a_yang', '...",1.0,4.0,2019-041-01,"{'j__john_g_roberts_jr': 2.0, 'j__clarence_tho...","{'j__john_g_roberts_jr': 1.0, 'j__clarence_tho...",0.0,"{'j__john_g_roberts_jr': 1.0, 'j__clarence_tho..."
7746,2019_19-67,2019,590 US _,United States v. Sineneng-Smith,United States of America,Evelyn Sineneng-Smith,19-67,Roberts Court,"May 7, 2020",https://www.oyez.org/cases/2019/19-67,...,False,True,"{'Eric J. Feigin': {'id': 'eric_j_feigin', 'na...",1.0,5.0,2019-043-01,"{'j__john_g_roberts_jr': 2.0, 'j__clarence_tho...","{'j__john_g_roberts_jr': 1.0, 'j__clarence_tho...",0.0,"{'j__john_g_roberts_jr': 1.0, 'j__clarence_tho..."


In [None]:
# Drops unnecessary columns
caseCols = ["citation", "petitioner", "respondent", "docket_no", "decided_date", "url", "transcripts",
            "adv_sides_inferred", "known_respondent_adv", "advocates", "win_side", "win_side_detail",
            "scdb_docket_id", "is_eq_divided", "votes_detail"]
df_cases_clean = df_cases.drop(columns=caseCols)

# Filters data to include only Roberts Court cases
df_cases_clean = df_cases_clean[df_cases_clean["court"] == "Roberts Court"]

# Renames case id column in preparation for data merge
df_cases_clean.rename(columns={"id": "meta.case_id"}, inplace=True)
df_cases_clean

Unnamed: 0,meta.case_id,year,title,court,votes,votes_side
6573,2005_128-orig,2005,Alaska v. United States,Roberts Court,"{'j__john_paul_stevens': 2.0, 'j__sandra_day_o...","{'j__john_paul_stevens': 0.0, 'j__sandra_day_o..."
6574,2005_04-433,2005,Anza v. Ideal Steel Supply Corporation,Roberts Court,"{'j__john_paul_stevens': 2.0, 'j__antonin_scal...","{'j__john_paul_stevens': 1.0, 'j__antonin_scal..."
6575,2005_04-944,2005,Arbaugh v. Y & H Corp.,Roberts Court,"{'j__john_paul_stevens': 2.0, 'j__antonin_scal...","{'j__john_paul_stevens': 1.0, 'j__antonin_scal..."
6576,2005_8-orig,2005,Arizona v. California,Roberts Court,"{'j__john_paul_stevens': 2.0, 'j__antonin_scal...","{'j__john_paul_stevens': 0.0, 'j__antonin_scal..."
6577,2005_04-1506,2005,Arkansas Dept. of Health and Human Servs. v. A...,Roberts Court,"{'j__john_paul_stevens': 2.0, 'j__antonin_scal...","{'j__john_paul_stevens': 0.0, 'j__antonin_scal..."
...,...,...,...,...,...,...
7742,2019_19-635,2019,Trump v. Vance,Roberts Court,"{'j__john_g_roberts_jr': 2.0, 'j__clarence_tho...","{'j__john_g_roberts_jr': 0.0, 'j__clarence_tho..."
7743,2019_19-46,2019,U.S. Patent and Trademark Office v. Booking.co...,Roberts Court,"{'j__john_g_roberts_jr': 2.0, 'j__clarence_tho...","{'j__john_g_roberts_jr': 0.0, 'j__clarence_tho..."
7744,2019_19-177,2019,United States Agency for International Develop...,Roberts Court,"{'j__john_g_roberts_jr': 2.0, 'j__clarence_tho...","{'j__john_g_roberts_jr': 1.0, 'j__clarence_tho..."
7745,2019_18-1584,2019,United States Forest Service v. Cowpasture Riv...,Roberts Court,"{'j__john_g_roberts_jr': 2.0, 'j__clarence_tho...","{'j__john_g_roberts_jr': 1.0, 'j__clarence_tho..."


In [None]:
# Expands votes column
votes_expanded = df_cases_clean['votes'].apply(pd.Series)
votes_expanded = votes_expanded.add_prefix('votes.')

# Expands votes_side column
votes_side_expanded = df_cases_clean['votes_side'].apply(pd.Series)
votes_side_expanded = votes_side_expanded.add_prefix('votes_side.')

# Combines data into one dataframe
df_expanded = pd.concat([df_cases_clean.drop(columns=['votes', 'votes_side']), votes_expanded, votes_side_expanded], axis=1)
df_expanded

Unnamed: 0,meta.case_id,year,title,court,votes.j__john_paul_stevens,votes.j__sandra_day_oconnor,votes.j__antonin_scalia,votes.j__anthony_m_kennedy,votes.j__david_h_souter,votes.j__clarence_thomas,...,votes_side.j__david_h_souter,votes_side.j__clarence_thomas,votes_side.j__ruth_bader_ginsburg,votes_side.j__stephen_g_breyer,votes_side.j__john_g_roberts_jr,votes_side.j__samuel_a_alito_jr,votes_side.j__sonia_sotomayor,votes_side.j__elena_kagan,votes_side.j__neil_gorsuch,votes_side.j__brett_m_kavanaugh
6573,2005_128-orig,2005,Alaska v. United States,Roberts Court,2.0,2.0,2.0,2.0,2.0,2.0,...,0.0,0.0,0.0,0.0,,,,,,
6574,2005_04-433,2005,Anza v. Ideal Steel Supply Corporation,Roberts Court,2.0,,2.0,2.0,2.0,2.0,...,1.0,1.0,1.0,1.0,1.0,1.0,,,,
6575,2005_04-944,2005,Arbaugh v. Y & H Corp.,Roberts Court,2.0,,2.0,2.0,2.0,2.0,...,1.0,1.0,1.0,1.0,1.0,,,,,
6576,2005_8-orig,2005,Arizona v. California,Roberts Court,2.0,,2.0,2.0,2.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,,,,
6577,2005_04-1506,2005,Arkansas Dept. of Health and Human Servs. v. A...,Roberts Court,2.0,,2.0,2.0,2.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7742,2019_19-635,2019,Trump v. Vance,Roberts Court,,,,,,1.0,...,,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
7743,2019_19-46,2019,U.S. Patent and Trademark Office v. Booking.co...,Roberts Court,,,,,,2.0,...,,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
7744,2019_19-177,2019,United States Agency for International Develop...,Roberts Court,,,,,,2.0,...,,1.0,0.0,0.0,1.0,1.0,0.0,,1.0,1.0
7745,2019_18-1584,2019,United States Forest Service v. Cowpasture Riv...,Roberts Court,,,,,,2.0,...,,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0


### Step 4: Merge the data

In [None]:
df_merge = df_utt_justices.merge(df_expanded, how="inner", on="meta.case_id")
df_merge

Unnamed: 0,utt_id,text,speaker,reply_to,meta.case_id,meta.speaker_type,year,title,court,votes.j__john_paul_stevens,...,votes_side.j__david_h_souter,votes_side.j__clarence_thomas,votes_side.j__ruth_bader_ginsburg,votes_side.j__stephen_g_breyer,votes_side.j__john_g_roberts_jr,votes_side.j__samuel_a_alito_jr,votes_side.j__sonia_sotomayor,votes_side.j__elena_kagan,votes_side.j__neil_gorsuch,votes_side.j__brett_m_kavanaugh
0,22620__0_000,We'll hear argument first this morning in 04-4...,j__john_g_roberts_jr,,2005_04-433,J,2005,Anza v. Ideal Steel Supply Corporation,Roberts Court,2.0,...,1.0,1.0,1.0,1.0,1.0,1.0,,,,
1,22620__0_002,"Well, isn't there something different here?\nB...",j__david_h_souter,22620__0_001,2005_04-433,J,2005,Anza v. Ideal Steel Supply Corporation,Roberts Court,2.0,...,1.0,1.0,1.0,1.0,1.0,1.0,,,,
2,22620__0_004,"Sure, but they suffered the harm because the f...",j__david_h_souter,22620__0_003,2005_04-433,J,2005,Anza v. Ideal Steel Supply Corporation,Roberts Court,2.0,...,1.0,1.0,1.0,1.0,1.0,1.0,,,,
3,22620__0_006,"Well, why... why is that true?\nLet's assume t...",j__anthony_m_kennedy,22620__0_005,2005_04-433,J,2005,Anza v. Ideal Steel Supply Corporation,Roberts Court,2.0,...,1.0,1.0,1.0,1.0,1.0,1.0,,,,
4,22620__0_008,"Mr. Frederick, you... you started by saying ho...",j__ruth_bader_ginsburg,22620__0_007,2005_04-433,J,2005,Anza v. Ideal Steel Supply Corporation,Roberts Court,2.0,...,1.0,1.0,1.0,1.0,1.0,1.0,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125912,24969__2_003,-- what in reading this statute would give an ...,j__sonia_sotomayor,24969__2_002,2019_19-67,J,2019,United States v. Sineneng-Smith,Roberts Court,,...,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
125913,24969__2_005,But accomplice liability --,j__sonia_sotomayor,24969__2_004,2019_19-67,J,2019,United States v. Sineneng-Smith,Roberts Court,,...,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
125914,24969__2_007,-- has all sorts of meaning that you're not en...,j__sonia_sotomayor,24969__2_006,2019_19-67,J,2019,United States v. Sineneng-Smith,Roberts Court,,...,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
125915,24969__2_009,-- altogether?,j__sonia_sotomayor,24969__2_008,2019_19-67,J,2019,United States v. Sineneng-Smith,Roberts Court,,...,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [None]:
df_merge.to_csv("scotus_roberts_data.csv", index=False)