## Supreme Court Project Guide

The ultimate goal of this project is to build a database of Supreme Court cases for 2020 (or a different range of years) that includes the dialogue from the oral arguments of each case. As we have seen in class the arguments were scraped from this page: https://www.supremecourt.gov/oral_arguments/argument_transcript.aspx 

See if you can follow that guide to downloading and transforming pdfs to texts (don't be shy on slack!)

Once you have a folder of texts transcripts there are three primary programmatic steps that you need to complete:

**Please note:** Step 3 is the most challenging--if you want to spend some time coding, you can skip Steps 1 and 2 and get to work on Step 3

**STEP 1:** scrape all of the case information available on this page: https://www.supremecourt.gov/oral_arguments/argument_transcript/2020

This should include case name, docket number, etc--and most importantly the name of the PDF file. All of the text files share the exact same name as the PDF files they came from. This file name will allow you to connect your transcription data with your case data. 

It is up to you what kind data structure you want to build. But it likely to be a list of lists, or list of dictionaries--for each case you will have a list or dictionary of the information you scrape from the webpage.

**STEP 2:** find secondary source(s) to scrape/integrate with your case data. The information on the Supreme Court page is very limited. You need to find a source or group of sources that ad information. The most important information would likely be: the decision, who voted for and against, and the district court origin of the case (for geocoding). You might think of other great things to put in there too! This information needs to be merged with the data you have from STEP 2.

**STEP 3:** use regular expressions to clean up and parse the text files so that you have a searchable data structure containing the dialog from the transcripts. 

**Data Architecture** 
You will need to think about how you will set up, separate, and join different tables that you create. The initial scraping will give you very simple dataframe: the columns will be dockett, case name, date argued, and PDF name. The regex work on the PDFs should result in a very simple table (or just a list of tuples) of speaker name and dialogue. 

`[('MR. BERGERON'," Yes. That's essentially the same thing"),('JUSTICE SOTOMAYOR',' So how do you deal with Chambers?')]`

But make sure you attach the docket number or pdf filename to each set of arguments you transform using regex. Your secondary sources and information should be linked by docket number, but the question is how to set up those data frames, join them, aggregate them, and narrow them to the fields necessary for presentation.

Go step-by-step through this, and DM me on Slack whenever you get stuck, and I will help. If you complete all the steps before Friday, Slack me if you want to go further.

**Interpretive Architecture**
Also consider what kind of interpretive categories you can add through your reading and research. At the very least, it is recommended that you come up with categories for the kinds of cases that are before the court: human clustering for meaning is always more effective than computational clustering. Try to come up with perhaps 8 to 10 domains that groups of cases might belong to. But also think of other ways of categorizing these cases or these decisions--by politics, by consequences on citizens (you could make a scale from 1 to 10), even an aggregated index of consequences/effects on different types of communities, sectors, regions, etc. 

You are the researcher, these categories or ways of expressing your point-of-view.



### STEP 1
Scrape all of the necessary information from:

https://www.supremecourt.gov/oral_arguments/argument_transcript/2020

This should result in a list of dictionaries for each case.

In [370]:
###Import your scraping libraries
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
my_url = "https://www.supremecourt.gov/oral_arguments/argument_transcript/2020"
raw_html = requests.get(my_url).content
soup_doc = BeautifulSoup(raw_html, "html.parser")
table = soup_doc.find_all(class_="table table-bordered")

In [372]:
###Write your scraping code here
#Scraping the transcripts table first
    
entire_rows = []
for each_table in table:
    all_rows = each_table.find_all('tr')
    for row in all_rows[1:]:
        case_dict = {}
        case_dict ['date'] = row.find_all('td')[1].text
        case_dict ['docket'] = row.a.string
        case_dict ['case'] = row.find_all('span')[1].text
        case_dict ['link'] = row.a['href']
        entire_rows.append(case_dict)

#Printing the list of dictionaries  
entire_rows

[{'date': '04/19/21',
  'docket': '20-543',
  'case': 'Yellen v. Confederated Tribes of Chehalis Reservation',
  'link': '../argument_transcripts/2020/20-543_hgci.pdf'},
 {'date': '04/19/21',
  'docket': '20-315',
  'case': 'Santos Sanchez v. Mayorkas',
  'link': '../argument_transcripts/2020/20-315_l647.pdf'},
 {'date': '04/20/21',
  'docket': '19-8709',
  'case': 'Greer v. United States',
  'link': '../argument_transcripts/2020/19-8709_5hek.pdf'},
 {'date': '04/20/21',
  'docket': '20-444',
  'case': 'United States v. Gary',
  'link': '../argument_transcripts/2020/20-444_5i26.pdf'},
 {'date': '04/21/21',
  'docket': '20-334',
  'case': 'San Antonio v. Hotels.com, L.P.',
  'link': '../argument_transcripts/2020/20-334_p86b.pdf'},
 {'date': '04/21/21',
  'docket': '20-440',
  'case': 'Minerva Surgical, Inc. v. Hologic, Inc.',
  'link': '../argument_transcripts/2020/20-440_k5fm.pdf'},
 {'date': '04/26/21',
  'docket': '19-251',
  'case': 'Americans for Prosperity Foundation v. Bonta',
  

### STEP 2 
Scrape the additional source(s)

For this you need to do research and try to find sources that will give you useful information that you can add to the table/dictionary you created in Step 1.

Here are some recommended sources that you can scrape and add to your data. You do not need to scrape all of these, and you may want to look for other sources that are useful.

Geographical locations:
https://system.uslegal.com/us-courts-of-appeals/

Transcripts by year
https://www.supremecourt.gov/oral_arguments/argument_transcript/2017

Dockets buy circuit court (I recommend at least this one):
https://www.supremecourt.gov/orders/ordersbycircuit/ordercasebycircuit/061118OrderCasesByCircuit

Dockett information by case:
https://www.supremecourt.gov/search.aspx?filename=/docket/docketfiles/html/public/17-7919.html

Opinions (as seen in Homework 3):
https://www.supremecourt.gov/opinions/slipopinion/17

In [361]:
geo_url = "https://www.supremecourt.gov/opinions/slipopinion/20"
html = requests.get(geo_url).content
soup_doc = BeautifulSoup(html, "html.parser")

the_tables = soup_doc.find_all('td')
dockets = soup_doc.find_all('td', {'style':'text-align: center; white-space: nowrap;'})


all_data = []
for data in the_tables[5:]:
    try:
        myrow = {}
        myrow['title'] = data.find('a')['title']
        all_data.append(myrow)
    except:
        pass
    
all_data

[{'title': 'The District Court’s judgment—which vacated as unlawful the Centers for Disease Control and Prevention’s imposition of a nationwide moratorium on evictions of any tenants who live in a county that is experiencing substantial or high levels of COVID–19 transmission and who make certain declarations of financial need, 86 Fed. Reg. 43244—is enforceable and the stay of that judgment is vacated.'},
 {'title': 'In this federal habeas case, the Eleventh Circuit erred in characterizing the Alabama court’s case-specific analysis as a “categorical rule” that any prisoner will always lose an ineffective-assistance-of-trial-counsel claim if he fails to call and question trial counsel concerning his or her actions and reasoning; the Alabama court did not violate clearly established federal law when it rejected Reeves’ ineffective-assistance-of-trial-counsel claim.'},
 {'title': 'Arizona’s challenged voting regulations governing precinct-based election-day voting (rejecting ballots cast 

In [362]:
titles = pd.DataFrame(all_data)
dockets = pd.DataFrame(dockets)

In [363]:
df3 = titles.join(dockets, how='outer')

In [366]:
df3 = pd.read_csv("dockets.csv")

In [367]:
df3

Unnamed: 0,title,docket
0,The District Court’s judgment—which vacated as...,21A23
1,"In this federal habeas case, the Eleventh Circ...",20-1084
2,Arizona’s challenged voting regulations govern...,19-1257
3,"The Ninth Circuit’s judgment, which vacated th...",19-251
4,The well-grounded patent law doctrine of assig...,20-440
...,...,...
63,Because plaintiff Adams has not shown that he ...,19-309
64,The Religious Freedom Restoration Act of 1993’...,19-71
65,Respondent is enjoined from enforcing Executiv...,20A87
66,Because any reasonable correctional officer sh...,19-1261


### STEP 3
Here we go: the text files that were extracted from the PDFs are quite messy, you do not need to get them perfect, but you need to clean them up enough so that you can zone in on the arguments themselves. Below I take you step-by-step through what you need to do. In the end you want to have a separate list for each case that contains the speaker and the dialogue attached to that speaker.

**Step 1:** Download the text files from courseworks.

Make sure they are locally on your computer. 

Open up the text files in a text editor like sublime, and carefully look at the problems with the files. How will you clean this up?

**Step 2:** Eventually you will want to loop through all of the text files and run the cleanup on all of them. But first just select one text file to open up and begin cleaning up.

In [48]:
#Import the regular expression library
import re
import requests

url = "https://www.supremecourt.gov/oral_arguments/argument_transcript/2020"
html = requests.get(url).content

soup_doc = BeautifulSoup(html, "html.parser")

In [49]:
all_tables = soup_doc.find_all(class_="table table-bordered")

In [336]:
all_pdf_links = []

for table in all_tables[1:]:
    good_row = table.find_all('tr')
    for row in good_row:
        if row.td is not None:
            print(row.a['href'])
            all_pdf_links.append(row.a['href'][3:])
all_pdf_links

../argument_transcripts/2020/20-107_n758.pdf
../argument_transcripts/2020/19-1414_p86b.pdf
../argument_transcripts/2020/20-157_5i36.pdf
../argument_transcripts/2020/20-222_3fbh.pdf
../argument_transcripts/2020/20-297_3ea4.pdf
../argument_transcripts/2020/20-512_g314.pdf
../argument_transcripts/2020/142-orig_2_3ebh.pdf
../argument_transcripts/2020/19-1155_6537.pdf
../argument_transcripts/2020/20-18_986b.pdf
../argument_transcripts/2020/19-1434_e1p3.pdf
../argument_transcripts/2020/19-1257_1b7d.pdf
../argument_transcripts/2020/19-1442_9o6b.pdf
../argument_transcripts/2020/19-897_l537.pdf
../argument_transcripts/2020/19-968_6kh7.pdf
../argument_transcripts/2020/19-508_3f14.pdf
../argument_transcripts/2020/19-1231_9ol1.pdf
../argument_transcripts/2020/19-1189_k53m.pdf
../argument_transcripts/2020/20-366_7lho.pdf
../argument_transcripts/2020/19-783_2d8f.pdf
../argument_transcripts/2020/19-416_6k47.pdf
../argument_transcripts/2020/19-930_c07e.pdf
../argument_transcripts/2020/19-5807_i4dj.pdf

['argument_transcripts/2020/20-107_n758.pdf',
 'argument_transcripts/2020/19-1414_p86b.pdf',
 'argument_transcripts/2020/20-157_5i36.pdf',
 'argument_transcripts/2020/20-222_3fbh.pdf',
 'argument_transcripts/2020/20-297_3ea4.pdf',
 'argument_transcripts/2020/20-512_g314.pdf',
 'argument_transcripts/2020/142-orig_2_3ebh.pdf',
 'argument_transcripts/2020/19-1155_6537.pdf',
 'argument_transcripts/2020/20-18_986b.pdf',
 'argument_transcripts/2020/19-1434_e1p3.pdf',
 'argument_transcripts/2020/19-1257_1b7d.pdf',
 'argument_transcripts/2020/19-1442_9o6b.pdf',
 'argument_transcripts/2020/19-897_l537.pdf',
 'argument_transcripts/2020/19-968_6kh7.pdf',
 'argument_transcripts/2020/19-508_3f14.pdf',
 'argument_transcripts/2020/19-1231_9ol1.pdf',
 'argument_transcripts/2020/19-1189_k53m.pdf',
 'argument_transcripts/2020/20-366_7lho.pdf',
 'argument_transcripts/2020/19-783_2d8f.pdf',
 'argument_transcripts/2020/19-416_6k47.pdf',
 'argument_transcripts/2020/19-930_c07e.pdf',
 'argument_transcripts/2

In [53]:
len(all_pdf_links)

45

In [54]:
#Downloading pdfs into a folder

import time
import requests
for urls in all_pdf_links:
    time.sleep(2)
    link = 'https://www.supremecourt.gov/oral_arguments/' + urls
    file_name = "/Users/richardabbey/Desktop/Homeworks/Final_Project/" + urls.split('/')[-1]
    r = requests.get(link, stream=True)
    with open(file_name,'wb') as Pypdf:
        for chunk in r.iter_content():
            if chunk:
                Pypdf.write(chunk)

In [56]:
#Here I make a list of the names of the PDFs

pdf_names = [url.split('/')[-1] for url in all_pdf_links]
pdf_names

['20-107_n758.pdf',
 '19-1414_p86b.pdf',
 '20-157_5i36.pdf',
 '20-222_3fbh.pdf',
 '20-297_3ea4.pdf',
 '20-512_g314.pdf',
 '142-orig_2_3ebh.pdf',
 '19-1155_6537.pdf',
 '20-18_986b.pdf',
 '19-1434_e1p3.pdf',
 '19-1257_1b7d.pdf',
 '19-1442_9o6b.pdf',
 '19-897_l537.pdf',
 '19-968_6kh7.pdf',
 '19-508_3f14.pdf',
 '19-1231_9ol1.pdf',
 '19-1189_k53m.pdf',
 '20-366_7lho.pdf',
 '19-783_2d8f.pdf',
 '19-416_6k47.pdf',
 '19-930_c07e.pdf',
 '19-5807_i4dj.pdf',
 '18-1447_apl1.pdf',
 '19-351_d0fi.pdf',
 '19-511_l537.pdf',
 '19-963_2c8f.pdf',
 '19-422_4gdj.pdf',
 '19-547_c07d.pdf',
 '19-199_m6hn.pdf',
 '18-1259_e2p3.pdf',
 '19-5410_8n59.pdf',
 '19-123_o758.pdf',
 '19-863_k5gm.pdf',
 '19-546_2d9g.pdf',
 '19-840_1a72.pdf',
 '19-309_4425.pdf',
 '65-orig_7l48.pdf',
 '18-540_8njq.pdf',
 '19-71_e2q3.pdf',
 '18-956_2dp3.pdf',
 '19-368_m648.pdf',
 '19-108_e1p3.pdf',
 '19-357_2b35.pdf',
 '19-292_5hdk.pdf',
 '19-438_q713.pdf']

In [62]:
#Converting from pdfs into text

import tika
from tika import parser
import time
one_pdf = ['20-107_n758.pdf',
 '19-1414_p86b.pdf',
 '20-157_5i36.pdf',
 '20-222_3fbh.pdf',
 '20-297_3ea4.pdf',
 '20-512_g314.pdf',
 '142-orig_2_3ebh.pdf',
 '19-1155_6537.pdf',
 '20-18_986b.pdf',
 '19-1434_e1p3.pdf',
 '19-1257_1b7d.pdf',
 '19-1442_9o6b.pdf',
 '19-897_l537.pdf',
 '19-968_6kh7.pdf',
 '19-508_3f14.pdf',
 '19-1231_9ol1.pdf',
 '19-1189_k53m.pdf',
 '20-366_7lho.pdf',
 '19-783_2d8f.pdf',
 '19-416_6k47.pdf',
 '19-930_c07e.pdf',
 '19-5807_i4dj.pdf',
 '18-1447_apl1.pdf',
 '19-351_d0fi.pdf',
 '19-511_l537.pdf',
 '19-963_2c8f.pdf',
 '19-422_4gdj.pdf',
 '19-547_c07d.pdf',
 '19-199_m6hn.pdf',
 '18-1259_e2p3.pdf',
 '19-5410_8n59.pdf',
 '19-123_o758.pdf',
 '19-863_k5gm.pdf',
 '19-546_2d9g.pdf',
 '19-840_1a72.pdf',
 '19-309_4425.pdf',
 '65-orig_7l48.pdf',
 '18-540_8njq.pdf',
 '19-71_e2q3.pdf',
 '18-956_2dp3.pdf',
 '19-368_m648.pdf',
 '19-108_e1p3.pdf',
 '19-357_2b35.pdf',
 '19-292_5hdk.pdf',
 '19-438_q713.pdf']
text_list = []
for urls in one_pdf:
    time.sleep(2)
    file_n = urls.split('/')[-1]
   # print(file_n)
    file_name = "/Users/richardabbey/Desktop/Homeworks/Final_Project/" + file_n
    parsed_pdf = parser.from_file(file_name) 
    txt_data = parsed_pdf['content']
    txt_name = file_n.split('.')[0] + "NEW.txt"
    print(txt_name)
    text_list.append(txt_name)
    file_out = "/Users/richardabbey/Desktop/Homeworks/Final_Project/" + txt_name
    with open(file_out, 'w') as outfile:
        outfile.write(txt_data)

20-107_n758NEW.txt
19-1414_p86bNEW.txt
20-157_5i36NEW.txt
20-222_3fbhNEW.txt
20-297_3ea4NEW.txt
20-512_g314NEW.txt
142-orig_2_3ebhNEW.txt
19-1155_6537NEW.txt
20-18_986bNEW.txt
19-1434_e1p3NEW.txt
19-1257_1b7dNEW.txt
19-1442_9o6bNEW.txt
19-897_l537NEW.txt
19-968_6kh7NEW.txt
19-508_3f14NEW.txt
19-1231_9ol1NEW.txt
19-1189_k53mNEW.txt
20-366_7lhoNEW.txt
19-783_2d8fNEW.txt
19-416_6k47NEW.txt
19-930_c07eNEW.txt
19-5807_i4djNEW.txt
18-1447_apl1NEW.txt
19-351_d0fiNEW.txt
19-511_l537NEW.txt
19-963_2c8fNEW.txt
19-422_4gdjNEW.txt
19-547_c07dNEW.txt
19-199_m6hnNEW.txt
18-1259_e2p3NEW.txt
19-5410_8n59NEW.txt
19-123_o758NEW.txt
19-863_k5gmNEW.txt
19-546_2d9gNEW.txt
19-840_1a72NEW.txt
19-309_4425NEW.txt
65-orig_7l48NEW.txt
18-540_8njqNEW.txt
19-71_e2q3NEW.txt
18-956_2dp3NEW.txt
19-368_m648NEW.txt
19-108_e1p3NEW.txt
19-357_2b35NEW.txt
19-292_5hdkNEW.txt
19-438_q713NEW.txt


In [74]:
# First sample
all_text = []

for text in text_list:
    string_1 = "/Users/richardabbey/Desktop/Homeworks/Final_Project/"
    string_open = string_1 + text
    print(string_open)
    transcript = f.read()
    all_docs = re.findall(r"\d\d-,*", transcript, re.M)
    
    f = open(string_open, 'r')

/Users/richardabbey/Desktop/Homeworks/Final_Project/20-107_n758NEW.txt
/Users/richardabbey/Desktop/Homeworks/Final_Project/19-1414_p86bNEW.txt
/Users/richardabbey/Desktop/Homeworks/Final_Project/20-157_5i36NEW.txt
/Users/richardabbey/Desktop/Homeworks/Final_Project/20-222_3fbhNEW.txt
/Users/richardabbey/Desktop/Homeworks/Final_Project/20-297_3ea4NEW.txt
/Users/richardabbey/Desktop/Homeworks/Final_Project/20-512_g314NEW.txt
/Users/richardabbey/Desktop/Homeworks/Final_Project/142-orig_2_3ebhNEW.txt
/Users/richardabbey/Desktop/Homeworks/Final_Project/19-1155_6537NEW.txt
/Users/richardabbey/Desktop/Homeworks/Final_Project/20-18_986bNEW.txt
/Users/richardabbey/Desktop/Homeworks/Final_Project/19-1434_e1p3NEW.txt
/Users/richardabbey/Desktop/Homeworks/Final_Project/19-1257_1b7dNEW.txt
/Users/richardabbey/Desktop/Homeworks/Final_Project/19-1442_9o6bNEW.txt
/Users/richardabbey/Desktop/Homeworks/Final_Project/19-897_l537NEW.txt
/Users/richardabbey/Desktop/Homeworks/Final_Project/19-968_6kh7NEW.tx

**How in the world are you going to clean this up?**
Take a close look and think about first what needs to be removed, and then needs to be isolated. You'll probably need the combination of regular expression (especially using subs() -- which is a regex replace), and simple splits -- where you split the text that point, and just keep the part of the text that you want. If you want to figure this on your own don't read any further--if you're starting to get stuck go a few cells down, and follow my hints.

Also take a look at the hint below--it might come in very handy...


In [26]:
#A note on regex splits:
# look at the difference between regex1 regex2
#A split using groups keeps the groups!!!!

string = "Tomorrow and tomorrow and tomorrow"
regex1 = r"and" #not grouped
regex2 = r"(and)" #grouped
re.split(regex1,string)

['Tomorrow ', ' tomorrow ', ' tomorrow']

In [236]:
# cleaning a sample_transcript

clean_transcript = re.sub(r"Heritage Reporting Corporation[\s\d]+"," ",sample_transcript)
cleaner = re.sub(r"Heritage Reporting Corporation[\s\d]+"," ",sample_transcript)
cleanerer = re.sub(r"[\s\d]+Official - Subject to Final Review"," ",cleaner)
regex5 = r"(The case is submitted.)"
no_footer = re.split(regex5, cleanerer)
almost_clean = no_footer[0] + no_footer[1]
regex6 = r"P R O C E E D I N G S"
almost_cleaner = re.split(regex6, almost_clean )
cleaner = almost_cleaner [1]
regex7 = r"\n PAGE"
new_cleaner = re.split(regex7,cleaner)
new_clean = almost_cleaner[1]
new_clean
regex8 = r"\s[A-Z][:]\s[A-Z][:]"
newer_clean = re.split(regex8, new_clean)
also_clean = newer_clean[0]
neww = re.findall(r"[A-Z .]+[:]", also_clean)
regex9 = r"([A-Z][A-Z .\-']+)[:]"
cleaner2 = re.split(regex9, also_clean)
cleaner2


['\n\n (11:15 a.m.)\n\n ',
 'CHIEF JUSTICE ROBERTS',
 " We'll hear\n\n argument next in Case 19-71, Tanzin versus\n\n Tanvir.\n\n Mr. Kneedler.\n\n ORAL ARGUMENT OF EDWIN S. KNEEDLER\n\n ON BEHALF OF THE PETITIONERS\n\n ",
 'MR. KNEEDLER',
 "  Mr. Chief Justice, and \n\nmay it please the Court: \n\nIn enacting RFRA, Congress did not \n\nsubject federal employees throughout the \n\ngovernment to a new cause of action for damages \n\nin their personal capacity. \n\nRFRA's remedy section provides only \n\nfor appropriate relief against the government. \n\nDamages against an individual employee in his \n\npersonal capacity are not relief against the \n\ngovernment. \n\nAt the same time, where a suit is \n\nbrought against the federal government, \n\nincluding against a federal official in his \n\nofficial capacity, as RFRA provides for, money \n\ndamages are not appropriate relief. \n\nPrior to this Court's decision in  \n\n Smith and the passage of RFRA, injunctive relief \n\nagainst a fe

In [357]:
#Cleaning all text files in a folder

res = []

for text in text_list:
    
    f = open('//Users/richardabbey/Desktop/Homeworks/Final_Project/txt_files/' + text, 'r')
    transcript = f.read()
    all_dockets = re.findall(r"\d\d-.*",transcript,re.M) 
    docket = all_dockets[0]
    transcript =re.sub(r"\n"," ", transcript)
    #1. Heritage company stuff, and numbers
    clean_transcript = re.sub('Heritage Reporting Corporation', '', transcript)
    #2. Chop off the beginning before the dialogue begins
    clean_transcript = re.sub(" \d+ ", " ", clean_transcript)
    clean_transcript = clean_transcript.split(".m.) ")
    clean_transcript = clean_transcript[1]
    #3. Chop off the end after the dialogue ends
    clean_transcript = clean_transcript.split("The case is submitted.)")
    clean_transcript = clean_transcript[0]
    #make it a list of pairs of speaker and speech
    #get a list of speaker and speech
    clean_text = re.split(r'([A-V[A-Z ]+:)', clean_transcript)
    clean_text.pop(0)
    languages = clean_text[::2]
    speakers = clean_text[1::2]

    X = languages
    Y = speakers

    final_list = ([i for i in zip(X,Y)])

    df = pd.DataFrame(final_list)
    df['docket'] = docket
    df.columns.values[0] = "speaker"
    df.columns.values[1] = "text"
    
    res.append(df)
    
appended_data = pd.concat(res)
appended_data

df2

Unnamed: 0,date,docket,case,link
0,04/19/21,20-543,Yellen v. Confederated Tribes of Chehalis Rese...,../argument_transcripts/2020/20-543_hgci.pdf
1,04/19/21,20-315,Santos Sanchez v. Mayorkas,../argument_transcripts/2020/20-315_l647.pdf
2,04/20/21,19-8709,Greer v. United States,../argument_transcripts/2020/19-8709_5hek.pdf
3,04/20/21,20-444,United States v. Gary,../argument_transcripts/2020/20-444_5i26.pdf
4,04/21/21,20-334,"San Antonio v. Hotels.com, L.P.",../argument_transcripts/2020/20-334_p86b.pdf
5,04/21/21,20-440,"Minerva Surgical, Inc. v. Hologic, Inc.",../argument_transcripts/2020/20-440_k5fm.pdf
6,04/26/21,19-251,Americans for Prosperity Foundation v. Bonta,../argument_transcripts/2020/19-251_h3ci.pdf
7,04/26/21,20-382,Guam v. United States,../argument_transcripts/2020/20-382_4f14.pdf
8,04/27/21,20-472,"Hollyfrontier Cheyenne Refining, LLC v. Renewa...",../argument_transcripts/2020/20-472_bp7c.pdf
9,04/27/21,20-437,United States v. Palomar-Santiago,../argument_transcripts/2020/20-437_n758.pdf


In [343]:
#Merging my dataframes to have all the information from two sites
#Merging on a common identifier which is docket

edited_list = pd.merge(df3, appended_data, on='docket')

In [344]:
#Merging my dataframes to have all the information from two sites
#Merging on a common identifier which is docket

edited_list = pd.merge(edited_list, df2, on='docket')

In [375]:
edited_list

Unnamed: 0,title,docket,speaker,text,date,case,link
0,Arizona’s challenged voting regulations govern...,19-1257,CHIEF JUSTICE ROBERTS:,We will hear argument this morning in Case ...,03/02/21,Brnovich v. Democratic National Committee,../argument_transcripts/2020/19-1257_1b7d.pdf
1,Arizona’s challenged voting regulations govern...,19-1257,CARVIN:,"Mr. Chief Justice, and may it please the Co...",03/02/21,Brnovich v. Democratic National Committee,../argument_transcripts/2020/19-1257_1b7d.pdf
2,Arizona’s challenged voting regulations govern...,19-1257,CHIEF JUSTICE ROBERTS:,Mr. -- MR.,03/02/21,Brnovich v. Democratic National Committee,../argument_transcripts/2020/19-1257_1b7d.pdf
3,Arizona’s challenged voting regulations govern...,19-1257,CARVIN:,-- and notable --,03/02/21,Brnovich v. Democratic National Committee,../argument_transcripts/2020/19-1257_1b7d.pdf
4,Arizona’s challenged voting regulations govern...,19-1257,CHIEF JUSTICE ROBERTS:,"-- Mr. Carvin, as I understand your test as...",03/02/21,Brnovich v. Democratic National Committee,../argument_transcripts/2020/19-1257_1b7d.pdf
...,...,...,...,...,...,...,...
7519,Because plaintiffs have not shown standing and...,20-366,CHIEF JUSTICE ROBERTS:,"A minute to wrap up, Mr. Ho. MR.",11/30/20,Trump v. New York,../argument_transcripts/2020/20-366_7lho.pdf
7520,Because plaintiffs have not shown standing and...,20-366,HO:,"In closing, Your Honors, no court, no Congr...",11/30/20,Trump v. New York,../argument_transcripts/2020/20-366_7lho.pdf
7521,Because plaintiffs have not shown standing and...,20-366,CHIEF JUSTICE ROBERTS:,"Thank you, counsel. Rebuttal, General Wal...",11/30/20,Trump v. New York,../argument_transcripts/2020/20-366_7lho.pdf
7522,Because plaintiffs have not shown standing and...,20-366,WALL ON BEHALF OF THE APPELLANTS GENE...,"Thank you, Mr. Chief Justice. So, as I t...",11/30/20,Trump v. New York,../argument_transcripts/2020/20-366_7lho.pdf


In [350]:
# neworder = ['date','docket','case','speaker','text', 'link']
# edited_list = edited_list.reindex(columns=neworder, rese)

edited_list.to_csv('Supreme_Court.csv',
          columns=['date','docket','case','speaker', 'text', 'link'],  index=False)


In [359]:
#Opening the master dataframe

df = pd.read_csv("Supreme_Court.csv")

In [374]:
df

Unnamed: 0,date,docket,case,speaker,text,link
0,03/02/21,19-1257,Brnovich v. Democratic National Committee,CHIEF JUSTICE ROBERTS:,We will hear argument this morning in Case ...,../argument_transcripts/2020/19-1257_1b7d.pdf
1,03/02/21,19-1257,Brnovich v. Democratic National Committee,CARVIN:,"Mr. Chief Justice, and may it please the Co...",../argument_transcripts/2020/19-1257_1b7d.pdf
2,03/02/21,19-1257,Brnovich v. Democratic National Committee,CHIEF JUSTICE ROBERTS:,Mr. -- MR.,../argument_transcripts/2020/19-1257_1b7d.pdf
3,03/02/21,19-1257,Brnovich v. Democratic National Committee,CARVIN:,-- and notable --,../argument_transcripts/2020/19-1257_1b7d.pdf
4,03/02/21,19-1257,Brnovich v. Democratic National Committee,CHIEF JUSTICE ROBERTS:,"-- Mr. Carvin, as I understand your test as...",../argument_transcripts/2020/19-1257_1b7d.pdf
...,...,...,...,...,...,...
7519,11/30/20,20-366,Trump v. New York,CHIEF JUSTICE ROBERTS:,"A minute to wrap up, Mr. Ho. MR.",../argument_transcripts/2020/20-366_7lho.pdf
7520,11/30/20,20-366,Trump v. New York,HO:,"In closing, Your Honors, no court, no Congr...",../argument_transcripts/2020/20-366_7lho.pdf
7521,11/30/20,20-366,Trump v. New York,CHIEF JUSTICE ROBERTS:,"Thank you, counsel. Rebuttal, General Wal...",../argument_transcripts/2020/20-366_7lho.pdf
7522,11/30/20,20-366,Trump v. New York,WALL ON BEHALF OF THE APPELLANTS GENE...,"Thank you, Mr. Chief Justice. So, as I t...",../argument_transcripts/2020/20-366_7lho.pdf


### Cleaning comes first

A step-by-step way of Cleaning up this mess.

Step 1. You might notice that every page has:

`Heritage Reporting Corporation

Official 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25`

(Note in earlier years it was:
`Alderson Reporting Company

Official - Subject to Final Review`
 If you choose to transform arguments from earlier years, please Slack me and I will send you the instructions for earlier versions of these PDFs.
 )
You want to get rid of that. I would use a regex sub() 

Step 2 and 3. **chop off the beginning/ chop off the end**: now it would be very helpful to get rid of all of the text that comes before the arguments begin, and all the text that comes after the argument (each page has a really annoying index at the end that you don't want to be searching through). Look for words or phrases that uniquely repeat at the beginning and at the end of the arguments. The easiest way to isolate this, to do a simple split() on one of those phrases, and keep the half of The split you want. (Am I being too cryptic here?--a good split should give you list with two elements when you want to keep one of them) Think about it and email me.

Try to get these 3 cleaning actions to work step-by-step in the 4 cells below. As you go, I would assign each cleaner version of the text to a new variable. 

In [None]:
#Check your new variable to make sure it is clean

### Get your dialogue list
Now this transcription should be clean enough to get a list with every speaker, and what the speaker said. The pattern for the speakers is fairly obvious--my recommendation is to do a split using groups (like the example I show above with "tomorrow and tomorrow").

If you write your regular expression correctly: you should get a single list in which each element is either a speaker, or what was said.

In [None]:
#get a list of speaker and speech

### Make it a list of pairs
If you got your list the way I recommended to, it is just single list with elements after element--you need to figure out how to change it so you pair the speaker with what is said. Give it some thought, there are a few ways to try to do this. If you made it this far, you're doing great!

### Loop through all texts
If you made it this far--congratulations! 
The only thing left is to set up a loop that looks through all the texts and runs the cleanup and parsing when each one. You will need to have completed Step 1 in order to be able to do this loop because you will need the names to PDFs to do it. (Also each final list should also contain the PDF name, so you can reference it from your case database.)

In [None]:
# you could try here--Or email me with questions...