## Federal Courts Project Guide

The ultimate goal of this project is to build a centralized database of federal judgeships across the 13 district appellate courts and/or the 96 District courts in the United States. Because of the wealth of data involved, and the fact that much of this data is scattered across many pages and sites, the first step involves researching the domain, and developing a focus and range of data you want to obtain and make available.

Here are three possible angles:

1. Current judgeships, vacancies, and nomination proceedings: with this focus you would download tables of the recent vacancies and appointments, and go further into nomination procedures and Q&A's. This would entail a combination of scraping, conversions of PDFs, and using regular expressions to parse the PDFs (this is tough).

2. Historical judgeships: with this focus you examine changes in federal judgeships over a certain period of time (perhaps 10 to 20 years). This would entail mainly the scraping of many pages and the integration of data about specific judges, ordered by district.

3. Recent Nominations and confirmations:  this would focus specifically on judges newly nominated or appointed under the current administration. The focus would be more directly on the nomination hearings (Q&As), as well as the search for other data sources regarding the judges--news articles, opinions, writings by the judges.



Your primary goal by Thursday is to come up with a specific research question: what kind of knowledge do you want to investigate, build and make available through this project. What are the central units of analysis? What do you want to reveal about the federal courts?

Your secondary goal is to view the primary source pages and begin scraping. You do not have to have your central research question right at the beginning of the scraping, but it may help to have a direction.

You're goal by the end of the weekend (Dec 2 or 3) is to have a finalized architecture for your dataframe(s), any finalized list of sources that you will scrape/obtain.

**Data Architecture**
The question of architecture is central to this project. Because of the many possible angles, and the highly decentralized state of the primary source data, there is a wide range of designs for tables, rows, columns. You may want to begin scraping some of the main pages to get more familiar with what kind of rows and columns might be involved.

**Interpretive architecture**
This depends I how focused your data frame will be. If you pick specific districts, judges and/or confirmation hearings you may want to do more human reading to assess different ways the framing the politics/legal perspective of the judge or the district's decisions. If you choose to cast a wider net for your data, then you will want to focus on more quantitative categories for framing this: judges age, district, background, length of appointment, length of vacancy, number of vacancies, etc.



### Coding considerations:
While there is a great amount of data available, much of it is distributed across multiple pages, sometimes and inconsistent format. If you're interested in scraping nominations and downloading PDFs, you may want to consider using **selenium** for part of it. If you want to use beautiful soup, you will have to download links, and the loop through multiple pages to get a complete data set--unless your focus is more specific.

### STEP 1
Scrape the first page of judicial vacancies:

http://www.uscourts.gov/judges-judgeships/judicial-vacancies/current-judicial-vacancies

In [4]:
###Import your scraping libraries
import requests
from bs4 import BeautifulSoup

In [5]:
import pandas as pd
from playwright.async_api import async_playwright

In [6]:
###Write your scraping code here
my_url = "https://www.uscourts.gov/judges-judgeships/judicial-vacancies/current-judicial-vacancies"
raw_html = requests.get(my_url).content

In [7]:
soup_doc = BeautifulSoup(raw_html, "html.parser")
print(soup_doc.prettify())

<!DOCTYPE html>
<!--[if IEMobile 7]><html class="iem7"  lang="en" dir="ltr"><![endif]-->
<!--[if lte IE 6]><html class="lt-ie9 lt-ie8 lt-ie7"  lang="en" dir="ltr"><![endif]-->
<!--[if (IE 7)&(!IEMobile)]><html class="lt-ie9 lt-ie8"  lang="en" dir="ltr"><![endif]-->
<!--[if IE 8]><html class="lt-ie9"  lang="en" dir="ltr"><![endif]-->
<!--[if (gte IE 9)|(gt IEMobile 7)]><!-->
<html dir="ltr" lang="en" prefix="og: https://ogp.me/ns#">
 <!--<![endif]-->
 <head>
  <!--[if IE]><![endif]-->
  <link href="//fonts.gstatic.com" rel="dns-prefetch"/>
  <link crossorigin="" href="//fonts.gstatic.com" rel="preconnect"/>
  <link href="//themes.googleusercontent.com" rel="dns-prefetch"/>
  <link crossorigin="" href="//themes.googleusercontent.com" rel="preconnect"/>
  <link href="//maps.googleapis.com" rel="dns-prefetch"/>
  <link href="//maps.googleapis.com" rel="preconnect"/>
  <meta charset="utf-8"/>
  <link href="//www.uscourts.gov/profiles/uscourts/themes/usczen/favicon.ico" rel="shortcut icon" t

In [8]:
soup_doc.find_all('tbody')

[<tbody>
 <tr><td data-th="Court:">01 - CCA</td><td data-th="Incumbent:">Howard,Jeffrey R.</td><td data-th="Vacancy Reason:">Senior</td><td data-th="Vacancy Date:">03/31/2022</td><td data-th="Nominee:">Aframe,Seth Robert</td><td data-th="Nomination Date:">10/04/2023</td></tr><tr><td data-th="Court:">02 - CT</td><td data-th="Incumbent:">Merriam,Sarah A. L.</td><td data-th="Vacancy Reason:">Elevated</td><td data-th="Vacancy Date:">09/15/2022</td><td data-th="Nominee:">Russell,Sarah French</td><td data-th="Nomination Date:">10/04/2023</td></tr><tr><td data-th="Court:">02 - NY-S</td><td data-th="Incumbent:">Gardephe,Paul G.</td><td data-th="Vacancy Reason:">Senior</td><td data-th="Vacancy Date:">08/09/2023</td><td data-th="Nominee:"></td><td data-th="Nomination Date:"></td></tr><tr><td data-th="Court:">02 - NY-W</td><td data-th="Incumbent:">Geraci Jr.,Frank Paul</td><td data-th="Vacancy Reason:">Senior</td><td data-th="Vacancy Date:">04/01/2023</td><td data-th="Nominee:">Holland,Colleen Da

In [9]:
all_rows = soup_doc.find_all("td")
all_rows

[<td data-th="Court:">01 - CCA</td>,
 <td data-th="Incumbent:">Howard,Jeffrey R.</td>,
 <td data-th="Vacancy Reason:">Senior</td>,
 <td data-th="Vacancy Date:">03/31/2022</td>,
 <td data-th="Nominee:">Aframe,Seth Robert</td>,
 <td data-th="Nomination Date:">10/04/2023</td>,
 <td data-th="Court:">02 - CT</td>,
 <td data-th="Incumbent:">Merriam,Sarah A. L.</td>,
 <td data-th="Vacancy Reason:">Elevated</td>,
 <td data-th="Vacancy Date:">09/15/2022</td>,
 <td data-th="Nominee:">Russell,Sarah French</td>,
 <td data-th="Nomination Date:">10/04/2023</td>,
 <td data-th="Court:">02 - NY-S</td>,
 <td data-th="Incumbent:">Gardephe,Paul G.</td>,
 <td data-th="Vacancy Reason:">Senior</td>,
 <td data-th="Vacancy Date:">08/09/2023</td>,
 <td data-th="Nominee:"></td>,
 <td data-th="Nomination Date:"></td>,
 <td data-th="Court:">02 - NY-W</td>,
 <td data-th="Incumbent:">Geraci Jr.,Frank Paul</td>,
 <td data-th="Vacancy Reason:">Senior</td>,
 <td data-th="Vacancy Date:">04/01/2023</td>,
 <td data-th="No

In [10]:
all_rows = soup_doc.find_all("td")
list_rows = []
for row in all_rows:
    text_content = row.get_text(strip=True)
    list_rows.append(text_content)
print(list_rows)

['01 - CCA', 'Howard,Jeffrey R.', 'Senior', '03/31/2022', 'Aframe,Seth Robert', '10/04/2023', '02 - CT', 'Merriam,Sarah A. L.', 'Elevated', '09/15/2022', 'Russell,Sarah French', '10/04/2023', '02 - NY-S', 'Gardephe,Paul G.', 'Senior', '08/09/2023', '', '', '02 - NY-W', 'Geraci Jr.,Frank Paul', 'Senior', '04/01/2023', 'Holland,Colleen Danielle', '09/11/2023', '03 - CCA', 'Greenaway Jr.,Joseph A.', 'Retired', '06/15/2023', 'Mangi,Adeel Abdullah', '11/27/2023', '03 - NJ', 'McNulty,Kevin', 'Senior', '10/31/2023', 'Kiel,Edward Sunyol', '10/04/2023', '03 - PA-E', 'Rufe,Cynthia M.', 'Senior', '12/31/2021', '', '', '03 - PA-E', 'Smith,Edward G.', 'Deceased', '11/27/2023', '', '', '03 - PA-M', 'Jones III,John E.', 'Retired', '08/01/2021', 'Mehalchick,Karoline', '07/11/2023', '04 - CCA', 'Motz,Diana Gribbon', 'Senior', '09/30/2022', 'Berner,Nicole G.', '11/27/2023', '04 - NC-W', 'Conrad Jr.,Robert James', 'Senior', '05/17/2023', '', '', '04 - SC', 'Childs,J. Michelle', 'Elevated', '07/19/2022', 

In [11]:
keys = ['Court', 'Incumbent', 'Vacancy Reason', 'Vacancy Date', 'Nominee', 'Nomination date']
list_of_dicts = []

# Iterate over the data list and create dictionaries
for i in range(0, len(list_rows), len(keys)):
    # Create a dictionary for each set of elements based on keys
    item = dict(zip(keys, list_rows[i:i + len(keys)]))
    list_of_dicts.append(item)

list_of_dicts

[{'Court': '01 - CCA',
  'Incumbent': 'Howard,Jeffrey R.',
  'Vacancy Reason': 'Senior',
  'Vacancy Date': '03/31/2022',
  'Nominee': 'Aframe,Seth Robert',
  'Nomination date': '10/04/2023'},
 {'Court': '02 - CT',
  'Incumbent': 'Merriam,Sarah A. L.',
  'Vacancy Reason': 'Elevated',
  'Vacancy Date': '09/15/2022',
  'Nominee': 'Russell,Sarah French',
  'Nomination date': '10/04/2023'},
 {'Court': '02 - NY-S',
  'Incumbent': 'Gardephe,Paul G.',
  'Vacancy Reason': 'Senior',
  'Vacancy Date': '08/09/2023',
  'Nominee': '',
  'Nomination date': ''},
 {'Court': '02 - NY-W',
  'Incumbent': 'Geraci Jr.,Frank Paul',
  'Vacancy Reason': 'Senior',
  'Vacancy Date': '04/01/2023',
  'Nominee': 'Holland,Colleen Danielle',
  'Nomination date': '09/11/2023'},
 {'Court': '03 - CCA',
  'Incumbent': 'Greenaway Jr.,Joseph A.',
  'Vacancy Reason': 'Retired',
  'Vacancy Date': '06/15/2023',
  'Nominee': 'Mangi,Adeel Abdullah',
  'Nomination date': '11/27/2023'},
 {'Court': '03 - NJ',
  'Incumbent': 'McNul

In [12]:
import pandas as pd
df = pd.DataFrame(list_of_dicts)
df.head()

Unnamed: 0,Court,Incumbent,Vacancy Reason,Vacancy Date,Nominee,Nomination date
0,01 - CCA,"Howard,Jeffrey R.",Senior,03/31/2022,"Aframe,Seth Robert",10/04/2023
1,02 - CT,"Merriam,Sarah A. L.",Elevated,09/15/2022,"Russell,Sarah French",10/04/2023
2,02 - NY-S,"Gardephe,Paul G.",Senior,08/09/2023,,
3,02 - NY-W,"Geraci Jr.,Frank Paul",Senior,04/01/2023,"Holland,Colleen Danielle",09/11/2023
4,03 - CCA,"Greenaway Jr.,Joseph A.",Retired,06/15/2023,"Mangi,Adeel Abdullah",11/27/2023


In [13]:
df.to_csv("juridicial_vacancies.csv", index=False)
df.head()

Unnamed: 0,Court,Incumbent,Vacancy Reason,Vacancy Date,Nominee,Nomination date
0,01 - CCA,"Howard,Jeffrey R.",Senior,03/31/2022,"Aframe,Seth Robert",10/04/2023
1,02 - CT,"Merriam,Sarah A. L.",Elevated,09/15/2022,"Russell,Sarah French",10/04/2023
2,02 - NY-S,"Gardephe,Paul G.",Senior,08/09/2023,,
3,02 - NY-W,"Geraci Jr.,Frank Paul",Senior,04/01/2023,"Holland,Colleen Danielle",09/11/2023
4,03 - CCA,"Greenaway Jr.,Joseph A.",Retired,06/15/2023,"Mangi,Adeel Abdullah",11/27/2023


### STEP 2
Scrape the first page of judicial confirmations:

http://www.uscourts.gov/judges-judgeships/judicial-vacancies/confirmation-listing


In [145]:
my_url = "http://www.uscourts.gov/judges-judgeships/judicial-vacancies/confirmation-listing"
raw_html = requests.get(my_url).content

In [146]:
soup = BeautifulSoup(raw_html, "html.parser")
print(soup.prettify())

<!DOCTYPE html>
<!--[if IEMobile 7]><html class="iem7"  lang="en" dir="ltr"><![endif]-->
<!--[if lte IE 6]><html class="lt-ie9 lt-ie8 lt-ie7"  lang="en" dir="ltr"><![endif]-->
<!--[if (IE 7)&(!IEMobile)]><html class="lt-ie9 lt-ie8"  lang="en" dir="ltr"><![endif]-->
<!--[if IE 8]><html class="lt-ie9"  lang="en" dir="ltr"><![endif]-->
<!--[if (gte IE 9)|(gt IEMobile 7)]><!-->
<html dir="ltr" lang="en" prefix="og: https://ogp.me/ns#">
 <!--<![endif]-->
 <head>
  <!--[if IE]><![endif]-->
  <link href="//fonts.gstatic.com" rel="dns-prefetch"/>
  <link crossorigin="" href="//fonts.gstatic.com" rel="preconnect"/>
  <link href="//themes.googleusercontent.com" rel="dns-prefetch"/>
  <link crossorigin="" href="//themes.googleusercontent.com" rel="preconnect"/>
  <link href="//maps.googleapis.com" rel="dns-prefetch"/>
  <link href="//maps.googleapis.com" rel="preconnect"/>
  <meta charset="utf-8"/>
  <link href="//www.uscourts.gov/profiles/uscourts/themes/usczen/favicon.ico" rel="shortcut icon" t

In [147]:
soup.find_all('tbody')

[<tbody>
 <tr><td data-th="Nominee:">Rikelman,Julie</td><td data-th="Nomination Date:">01/03/2023</td><td data-th="Confirmation Date:">06/20/2023</td><td data-th="Court:">01 - CCA</td><td data-th="Incumbent:">Lynch,Sandra L.</td><td data-th="Vacancy Reason:">Senior</td><td data-th="Vacancy Date:">12/31/2022</td></tr><tr><td data-th="Nominee:">Guzman,Margaret R.</td><td data-th="Nomination Date:">01/03/2023</td><td data-th="Confirmation Date:">03/01/2023</td><td data-th="Court:">01 - MA</td><td data-th="Incumbent:">Hillman,Timothy S.</td><td data-th="Vacancy Reason:">Senior</td><td data-th="Vacancy Date:">07/01/2022</td></tr><tr><td data-th="Nominee:">Joun,Myong J.</td><td data-th="Nomination Date:">01/23/2023</td><td data-th="Confirmation Date:">07/12/2023</td><td data-th="Court:">01 - MA</td><td data-th="Incumbent:">O'Toole Jr.,George A.</td><td data-th="Vacancy Reason:">Senior</td><td data-th="Vacancy Date:">01/01/2018</td></tr><tr><td data-th="Nominee:">Kobick,Julia E.</td><td data-

In [148]:
all_rows2 = soup.find_all("td")
all_rows2

[<td data-th="Nominee:">Rikelman,Julie</td>,
 <td data-th="Nomination Date:">01/03/2023</td>,
 <td data-th="Confirmation Date:">06/20/2023</td>,
 <td data-th="Court:">01 - CCA</td>,
 <td data-th="Incumbent:">Lynch,Sandra L.</td>,
 <td data-th="Vacancy Reason:">Senior</td>,
 <td data-th="Vacancy Date:">12/31/2022</td>,
 <td data-th="Nominee:">Guzman,Margaret R.</td>,
 <td data-th="Nomination Date:">01/03/2023</td>,
 <td data-th="Confirmation Date:">03/01/2023</td>,
 <td data-th="Court:">01 - MA</td>,
 <td data-th="Incumbent:">Hillman,Timothy S.</td>,
 <td data-th="Vacancy Reason:">Senior</td>,
 <td data-th="Vacancy Date:">07/01/2022</td>,
 <td data-th="Nominee:">Joun,Myong J.</td>,
 <td data-th="Nomination Date:">01/23/2023</td>,
 <td data-th="Confirmation Date:">07/12/2023</td>,
 <td data-th="Court:">01 - MA</td>,
 <td data-th="Incumbent:">O'Toole Jr.,George A.</td>,
 <td data-th="Vacancy Reason:">Senior</td>,
 <td data-th="Vacancy Date:">01/01/2018</td>,
 <td data-th="Nominee:">Kobick

In [149]:
all_rows2 = soup.find_all("td")
list_rows2 = []
for row2 in all_rows2:
    text_content2 = row2.get_text(strip=True)
    list_rows2.append(text_content2)
print(list_rows2)

['Rikelman,Julie', '01/03/2023', '06/20/2023', '01 - CCA', 'Lynch,Sandra L.', 'Senior', '12/31/2022', 'Guzman,Margaret R.', '01/03/2023', '03/01/2023', '01 - MA', 'Hillman,Timothy S.', 'Senior', '07/01/2022', 'Joun,Myong J.', '01/23/2023', '07/12/2023', '01 - MA', "O'Toole Jr.,George A.", 'Senior', '01/01/2018', 'Kobick,Julia E.', '01/23/2023', '11/07/2023', '01 - MA', 'Young,William G.', 'Senior', '07/01/2021', 'Mendez-Miro,Gina R.', '01/03/2023', '02/14/2023', '01 - PR', 'Cerezo,Carmen Consuelo', 'Retired', '02/28/2021', 'Kahn,Maria Araujo', '01/03/2023', '03/09/2023', '02 - CCA', 'Cabranes,Jose A.', 'Senior', '03/09/2023', 'Oliver,Vernon D.', '05/04/2023', '09/19/2023', '02 - CT', 'Underhill,Stefan R.', 'Senior', '11/01/2022', 'Choudhury,Nusrat Jahan', '01/03/2023', '06/15/2023', '02 - NYE', 'Bianco,Joseph F.', 'Elevated', '05/08/2019', 'Merchant,Orelia Eleta', '01/23/2023', '05/03/2023', '02 - NYE', 'Kuntz II,William Francis', 'Senior', '01/01/2022', 'Merle,Natasha C.', '01/03/2023

In [161]:
keys = ['Nominee', 'Nomination date', 'Confirmation date', 'Court', 'Incumbent', 'Vacancy Reason', 'Vacancy Date']
list_of_dicts2 = []
for i in range(0, len(list_rows2), len(keys)):
    item2 = dict(zip(keys, list_rows2[i:i + len(keys)]))
    list_of_dicts2.append(item2)
list_of_dicts2

[{'Nominee': 'Rikelman,Julie',
  'Nomination date': '01/03/2023',
  'Confirmation date': '06/20/2023',
  'Court': '01 - CCA',
  'Incumbent': 'Lynch,Sandra L.',
  'Vacancy Reason': 'Senior',
  'Vacancy Date': '12/31/2022'},
 {'Nominee': 'Guzman,Margaret R.',
  'Nomination date': '01/03/2023',
  'Confirmation date': '03/01/2023',
  'Court': '01 - MA',
  'Incumbent': 'Hillman,Timothy S.',
  'Vacancy Reason': 'Senior',
  'Vacancy Date': '07/01/2022'},
 {'Nominee': 'Joun,Myong J.',
  'Nomination date': '01/23/2023',
  'Confirmation date': '07/12/2023',
  'Court': '01 - MA',
  'Incumbent': "O'Toole Jr.,George A.",
  'Vacancy Reason': 'Senior',
  'Vacancy Date': '01/01/2018'},
 {'Nominee': 'Kobick,Julia E.',
  'Nomination date': '01/23/2023',
  'Confirmation date': '11/07/2023',
  'Court': '01 - MA',
  'Incumbent': 'Young,William G.',
  'Vacancy Reason': 'Senior',
  'Vacancy Date': '07/01/2021'},
 {'Nominee': 'Mendez-Miro,Gina R.',
  'Nomination date': '01/03/2023',
  'Confirmation date': '02

In [162]:
df = pd.DataFrame(list_of_dicts2)
df.head()

Unnamed: 0,Nominee,Nomination date,Confirmation date,Court,Incumbent,Vacancy Reason,Vacancy Date
0,"Rikelman,Julie",01/03/2023,06/20/2023,01 - CCA,"Lynch,Sandra L.",Senior,12/31/2022
1,"Guzman,Margaret R.",01/03/2023,03/01/2023,01 - MA,"Hillman,Timothy S.",Senior,07/01/2022
2,"Joun,Myong J.",01/23/2023,07/12/2023,01 - MA,"O'Toole Jr.,George A.",Senior,01/01/2018
3,"Kobick,Julia E.",01/23/2023,11/07/2023,01 - MA,"Young,William G.",Senior,07/01/2021
4,"Mendez-Miro,Gina R.",01/03/2023,02/14/2023,01 - PR,"Cerezo,Carmen Consuelo",Retired,02/28/2021


In [163]:
# Extracting last names and first names from Nominee and Incumbent columns
df[['Nominee Last Name', 'Nominee First Name']] = df['Nominee'].str.extract(r'([^,]+),(.+)')
df[['Incumbent Last Name', 'Incumbent First Name']] = df['Incumbent'].str.extract(r'([^,]+),(.+)')
print(df)

                Nominee Nomination date Confirmation date     Court  \
0        Rikelman,Julie      01/03/2023        06/20/2023  01 - CCA   
1    Guzman,Margaret R.      01/03/2023        03/01/2023   01 - MA   
2         Joun,Myong J.      01/23/2023        07/12/2023   01 - MA   
3       Kobick,Julia E.      01/23/2023        11/07/2023   01 - MA   
4   Mendez-Miro,Gina R.      01/03/2023        02/14/2023   01 - PR   
..                  ...             ...               ...       ...   
59       Abudu,Nancy G.      01/03/2023        05/18/2023  11 - CCA   
60      Hadji,Philip S.      06/08/2023        09/21/2023        CL   
61      Silfen,Molly R.      02/27/2023        06/08/2023        CL   
62    Garcia,Bradley N.      01/03/2023        05/15/2023  DC - CCA   
63         Reyes,Ana C.      01/03/2023        02/15/2023   DC - DC   

                 Incumbent Vacancy Reason Vacancy Date Nominee Last Name  \
0          Lynch,Sandra L.         Senior   12/31/2022          Rikelma

In [167]:
new_column_order = [
    'Nominee Last Name', 'Nominee First Name', 'Nominee', 'Nomination date', 'Confirmation date', 'Court',
    'Incumbent Last Name', 'Incumbent First Name', 'Incumbent', 'Vacancy Reason', 'Vacancy Date'
]
# Reindexing DataFrame with new column order
df = df.reindex(columns=new_column_order)
df.head()
#df.info()

Unnamed: 0,Nominee Last Name,Nominee First Name,Nominee,Nomination date,Confirmation date,Court,Incumbent Last Name,Incumbent First Name,Incumbent,Vacancy Reason,Vacancy Date
0,Rikelman,Julie,"Rikelman,Julie",01/03/2023,06/20/2023,01 - CCA,Lynch,Sandra L.,"Lynch,Sandra L.",Senior,12/31/2022
1,Guzman,Margaret R.,"Guzman,Margaret R.",01/03/2023,03/01/2023,01 - MA,Hillman,Timothy S.,"Hillman,Timothy S.",Senior,07/01/2022
2,Joun,Myong J.,"Joun,Myong J.",01/23/2023,07/12/2023,01 - MA,O'Toole Jr.,George A.,"O'Toole Jr.,George A.",Senior,01/01/2018
3,Kobick,Julia E.,"Kobick,Julia E.",01/23/2023,11/07/2023,01 - MA,Young,William G.,"Young,William G.",Senior,07/01/2021
4,Mendez-Miro,Gina R.,"Mendez-Miro,Gina R.",01/03/2023,02/14/2023,01 - PR,Cerezo,Carmen Consuelo,"Cerezo,Carmen Consuelo",Retired,02/28/2021


In [168]:
df.to_csv("confirmation_listings.csv", index=False)
df.head()

Unnamed: 0,Nominee Last Name,Nominee First Name,Nominee,Nomination date,Confirmation date,Court,Incumbent Last Name,Incumbent First Name,Incumbent,Vacancy Reason,Vacancy Date
0,Rikelman,Julie,"Rikelman,Julie",01/03/2023,06/20/2023,01 - CCA,Lynch,Sandra L.,"Lynch,Sandra L.",Senior,12/31/2022
1,Guzman,Margaret R.,"Guzman,Margaret R.",01/03/2023,03/01/2023,01 - MA,Hillman,Timothy S.,"Hillman,Timothy S.",Senior,07/01/2022
2,Joun,Myong J.,"Joun,Myong J.",01/23/2023,07/12/2023,01 - MA,O'Toole Jr.,George A.,"O'Toole Jr.,George A.",Senior,01/01/2018
3,Kobick,Julia E.,"Kobick,Julia E.",01/23/2023,11/07/2023,01 - MA,Young,William G.,"Young,William G.",Senior,07/01/2021
4,Mendez-Miro,Gina R.,"Mendez-Miro,Gina R.",01/03/2023,02/14/2023,01 - PR,Cerezo,Carmen Consuelo,"Cerezo,Carmen Consuelo",Retired,02/28/2021


### STEP 3
Investigate the judicial committee's confirmation postings:

https://www.judiciary.senate.gov/nominations/confirmed

This is relatively straightforward, except that the most interesting information is possibly PDFs of the questionnaires for each candidate. To get the PDFs you need to use selenium (see step 4), but first look this data and assess whether you think it will be useful to you. You can then parse them using regular expressions.

In [4]:
#Don't necessarily code here
#Think about where you're going first
#And read below

### STEP 4
Investigate the judicial committee's hearings on nominees: 

https://www.judiciary.senate.gov/hearings

This one is very tricky. It is where you can find PDFs with Q&A's from confirmation hearings. It is a multiple page scrape just to get links to various nomination pages, which then have links to PDFs, which is then have redirects to download the PDFs (you have to use selenium here). 

But before you do the scrape just go through the hearings pages by hand and click on where it says "Nominations". Look at the different Q&A's available and see if you think they will be useful to you. If they will be I can give you most of the code you will need to get the PDFs. Also, I have uploaded a file on slack of one hearings PDFs along with text conversions of them. Take a look at the text conversions, because you'll need to parse them using regular expressions.

If you are interested in more historical data, look into the information on these links:

Archives of vacancies/confirmations (if you want to build more historical data)
http://www.uscourts.gov/judges-judgeships/judicial-vacancies/archive-judicial-vacancies

Present and past judges including resumes:

Appeals courts:
https://www.fjc.gov/history/courts/u.s.-court-appeals-district-columbia-circuit-justices-and-judges

District courts:
https://www.fjc.gov/history/courts/u.s.-district-courts-and-federal-judiciary

In [96]:
#Think about your focus and what your ultimate architecture should be

In [121]:
#Here I'm trying to start scraping the lists that include names with information about their gender and race:
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)

In [122]:
page = await browser.new_page()

In [123]:
await page.goto("https://www.fjc.gov/history/judges/search/advanced-search")

<Response url='https://www.fjc.gov/history/judges/search/advanced-search' request=<Request url='https://www.fjc.gov/history/judges/search/advanced-search' method='GET'>>

In [108]:
html = await page.content()
doc3 = BeautifulSoup(html)

In [124]:
await page.get_by_text("Personal Characteristics and Background").click()

In [125]:
await page.locator("#edit-gender").select_option('Female')

['8211']

In [126]:
await page.locator("#edit-employment").click()

In [127]:
await page.keyboard.press('Enter')

In [None]:
#More to come...

In [118]:
#Scraping all the judges of the court of appeals
my_url = "https://www.fjc.gov/history/courts/u.s.-court-appeals-first-circuit-judges"
raw_html = requests.get(my_url).content

In [119]:
soup_doc = BeautifulSoup(raw_html, "html.parser")
print(soup_doc.prettify())

<html style="height:100%">
 <head>
  <meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/>
  <meta content="telephone=no" name="format-detection"/>
  <meta content="initial-scale=1.0" name="viewport"/>
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <script src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3" type="text/javascript">
  </script>
  <script async="" src="/gy-come-him-there-What-weepes-and-be-stion-Watch">
  </script>
 </head>
 <body style="margin:0px;height:100%">
  <iframe frameborder="0" height="100%" id="main-iframe" marginheight="0px" marginwidth="0px" src="/_Incapsula_Resource?SWUDNSAI=31&amp;xinfo=7-53664637-0%20NNNN%20RT%281701791942508%2067%29%20q%280%20-1%20-1%200%29%20r%280%20-1%29%20B12%2814%2c0%2c0%29%20U24&amp;incident_id=1252001300358031151-341727187497656455&amp;edet=12&amp;cinfo=0e000000eeb0&amp;rpinfo=0&amp;cts=%2f2MOU6xH0dryLiZne3ggy5brifX%2fNfykP0Ty7ju0hFnHHkWeoJHxBUtIsyhVVAyU&amp;mth=GET" width="100%">
   Request unsucce

In [171]:
#Scraping the links to all the district court websites
my_url = "https://www.uscourts.gov/about-federal-courts/federal-courts-public/court-website-links"
raw_html = requests.get(my_url).content
doc = BeautifulSoup(raw_html, "html.parser")
print(doc.prettify())

<!DOCTYPE html>
<!--[if IEMobile 7]><html class="iem7"  lang="en" dir="ltr"><![endif]-->
<!--[if lte IE 6]><html class="lt-ie9 lt-ie8 lt-ie7"  lang="en" dir="ltr"><![endif]-->
<!--[if (IE 7)&(!IEMobile)]><html class="lt-ie9 lt-ie8"  lang="en" dir="ltr"><![endif]-->
<!--[if IE 8]><html class="lt-ie9"  lang="en" dir="ltr"><![endif]-->
<!--[if (gte IE 9)|(gt IEMobile 7)]><!-->
<html dir="ltr" lang="en" prefix="og: https://ogp.me/ns#">
 <!--<![endif]-->
 <head>
  <!--[if IE]><![endif]-->
  <link href="//fonts.gstatic.com" rel="dns-prefetch"/>
  <link crossorigin="" href="//fonts.gstatic.com" rel="preconnect"/>
  <link href="//themes.googleusercontent.com" rel="dns-prefetch"/>
  <link crossorigin="" href="//themes.googleusercontent.com" rel="preconnect"/>
  <link href="//maps.googleapis.com" rel="dns-prefetch"/>
  <link href="//maps.googleapis.com" rel="preconnect"/>
  <meta charset="utf-8"/>
  <meta content="https://www.uscourts.gov/sites/default/files/styles/lead/public/circuit_map_in_age

In [179]:
all_districts = doc.find_all('td')
print(all_districts)

[<td>
<a href="https://www.almd.uscourts.gov">Alabama Middle</a><br/>
<a href="https://www.alnd.uscourts.gov">Alabama Northern</a><br/>
<a href="https://www.alsd.uscourts.gov">Alabama Southern</a>
</td>, <td style="white-space:nowrap;">
<a href="https://www.almb.uscourts.gov">Alabama Middle</a><br/>
<a href="https://www.alnb.uscourts.gov">Alabama Northern</a><br/>
<a href="https://www.alsb.uscourts.gov">Alabama Southern</a>
</td>, <td>
<a href="https://www.akd.uscourts.gov">Alaska</a>
</td>, <td style="white-space:nowrap;">
<a href="https://www.akb.uscourts.gov">Alaska</a>
</td>, <td>
<a href="https://www.azd.uscourts.gov">Arizona</a>
</td>, <td style="white-space:nowrap;">
<a href="https://www.azb.uscourts.gov">Arizona</a>
</td>, <td>
<a href="https://www.are.uscourts.gov">Arkansas Eastern</a><br/>
<a href="https://www.arwd.uscourts.gov">Arkansas Western</a>
</td>, <td style="white-space:nowrap;">
<a href="https://www.areb.uscourts.gov">Arkansas Eastern &amp; Western</a>
</td>, <td>
<

In [183]:
court_names = []
court_links = []
for district in all_districts:
    court_links.extend([a['href'] for a in district.find_all('a')])
    court_names.extend([a.get_text() for a in district.find_all('a')])

In [184]:
court_names

['Alabama Middle',
 'Alabama Northern',
 'Alabama Southern',
 'Alabama Middle',
 'Alabama Northern',
 'Alabama Southern',
 'Alaska',
 'Alaska',
 'Arizona',
 'Arizona',
 'Arkansas Eastern',
 'Arkansas Western',
 'Arkansas Eastern & Western',
 'California Central',
 'California Eastern',
 'California Northern',
 'California Southern',
 'California Central',
 'California Eastern',
 'California Northern',
 'California Southern',
 'Colorado',
 'Colorado',
 'Connecticut',
 'Connecticut',
 'Delaware',
 'Delaware',
 'District of Columbia',
 'District of Columbia',
 'Florida Middle',
 'Florida Northern',
 'Florida Southern',
 'Florida Middle',
 'Florida Northern',
 'Florida Southern',
 'Georgia Middle',
 'Georgia Northern',
 'Georgia Southern',
 'Georgia Middle',
 'Georgia Northern',
 'Georgia Southern',
 'Guam',
 'Guam',
 'Hawaii',
 'Hawaii',
 'Idaho',
 'Idaho',
 'Illinois Central',
 'Illinois Northern',
 'Illinois Southern',
 'Illinois Central',
 'Illinois Northern',
 'Illinois Southern',
 'I

In [186]:
data = {'Court Name': court_names, 'Court Website': court_links}
df = pd.DataFrame(data)
df = df.drop_duplicates(subset='Court Name', keep='first')
df.reset_index(drop=True, inplace=True)
print(df)

                Court Name                  Court Website
0           Alabama Middle  https://www.almd.uscourts.gov
1         Alabama Northern  https://www.alnd.uscourts.gov
2         Alabama Southern  https://www.alsd.uscourts.gov
3                   Alaska   https://www.akd.uscourts.gov
4                  Arizona   https://www.azd.uscourts.gov
..                     ...                            ...
90  West Virginia Northern  https://www.wvnd.uscourts.gov
91  West Virginia Southern  https://www.wvsd.uscourts.gov
92       Wisconsin Eastern  https://www.wied.uscourts.gov
93       Wisconsin Western  https://www.wiwd.uscourts.gov
94                 Wyoming   https://www.wyd.uscourts.gov

[95 rows x 2 columns]


In [188]:
df.to_csv("All_district_court_names_and_links.csv", index=False)
df.head()

Unnamed: 0,Court Name,Court Website
0,Alabama Middle,https://www.almd.uscourts.gov
1,Alabama Northern,https://www.alnd.uscourts.gov
2,Alabama Southern,https://www.alsd.uscourts.gov
3,Alaska,https://www.akd.uscourts.gov
4,Arizona,https://www.azd.uscourts.gov


In [190]:
df = df[df['Court Name'] != 'Arkansas Eastern & Western']
df.to_csv('All_district_court_names_and_links.csv', index=False)