# Data Fetching: FJC and Congress.gov API

This notebook is responsible for fetching and initially processing data from our primary sources:

1. Federal Judicial Center (FJC) CSV and Excel files
2. Congress.gov API judicial nomination data

According to the project architecture, this notebook will:
1. Download or use cached data from the FJC and Congress.gov API
2. Perform minimal transformations to convert to dataframes
3. Save the resulting dataframes to `data/raw` for further processing by downstream notebooks

## Setup

In [None]:
import os
import sys
from pathlib import Path

import pandas as pd
from loguru import logger

# Add the project root to the path so we can import our modules
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

from nomination_predictor.config import (EXTERNAL_DATA_DIR, INTERIM_DATA_DIR,
                                         PROCESSED_DATA_DIR, RAW_DATA_DIR)
from nomination_predictor.congress_api import CongressAPIClient
from nomination_predictor.fjc_data import (FJC_DATA_DIR, build_seat_timeline,
                                           get_predecessor_info, load_fjc_data)

# Setup logging
logger.remove()  # Remove default handler
logger.add(sys.stderr, format="<green>{time:YYYY-MM-DD HH:mm:ss}</green> | <level>{level}</level> | <cyan>{function}</cyan> - <level>{message}</level>", level="INFO")

[32m2025-07-11 19:29:10.319[0m | [1mINFO    [0m | [36mnomination_predictor.config[0m:[36m<module>[0m:[36m103[0m - [1mProject root: /home/wsl2ubuntuuser/nomination_predictor[0m
[32m2025-07-11 19:29:10.321[0m | [1mINFO    [0m | [36mnomination_predictor.config[0m:[36m<module>[0m:[36m127[0m - [1mConfiguration loaded[0m


5

## 1. Federal Judicial Center (FJC) Data

The FJC data is our canonical source for judicial seat timelines, judge demographics, and nomination failures.

### Check if FJC data exists or download if needed

In [None]:
# Check if required FJC data files exist and download any missing ones
from nomination_predictor.fjc_data import (FJC_DATA_DIR, REQUIRED_FJC_FILES,
                                           ensure_fjc_data_files,
                                           load_fjc_data)

# Check for missing files and download them if needed
downloaded, failed = ensure_fjc_data_files()

# Report status
if downloaded:
    print(f"✓ Downloaded {len(downloaded)} previously missing files: {', '.join(downloaded)}")
if failed:
    print(f"❌ Failed to download {len(failed)} files: {', '.join(failed)}")
    
# Also report on which files are present
present_files = [f for f in REQUIRED_FJC_FILES if (FJC_DATA_DIR / f).exists()]
if len(present_files) == len(REQUIRED_FJC_FILES):
    print(f"✓ All required FJC data files are available in {FJC_DATA_DIR}")
else:
    missing = set(REQUIRED_FJC_FILES) - set(present_files)
    print(f"⚠️ Still missing {len(missing)} required files: {', '.join(missing)}")

[32m2025-07-11 19:29:10[0m | [1mINFO[0m | [36mensure_fjc_data_files[0m - [1mEnsuring FJC data files are available[0m


✓ All required FJC data files are available in /home/wsl2ubuntuuser/nomination_predictor/data/external/FederalJudicialCenter


### Load FJC Data

In [None]:
# Load all FJC data files (with auto-download enabled by default)
fjc_data = load_fjc_data()

# Access individual DataFrames
print(f"Loaded FJC data files:")
for key, df in fjc_data.items():
    print(f"- {key}: {len(df)} records")

# Store references to commonly used DataFrames for easier access
judges_df = fjc_data.get('judges')
demographics_df = fjc_data.get('demographics')
education_df = fjc_data.get('education')
service_df = fjc_data.get('federal_judicial_service')
other_nominations_recess_df = fjc_data.get('other_nominations_recess')
other_federal_judicial_service_df = fjc_data.get('other_federal_judicial_service')
professional_career_df = fjc_data.get('professional_career')

# Check if we have the required service data to build the seat timeline
if service_df is None or len(service_df) == 0:
    print("❌ Error: federal-judicial-service.csv not found or empty")
    # Create an empty DataFrame as a fallback
    seat_timeline_df = pd.DataFrame()
else:
    # Build the seat timeline from the service data
    seat_timeline_df = build_seat_timeline(service_df)
    print(f"Built seat timeline with {len(seat_timeline_df)} records")
    
    # Show the first few rows
    seat_timeline_df.head()

[32m2025-07-11 19:29:10[0m | [1mINFO[0m | [36mload_fjc_data[0m - [1mLoading FJC data files[0m
[32m2025-07-11 19:29:10[0m | [1mINFO[0m | [36mensure_fjc_data_files[0m - [1mEnsuring FJC data files are available[0m
[32m2025-07-11 19:29:10[0m | [1mINFO[0m | [36mload_fjc_csv[0m - [1mLoading FJC data file: demographics.csv[0m
[32m2025-07-11 19:29:10[0m | [1mINFO[0m | [36mload_fjc_data[0m - [1mLoaded demographics data with 4022 records[0m
[32m2025-07-11 19:29:10[0m | [1mINFO[0m | [36mload_fjc_csv[0m - [1mLoading FJC data file: education.csv[0m
[32m2025-07-11 19:29:10[0m | [1mINFO[0m | [36mload_fjc_data[0m - [1mLoaded education data with 8040 records[0m
[32m2025-07-11 19:29:10[0m | [1mINFO[0m | [36mload_fjc_csv[0m - [1mLoading FJC data file: federal-judicial-service.csv[0m


[32m2025-07-11 19:29:10[0m | [1mINFO[0m | [36mload_fjc_data[0m - [1mLoaded federal_judicial_service data with 4720 records[0m
[32m2025-07-11 19:29:10[0m | [1mINFO[0m | [36mload_fjc_csv[0m - [1mLoading FJC data file: judges.csv[0m
[32m2025-07-11 19:29:11[0m | [1mINFO[0m | [36mload_fjc_data[0m - [1mLoaded judges data with 4022 records[0m
[32m2025-07-11 19:29:11[0m | [1mINFO[0m | [36mload_fjc_csv[0m - [1mLoading FJC data file: other-nominations-recess.csv[0m
[32m2025-07-11 19:29:11[0m | [1mINFO[0m | [36mload_fjc_data[0m - [1mLoaded other_nominations_recess data with 828 records[0m
[32m2025-07-11 19:29:11[0m | [1mINFO[0m | [36mload_fjc_csv[0m - [1mLoading FJC data file: other-federal-judicial-service.csv[0m
[32m2025-07-11 19:29:11[0m | [1mINFO[0m | [36mload_fjc_data[0m - [1mLoaded other_federal_judicial_service data with 611 records[0m
[32m2025-07-11 19:29:11[0m | [1mINFO[0m | [36mload_fjc_csv[0m - [1mLoading FJC data file: pro

Loaded FJC data files:
- demographics: 4022 records
- education: 8040 records
- federal_judicial_service: 4720 records
- judges: 4022 records
- other_nominations_recess: 828 records
- other_federal_judicial_service: 611 records
- professional_career: 19003 records




Built seat timeline with 4720 records


In [None]:
# Build the seat timeline from service data
try:
    # Only proceed if service_df exists and is not empty
    if service_df.empty:
        raise ValueError("service_df is empty - cannot build seat timeline")
        
    # Build the seat timeline
    seat_timeline_df = build_seat_timeline(service_df)
    print(f"Built seat timeline: {len(seat_timeline_df)} records")
except Exception as e:
    print(f"❌ Error building seat timeline: {e}")
    raise  # Re-raise the exception to ensure the cell fails visibly

[32m2025-07-11 19:29:18[0m | [1mINFO[0m | [36mbuild_seat_timeline[0m - [1mBuilding seat timeline table[0m


Built seat timeline: 4720 records


### Build Seat Timeline (Master Table)

In [None]:
# Create the predecessor lookup table
predecessor_lookup = get_predecessor_info(seat_timeline_df)
print(f"Created predecessor lookup: {len(predecessor_lookup)} records")

# Preview the predecessor lookup
print(predecessor_lookup.head())

[32m2025-07-11 19:29:26[0m | [1mINFO[0m | [36mget_predecessor_info[0m - [1mBuilding seat predecessor lookup table[0m


Created predecessor lookup: 3259 records
          seat_id  predecessor_nid vacancy_date
396   1801CC10101          1378061   1802-07-01
2564  1801CC10201          1384076   1802-05-06
3937  1801CC10301          1387966   1802-07-01
285   1801CC20101          1377756   1802-07-01
1887  1801CC20201          1382206   1802-07-01


## 2. Congress.gov API Data

The Congress.gov API provides detailed information about judicial nominations, including:
- Nomination date
- Nominee information
- Confirmation status and date
- Committee actions

### Setup API Access

In [None]:
# Check if API key is available
api_key = os.environ.get("CONGRESS_API_KEY")
if not api_key:
    print("❌ Error: CONGRESS_API_KEY environment variable not set")
    print("Please set the CONGRESS_API_KEY environment variable to your Congress.gov API key")
    print("You can request an API key at: https://api.congress.gov/sign-up/")
else:
    print("✓ Congress API key found in environment variables")
    # Initialize the API client
    congress_client = CongressAPIClient(api_key)
    print("✓ Congress API client initialized")

✓ Congress API key found in environment variables
✓ Congress API client initialized


### Fetch Judicial Nominations from Recent Congresses

In [None]:
# Fetch judicial nominations from recent congresses
# Congress numbering: 116th (2019-2021), 117th (2021-2023), 118th (2023-2025)
import os
from pathlib import Path

from nomination_predictor.config import EXTERNAL_DATA_DIR

# Define cache file path for nominations
nominations_cache_file = os.path.join(EXTERNAL_DATA_DIR, "congress_nominations_cache.csv")
congresses = [118, 117, 116]  # Most recent three congresses

# Check if we have cached data
if os.path.exists(nominations_cache_file):
    print(f"Found cached nominations data at {nominations_cache_file}")
    nominations_df = pd.read_csv(nominations_cache_file, parse_dates=['receivedDate', 'authorityDate'])
    print(f"Loaded {len(nominations_df)} nominations from cache")
else:
    # If no cache, fetch from API
    all_nominations = []
    
    for congress in congresses:
        try:
            print(f"Fetching judicial nominations for the {congress}th Congress...")
            nominations = congress_client.get_judicial_nominations(congress)
            print(f"  ✓ Retrieved {len(nominations)} judicial nominations")
            all_nominations.extend(nominations)
        except Exception as e:
            print(f"  ❌ Error fetching nominations for {congress}th Congress: {str(e)}")
    
    # Convert to DataFrame
    nominations_df = pd.DataFrame(all_nominations)
    print(f"\nTotal nominations retrieved: {len(nominations_df)}")
    
    # Save to cache file
    if len(nominations_df) > 0:
        # Ensure directory exists
        os.makedirs(os.path.dirname(nominations_cache_file), exist_ok=True)
        print(f"Saving nominations to cache file: {nominations_cache_file}")
        nominations_df.to_csv(nominations_cache_file, index=False)
        print(f"✓ Saved {len(nominations_df)} nominations to cache")

[32m2025-07-11 19:29:26[0m | [1mINFO[0m | [36mget_judicial_nominations[0m - [1mFetching judicial nominations for Congress 118[0m
[32m2025-07-11 19:29:26[0m | [1mINFO[0m | [36mget_nominations[0m - [1mFetching nominations for 118th Congress with pagination[0m
[32m2025-07-11 19:29:26[0m | [1mINFO[0m | [36mget_nominations[0m - [1mFetching page 1 for 118th Congress nominations[0m


Fetching judicial nominations for the 118th Congress...


[32m2025-07-11 19:29:29[0m | [1mINFO[0m | [36mget_nominations[0m - [1mRetrieved 250 nominations from page 1[0m
[32m2025-07-11 19:29:29[0m | [1mINFO[0m | [36mget_nominations[0m - [1mMoving to page 2 with offset 250[0m
[32m2025-07-11 19:29:29[0m | [1mINFO[0m | [36mget_nominations[0m - [1mFetching page 2 for 118th Congress nominations[0m
[32m2025-07-11 19:29:33[0m | [1mINFO[0m | [36mget_nominations[0m - [1mRetrieved 250 nominations from page 2[0m
[32m2025-07-11 19:29:33[0m | [1mINFO[0m | [36mget_nominations[0m - [1mMoving to page 3 with offset 500[0m
[32m2025-07-11 19:29:33[0m | [1mINFO[0m | [36mget_nominations[0m - [1mFetching page 3 for 118th Congress nominations[0m
[32m2025-07-11 19:29:36[0m | [1mINFO[0m | [36mget_nominations[0m - [1mRetrieved 250 nominations from page 3[0m
[32m2025-07-11 19:29:36[0m | [1mINFO[0m | [36mget_nominations[0m - [1mMoving to page 4 with offset 750[0m
[32m2025-07-11 19:29:36[0m | [1mINFO[0m |

  ✓ Retrieved 285 judicial nominations
Fetching judicial nominations for the 117th Congress...


[32m2025-07-11 19:32:37[0m | [1mINFO[0m | [36mget_nominations[0m - [1mRetrieved 250 nominations from page 1[0m
[32m2025-07-11 19:32:37[0m | [1mINFO[0m | [36mget_nominations[0m - [1mMoving to page 2 with offset 250[0m
[32m2025-07-11 19:32:37[0m | [1mINFO[0m | [36mget_nominations[0m - [1mFetching page 2 for 117th Congress nominations[0m
[32m2025-07-11 19:32:43[0m | [1mINFO[0m | [36mget_nominations[0m - [1mRetrieved 250 nominations from page 2[0m
[32m2025-07-11 19:32:43[0m | [1mINFO[0m | [36mget_nominations[0m - [1mMoving to page 3 with offset 500[0m
[32m2025-07-11 19:32:43[0m | [1mINFO[0m | [36mget_nominations[0m - [1mFetching page 3 for 117th Congress nominations[0m
[32m2025-07-11 19:32:47[0m | [1mINFO[0m | [36mget_nominations[0m - [1mRetrieved 250 nominations from page 3[0m
[32m2025-07-11 19:32:47[0m | [1mINFO[0m | [36mget_nominations[0m - [1mMoving to page 4 with offset 750[0m
[32m2025-07-11 19:32:47[0m | [1mINFO[0m |

  ✓ Retrieved 387 judicial nominations
Fetching judicial nominations for the 116th Congress...


[32m2025-07-11 19:36:51[0m | [1mINFO[0m | [36mget_nominations[0m - [1mRetrieved 250 nominations from page 1[0m
[32m2025-07-11 19:36:51[0m | [1mINFO[0m | [36mget_nominations[0m - [1mMoving to page 2 with offset 250[0m
[32m2025-07-11 19:36:51[0m | [1mINFO[0m | [36mget_nominations[0m - [1mFetching page 2 for 116th Congress nominations[0m
[32m2025-07-11 19:36:55[0m | [1mINFO[0m | [36mget_nominations[0m - [1mRetrieved 250 nominations from page 2[0m
[32m2025-07-11 19:36:55[0m | [1mINFO[0m | [36mget_nominations[0m - [1mMoving to page 3 with offset 500[0m
[32m2025-07-11 19:36:55[0m | [1mINFO[0m | [36mget_nominations[0m - [1mFetching page 3 for 116th Congress nominations[0m
[32m2025-07-11 19:36:58[0m | [1mINFO[0m | [36mget_nominations[0m - [1mRetrieved 250 nominations from page 3[0m
[32m2025-07-11 19:36:58[0m | [1mINFO[0m | [36mget_nominations[0m - [1mMoving to page 4 with offset 750[0m
[32m2025-07-11 19:36:58[0m | [1mINFO[0m |

  ✓ Retrieved 308 judicial nominations

Total nominations retrieved: 980
Saving nominations to cache file: /home/wsl2ubuntuuser/nomination_predictor/data/external/congress_nominations_cache.csv
✓ Saved 980 nominations to cache


In [None]:
# Preview the nominations
print(nominations_df.head())

   congress  nomination_number citation            source  source_year  \
0       118               2012   PN2012  congress.gov_api         2025   
1       118               2013   PN2013  congress.gov_api         2025   
2       118                814    PN814  congress.gov_api         2025   
3       118                771    PN771  congress.gov_api         2025   
4       118                769    PN769  congress.gov_api         2025   

   source_month nomination_date latest_action_date  \
0             7      2024-07-31         2025-01-03   
1             7      2024-07-31         2025-01-03   
2             7      2023-07-11         2024-01-03   
3             7      2023-06-08         2023-12-14   
4             7      2023-06-08         2023-12-14   

                                  latest_action_text nominee  \
0  Returned to the President under the provisions...           
1  Returned to the President under the provisions...           
2  Returned to the President under the

## 3. Crosswalk and Join Data Sources

Now we'll crosswalk the Congress.gov nomination data to the FJC seat timeline using the nomination-to-seat matching logic.

In [None]:
from nomination_predictor.fjc_data import crosswalk_congress_api

crosswalked_df = crosswalk_congress_api(nominations_df, seat_timeline_df, judges_df)
print(f"Crosswalked {len(crosswalked_df)} nominations with seat timeline and judges data")
print(f"Match statistics:\n{crosswalked_df['seat_match_method'].value_counts()}")

# Preview crosswalked data
print(crosswalked_df.head())

[32m2025-07-11 19:43:18[0m | [1mINFO[0m | [36mcrosswalk_congress_api[0m - [1mCrosswalking Congress.gov API data with FJC seat timeline[0m


[32m2025-07-11 19:42:49[0m | [1mINFO[0m | [36mcrosswalk_congress_api[0m - [1mAvailable columns in seat_timeline: ['nid', 'sequence', 'judge_name', 'court_type', 'court_name', 'appointment_title', 'appointing_president', 'party_of_appointing_president', 'reappointing_president', 'party_of_reappointing_president', 'aba_rating', 'seat_id', 'statute_authorizing_new_seat', 'recess_appointment_date', 'nomination_date', 'committee_referral_date', 'hearing_date', 'judiciary_committee_action', 'committee_action_date', 'senate_vote_type', 'ayes/nays', 'confirmation_date', 'commission_date', 'service_as_chief_judge,_begin', 'service_as_chief_judge,_end', '2nd_service_as_chief_judge,_begin', '2nd_service_as_chief_judge,_end', 'senior_status_date', 'termination', 'termination_date', 'vacancy_date'][0m
[32m2025-07-11 19:42:49[0m | [1mINFO[0m | [36mcrosswalk_congress_api[0m - [1mJoining on columns: ['nid', 'seat_id'][0m
[32m2025-07-11 19:42:49[0m | [1mINFO[0m | [36mcrosswalk_cong

Crosswalked 1019 nominations with seat timeline and judges data
Match statistics:
seat_match_method
unmatched           766
predecessor_name    253
Name: count, dtype: int64
   congress  nomination_number citation            source  source_year  \
0       118               2012   PN2012  congress.gov_api         2025   
1       118               2013   PN2013  congress.gov_api         2025   
2       118                814    PN814  congress.gov_api         2025   
3       118                771    PN771  congress.gov_api         2025   
4       118                769    PN769  congress.gov_api         2025   

   source_month nomination_date_x latest_action_date  \
0             7        2024-07-31         2025-01-03   
1             7        2024-07-31         2025-01-03   
2             7        2023-07-11         2024-01-03   
3             7        2023-06-08         2023-12-14   
4             7        2023-06-08         2023-12-14   

                                  latest_act

## 4. Create Master Dataset

Now we'll create the master dataset by joining the seat timeline with the crosswalked nominations data.

In [None]:
from nomination_predictor.fjc_data import create_master_dataset

# Create master dataset if we have both datasets
master_df = create_master_dataset(
    seat_timeline_df,
    crosswalked_df
)

print(f"Created master dataset: {len(master_df)} records")

# Preview master dataset
master_df.head()

[32m2025-07-11 19:42:49[0m | [1mINFO[0m | [36mcreate_master_dataset[0m - [1mCreating master dataset[0m
[32m2025-07-11 19:42:49[0m | [1mINFO[0m | [36mcreate_master_dataset[0m - [1mAdding nomination data to master dataset[0m


Created master dataset: 4803 records


## 5. Save Data to Raw Directory

Save the datasets to the raw data directory for use by downstream notebooks.

In [None]:
# Save the FJC data (with normalized column names) and seat timeline to the raw data directory
import os
from datetime import datetime

from nomination_predictor.config import RAW_DATA_DIR

# Create the raw data directory if it doesn't exist
os.makedirs(RAW_DATA_DIR, exist_ok=True)

# Add a timestamp for the manifest
timestamp = datetime.now().strftime("%Y%m%d")

# Save each FJC dataframe
print(f"Saving FJC data (with normalized column names) to {RAW_DATA_DIR}...")
saved_files = []

# Save all loaded FJC dataframes
for key, df in fjc_data.items():
    if df is not None and len(df) > 0:
        # Create filename
        output_file = RAW_DATA_DIR / f"{key}.csv"
        
        # Save to CSV
        df.to_csv(output_file, index=False)
        saved_files.append(f"{key}.csv")
        print(f"  ✓ Saved {len(df)} records to {output_file}")

# Save the seat timeline master table if it exists
if seat_timeline_df is not None and len(seat_timeline_df) > 0:
    output_file = RAW_DATA_DIR / "seat_timeline_master.csv"
    seat_timeline_df.to_csv(output_file, index=False)
    saved_files.append("seat_timeline_master.csv")
    print(f"  ✓ Saved seat timeline master table with {len(seat_timeline_df)} records to {output_file}")

# Create a manifest file to track what was saved and when
manifest_content = f"""# FJC Data Processing Manifest
Processed on: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}
Note: Only column names are normalized (lowercase with underscores), data values remain unchanged
Files saved:
{chr(10).join(['- ' + file for file in saved_files])}
"""

with open(RAW_DATA_DIR / f"fjc_data_manifest_{timestamp}.txt", "w") as f:
    f.write(manifest_content)

print(f"✓ Saved {len(saved_files)} files to {RAW_DATA_DIR}")
print(f"✓ Created manifest: fjc_data_manifest_{timestamp}.txt")

✓ Saved seat timeline to /home/wsl2ubuntuuser/nomination_predictor/data/raw/seat_timeline.csv
✓ Saved crosswalked nominations to /home/wsl2ubuntuuser/nomination_predictor/data/raw/crosswalked_nominations.csv
✓ Saved master dataset to /home/wsl2ubuntuuser/nomination_predictor/data/raw/master_dataset.csv


In [None]:

# Save crosswalked nominations
if crosswalked_df is not None and not crosswalked_df.empty:
    output_path = RAW_DATA_DIR / "crosswalked_nominations.csv"
    crosswalked_df.to_csv(output_path, index=False)
    print(f"✓ Saved crosswalked nominations to {output_path}")
else:
    print("No crosswalked nominations to save")
    
# Save master dataset
if master_df is not None and not master_df.empty:
    output_path = RAW_DATA_DIR / "master_dataset.csv"
    master_df.to_csv(output_path, index=False)
    print(f"✓ Saved master dataset to {output_path}")
else:
    print("No master dataset to save")

## Summary

In this notebook, we have:

1. Loaded Federal Judicial Center (FJC) data, the canonical source for judicial seats and judges
2. Built the seat timeline as our master table
3. Fetched judicial nominations from the Congress.gov API
4. Crosswalked the nomination data to FJC seat IDs
5. Created a master dataset joining these sources
6. Saved all datasets to the raw data directory for further processing by downstream notebooks

The next notebook (1.00-nw-data-cleaning-feature-creation.ipynb) will load these datasets, clean them, and engineer features for modeling.