# Data Fetching: FJC and Congress.gov API

This notebook is responsible for fetching and initially processing data from our primary sources:

1. Federal Judicial Center (FJC) CSV and Excel files
2. Congress.gov API judicial nomination data

According to the project architecture, this notebook will:
1. Download or use cached data from the FJC and Congress.gov API
2. Perform minimal transformations to convert to dataframes
3. Save the resulting dataframes to `data/raw` for further processing by downstream notebooks

## Setup

In [None]:
import os
import sys
from pathlib import Path

import pandas as pd
from loguru import logger

# Add the project root to the path so we can import our modules
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

from nomination_predictor.congress_api import CongressAPIClient

# Setup logging
logger.remove()  # Remove default handler
logger.add(sys.stderr, format="<green>{time:YYYY-MM-DD HH:mm:ss}</green> | <level>{level}</level> | <cyan>{function}</cyan> - <level>{message}</level>", level="INFO")

[32m2025-07-13 10:35:32.106[0m | [1mINFO    [0m | [36mnomination_predictor.config[0m:[36m<module>[0m:[36m103[0m - [1mProject root: /home/wsl2ubuntuuser/nomination_predictor[0m
[32m2025-07-13 10:35:32.108[0m | [1mINFO    [0m | [36mnomination_predictor.config[0m:[36m<module>[0m:[36m127[0m - [1mConfiguration loaded[0m


5

## 1. Federal Judicial Center (FJC) Data

The FJC data is our canonical source for judicial seat timelines, judge demographics, and nomination failures.

### Check if FJC data exists or download if needed

In [None]:
# Check if required FJC data files exist and download any missing ones
from nomination_predictor.config import EXTERNAL_DATA_DIR
from nomination_predictor.fjc_data import (REQUIRED_FJC_FILES,
                                           ensure_fjc_data_files,
                                           load_fjc_data)

# Check for missing files and download them if needed
downloaded, failed = ensure_fjc_data_files()

# Report status
if downloaded:
    print(f"✓ Downloaded {len(downloaded)} previously missing files: {', '.join(downloaded)}")
if failed:
    print(f"❌ Failed to download {len(failed)} files: {', '.join(failed)}")
    
# Also report on which files are present
present_files = [f for f in REQUIRED_FJC_FILES if (EXTERNAL_DATA_DIR / f).exists()]
if len(present_files) == len(REQUIRED_FJC_FILES):
    print(f"✓ All required FJC data files are available in {EXTERNAL_DATA_DIR}")
else:
    missing = set(REQUIRED_FJC_FILES) - set(present_files)
    print(f"⚠️ Still missing {len(missing)} required files: {', '.join(missing)}")

[32m2025-07-13 10:35:32[0m | [1mINFO[0m | [36mensure_fjc_data_files[0m - [1mEnsuring FJC data files are available[0m


✓ All required FJC data files are available in /home/wsl2ubuntuuser/nomination_predictor/data/external


### Load FJC Data

In [None]:
# Load all FJC data files (with auto-download enabled by default)
fjc_data = load_fjc_data()

# Access individual DataFrames
print(f"Loaded FJC data files:")
for key, df in fjc_data.items():
    print(f"- {key}: {len(df)} records")

# Store references to commonly used DataFrames for easier access
judges_df = fjc_data.get('judges')
demographics_df = fjc_data.get('demographics')
education_df = fjc_data.get('education')
federal_judicial_service_df = fjc_data.get('federal_judicial_service')
other_nominations_recess_df = fjc_data.get('other_nominations_recess')
other_federal_judicial_service_df = fjc_data.get('other_federal_judicial_service')
professional_career_df = fjc_data.get('professional_career')

# Create a dictionary of all FJC dataframes for easy iteration
all_dataframes = {
    'judges': judges_df,
    'demographics': demographics_df,
    'education': education_df,
    'federal_judicial_service': federal_judicial_service_df,
    'other_nominations_recess': other_nominations_recess_df,
    'other_federal_judicial_service': other_federal_judicial_service_df,
    'professional_career': professional_career_df
}

[32m2025-07-13 10:35:32[0m | [1mINFO[0m | [36mload_fjc_data[0m - [1mLoading FJC data files[0m
[32m2025-07-13 10:35:32[0m | [1mINFO[0m | [36mensure_fjc_data_files[0m - [1mEnsuring FJC data files are available[0m
[32m2025-07-13 10:35:32[0m | [1mINFO[0m | [36mload_fjc_csv[0m - [1mLoading FJC data file: demographics.csv[0m
[32m2025-07-13 10:35:32[0m | [1mINFO[0m | [36mload_fjc_data[0m - [1mLoaded demographics data with 4022 records[0m
[32m2025-07-13 10:35:32[0m | [1mINFO[0m | [36mload_fjc_csv[0m - [1mLoading FJC data file: education.csv[0m
[32m2025-07-13 10:35:32[0m | [1mINFO[0m | [36mload_fjc_data[0m - [1mLoaded education data with 8040 records[0m
[32m2025-07-13 10:35:32[0m | [1mINFO[0m | [36mload_fjc_csv[0m - [1mLoading FJC data file: federal-judicial-service.csv[0m
[32m2025-07-13 10:35:32[0m | [1mINFO[0m | [36mload_fjc_data[0m - [1mLoaded federal_judicial_service data with 4720 records[0m
[32m2025-07-13 10:35:32[0m | [1m

Loaded FJC data files:
- demographics: 4022 records
- education: 8040 records
- federal_judicial_service: 4720 records
- judges: 4022 records
- other_nominations_recess: 828 records
- other_federal_judicial_service: 611 records
- professional_career: 19003 records


### Build a "seat timeline" inferred from FJC's data about when judges were in service:

In [None]:
from nomination_predictor.dataset import build_and_validate_seat_timeline

try:
    seat_timeline_df = build_and_validate_seat_timeline(federal_judicial_service_df)
    print(f"✅ Successfully built seat timeline with {len(seat_timeline_df):,} records")
    all_dataframes['seat_timeline'] = seat_timeline_df
except Exception as e:
    print(f"❌ Error: {e}")
    raise

[32m2025-07-13 10:35:33[0m | [1mINFO[0m | [36mbuild_seat_timeline[0m - [1mBuilding seat timeline table[0m
[32m2025-07-13 10:35:37[0m | [1mINFO[0m | [36mbuild_and_validate_seat_timeline[0m - [1mSuccessfully built seat timeline with 4,720 records[0m


✅ Successfully built seat timeline with 4,720 records


## 2. Congress.gov API Data

The Congress.gov API provides detailed information about judicial nominations, including:
- Nomination date
- Nominee information
- Confirmation status and date
- Committee actions

### Setup API Access

In [None]:
# Check if API key is available
api_key = os.environ.get("CONGRESS_API_KEY")
if not api_key:
    print("❌ Error: CONGRESS_API_KEY environment variable not set")
    print("Please set the CONGRESS_API_KEY environment variable to your Congress.gov API key")
    print("You can request an API key at: https://api.congress.gov/sign-up/")
else:
    print("✓ Congress API key found in environment variables")
    # Initialize the API client
    congress_client = CongressAPIClient(api_key)
    print("✓ Congress API client initialized")

✓ Congress API key found in environment variables
✓ Congress API client initialized


### Fetch Judicial Nominations from Recent Congresses

In [None]:
# Fetch judicial nominations from recent congresses
# Congress numbering: 116th (2019-2021), 117th (2021-2023), 118th (2023-2025)
# Import the new function
import os

from nomination_predictor.config import RAW_DATA_DIR
from nomination_predictor.dataset import fetch_judicial_nominations

# Define constants 
MOST_RECENT_CONGRESS_TERM_TO_GET = 118
OLDEST_CONGRESS_TERM_TO_GET = 96

# Define cache file path for nominations
nominations_cache_file = os.path.join(RAW_DATA_DIR, "nominations.csv")

# Fetch nominations with improved error handling
nominations_df, success = fetch_judicial_nominations(
    congress_client=congress_client,
    most_recent_congress=MOST_RECENT_CONGRESS_TERM_TO_GET,
    oldest_congress=OLDEST_CONGRESS_TERM_TO_GET,
    auto_paginate=True,
    cache_file=nominations_cache_file
)

# Critical validation - prevent proceeding if we don't have valid data
if not success or len(nominations_df) == 0:
    raise RuntimeError(
        "Failed to retrieve valid nomination data. "
        "Please check the logs for errors and fix any issues before continuing."
    )

# Add to all_dataframes collection if we have valid data
all_dataframes['nominations'] = nominations_df

print(f"✓ Successfully loaded {len(nominations_df)} nomination records")

[32m2025-07-13 10:35:38[0m | [1mINFO[0m | [36mfetch_judicial_nominations[0m - [1mFound cached nominations data at /home/wsl2ubuntuuser/nomination_predictor/data/raw/nominations.csv[0m
[32m2025-07-13 10:35:38[0m | [1mINFO[0m | [36mfetch_judicial_nominations[0m - [1mLoaded 5746 nominations records from cache (retrieved from 2025-07-12 00:00 to 2025-07-12 00:00)[0m


✓ Successfully loaded 5746 nomination records


In [None]:
# Preview the nominations
print(nominations_df.head())
all_dataframes['nominations'] = nominations_df

                                          nomination  \
0  {'actions': {'count': 6, 'url': 'https://api.c...   
1  {'actions': {'count': 6, 'url': 'https://api.c...   
2  {'actions': {'count': 11, 'url': 'https://api....   
3  {'actions': {'count': 14, 'url': 'https://api....   
4  {'actions': {'count': 20, 'url': 'https://api....   

                                             request retrieval_date  \
0  {'congress': '118', 'contentType': 'applicatio...     2025-07-12   
1  {'congress': '118', 'contentType': 'applicatio...     2025-07-12   
2  {'congress': '118', 'contentType': 'applicatio...     2025-07-12   
3  {'congress': '118', 'contentType': 'applicatio...     2025-07-12   
4  {'congress': '118', 'contentType': 'applicatio...     2025-07-12   

   is_full_detail  
0            True  
1            True  
2            True  
3            True  
4            True  


### Fetch nominees for just-retrieved nominations

In [None]:
from nomination_predictor.dataset import \
    extract_nominee_urls_from_nominations_df

# Extract nominee URLs from the JSON-structured nominations DataFrame
nominee_urls_df = extract_nominee_urls_from_nominations_df(nominations_df)
nominee_urls = nominee_urls_df['nominee_url'].tolist()
print(f"Found {len(nominee_urls)} nominations to retrieve nominee URLs for")

# nominee_urls_df is neither intended nor necessary to be saved as a file;
# it's simply a utility for another API-driven retrieval operation below.
# Each row contains: citation, nominee_url, congress, number

[32m2025-07-13 10:35:38[0m | [1mINFO[0m | [36mextract_nominee_urls_from_nominations_df[0m - [1mProcessing 5746 nominations to extract nominee URLs[0m


Found 5671 nominations to retrieve nominee URLs for


In [None]:
from nomination_predictor.dataset import get_retrieval_date_range_message

nominees_cache_file = os.path.join(RAW_DATA_DIR, "nominees.csv")

# Check if we have cached data
if os.path.exists(nominees_cache_file):
    print(f"Found cached nominees data at {nominees_cache_file}.  ")
    nominees_df = pd.read_csv(nominees_cache_file)
    print(f"{get_retrieval_date_range_message(nominees_df, 'nominees')}")
elif 0 == len(nominee_urls):
    print("⚠️ No nominee URLs found to download from")
else:
    print(f"Fetching nominee data for {len(nominee_urls)} nominations...")

    # Filter out records without nominee_url
    nominees_data = congress_client.get_all_nominees_data(nominee_urls)

    # Convert to DataFrame
    nominees_df = pd.DataFrame(nominees_data)
    print(f"\nTotal nominees retrieved: {len(nominees_df)}")

Found cached nominees data at /home/wsl2ubuntuuser/nomination_predictor/data/raw/nominees.csv.  
Loaded 5671 nominee records from cache; Loaded 5671 nominees records from cache (retrieved from 2025-07-13 00:00 to 2025-07-13 00:00)


In [None]:
# Preview the nominees
print(nominees_df.head())
all_dataframes['nominees'] = nominees_df

                                             nominee  \
0  {'nominees': [{'firstName': 'Nicholas', 'lastN...   
1  {'nominees': [{'firstName': 'James', 'lastName...   
2  {'nominees': [{'firstName': 'Brandy', 'lastNam...   
3  {'nominees': [{'firstName': 'Jeffrey', 'lastNa...   
4  {'nominees': [{'firstName': 'Karoline', 'lastN...   

                                             request retrieval_date  
0  {'url': 'https://api.congress.gov/v3/nominatio...     2025-07-13  
1  {'url': 'https://api.congress.gov/v3/nominatio...     2025-07-13  
2  {'url': 'https://api.congress.gov/v3/nominatio...     2025-07-13  
3  {'url': 'https://api.congress.gov/v3/nominatio...     2025-07-13  
4  {'url': 'https://api.congress.gov/v3/nominatio...     2025-07-13  


In [None]:
## TODO: determine whether safe to move this to other notebook, or if other code already depends on it happening this early
## Normalize column names, leaving data values as-is
#nominees_df.columns = [col.casefold().replace(' ', '_') for col in nominees_df.columns]
#print("\nNominees DataFrame columns:")
#for col in sorted(nominees_df.columns):
#     print(f"- {col}: {nominees_df[col].nunique()} unique values")

## 3. Confirm "nid" and "citation" uniqueness to later use as FJC and Congress indexes, respectively

In [None]:
# Check for uniqueness in ID fields before saving to the raw data folder
from nomination_predictor.dataset import validate_dataframe_ids

print("Checking ID uniqueness in dataframes before saving...")

uniqueness_results = validate_dataframe_ids(all_dataframes) # discovered not all dataframes treat nid as unique due to re-appointments, position changes, etc.

# Check if any dataframes have duplicate IDs
problematic_dfs = [name for name, result in uniqueness_results.items() 
                   if not result.get('is_unique', True)]

# if you want an easily-intuitive reason why a dataframe may not be able to use nid uniquely, try adding "education" to the "uniqueness required" list iterated through below, and see what it outputs.
# you'll find numerous judges who are listed multiple times for having gotten different college or university degrees over the years.
if any(name in ["judges", "demographics", "nominations", "nominees",] for name in problematic_dfs):
    logger.warning(f"⚠️ Found non-unique IDs in: {', '.join(problematic_dfs)}")
    for df_name in problematic_dfs:
        result = uniqueness_results[df_name]
        logger.warning(f"\nDuplicates in {df_name}:")
        display(result['duplicate_rows'])
else:
    
    logger.info("✓ All ID fields are unique across unique-ID-required dataframes.")

[32m2025-07-13 10:35:40[0m | [1mINFO[0m | [36mvalidate_dataframe_ids[0m - [1mChecking 'nid' uniqueness for dataframe 'judges'[0m
[32m2025-07-13 10:35:40[0m | [1mINFO[0m | [36mcheck_id_uniqueness[0m - [1mAll nid values are unique[0m
[32m2025-07-13 10:35:40[0m | [1mINFO[0m | [36mvalidate_dataframe_ids[0m - [1mChecking 'nid' uniqueness for dataframe 'demographics'[0m
[32m2025-07-13 10:35:40[0m | [1mINFO[0m | [36mcheck_id_uniqueness[0m - [1mAll nid values are unique[0m
[32m2025-07-13 10:35:40[0m | [1mINFO[0m | [36mvalidate_dataframe_ids[0m - [1mChecking 'nid' uniqueness for dataframe 'education'[0m
[32m2025-07-13 10:35:40[0m | [1mINFO[0m | [36mvalidate_dataframe_ids[0m - [1mChecking 'nid' uniqueness for dataframe 'federal_judicial_service'[0m
[32m2025-07-13 10:35:40[0m | [1mINFO[0m | [36mvalidate_dataframe_ids[0m - [1mChecking 'nid' uniqueness for dataframe 'other_nominations_recess'[0m
[32m2025-07-13 10:35:40[0m | [1mINFO[0m | [

Checking ID uniqueness in dataframes before saving...


## 4. Save Data to Raw Directory

Save the datasets to the raw data directory for use by downstream notebooks.

In [None]:
# Save data to the raw data directory
import os
from datetime import datetime

from nomination_predictor.config import RAW_DATA_DIR

# Create the raw data directory if it doesn't exist
os.makedirs(RAW_DATA_DIR, exist_ok=True)

# Add a timestamp for the manifest
timestamp = datetime.now().strftime("%Y%m%d")

# Save each FJC dataframe
# Save all dataframes to the raw data directory
print(f"Saving dataframes to {RAW_DATA_DIR}...")
saved_files = []

# Ensure the output directory exists
RAW_DATA_DIR.mkdir(parents=True, exist_ok=True)

# Save all dataframes from the all_dataframes collection
for name, df in all_dataframes.items():
    if df is not None and not df.empty:
        try:
            # Create filename
            output_file = RAW_DATA_DIR / f"{name}.csv"
            
            # Save to CSV
            df.to_csv(output_file, index=False)
            saved_files.append(f"{name}.csv")
            print(f"  ✓ Saved {len(df):,} records to {output_file}")
        except Exception as e:
            print(f"  ✗ Error saving {name}: {str(e)}")

# Print summary
if saved_files:
    print(f"\n✅ Successfully saved {len(saved_files)} dataframes to {RAW_DATA_DIR}")
else:
    print("\n⚠️ No dataframes were saved - check if all_dataframes is populated correctly")

# Create a manifest file to track what was saved and when
manifest_content = f"""# FJC Data Processing Manifest
Processed on: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}
Note: Only column names are normalized (lowercase with underscores), data values remain unchanged
Files saved:
{chr(10).join(['- ' + file for file in saved_files])}
"""

with open(RAW_DATA_DIR / f"fjc_data_manifest_{timestamp}.txt", "w") as f:
    f.write(manifest_content)

print(f"✓ Saved {len(saved_files)} files to {RAW_DATA_DIR}")
print(f"✓ Created manifest: fjc_data_manifest_{timestamp}.txt")

Saving dataframes to /home/wsl2ubuntuuser/nomination_predictor/data/raw...
  ✓ Saved 4,022 records to /home/wsl2ubuntuuser/nomination_predictor/data/raw/judges.csv
  ✓ Saved 4,022 records to /home/wsl2ubuntuuser/nomination_predictor/data/raw/demographics.csv
  ✓ Saved 8,040 records to /home/wsl2ubuntuuser/nomination_predictor/data/raw/education.csv
  ✓ Saved 4,720 records to /home/wsl2ubuntuuser/nomination_predictor/data/raw/federal_judicial_service.csv
  ✓ Saved 828 records to /home/wsl2ubuntuuser/nomination_predictor/data/raw/other_nominations_recess.csv
  ✓ Saved 611 records to /home/wsl2ubuntuuser/nomination_predictor/data/raw/other_federal_judicial_service.csv
  ✓ Saved 19,003 records to /home/wsl2ubuntuuser/nomination_predictor/data/raw/professional_career.csv
  ✓ Saved 4,720 records to /home/wsl2ubuntuuser/nomination_predictor/data/raw/seat_timeline.csv
  ✓ Saved 5,746 records to /home/wsl2ubuntuuser/nomination_predictor/data/raw/nominations.csv
  ✓ Saved 5,671 records to /home/

In [None]:
# Save Congress API retrieved nominations to cache file
if nominations_df is not None and not nominations_df.empty:
    # Ensure directory exists
    os.makedirs(os.path.dirname(nominations_cache_file), exist_ok=True)
    print(f"Saving nominations to cache file: {nominations_cache_file}")
    nominations_df.to_csv(nominations_cache_file, index=False)
    print(f"✓ Saved {len(nominations_df)} nominations to cache")
    
if nominees_df is not None and not nominees_df.empty:
    # Ensure directory exists
    os.makedirs(os.path.dirname(nominees_cache_file), exist_ok=True)
    print(f"Saving nominees to cache file: {nominees_cache_file}")
    nominees_df.to_csv(nominees_cache_file, index=False)
    print(f"✓ Saved {len(nominees_df)} nominees to cache")

Saving nominations to cache file: /home/wsl2ubuntuuser/nomination_predictor/data/raw/nominations.csv
✓ Saved 5746 nominations to cache
Saving nominees to cache file: /home/wsl2ubuntuuser/nomination_predictor/data/raw/nominees.csv
✓ Saved 5671 nominees to cache


## Summary

In this notebook, we have:

1. Loaded Federal Judicial Center (FJC) data, the canonical source for judicial seats and judges
2. Built a raw seat timeline dataframe inferred from the FJC service data
3. Fetched judicial nominations from the Congress.gov API
4. Fetched judicial nominee data from the Congress.gov API
5. Saved all datasets to the raw data directory for further processing by downstream notebooks

The next notebook (e.g. 1.##-nw-feature-engineering.ipynb) will load these datasets, clean them, and engineer features for modeling.