# Data Fetching: FJC and Congress.gov API

This notebook is responsible for fetching and initially processing data from our primary sources:

1. Federal Judicial Center (FJC) CSV and Excel files
2. Congress.gov API judicial nomination data

According to the project architecture, this notebook will:
1. Download or use cached data from the FJC and Congress.gov API
2. Perform minimal transformations to convert to dataframes
3. Save the resulting dataframes to `data/raw` for further processing by downstream notebooks

## Setup

In [None]:
import os
import sys
from pathlib import Path

import pandas as pd
from loguru import logger

# Add the project root to the path so we can import our modules
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

from nomination_predictor.config import (EXTERNAL_DATA_DIR, INTERIM_DATA_DIR,
                                         PROCESSED_DATA_DIR, RAW_DATA_DIR)
from nomination_predictor.congress_api import CongressAPIClient
from nomination_predictor.fjc_data import (FJC_DATA_DIR, build_seat_timeline,
                                           get_predecessor_info, load_fjc_csv,
                                           load_judges_data)

# Setup logging
logger.remove()  # Remove default handler
logger.add(sys.stderr, format="<green>{time:YYYY-MM-DD HH:mm:ss}</green> | <level>{level}</level> | <cyan>{function}</cyan> - <level>{message}</level>", level="INFO")

8

## 1. Federal Judicial Center (FJC) Data

The FJC data is our canonical source for judicial seat timelines, judge demographics, and nomination failures.

### Check if FJC data exists or download if needed

In [None]:
# Check if the FJC data directory exists and contains expected files
required_files = [
    'judges.csv',
    'demographics.csv',
    'federal-judicial-service.csv',
    'education.csv',
    'professional-career.csv',
    'other-nominations-recess.csv'
]


# Optional files
optional_files = [
    'judges.xlsx',
    'categories.xlsx',
]
# Check required files
missing_files = []
for file in required_files:
    if not (FJC_DATA_DIR / file).exists():
        missing_files.append(file)

if missing_files:
    print(f"⚠️ Missing required FJC data files: {missing_files}")
    print("Please download the FJC data files from the Federal Judicial Center website and place them in:")
    print(f"  {FJC_DATA_DIR}")
    print("\nDownload links:")
    print("  (all from https://www.fjc.gov/history/judges/biographical-directory-article-iii-federal-judges-export)")
    print("  - https://www.fjc.gov/sites/default/files/history/judges.xlsx")
    print("  - https://www.fjc.gov/sites/default/files/history/judges.csv")
    print("  - https://www.fjc.gov/sites/default/files/history/categories.xlsx")
    print("  - https://www.fjc.gov/sites/default/files/history/demographics.csv")
    print("  - https://www.fjc.gov/sites/default/files/history/federal-judicial-service.csv")
    print("  - https://www.fjc.gov/sites/default/files/history/education.csv")
    print("  - https://www.fjc.gov/sites/default/files/history/professional-career.csv")
    print("  - https://www.fjc.gov/sites/default/files/history/other-nominations-recess.csv")
else:
    print(f"✓ All required FJC data files found in {FJC_DATA_DIR}")
    
    # Check optional files
    for file in optional_files:
        if (FJC_DATA_DIR / file).exists():
            print(f"  ✓ Optional file found: {file}")
        else:
            print(f"  ℹ️ Optional file not found: {file}")

✓ All required FJC data files found in /home/wsl2ubuntuuser/nomination_predictor/data/external/FederalJudicialCenter
  ✓ Optional file found: judges.xlsx
  ✓ Optional file found: categories.xlsx


### Load FJC Data

In [None]:
# Load the service data
try:
    service_df = load_fjc_csv('federal-judicial-service.csv')
    print(f"Loaded service data: {len(service_df)} records")
except FileNotFoundError:
    print("❌ Error: federal-judicial-service.csv not found")
    service_df = pd.DataFrame()

[32m2025-07-11 15:10:28[0m | [1mINFO[0m | [36mload_fjc_csv[0m - [1mLoading FJC data file: federal-judicial-service.csv[0m


Loaded service data: 4720 records


In [None]:
# Load judges data with demographics
try:
    judges_df = load_judges_data(include_demographics=True)
    print(f"Loaded judges data: {len(judges_df)} records")
except FileNotFoundError:
    print("❌ Error: judges.csv not found")
    judges_df = pd.DataFrame()

[32m2025-07-11 15:10:40[0m | [1mINFO[0m | [36mload_judges_data[0m - [1mLoading judges data[0m
[32m2025-07-11 15:10:40[0m | [1mINFO[0m | [36mload_fjc_csv[0m - [1mLoading FJC data file: judges.csv[0m
[32m2025-07-11 15:10:51[0m | [1mINFO[0m | [36mload_fjc_csv[0m - [1mLoading FJC data file: demographics.csv[0m
[32m2025-07-11 15:10:51[0m | [1mINFO[0m | [36mload_judges_data[0m - [1mAdded demographic information to judges data[0m


Loaded judges data: 4022 records


### Build Seat Timeline (Master Table)

In [None]:
# Build the seat timeline if we have service data
if not service_df.empty:
    seat_timeline_df = build_seat_timeline(service_df)
    print(f"Built seat timeline: {len(seat_timeline_df)} records")
    
    # Preview the seat timeline
    seat_timeline_df.head()

[32m2025-07-11 15:10:51[0m | [1mINFO[0m | [36mbuild_seat_timeline[0m - [1mBuilding seat timeline table[0m


KeyError: "['seat_id', 'court', 'commission_date', 'termination_date', 'termination_reason'] not in index"

In [None]:
# Create the predecessor lookup table
if 'seat_timeline_df' in locals():
    predecessor_lookup = get_predecessor_info(seat_timeline_df)
    print(f"Created predecessor lookup: {len(predecessor_lookup)} records")
    
    # Preview the predecessor lookup
    predecessor_lookup.head()

## 2. Congress.gov API Data

The Congress.gov API provides detailed information about judicial nominations, including:
- Nomination date
- Nominee information
- Confirmation status and date
- Committee actions

### Setup API Access

In [None]:
# Check if API key is available
api_key = os.environ.get("CONGRESS_API_KEY")
if not api_key:
    print("❌ Error: CONGRESS_API_KEY environment variable not set")
    print("Please set the CONGRESS_API_KEY environment variable to your Congress.gov API key")
    print("You can request an API key at: https://api.congress.gov/sign-up/")
else:
    print("✓ Congress API key found in environment variables")
    # Initialize the API client
    congress_client = CongressAPIClient(api_key)
    print("✓ Congress API client initialized")

### Fetch Judicial Nominations from Recent Congresses

In [None]:
# Fetch judicial nominations from recent congresses
# Congress numbering: 116th (2019-2021), 117th (2021-2023), 118th (2023-2025)

if 'congress_client' in locals():
    congresses = [118, 117, 116]  # Most recent three congresses
    all_nominations = []
    
    for congress in congresses:
        try:
            print(f"Fetching judicial nominations for the {congress}th Congress...")
            nominations = congress_client.get_judicial_nominations(congress)
            print(f"  ✓ Retrieved {len(nominations)} judicial nominations")
            all_nominations.extend(nominations)
        except Exception as e:
            print(f"  ❌ Error fetching nominations for {congress}th Congress: {str(e)}")
    
    # Convert to DataFrame
    nominations_df = pd.DataFrame(all_nominations)
    print(f"\nTotal nominations retrieved: {len(nominations_df)}")
    
    # Preview the nominations
    if not nominations_df.empty:
        nominations_df.head()

## 3. Crosswalk and Join Data Sources

Now we'll crosswalk the Congress.gov nomination data to the FJC seat timeline using the nomination-to-seat matching logic.

In [None]:
from nomination_predictor.fjc_data import crosswalk_congress_api

# Crosswalk if we have both datasets
if 'nominations_df' in locals() and 'seat_timeline_df' in locals() and 'judges_df' in locals():
    crosswalked_df = crosswalk_congress_api(
        nominations_df,
        seat_timeline_df,
        judges_df
    )
    
    print(f"Crosswalked nominations: {len(crosswalked_df)} records")
    print(f"Match statistics:\n{crosswalked_df['seat_match_method'].value_counts()}")
    
    # Preview crosswalked data
    crosswalked_df.head()

## 4. Create Master Dataset

Now we'll create the master dataset by joining the seat timeline with the crosswalked nominations data.

In [None]:
from nomination_predictor.fjc_data import create_master_dataset

# Create master dataset if we have both datasets
if 'seat_timeline_df' in locals() and 'crosswalked_df' in locals():
    master_df = create_master_dataset(
        seat_timeline_df,
        crosswalked_df
    )
    
    print(f"Created master dataset: {len(master_df)} records")
    
    # Preview master dataset
    master_df.head()

## 5. Save Data to Raw Directory

Save the datasets to the raw data directory for use by downstream notebooks.

In [None]:
# Save seat timeline
if 'seat_timeline_df' in locals() and not seat_timeline_df.empty:
    output_path = RAW_DATA_DIR / "seat_timeline.csv"
    seat_timeline_df.to_csv(output_path, index=False)
    print(f"✓ Saved seat timeline to {output_path}")

# Save crosswalked nominations
if 'crosswalked_df' in locals() and not crosswalked_df.empty:
    output_path = RAW_DATA_DIR / "crosswalked_nominations.csv"
    crosswalked_df.to_csv(output_path, index=False)
    print(f"✓ Saved crosswalked nominations to {output_path}")
    
# Save master dataset
if 'master_df' in locals() and not master_df.empty:
    output_path = RAW_DATA_DIR / "master_dataset.csv"
    master_df.to_csv(output_path, index=False)
    print(f"✓ Saved master dataset to {output_path}")

## Summary

In this notebook, we have:

1. Loaded Federal Judicial Center (FJC) data, the canonical source for judicial seats and judges
2. Built the seat timeline as our master table
3. Fetched judicial nominations from the Congress.gov API
4. Crosswalked the nomination data to FJC seat IDs
5. Created a master dataset joining these sources
6. Saved all datasets to the raw data directory for further processing by downstream notebooks

The next notebook (1.00-nw-data-cleaning-feature-creation.ipynb) will load these datasets, clean them, and engineer features for modeling.