# FJC-Centric Pipeline Exploration

This notebook demonstrates the new Federal Judicial Center (FJC) centric data pipeline for the nomination predictor project. The FJC data serves as the canonical source for:
- Judicial seat timelines
- Judge demographics
- Nomination failures

This notebook will:
1. Load and validate FJC CSV files
2. Build the seat timeline as the master table
3. Process judge demographic data
4. Create the predecessor lookup table
5. Fetch judicial nominations from Congress.gov API
6. Crosswalk Congress.gov data with FJC seat IDs
7. Create a master dataset by joining these sources
8. Visualize and validate data quality

## Setup

In [None]:
import os
import sys
from pathlib import Path

import pandas as pd
import numpy as np
from loguru import logger
import matplotlib.pyplot as plt

# Add the project root to the path so we can import our modules
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

from nomination_predictor.config import (
    EXTERNAL_DATA_DIR,
    RAW_DATA_DIR,
    INTERIM_DATA_DIR,
    PROCESSED_DATA_DIR
)
from nomination_predictor.fjc_data import FJC_DATA_DIR, crosswalk_congress_api
from nomination_predictor.fjc_processor import (
    validate_data_files,
    process_fjc_data,
    test_date_parsing
)
from nomination_predictor.congress_api import CongressAPIClient

# Setup logging
logger.remove()  # Remove default handler
logger.add(sys.stderr, format="<green>{time:YYYY-MM-DD HH:mm:ss}</green> | <level>{level: <8}</level> | <cyan>{function}</cyan> - <level>{message}</level>", level="INFO")

## 1. FJC Data Validation

First, we'll check if the required FJC data files exist in the expected directory.

In [None]:
# Check for required FJC data files
valid_files = validate_data_files()

if not valid_files:
    print("\nPlease download the required FJC data files from:")
    print("  - https://www.fjc.gov/history/judges/biographical-directory-federal-judges-export")
    print("  - https://www.fjc.gov/history/judges/diversity-bench")
    print(f"\nAnd place them in: {FJC_DATA_DIR}")

### Date Parsing Testing

The FJC data contains dates in various formats, including pre-1900 dates which require special handling. Let's test our date parsing functionality.

In [None]:
# Demonstrate date parsing functionality
test_date_parsing()

## 2. Process FJC Data

Now we'll process the FJC data files to create the seat timeline, judge data, and predecessor lookup table. We'll run this in validation mode first to check for data quality issues.

In [None]:
# Process FJC data in validation mode (doesn't save files)
seat_timeline_df, judges_df, predecessor_df = process_fjc_data(
    output_dir=INTERIM_DATA_DIR,
    validate_mode=True
)

## 3. Explore FJC Data

### Explore the Seat Timeline

The seat timeline is our master table, with one row per incumbent-seat tenure.

In [None]:
# Explore the seat timeline
if seat_timeline_df is not None:
    print(f"Seat timeline shape: {seat_timeline_df.shape}")
    print(f"Number of unique seats: {seat_timeline_df['seat_id'].nunique()}")
    print("\nSample records:")
    display(seat_timeline_df.head())
    
    # Basic column info
    print("\nColumn data types:")
    display(seat_timeline_df.dtypes)
    
    # Check for missing values
    print("\nMissing values by column:")
    missing = seat_timeline_df.isna().sum().to_frame(name='Missing')
    missing['Percent'] = missing['Missing'] / len(seat_timeline_df) * 100
    display(missing[missing['Missing'] > 0].sort_values('Missing', ascending=False))

### Visualize Judicial Tenures

Let's create some visualizations to understand the tenure patterns.

In [None]:
if seat_timeline_df is not None:
    # Calculate tenure length in years
    seat_timeline_df['tenure_years'] = (seat_timeline_df['termination_date'] - seat_timeline_df['commission_date']).dt.days / 365.25
    
    # Plot tenure distribution
    plt.figure(figsize=(12, 6))
    plt.hist(seat_timeline_df['tenure_years'].dropna(), bins=50, alpha=0.7, color='steelblue')
    plt.axvline(seat_timeline_df['tenure_years'].median(), color='red', linestyle='--', 
                label=f'Median: {seat_timeline_df["tenure_years"].median():.1f} years')
    plt.title('Distribution of Judicial Tenures')
    plt.xlabel('Tenure (years)')
    plt.ylabel('Number of Judges')
    plt.legend()
    plt.grid(alpha=0.3)
    plt.show()
    
    # Print summary statistics
    print("Tenure length statistics (years):")
    print(seat_timeline_df['tenure_years'].describe())

### Explore Judge Demographics

Let's examine the judge demographic data, if available.

In [None]:
if judges_df is not None:
    print(f"Judges data shape: {judges_df.shape}")
    print("\nSample records:")
    display(judges_df.head())
    
    # Check if demographic columns are present
    demographic_cols = ['gender', 'race', 'birth_year']
    available_demo_cols = [col for col in demographic_cols if col in judges_df.columns]
    
    if available_demo_cols:
        print("\nDemographic information available:")
        for col in available_demo_cols:
            print(f"\n{col.title()} distribution:")
            if col in ['gender', 'race']:
                display(judges_df[col].value_counts(dropna=False))
            elif col == 'birth_year':
                plt.figure(figsize=(12, 6))
                judges_df['birth_year'].hist(bins=30, alpha=0.7, color='steelblue')
                plt.title('Distribution of Judge Birth Years')
                plt.xlabel('Birth Year')
                plt.ylabel('Number of Judges')
                plt.grid(alpha=0.3)
                plt.show()
    else:
        print("\nNo demographic information available in the judges data.")

### Examine Predecessor Lookup Table

The predecessor lookup table is crucial for crosswalking with other data sources.

In [None]:
if predecessor_df is not None:
    print(f"Predecessor lookup shape: {predecessor_df.shape}")
    print(f"Number of unique seats: {predecessor_df['seat_id'].nunique()}")
    print("\nSample records:")
    display(predecessor_df.head())
    
    # Check for missing values
    print("\nMissing values by column:")
    missing = predecessor_df.isna().sum().to_frame(name='Missing')
    missing['Percent'] = missing['Missing'] / len(predecessor_df) * 100
    display(missing[missing['Missing'] > 0].sort_values('Missing', ascending=False))

## 4. Congress.gov API Integration

Now we'll integrate Congress.gov API data to supplement our FJC data with nomination details.

### Setup API Access

In [None]:
# Check if API key is available
api_key = os.environ.get("CONGRESS_API_KEY")
if not api_key:
    print("❌ Error: CONGRESS_API_KEY environment variable not set")
    print("Please set the CONGRESS_API_KEY environment variable to your Congress.gov API key")
    print("You can request an API key at: https://api.congress.gov/sign-up/")
else:
    print("✓ Congress API key found in environment variables")
    # Initialize the API client
    congress_client = CongressAPIClient(api_key)
    print("✓ Congress API client initialized")

### Fetch Judicial Nominations from Recent Congresses

In [None]:
# Fetch judicial nominations from recent congresses
# Congress numbering: 116th (2019-2021), 117th (2021-2023), 118th (2023-2025)

if 'congress_client' in locals():
    congresses = [118, 117, 116]  # Most recent three congresses
    all_nominations = []
    
    for congress in congresses:
        try:
            print(f"Fetching judicial nominations for the {congress}th Congress...")
            nominations = congress_client.get_judicial_nominations(congress)
            print(f"  ✓ Retrieved {len(nominations)} judicial nominations")
            all_nominations.extend(nominations)
        except Exception as e:
            print(f"  ❌ Error fetching nominations for {congress}th Congress: {str(e)}")
    
    # Convert to DataFrame
    nominations_df = pd.DataFrame(all_nominations)
    print(f"\nTotal nominations retrieved: {len(nominations_df)}")
    
    # Preview the nominations
    if not nominations_df.empty:
        display(nominations_df.head())

## 5. Crosswalk and Join Data Sources

Now we'll crosswalk the Congress.gov nomination data to the FJC seat timeline using the nomination-to-seat matching logic.

In [None]:
# Crosswalk if we have both datasets
if 'nominations_df' in locals() and not nominations_df.empty and seat_timeline_df is not None and judges_df is not None:
    crosswalked_df = crosswalk_congress_api(
        nominations_df,
        seat_timeline_df,
        judges_df
    )
    
    print(f"Crosswalked nominations: {len(crosswalked_df)} records")
    print(f"Match statistics:\n{crosswalked_df['seat_match_method'].value_counts()}")
    
    # Preview crosswalked data
    display(crosswalked_df.head())
else:
    print("Cannot perform crosswalk - missing one or more required datasets")

### Analyze Crosswalk Quality

Let's assess the quality of our crosswalk by examining match rates and potential issues.

In [None]:
if 'crosswalked_df' in locals() and not crosswalked_df.empty:
    # Calculate match rate
    match_rate = (crosswalked_df['seat_match_method'] != 'unmatched').mean() * 100
    print(f"Match rate: {match_rate:.2f}%")
    
    # Examine unmatched nominations
    unmatched = crosswalked_df[crosswalked_df['seat_match_method'] == 'unmatched']
    if len(unmatched) > 0:
        print(f"\nSample of unmatched nominations ({len(unmatched)} total):")
        display(unmatched.head())
        
        # Check patterns in unmatched nominations
        if 'description' in unmatched.columns:
            print("\nCommon patterns in unmatched nomination descriptions:")
            display(unmatched['description'].value_counts().head(10))

## 6. Create Master Dataset

Now we'll create the master dataset by joining the seat timeline with the crosswalked nominations data.

In [None]:
from nomination_predictor.fjc_data import create_master_dataset

# Create master dataset if we have both datasets
if 'seat_timeline_df' in locals() and 'crosswalked_df' in locals() and not crosswalked_df.empty:
    master_df = create_master_dataset(
        seat_timeline_df,
        crosswalked_df
    )
    
    print(f"Created master dataset: {len(master_df)} records")
    
    # Preview master dataset
    display(master_df.head())
else:
    print("Cannot create master dataset - missing required input datasets")

## 7. Save Processed Data

Now we can save our processed datasets for use in downstream notebooks.

In [None]:
# Ask the user if they want to save the processed data
save_data = input("Save processed data files? (y/n): ").strip().lower() == 'y'

if save_data:
    # Save seat timeline
    if seat_timeline_df is not None and not seat_timeline_df.empty:
        output_path = RAW_DATA_DIR / "seat_timeline.csv"
        seat_timeline_df.to_csv(output_path, index=False)
        print(f"✓ Saved seat timeline to {output_path}")

    # Save crosswalked nominations
    if 'crosswalked_df' in locals() and not crosswalked_df.empty:
        output_path = RAW_DATA_DIR / "crosswalked_nominations.csv"
        crosswalked_df.to_csv(output_path, index=False)
        print(f"✓ Saved crosswalked nominations to {output_path}")
        
    # Save master dataset
    if 'master_df' in locals() and not master_df.empty:
        output_path = RAW_DATA_DIR / "master_dataset.csv"
        master_df.to_csv(output_path, index=False)
        print(f"✓ Saved master dataset to {output_path}")
        
    # Save judges data
    if judges_df is not None and not judges_df.empty:
        output_path = RAW_DATA_DIR / "judges.csv"
        judges_df.to_csv(output_path, index=False)
        print(f"✓ Saved judges data to {output_path}")
else:
    print("Skipped saving data files.")

## Summary

In this notebook, we have:

1. Loaded and validated Federal Judicial Center (FJC) data files
2. Built the seat timeline as our master table for judicial tenures
3. Explored judge demographics from the FJC
4. Created the predecessor lookup table for crosswalking
5. Fetched and processed judicial nominations from the Congress.gov API
6. Crosswalked the nomination data to FJC seat IDs
7. Created and validated a master dataset joining these sources
8. Saved the processed data for downstream use

This FJC-centric pipeline provides a more reliable, comprehensive, and maintainable approach to judicial nomination data than the previous HTML/PDF scraping approach. The next notebook (1.00-nw-data-cleaning-feature-creation.ipynb) will use these datasets for feature engineering and modeling preparation.