# Judicial Vacancies Data Source Exploration

This notebook demonstrates how to use the `dataset` module to fetch and process judicial vacancy data.

## Overview

We'll:
1. Import the necessary modules
2. Fetch HTML data from the judicial vacancies archive
3. Extract vacancy data from the HTML
4. Convert the data to a pandas DataFrame
5. Save the raw data to a CSV file

## Setup

In [37]:
import sys
!{sys.executable} -m pip list | grep nomination_predictor

nomination_predictor      0.0.1             /home/wsl2ubuntuuser/nomination_predictor


In [None]:
%load_ext autoreload
%autoreload 2

import sys
import os
from pathlib import Path
import pandas as pd
import typer
from datetime import datetime

# Import our data processing module
from nomination_predictor import dataset
from nomination_predictor.config import RAW_DATA_DIR

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [None]:
# Set up paths
PROJECT_ROOT = Path.cwd().parent
DATA_RAW = PROJECT_ROOT / 'data' / 'raw'
DATA_RAW.mkdir(parents=True, exist_ok=True)

Notebook directory: /home/wsl2ubuntuuser/nomination_predictor/notebooks
Project root: /home/wsl2ubuntuuser/nomination_predictor
Data raw directory: /home/wsl2ubuntuuser/nomination_predictor/data/raw
Data raw exists: True


In [None]:
# Verify the data directory exists and is writable
if not DATA_RAW.exists():
    print(f"Error: Data directory does not exist: {DATA_RAW}")
elif not os.access(DATA_RAW, os.W_OK):
    print(f"Error: No write permission for directory: {DATA_RAW}")
else:
    print(f"Data directory is ready: {DATA_RAW}")

Data directory is ready: /home/wsl2ubuntuuser/nomination_predictor/data/raw


## 1. Fetch and Process Data

Let's fetch the data for the range of available years and process it.

In [None]:
def load_judicial_data(years_back=15, force_refresh=False):
    """
    Load judicial data using the dataset module.
    
    Args:
        years_back: Number of years of historical data to fetch
        force_refresh: If True, force refetching data even if files exist
        
    Returns:
        Tuple of (vacancies_df, confirmations_df, emergencies_df, combined_df)
    """
    output_file = DATA_RAW / "judicial_data.csv"
    
    # Only run the pipeline if output doesn't exist or force_refresh is True
    if force_refresh or not output_file.exists():
        print("Running data pipeline...")
        dataset.main(
            output_dir=DATA_RAW,
            output_filename=output_file.name,
            years_back=years_back
        )
    
    # Load individual datasets
    vacancies_path = DATA_RAW / "judicial_vacancies.csv"
    confirmations_path = DATA_RAW / "judicial_confirmations.csv"
    emergencies_path = DATA_RAW / "judicial_emergencies.csv"
    
    # Read the data
    combined_df = pd.read_csv(output_file, sep='|') if output_file.exists() else None
    vacancies_df = pd.read_csv(vacancies_path, sep='|') if vacancies_path.exists() else None
    confirmations_df = pd.read_csv(confirmations_path, sep='|') if confirmations_path.exists() else None
    emergencies_df = pd.read_csv(emergencies_path, sep='|') if emergencies_path.exists() else None
    
    return vacancies_df, confirmations_df, emergencies_df, combined_df

In [None]:
print("Loading judicial data...")
vacancies_df, confirmations_df, emergencies_df, combined_df = load_judicial_data(
    years_back=1,  # Adjust as needed
    force_refresh=False  # Set to True to refetch data
)

In [None]:
print("\nData Summary:")
print(f"Vacancies: {len(vacancies_df) if vacancies_df is not None else 0} records")
print(f"Confirmations: {len(confirmations_df) if confirmations_df is not None else 0} records")
print(f"Emergencies: {len(emergencies_df) if emergencies_df is not None else 0} records")
print(f"Combined: {len(combined_df) if combined_df is not None else 0} records")

In [None]:
if combined_df is not None:
    print("\nSample data from combined dataset:")
    display(combined_df.head())
    
    # Basic statistics
    print("\nBasic Statistics:")
    if 'vacancy_date' in combined_df.columns:
        print("\nDate Range:")
        print(f"Earliest vacancy: {combined_df['vacancy_date'].min()}")
        print(f"Latest vacancy: {combined_df['vacancy_date'].max()}")
    
    if 'circuit_district' in combined_df.columns:
        print("\nRecords by Circuit/District:")
        print(combined_df['circuit_district'].value_counts().head(10))

In [None]:
# Fetch all available years (1981 to current year)
df = fetch_and_process_years(start_year=year_to_fetch, end_year=year_to_fetch) # TODO: expand year range after we've determined code handles smaller subset okay


if df is not None:
    print('\nFirst few records:')
    display(df.head())
    
    # Basic summary
    print('\nRecords per year:')
    print(df['year'].value_counts().sort_index())

Fetching data for years 2024 to 2024...
Found 45 archive URLs
Processing year 2024...


  - Fetching judicial emergencies for december 2024...
  - Fetching vacancy summary for december 2024...
  - Fetching judicial vacancy list for december 2024...
  - Fetching future judicial vacancies for december 2024...
  - Fetching judicial confirmations for december 2024...
  - Fetching judicial emergencies for november 2024...
  - Processed 6 records for 2024
  - Fetching vacancy summary for november 2024...
  - Processed 8 records for 2024
  - Processed 21 records for 2024  - Fetching judicial vacancy list for november 2024...
  - Fetching future judicial vacancies for november 2024...

  - Processed 37 records for 2024
  - Fetching judicial confirmations for november 2024...  - Processed 23 records for 2024
  - Fetching judicial emergencies for october 2024...
  - Processed 141 records for 2024

  - Fetching vacancy summary for october 2024...
  - Processed 6 records for 2024
  - Fetching judicial vacancy list for october 2024...
  - Processed 19 records for 2024
  - Fetching fut

Unnamed: 0,court,vacancy_date,status,nominating_president,nominee,year,month
0,US Court of Appeals,NaT,,,,2024,vacancy summary for december 2024
1,US District Courts (includes territorial courts*),NaT,,,,2024,vacancy summary for december 2024
2,US Court of International Trade,NaT,,,,2024,vacancy summary for december 2024
3,US Court of Federal Claims,NaT,,,,2024,vacancy summary for december 2024
4,US Supreme Court,NaT,,,,2024,vacancy summary for december 2024



Records per year:
year
2024    2380
Name: count, dtype: int64


In [None]:
if df is not None:
    print(f'\nFirst few records for dataframe:')
    display(df.head())


First few records for dataframe:


Unnamed: 0,court,vacancy_date,status,nominating_president,nominee,year,month
0,US Court of Appeals,NaT,,,,2024,vacancy summary for december 2024
1,US District Courts (includes territorial courts*),NaT,,,,2024,vacancy summary for december 2024
2,US Court of International Trade,NaT,,,,2024,vacancy summary for december 2024
3,US Court of Federal Claims,NaT,,,,2024,vacancy summary for december 2024
4,US Supreme Court,NaT,,,,2024,vacancy summary for december 2024


## 2. Save Raw Data

Save the raw data to a CSV file in the `data/raw` directory.

In [None]:
def save_raw_data(df, year):
    """Save the raw data to a CSV file."""
    if df is None or df.empty:
        print('No data to save.')
        return
    
    filename = DATA_RAW / f'judicial_vacancies_{year}.csv'
    try:
        dataset.save_to_csv(df, filename)
        print(f'Data saved to {filename}')
    except Exception as e:
        print(f'Error saving data: {e}')

# Save the data we just fetched
if df is not None:
    save_raw_data(df, year_to_fetch)

2025-06-30 17:44:18,715 - nomination_predictor.dataset - INFO - Successfully saved data to /home/wsl2ubuntuuser/nomination_predictor/data/raw/judicial_vacancies_2024.csv
Data saved to /home/wsl2ubuntuuser/nomination_predictor/data/raw/judicial_vacancies_2024.csv


## 3. Load and Explore the Saved Data

Let's verify that we can load the saved data.

In [None]:
def load_raw_data(year):
    """Load raw data from a CSV file."""
    filename = DATA_RAW / f'judicial_vacancies_{year}.csv'
    if not filename.exists():
        print(f'File not found: {filename}')
        return None
    
    try:
        df = pd.read_csv(filename)
        print(f'Loaded {len(df)} records from {filename}')
        return df
    except Exception as e:
        print(f'Error loading {filename}: {e}')
        return None

# Load the data we just saved
loaded_df = load_raw_data(year_to_fetch)
if loaded_df is not None:
    print('\nDataFrame info:')
    display(loaded_df.info())
    print('\nFirst few records:')
    display(loaded_df.head())

Loaded 2380 records from /home/wsl2ubuntuuser/nomination_predictor/data/raw/judicial_vacancies_2024.csv

DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2380 entries, 0 to 2379
Data columns (total 7 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   court                 2380 non-null   object 
 1   vacancy_date          2182 non-null   object 
 2   status                0 non-null      float64
 3   nominating_president  0 non-null      float64
 4   nominee               1543 non-null   object 
 5   year                  2380 non-null   int64  
 6   month                 2380 non-null   object 
dtypes: float64(2), int64(1), object(4)
memory usage: 130.3+ KB


None


First few records:


Unnamed: 0,court,vacancy_date,status,nominating_president,nominee,year,month
0,US Court of Appeals,,,,,2024,vacancy summary for december 2024
1,US District Courts (includes territorial courts*),,,,,2024,vacancy summary for december 2024
2,US Court of International Trade,,,,,2024,vacancy summary for december 2024
3,US Court of Federal Claims,,,,,2024,vacancy summary for december 2024
4,US Supreme Court,,,,,2024,vacancy summary for december 2024


## Next Steps

1. **Data Cleaning**: In the next notebook, we'll clean and preprocess this data.
2. **Exploratory Analysis**: We'll explore the data to understand its structure and quality.
3. **Feature Engineering**: We'll create additional features that might be useful for analysis.
4. **Visualization**: We'll create visualizations to understand trends and patterns.