# Judicial Vacancies Data Source Exploration

This notebook demonstrates how to use the `dataset` module to fetch and process judicial vacancy data.

## Overview

We'll:
1. Import the necessary modules
2. Fetch HTML data from the judicial vacancies archive
3. Extract vacancy data from the HTML
4. Convert the data to a pandas DataFrame
5. Save the raw data to a CSV file

## Setup

In [29]:
import sys
!{sys.executable} -m pip list | grep nomination_predictor

nomination_predictor      0.0.1             /home/wsl2ubuntuuser/nomination_predictor


In [30]:
# Enable autoreload for development
%load_ext autoreload
%autoreload 2

# Import standard libraries
import os
from pathlib import Path
import pandas as pd

# Import our data processing module
from nomination_predictor import dataset

# Set up paths
PROJECT_ROOT = Path().resolve().parent.parent
DATA_RAW = PROJECT_ROOT / 'data' / 'raw'
DATA_RAW.mkdir(parents=True, exist_ok=True)  # Ensure directory exists

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## 1. Fetch and Process Data

Let's fetch the data for a specific year and process it.

In [None]:
from datetime import datetime

def fetch_and_process_years(start_year=None, end_year=None):
    """
    Fetch and process data for a range of years.
    
    Args:
        start_year: First year to process (inclusive). If None, uses 1981.
        end_year: Last year to process (inclusive). If None, uses current year.
        
    Returns:
        DataFrame containing all processed records, or None if no records found
    """
    # Set default years if not provided
    if start_year is None:
        start_year = 1981
    if end_year is None:
        end_year = datetime.now().year
        
    print(f'Fetching data for years {start_year} to {end_year}...')
    
    # Get all available archive URLs
    all_urls = dataset.generate_or_fetch_archive_urls()
    
    if not all_urls:
        print('No archive URLs found')
        return None
        
    print(f'Found {len(all_urls)} archive URLs')
    
    # Process each URL and collect records
    all_records = []
    for url in all_urls:
        try:
            # Extract year from URL
            from urllib.parse import urlparse, parse_qs
            parsed_url = urlparse(url)
            year_from_url = int(parse_qs(parsed_url.query).get('year', [''])[0])
            
            # Skip if outside our desired year range
            if not (start_year <= year_from_url <= end_year):
                continue
                
            print(f'Processing {year_from_url}...')
            
            # Fetch and process the URL
            html_content = dataset.fetch_html(url)
            records = dataset.extract_vacancy_table(html_content)
            
            # Add year to each record
            for record in records:
                record['year'] = year_from_url
                
            all_records.extend(records)
            print(f' - Extracted {len(records)} records from {url}')
            
        except Exception as e:
            print(f'Error processing {url}: {e}')
            continue
    
    if not all_records:
        print('No records found for the specified year range.')
        return None
    
    # Convert all records to a single DataFrame
    df = dataset.records_to_dataframe(all_records)
    print(f'\nTotal records processed: {len(df)}')
    
    return df

In [36]:
# Fetch all available years (1981 to current year)
df = fetch_and_process_years()


if df is not None:
    print('\nFirst few records:')
    display(df.head())
    
    # Basic summary
    print('\nRecords per year:')
    print(df['year'].value_counts().sort_index())

Fetching data for years 1981 to 2025...
Found 45 archive URLs
Error processing https://www.uscourts.gov/data-news/judicial-vacancies/archive-judicial-vacancies?year=1981: invalid literal for int() with base 10: 'archive-judicial-vacancies?year=1981'
Error processing https://www.uscourts.gov/data-news/judicial-vacancies/archive-judicial-vacancies?year=1982: invalid literal for int() with base 10: 'archive-judicial-vacancies?year=1982'
Error processing https://www.uscourts.gov/data-news/judicial-vacancies/archive-judicial-vacancies?year=1983: invalid literal for int() with base 10: 'archive-judicial-vacancies?year=1983'
Error processing https://www.uscourts.gov/data-news/judicial-vacancies/archive-judicial-vacancies?year=1984: invalid literal for int() with base 10: 'archive-judicial-vacancies?year=1984'
Error processing https://www.uscourts.gov/data-news/judicial-vacancies/archive-judicial-vacancies?year=1985: invalid literal for int() with base 10: 'archive-judicial-vacancies?year=1985

In [33]:

if df is not None:
    print(f'\nFirst few records for {year_to_fetch}:')
    display(df.head())

## 2. Save Raw Data

Save the raw data to a CSV file in the `data/raw` directory.

In [34]:
def save_raw_data(df, year):
    """Save the raw data to a CSV file."""
    if df is None or df.empty:
        print('No data to save.')
        return
    
    filename = DATA_RAW / f'judicial_vacancies_{year}.csv'
    try:
        dataset.save_to_csv(df, filename)
        print(f'Data saved to {filename}')
    except Exception as e:
        print(f'Error saving data: {e}')

# Save the data we just fetched
if df is not None:
    save_raw_data(df, year_to_fetch)

## 3. Load and Explore the Saved Data

Let's verify that we can load the saved data.

In [35]:
def load_raw_data(year):
    """Load raw data from a CSV file."""
    filename = DATA_RAW / f'judicial_vacancies_{year}.csv'
    if not filename.exists():
        print(f'File not found: {filename}')
        return None
    
    try:
        df = pd.read_csv(filename)
        print(f'Loaded {len(df)} records from {filename}')
        return df
    except Exception as e:
        print(f'Error loading {filename}: {e}')
        return None

# Load the data we just saved
loaded_df = load_raw_data(year_to_fetch)
if loaded_df is not None:
    print('\nDataFrame info:')
    display(loaded_df.info())
    print('\nFirst few records:')
    display(loaded_df.head())

File not found: /home/wsl2ubuntuuser/data/raw/judicial_vacancies_2023.csv


## Next Steps

1. **Data Cleaning**: In the next notebook, we'll clean and preprocess this data.
2. **Exploratory Analysis**: We'll explore the data to understand its structure and quality.
3. **Feature Engineering**: We'll create additional features that might be useful for analysis.
4. **Visualization**: We'll create visualizations to understand trends and patterns.