# Judicial Vacancies Data Source Exploration

This notebook demonstrates how to use the `dataset` module to fetch and process judicial vacancy data.

## Overview

We'll:
1. Import the necessary modules
2. Fetch HTML data from the judicial vacancies archive
3. Extract vacancy data from the HTML
4. Convert the data to a pandas DataFrame
5. Save the raw data to a CSV file

## Setup

In [34]:
import sys
!{sys.executable} -m pip list | grep nomination_predictor

nomination_predictor      0.0.1             /home/wsl2ubuntuuser/nomination_predictor


In [35]:
# Enable autoreload for development
%load_ext autoreload
%autoreload 2

# Import standard libraries
import os
from pathlib import Path
import pandas as pd

# Import our data processing module
from nomination_predictor import dataset


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [None]:
# Set up paths - use a reliable method to find the project root in Jupyter
from pathlib import Path

# Get the current working directory (where the notebook is running from)
NOTEBOOK_DIR = Path.cwd()

# The project root is one level up from the notebooks directory
PROJECT_ROOT = NOTEBOOK_DIR.parent

# Define data paths relative to project root
DATA_RAW = PROJECT_ROOT / 'data' / 'raw'
DATA_RAW.mkdir(parents=True, exist_ok=True)  # Ensure directory exists

# Print paths for debugging
print(f'Notebook directory: {NOTEBOOK_DIR}')
print(f'Project root: {PROJECT_ROOT}')
print(f'Data raw directory: {DATA_RAW}')
print(f'Data raw exists: {DATA_RAW.exists()}')

NameError: name '__file__' is not defined

In [None]:
# Verify the data directory exists and is writable
if not DATA_RAW.exists():
    print(f"Error: Data directory does not exist: {DATA_RAW}")
elif not os.access(DATA_RAW, os.W_OK):
    print(f"Error: No write permission for directory: {DATA_RAW}")
else:
    print(f"Data directory is ready: {DATA_RAW}")

## 1. Fetch and Process Data

Let's fetch the data for the range of available years and process it.

In [None]:
# TODO: remove this year limit after initial experimenting
year_to_fetch = 2024

In [None]:
from datetime import datetime

def fetch_and_process_years(start_year=None, end_year=None, max_workers=5):
    """
    Fetch and process data for a range of years using two-level extraction.
    
    Args:
        start_year: First year to process (inclusive). If None, uses 1981.
        end_year: Last year to process (inclusive). If None, uses current year.
        max_workers: Maximum number of concurrent requests
        
    Returns:
        DataFrame containing all processed records, or None if no records found
    """
    from concurrent.futures import ThreadPoolExecutor, as_completed
    from urllib.parse import urlparse, parse_qs
    
    # Set default years if not provided
    if start_year is None:
        start_year = 1981
    if end_year is None:
        end_year = datetime.now().year
        
    print(f'Fetching data for years {start_year} to {end_year}...')
    
    # Get all available archive URLs
    all_urls = dataset.generate_or_fetch_archive_urls()
    
    if not all_urls:
        print('No archive URLs found')
        return None
        
    print(f'Found {len(all_urls)} archive URLs')
    
    all_records = []
    processed_years = set()
    
    def process_month(month_info, year):
        """Helper to process a single month's data"""
        try:
            month_url = month_info['url']
            print(f"  - Fetching {month_info['month']}...")
            
            # Fetch and process the monthly page
            html_content = dataset.fetch_html(month_url)
            records = dataset.extract_vacancy_table(html_content)
            
            # Add year and month to each record
            for record in records:
                record['year'] = year
                record['month'] = month_info['month']
                
            return records
        except Exception as e:
            print(f"Error processing {month_info.get('month', 'unknown')}: {e}")
            return []
    
    # Process each year and its months
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        future_to_year = {}
        
        for url in all_urls:
            try:
                # Extract year from URL
                parsed_url = urlparse(url)
                year_from_url = int(parse_qs(parsed_url.query).get('year', [''])[0])
                
                # Skip if outside our desired year range
                if not (start_year <= year_from_url <= end_year):
                    continue
                    
                print(f'Processing year {year_from_url}...')
                
                # Fetch the year page
                year_html = dataset.fetch_html(url)
                
                # Extract month links
                month_links = dataset.extract_month_links(year_html)
                
                if not month_links:
                    print(f"  - No month links found for {year_from_url}, processing as single page")
                    # Process as a single page if no month links found
                    records = dataset.extract_vacancy_table(year_html)
                    for record in records:
                        record['year'] = year_from_url
                        record['month'] = 'Full Year'
                    all_records.extend(records)
                    continue
                
                # Process each month in parallel
                for month_info in month_links:
                    month_info['year'] = year_from_url
                    future = executor.submit(process_month, month_info, year_from_url)
                    future_to_year[future] = year_from_url
                
                processed_years.add(year_from_url)
                
            except Exception as e:
                print(f"Error processing year {year_from_url}: {e}")
                continue
        
        # Process results as they complete
        for future in as_completed(future_to_year):
            year = future_to_year[future]
            try:
                records = future.result()
                all_records.extend(records)
                print(f"  - Processed {len(records)} records for {year}")
            except Exception as e:
                print(f"Error processing months for year {year}: {e}")
    
    if not all_records:
        print('No records found for the specified year range.')
        return None
    
    # Convert all records to a single DataFrame
    df = dataset.records_to_dataframe(all_records)
    print(f'\nTotal records processed: {len(df)}')
    print(f'Years processed: {sorted(processed_years)}')
    
    return df

In [None]:
# Fetch all available years (1981 to current year)
df = fetch_and_process_years(start_year=year_to_fetch, end_year=year_to_fetch) # TODO: expand year range after we've determined code handles smaller subset okay


if df is not None:
    print('\nFirst few records:')
    display(df.head())
    
    # Basic summary
    print('\nRecords per year:')
    print(df['year'].value_counts().sort_index())

Fetching data for years 2024 to 2024...
Found 45 archive URLs
Processing year 2024...


  - Fetching judicial emergencies for december 2024...
  - Fetching vacancy summary for december 2024...
  - Fetching judicial vacancy list for december 2024...
  - Fetching future judicial vacancies for december 2024...
  - Fetching judicial confirmations for december 2024...
  - Fetching judicial emergencies for november 2024...  - Processed 37 records for 2024

  - Fetching vacancy summary for november 2024...
  - Processed 6 records for 2024
  - Fetching judicial vacancy list for november 2024...
  - Processed 8 records for 2024
  - Fetching future judicial vacancies for november 2024...
  - Processed 21 records for 2024
  - Fetching judicial confirmations for november 2024...
  - Processed 141 records for 2024
  - Fetching judicial emergencies for october 2024...  - Processed 23 records for 2024

  - Fetching vacancy summary for october 2024...
  - Processed 6 records for 2024
  - Fetching judicial vacancy list for october 2024...
  - Processed 19 records for 2024
  - Fetching fut

Unnamed: 0,court,vacancy_date,status,nominating_president,nominee,year,month
0,01 - CCA,2024-10-31,,,"Lipez,Julia M.",2024,judicial vacancy list for december 2024
1,02 - NY-S,2024-12-31,,,"Netburn,Sarah",2024,judicial vacancy list for december 2024
2,03 - CCA,2023-06-15,,,"Mangi,Adeel Abdullah",2024,judicial vacancy list for december 2024
3,04 - NC-M,2024-12-31,,,,2024,judicial vacancy list for december 2024
4,04 - NC-M,2024-12-31,,,,2024,judicial vacancy list for december 2024



Records per year:
year
2024    2380
Name: count, dtype: int64


In [None]:
if df is not None:
    print(f'\nFirst few records for dataframe:')
    display(df.head())


First few records for dataframe:


Unnamed: 0,court,vacancy_date,status,nominating_president,nominee,year,month
0,01 - CCA,2024-10-31,,,"Lipez,Julia M.",2024,judicial vacancy list for december 2024
1,02 - NY-S,2024-12-31,,,"Netburn,Sarah",2024,judicial vacancy list for december 2024
2,03 - CCA,2023-06-15,,,"Mangi,Adeel Abdullah",2024,judicial vacancy list for december 2024
3,04 - NC-M,2024-12-31,,,,2024,judicial vacancy list for december 2024
4,04 - NC-M,2024-12-31,,,,2024,judicial vacancy list for december 2024


## 2. Save Raw Data

Save the raw data to a CSV file in the `data/raw` directory.

In [None]:
def save_raw_data(df, year):
    """Save the raw data to a CSV file."""
    if df is None or df.empty:
        print('No data to save.')
        return
    
    filename = DATA_RAW / f'judicial_vacancies_{year}.csv'
    try:
        dataset.save_to_csv(df, filename)
        print(f'Data saved to {filename}')
    except Exception as e:
        print(f'Error saving data: {e}')

# Save the data we just fetched
if df is not None:
    save_raw_data(df, year_to_fetch)

2025-06-30 17:31:26,456 - nomination_predictor.dataset - INFO - Successfully saved data to /home/wsl2ubuntuuser/data/raw/judicial_vacancies_2024.csv
Data saved to /home/wsl2ubuntuuser/data/raw/judicial_vacancies_2024.csv


## 3. Load and Explore the Saved Data

Let's verify that we can load the saved data.

In [None]:
def load_raw_data(year):
    """Load raw data from a CSV file."""
    filename = DATA_RAW / f'judicial_vacancies_{year}.csv'
    if not filename.exists():
        print(f'File not found: {filename}')
        return None
    
    try:
        df = pd.read_csv(filename)
        print(f'Loaded {len(df)} records from {filename}')
        return df
    except Exception as e:
        print(f'Error loading {filename}: {e}')
        return None

# Load the data we just saved
loaded_df = load_raw_data(year_to_fetch)
if loaded_df is not None:
    print('\nDataFrame info:')
    display(loaded_df.info())
    print('\nFirst few records:')
    display(loaded_df.head())

Loaded 2380 records from /home/wsl2ubuntuuser/data/raw/judicial_vacancies_2024.csv

DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2380 entries, 0 to 2379
Data columns (total 7 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   court                 2380 non-null   object 
 1   vacancy_date          2182 non-null   object 
 2   status                0 non-null      float64
 3   nominating_president  0 non-null      float64
 4   nominee               1543 non-null   object 
 5   year                  2380 non-null   int64  
 6   month                 2380 non-null   object 
dtypes: float64(2), int64(1), object(4)
memory usage: 130.3+ KB


None


First few records:


Unnamed: 0,court,vacancy_date,status,nominating_president,nominee,year,month
0,01 - CCA,2024-10-31,,,"Lipez,Julia M.",2024,judicial vacancy list for december 2024
1,02 - NY-S,2024-12-31,,,"Netburn,Sarah",2024,judicial vacancy list for december 2024
2,03 - CCA,2023-06-15,,,"Mangi,Adeel Abdullah",2024,judicial vacancy list for december 2024
3,04 - NC-M,2024-12-31,,,,2024,judicial vacancy list for december 2024
4,04 - NC-M,2024-12-31,,,,2024,judicial vacancy list for december 2024


## Next Steps

1. **Data Cleaning**: In the next notebook, we'll clean and preprocess this data.
2. **Exploratory Analysis**: We'll explore the data to understand its structure and quality.
3. **Feature Engineering**: We'll create additional features that might be useful for analysis.
4. **Visualization**: We'll create visualizations to understand trends and patterns.