# Judicial Vacancies Data Source Exploration

This notebook demonstrates how to use the `make_dataset` module to fetch and process judicial vacancy data.

## Overview

We'll:
1. Import the necessary modules
2. Fetch HTML data from the judicial vacancies archive
3. Extract vacancy data from the HTML
4. Convert the data to a pandas DataFrame
5. Save the raw data to a CSV file

## Setup

In [None]:
# Enable autoreload for development
%load_ext autoreload
%autoreload 2

# Import standard libraries
import os
from pathlib import Path
import pandas as pd

# Import our data processing module
from nomination_predictor.data import make_dataset

# Set up paths
PROJECT_ROOT = Path().resolve().parent.parent
DATA_RAW = PROJECT_ROOT / 'data' / 'raw'
DATA_RAW.mkdir(parents=True, exist_ok=True)  # Ensure directory exists

## 1. Fetch and Process Data

Let's fetch the data for a specific year and process it.

In [None]:
def fetch_and_process_year(year=2023):
    """Fetch and process data for a specific year."""
    print(f'Fetching data for {year}...')
    
    # Generate or fetch archive URLs for the year
    urls = make_dataset.generate_or_fetch_archive_urls(year)
    
    if not urls:
        print(f'No URLs found for {year}')
        return None
    
    print(f'Found {len(urls)} URLs for {year}')
    
    # Process each URL
    all_records = []
    for url in urls:
        try:
            # Fetch HTML content
            html_content = make_dataset.fetch_html(url)
            
            # Extract records from HTML
            records = make_dataset.extract_vacancy_table(html_content)
            all_records.extend(records)
            print(f' - Extracted {len(records)} records from {url}')
        except Exception as e:
            print(f'Error processing {url}: {e}')
    
    if not all_records:
        print('No records found.')
        return None
    
    # Convert to DataFrame
    df = make_dataset.records_to_dataframe(all_records)
    
    # Add year column if not present
    if 'year' not in df.columns:
        df['year'] = year
    
    return df

In [None]:
# Example: Fetch data for 2023
year_to_fetch = 2023
df = fetch_and_process_year(year_to_fetch)

if df is not None:
    print(f'\nFirst few records for {year_to_fetch}:')
    display(df.head())

## 2. Save Raw Data

Save the raw data to a CSV file in the `data/raw` directory.

In [None]:
def save_raw_data(df, year):
    """Save the raw data to a CSV file."""
    if df is None or df.empty:
        print('No data to save.')
        return
    
    filename = DATA_RAW / f'judicial_vacancies_{year}.csv'
    try:
        make_dataset.save_to_csv(df, filename)
        print(f'Data saved to {filename}')
    except Exception as e:
        print(f'Error saving data: {e}')

# Save the data we just fetched
if df is not None:
    save_raw_data(df, year_to_fetch)

## 3. Load and Explore the Saved Data

Let's verify that we can load the saved data.

In [None]:
def load_raw_data(year):
    """Load raw data from a CSV file."""
    filename = DATA_RAW / f'judicial_vacancies_{year}.csv'
    if not filename.exists():
        print(f'File not found: {filename}')
        return None
    
    try:
        df = pd.read_csv(filename)
        print(f'Loaded {len(df)} records from {filename}')
        return df
    except Exception as e:
        print(f'Error loading {filename}: {e}')
        return None

# Load the data we just saved
loaded_df = load_raw_data(year_to_fetch)
if loaded_df is not None:
    print('\nDataFrame info:')
    display(loaded_df.info())
    print('\nFirst few records:')
    display(loaded_df.head())

## Next Steps

1. **Data Cleaning**: In the next notebook, we'll clean and preprocess this data.
2. **Exploratory Analysis**: We'll explore the data to understand its structure and quality.
3. **Feature Engineering**: We'll create additional features that might be useful for analysis.
4. **Visualization**: We'll create visualizations to understand trends and patterns.