## Step 1: Setup and Configuration

In this initial step, I'll set up the notebook's environment. This involves three key actions:
1.  Importing all necessary standard and third-party Python libraries.
2.  Programmatically setting the project's root directory to ensure all relative paths for data and source code work correctly.
3.  Importing my custom configuration variables and functions from the `config.py` and `src/` files.

In [2]:
# --- Foundational Library Imports ---
import os
from pathlib import Path
import sys
import time
import requests
import pandas as pd

# --- Project Root Setup ---
# This block ensures the notebook's working directory is always the project root (compliance-nlp/).
# This makes all file paths for data and source code consistent and reproducible.
current_dir = Path.cwd()
if current_dir.name == 'notebooks':
    os.chdir(current_dir.parent)
    print(f"Changed working directory to project root: {os.getcwd()}")
else:
    print(f"Already at project root: {os.getcwd()}")

Already at project root: /Users/lidasmac/compliance-nlp


In [3]:
# --- Custom & Configuration Imports ---
# This cell can only run correctly after the working directory has been set above.
from config import (
    API_TOKEN, QUERY, PAGE_SIZE, DATA_DIR, MIN_TEXT_LENGTH,
    TARGET_RULING_COUNT
)

## Step 2. Prepare API Request & Directories

Next, I will prepare the components needed to make a request to the CourtListener API and ensure the output directory for the data is ready.

-   **`headers`**: This dictionary contains my API token for authentication. The `"Authorization"` key must follow the format `"Token <my_token>"`.
-   **`params`**: This dictionary specifies the query parameters for the API call, such as my search term, the sort order, and the number of results to return per page.
-   **`os.makedirs`**: I use this function to ensure that the folder where I'll save the ruling files exists. If it doesn't, this command creates it automatically.

This block prepares everything needed to begin retrieving data securely and saving it locally.

In [4]:
# --- Prepare API Request & Directories ---

# Set up the authentication headers with the required space
headers = {
    "Authorization": f"Token {API_TOKEN}"
}

# Set up the parameters for a reproducible query
params = {
    "search": QUERY,
    "order_by": "-id",  # Sort by unique ID descending for deterministic results
    "page_size": PAGE_SIZE
}

# Ensure the output directory for raw data exists
# This should target the 'raw' sub-directory.
raw_data_dir = Path(DATA_DIR)/ "raw"
os.makedirs(raw_data_dir, exist_ok=True)

## 3. Execute Data Fetching Pipeline

With the setup complete, I will now execute the main data collection script. The goal is to ensure my local `data/raw/` directory contains a specific number of rulings, with the target amount controlled by a variable in my central `config.py` file.

-   **Configuration-Driven:** The script only runs if the number of local files is less than the `TARGET_RULING_COUNT` defined in `config.py`. For this phase, I have set this value to 74.
-   **Idempotent Logic:** Before downloading, the script checks if a file with a specific `opinion_id` already exists. If it does, the download is skipped. This allows the script to be run multiple times to safely "top up" the dataset without creating duplicates.
-   **Reproducible Results:** It uses stable sorting (`order_by: "-id"`) to ensure the set of downloaded documents is always the same.

This robust process gives me full control over my dataset size while making the collection process reproducible and easy to manage.

In [5]:
# --- 1. Define the final output directory ---
raw_data_dir = Path(DATA_DIR) / "raw"
os.makedirs(raw_data_dir, exist_ok=True)


# --- 2. SAFETY CHECK & MAIN LOGIC ---
try:
    current_file_count = len(list(raw_data_dir.glob("*.txt")))
    
    if current_file_count >= TARGET_RULING_COUNT:
        print(f"   Target of {TARGET_RULING_COUNT} files met or exceeded.")
        print(f"   Directory '{raw_data_dir}' already contains {current_file_count} files.")
        print("   Halting script. No new files will be downloaded.")
    else:
        print(f"Current file count is {current_file_count}. Target is {TARGET_RULING_COUNT}.")
        print("Proceeding to download missing files...")
        
        # --- 3. Main Fetching Loop ---
        saved_count = 0
        skipped_count = 0
        page = 1

        while (saved_count + current_file_count) < TARGET_RULING_COUNT:
            print(f"🔍 Fetching page {page}...")
            params["page"] = page
            
            try:
                response = requests.get("https://www.courtlistener.com/api/rest/v4/opinions/", headers=headers, params=params)
                response.raise_for_status()
                data = response.json() 
            except requests.exceptions.RequestException as e:
                print(f"API Request failed: {e}")
                break

            results = data.get("results", [])
            if not results:
                print("No results on this page. Halting.")
                break

            for item in results:
                if (saved_count + current_file_count) >= TARGET_RULING_COUNT:
                    break

                opinion_id = item.get("id")
                if not opinion_id: continue

                filepath = raw_data_dir / f"opinion_{opinion_id}.txt"

                if filepath.exists():
                    skipped_count += 1
                    continue

                text = item.get("plain_text") or item.get("html_with_citations") or ""
                if len(text.strip()) < MIN_TEXT_LENGTH: continue

                with open(filepath, "w", encoding="utf-8") as f:
                    f.write(text)

                saved_count += 1
                print(f"  Saved ({saved_count}): {filepath.name}")
                time.sleep(0.5)

            if not data.get("next"):
                print("\nNo more pages available from API. Halting.")
                break

            page += 1

        print(f"\n Fetching complete! Newly saved: {saved_count} rulings. Skipped: {skipped_count} existing files.")
        print(f"Total files in directory: {len(list(raw_data_dir.glob('*.txt')))}")

except Exception as e:
    print(f"An unexpected error occurred: {e}")

   Target of 74 files met or exceeded.
   Directory 'data/raw' already contains 74 files.
   Halting script. No new files will be downloaded.
