## 1. Setup & Config Imports

We start by importing core Python libraries and loading project settings from `config.py`.

- `sys.path.append(os.path.abspath(".."))` tells Python to look in the parent directory so it can find `config.py`.
- `config.py` holds sensitive or reusable values like the API token, query string, page size, and save paths.

This lets us separate code logic from configuration and makes our project easier to scale and maintain.

In [33]:
import sys
import os

# Add the parent folder to the system path
sys.path.append(os.path.abspath(".."))

from config import API_TOKEN, QUERY, PAGE_SIZE, MAX_PAGES, DATA_DIR, MIN_TEXT_LENGTH

## 2. Prepare API Request & Output Folder

We set up:

- `headers`: includes your API token for authentication with CourtListener. The `"Authorization"` key must follow the format `"Token <your_token>"`.
- `params`: specifies query parameters like your search term, sort order, and how many results to return per page.
- `os.makedirs`: ensures that the folder where we'll save ruling files exists. If it doesn't, it's created automatically.

This block prepares everything needed to begin retrieving data securely and saving it locally.

In [43]:
headers = {
    "Authorization": f"Token{API_TOKEN}"
}
params = {
    "search": QUERY,
    "order_by": "date_filed",
    "page_size": PAGE_SIZE
}
os.makedirs(DATA_DIR, exist_ok=True)

## 3. (Optional) Clear Raw Data Folder

If you're rerunning the fetch loop and want to start fresh, this block deletes all previously saved `.txt` files in the `data/raw/` directory.

Running this before each data collection ensures a clean, controlled set of rulings and prevents file duplication across runs.

In [52]:
# Optional: delete all .txt files before rerunning
for f in os.listdir(DATA_DIR):
    if f.endswith(".txt"):
        os.remove(os.path.join(DATA_DIR, f))

print(" Old files removed.")


 Old files removed.


## 4. Paginated Fetch Loop: Save Usable Rulings

This is the main loop that paginates through CourtListener search results and saves usable rulings as `.txt` files.

### Key logic:
- Loops through up to `MAX_PAGES`, stopping early if enough usable rulings are saved
- Skips entries with no meaningful text (length < `MIN_TEXT_LENGTH`)
- Cleans and formats the filename based on the case title and page number
- Writes each valid ruling to the `data/raw/` folder
- Waits 1 second between saves to avoid overwhelming the API

The loop ends when either:
- You've saved the target number of rulings, or
- There are no more API pages to fetch

This step builds your final raw dataset for downstream analysis and modeling.


In [53]:
saved = 0
page = 1

while saved < 150 and page <= MAX_PAGES:
    print(f"🔍 Fetching page {page}")
    params["page"] = page
    response = requests.get("https://www.courtlistener.com/api/rest/v4/opinions/", headers=headers, params=params)
    data = response.json()

    for i, item in enumerate(data.get("results", [])):
        text = item.get("plain_text") or item.get("html_with_citations") or ""
        if not text or len(text.strip()) < MIN_TEXT_LENGTH:
            print(f" Skipping empty result #{i}")
            continue

        title = item.get("case_name", f"opinion_{page}_{i}")
        filename = f"{title[:40].strip().replace(' ', '_').replace('/', '-')}.txt"
        filepath = os.path.join(DATA_DIR, filename)

        with open(filepath, "w", encoding="utf-8") as f:
            f.write(text)

        saved += 1
        print(f" Saved ({saved}): {filename}")
        time.sleep(1)  # avoid spamming the server

    if not data.get("next"):
        print(" No more pages available.")
        break

    page += 1

print(f"\n Finished! Total saved: {saved} rulings.")

🔍 Fetching page 1
 Skipping empty result #0
 Skipping empty result #1
 Skipping empty result #2
 Skipping empty result #3
 Skipping empty result #4
 Skipping empty result #5
 Skipping empty result #6
 Skipping empty result #7
 Skipping empty result #8
 Skipping empty result #9
 Skipping empty result #10
 Skipping empty result #11
 Skipping empty result #12
 Skipping empty result #13
 Skipping empty result #14
 Skipping empty result #15
 Skipping empty result #16
 Skipping empty result #17
 Skipping empty result #18
 Saved (1): opinion_1_19.txt
🔍 Fetching page 2
 Saved (2): opinion_2_0.txt
 Skipping empty result #1
 Skipping empty result #2
 Skipping empty result #3
 Skipping empty result #4
 Skipping empty result #5
 Skipping empty result #6
 Skipping empty result #7
 Skipping empty result #8
 Skipping empty result #9
 Skipping empty result #10
 Skipping empty result #11
 Skipping empty result #12
 Skipping empty result #13
 Skipping empty result #14
 Skipping empty result #15
 Skippin

## 5. Verify Saved Files

After running the fetch loop, this block confirms that the expected number of `.txt` files were saved in the output directory.

- `os.listdir()` lists all files in the `data/raw/` folder
- The print statements show how many rulings were saved and preview the first few filenames

This is a quick way to validate that your data collection ran as expected before moving on to preprocessing or modeling.


In [54]:
files = os.listdir(DATA_DIR)
print(f"Found {len(files)} files in {DATA_DIR}")
print(files[:5])  # Preview first few

Found 74 files in data/raw
['opinion_17_2.txt', 'opinion_11_14.txt', 'opinion_13_11.txt', 'opinion_11_4.txt', 'opinion_6_8.txt']
