## 1. Setup & Config Imports

We start by importing core Python libraries and loading project settings from `config.py`.

- `sys.path.append(os.path.abspath(".."))` tells Python to look in the parent directory so it can find `config.py`.
- `config.py` holds sensitive or reusable values like the API token, query string, page size, and save paths.

This lets us separate code logic from configuration and makes our project easier to scale and maintain.

In [33]:
import sys
import os

# Add the parent folder to the system path
sys.path.append(os.path.abspath(".."))

from config import API_TOKEN, QUERY, PAGE_SIZE, MAX_PAGES, DATA_DIR, MIN_TEXT_LENGTH

## 2. Prepare API Request & Output Folder

We set up:

- `headers`: includes your API token for authentication with CourtListener. The `"Authorization"` key must follow the format `"Token <your_token>"`.
- `params`: specifies query parameters like your search term, sort order, and how many results to return per page.
- `os.makedirs`: ensures that the folder where we'll save ruling files exists. If it doesn't, it's created automatically.

This block prepares everything needed to begin retrieving data securely and saving it locally.

In [43]:
headers = {
    "Authorization": f"Token{API_TOKEN}"
}
params = {
    "search": QUERY,
    "order_by": "date_filed",
    "page_size": PAGE_SIZE
}
os.makedirs(DATA_DIR, exist_ok=True)

In [48]:
saved = 0
page = 1

while saved < 150 and page <= MAX_PAGES:
    print(f"🔍 Fetching page {page}")
    params["page"] = page
    response = requests.get("https://www.courtlistener.com/api/rest/v4/opinions/", headers=headers, params=params)
    data = response.json()

    for i, item in enumerate(data.get("results", [])):
        text = item.get("plain_text") or item.get("html_with_citations") or ""
        if not text or len(text.strip()) < MIN_TEXT_LENGTH:
            print(f" Skipping empty result #{i}")
            continue

        title = item.get("case_name", f"opinion_{page}_{i}")
        filename = f"{title[:40].strip().replace(' ', '_').replace('/', '-')}.txt"
        filepath = os.path.join(DATA_DIR, filename)

        with open(filepath, "w", encoding="utf-8") as f:
            f.write(text)

        saved += 1
        print(f" Saved ({saved}): {filename}")
        time.sleep(1)  # avoid spamming the server

    if not data.get("next"):
        print(" No more pages available.")
        break

    page += 1

print(f"\n Finished! Total saved: {saved} rulings.")


🔍 Fetching page 1
⏭️ Skipping empty result #0
⏭️ Skipping empty result #1
⏭️ Skipping empty result #2
⏭️ Skipping empty result #3
⏭️ Skipping empty result #4
⏭️ Skipping empty result #5
⏭️ Skipping empty result #6
⏭️ Skipping empty result #7
⏭️ Skipping empty result #8
⏭️ Skipping empty result #9
⏭️ Skipping empty result #10
⏭️ Skipping empty result #11
⏭️ Skipping empty result #12
⏭️ Skipping empty result #13
⏭️ Skipping empty result #14
⏭️ Skipping empty result #15
⏭️ Skipping empty result #16
⏭️ Skipping empty result #17
⏭️ Skipping empty result #18
⏭️ Skipping empty result #19
🔍 Fetching page 2
⏭️ Skipping empty result #0
⏭️ Skipping empty result #1
⏭️ Skipping empty result #2
⏭️ Skipping empty result #3
⏭️ Skipping empty result #4
⏭️ Skipping empty result #5
✅ Saved (1): opinion_2_6.txt
⏭️ Skipping empty result #7
⏭️ Skipping empty result #8
⏭️ Skipping empty result #9
⏭️ Skipping empty result #10
⏭️ Skipping empty result #11
⏭️ Skipping empty result #12
⏭️ Skipping empty result #

In [45]:
#response = requests.get(base_url, headers=headers, params=params)
#print(response.status_code)
#print(response.text)

In [46]:
data = response.json()
print(data.keys())
print("Number of results:", len(data.get("results", [])))

dict_keys(['count', 'next', 'previous', 'results'])
Number of results: 20


In [47]:
for i, item in enumerate(data["results"]):
    text = item.get("plain_text") or item.get("html_with_citations") or ""
    print(f"Result #{i} — Length: {len(text.strip())}")

Result #0 — Length: 0
Result #1 — Length: 0
Result #2 — Length: 0
Result #3 — Length: 0
Result #4 — Length: 0
Result #5 — Length: 0
Result #6 — Length: 0
Result #7 — Length: 0
Result #8 — Length: 0
Result #9 — Length: 0
Result #10 — Length: 0
Result #11 — Length: 0
Result #12 — Length: 0
Result #13 — Length: 0
Result #14 — Length: 0
Result #15 — Length: 0
Result #16 — Length: 0
Result #17 — Length: 0
Result #18 — Length: 0
Result #19 — Length: 0


In [38]:
for i, item in enumerate(data["results"]):
    text = item.get("plain_text") or item.get("html_with_citations") or ""
    
    # Skip empty content
    if not text or len(text.strip()) < 100:
        print(f"Skipping empty result #{i}: {item.get('case_name')}")
        continue

    title = item.get("case_name", f"opinion_{i}")
    filename = f"{title[:40].strip().replace(' ', '_').replace('/', '-')}.txt"
    
    with open(os.path.join("data/raw", filename), "w", encoding="utf-8") as f:
        f.write(text)

    print(f"Saved: {filename}")
    time.sleep(1)

Skipping empty result #0: None
Skipping empty result #1: None
Skipping empty result #2: None
Skipping empty result #3: None
Skipping empty result #4: None
Skipping empty result #5: None
Skipping empty result #6: None
Skipping empty result #7: None
Skipping empty result #8: None
Skipping empty result #9: None
Skipping empty result #10: None
Skipping empty result #11: None
Skipping empty result #12: None
Skipping empty result #13: None
Skipping empty result #14: None
Skipping empty result #15: None
Skipping empty result #16: None
Skipping empty result #17: None
Skipping empty result #18: None
Skipping empty result #19: None
