## 1. Load and Inspect Ruling Files

This notebook loads the saved legal rulings from `data/raw/` into memory and previews their content.

### Goal:
- Verify structure and content quality
- Prepare for cleaning and embedding

In [1]:
import os

# Reuse DATA_DIR from config.py if running in the same environment
import sys
sys.path.append(os.path.abspath(".."))
from config import DATA_DIR

# Load ruling texts
filepaths = [os.path.join(DATA_DIR, f) for f in os.listdir(DATA_DIR) if f.endswith(".txt")]
documents = []

for filepath in filepaths:
    with open(filepath, "r", encoding="utf-8") as file:
        text = file.read().strip()
        documents.append(text)

print(f"Loaded {len(documents)} rulings into memory.")

Loaded 74 rulings into memory.


## 2. Preview Sample Rulings

This block prints the start of a few rulings to manually inspect structure, quality, and potential cleaning needs.


In [2]:
# Preview the first few rulings to check content structure
for i, doc in enumerate(documents[:3]):
    print(f"\n--- Ruling {i+1} Preview ---\n")
    print(doc[:500])  # Print the first 500 characters
    print("\n--- End Preview ---\n")


--- Ruling 1 Preview ---

<div>

People v Palm (<span class="citation" data-id="11022071"><a href="/opinion/10555483/people-v-palm/" aria-description="Citation for case: People v. Palm">2025 NY Slip Op 02799</a></span>)



<table width="80%" border="1" cellspacing="2" cellpadding="5" align="center">
<tr>
<td align="center"><b>People v Palm</b></td>
</tr>
<tr>
<td align="center"><span class="citation" data-id="11022071"><a href="/opinion/10555483/people-v-palm/" aria-description="Citation for case: People v. Palm">2025 NY

--- End Preview ---


--- Ruling 2 Preview ---

<div>

Pantanilla v Yuson (<span class="citation" data-id="10889300"><a href="/opinion/10422712/pantanilla-v-yuson/" aria-description="Citation for case: Pantanilla v. Yuson">2025 NY Slip Op 02597</a></span>)



<table width="80%" border="1" cellspacing="2" cellpadding="5" align="center">
<tr>
<td align="center"><b>Pantanilla v Yuson</b></td>
</tr>
<tr>
<td align="center"><span class="citation" data-id="10889300"><a hre

## 3. Clean Rulings by Stripping HTML Tags

CourtListener rulings are provided in HTML format (`html_with_citations`), which includes tags like `<div>`, `<table>`, and `<span>`.

This step removes all HTML tags, leaving only plain text for downstream processing and embedding.


In [4]:
import re

def strip_html_tags(text):
    """Remove HTML tags using regex."""
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

# Apply cleaning to all documents
cleaned_documents = [strip_html_tags(doc) for doc in documents]

In [6]:
# Preview cleaned rulings
for i, doc in enumerate(cleaned_documents[:3]):
    print(f"\n--- Cleaned Ruling {i+1} Preview ---\n")
    print(doc[:1500])
    print("\n--- End Preview ---\n")


--- Cleaned Ruling 1 Preview ---



People v Palm (2025 NY Slip Op 02799)





People v Palm


2025 NY Slip Op 02799


Decided on May 7, 2025


Appellate Division, Second Department



Published by New York State Law Reporting Bureau pursuant to Judiciary Law § 431.


This opinion is uncorrected and subject to revision before publication in the Official Reports.



Decided on May 7, 2025
SUPREME COURT OF THE STATE OF NEW YORK
Appellate Division, Second Judicial Department

FRANCESCA E. CONNOLLY, J.P.
ROBERT J. MILLER
LOURDES M. VENTURA
JAMES P. MCCORMACK, JJ.


2023-05297
 (Ind. No. 70575/22)

[*1]The People of the State of New York, respondent,
vNicholas Palm, appellant.


Mark Diamond, Pound Ridge, NY, for appellant.
David M. Hoovler, District Attorney, Goshen, NY (Edward D. Saslaw of counsel), for respondent.



DECISION &amp; ORDER
Appeal by the defendant from a judgment of the County Court, Orange County (Craig S. Brown, J.), rendered May 16, 2023, convicting him of criminal poss

## 4. Save Cleaned Rulings 

Now that the rulings are stripped of HTML tags and verified for content,  
we save them to `data/clean/` for reuse without repeating the cleaning step.


In [7]:
import os

# Create clean output directory if it doesn't exist
clean_dir = os.path.join("data", "clean")
os.makedirs(clean_dir, exist_ok=True)

# Save cleaned rulings
for i, text in enumerate(cleaned_documents):
    filename = f"clean_opinion_{i}.txt"
    with open(os.path.join(clean_dir, filename), "w", encoding="utf-8") as f:
        f.write(text)

print(f"Saved {len(cleaned_documents)} cleaned rulings to {clean_dir}")

Saved 74 cleaned rulings to data/clean
