## 1. Load and Inspect Ruling Files

This notebook loads the saved legal rulings from `data/raw/` into memory and previews their content.

### Goal:
- Verify structure and content quality
- Clean and prepare for embedding

In [1]:
import sys  # Provides access to system-specific functions and paths
import os  # Provides functions for interacting with the file system

# Add the parent directory to the module search path
# This allows importing config.py if it's located one level up
sys.path.append(os.path.abspath(".."))

# Import the path to the data directory from the config module
from config import DATA_DIR

# Collect paths to all .txt files in the data directory
filepaths = [os.path.join(DATA_DIR, f) for f in os.listdir(DATA_DIR) if f.endswith(".txt")]
documents = []

# Read and store the contents of each text file
for filepath in filepaths:
    with open(filepath, "r", encoding="utf-8") as file:
        text = file.read().strip()
        documents.append(text)

# Print the total number of rulings loaded into memory
print(f"Loaded {len(documents)} rulings into memory.")

Loaded 74 rulings into memory.


### Preview Sample Rulings

We now print the start of a few rulings to manually inspect structure, quality, and potential cleaning needs.


In [2]:
# Preview the first few rulings to check content structure
for i, doc in enumerate(documents[:3]):
    print(f"\n Ruling {i+1} Preview ------------------------------------------------------------------------------\n")
    print(doc[:500])  # Print the first 500 characters
    print(f"\n End of Ruling {i+1} Preview -----------------------------------------------------------------------\n")


 Ruling 1 Preview ------------------------------------------------------------------------------

<div>

People v Palm (<span class="citation" data-id="11022071"><a href="/opinion/10555483/people-v-palm/" aria-description="Citation for case: People v. Palm">2025 NY Slip Op 02799</a></span>)



<table width="80%" border="1" cellspacing="2" cellpadding="5" align="center">
<tr>
<td align="center"><b>People v Palm</b></td>
</tr>
<tr>
<td align="center"><span class="citation" data-id="11022071"><a href="/opinion/10555483/people-v-palm/" aria-description="Citation for case: People v. Palm">2025 NY

 End of Ruling 1 Preview -----------------------------------------------------------------------


 Ruling 2 Preview ------------------------------------------------------------------------------

<div>

Pantanilla v Yuson (<span class="citation" data-id="10889300"><a href="/opinion/10422712/pantanilla-v-yuson/" aria-description="Citation for case: Pantanilla v. Yuson">2025 NY Slip Op 02597</a></

## 2. Clean Rulings (Light Cleaning for Semantic Use)
This section lightly cleans the ruling text for semantic search and readability.
- HTML tags are stripped
- Whitespace is normalized but **newlines are preserved** for document structure

**Note:** We avoid over-cleaning in early stages — keep sentence boundaries for LLMs.

In [13]:
import re

def clean_ruling_text(text):
    """Clean and normalize ruling text for semantic tasks."""
    
    # 1. Lowercase and remove HTML tags
    text = re.sub(r'<.*?>', '', text.lower())
        
    # 2. Replace multiple newlines with a single newline 
    text = re.sub(r'\n\s*\n+', '\n', text)
    
    # 3. Normalize spaces and tabs (but not newlines)
    text = re.sub(r'[ \t]+', ' ', text)
    
    # 4. Strip leading/trailing whitespace
    text = text.strip()

    return text

# Apply combined cleaning to all documents
cleaned_documents = [clean_ruling_text(doc) for doc in documents]

### Preview Cleaned Ruling

In [14]:
#print(cleaned_documents[0])

In [15]:
# Preview cleaned rulings
for i, doc in enumerate(cleaned_documents[:3]):
    print(f"\n Cleaned Ruling {i+1} Preview ---------------------------------------------------------\n")
    print(doc[:1500])
    print(f"\n End Ruling {i+1} Preview -------------------------------------------------------------\n")


 Cleaned Ruling 1 Preview ---------------------------------------------------------

people v palm (2025 ny slip op 02799)
people v palm
2025 ny slip op 02799
decided on may 7, 2025
appellate division, second department
published by new york state law reporting bureau pursuant to judiciary law § 431.
this opinion is uncorrected and subject to revision before publication in the official reports.
decided on may 7, 2025
supreme court of the state of new york
appellate division, second judicial department
francesca e. connolly, j.p.
robert j. miller
lourdes m. ventura
james p. mccormack, jj.
2023-05297
 (ind. no. 70575/22)
[*1]the people of the state of new york, respondent,
vnicholas palm, appellant.
mark diamond, pound ridge, ny, for appellant.
david m. hoovler, district attorney, goshen, ny (edward d. saslaw of counsel), for respondent.
decision &amp; order
appeal by the defendant from a judgment of the county court, orange county (craig s. brown, j.), rendered may 16, 2023, convicting

### Optional: Normalize for Token-Level Tasks
Use this version for modeling tasks that require uniform text format.

In [16]:
def normalize_text(text, remove_punctuation=True, remove_stopwords=True):
    """Optionally remove punctuation and stopwords"""
    text = re.sub(r"\s+", " ", text).strip()
    return text

## 3. Save Cleaned Rulings 

Now that the rulings are cleaned and verified for content,  
we save them to `data/clean/` for reuse without repeating the cleaning step.


In [17]:
import os

# Create clean output directory if it doesn't exist
clean_dir = os.path.join("data", "clean")
os.makedirs(clean_dir, exist_ok=True)

# Save cleaned rulings
for i, text in enumerate(cleaned_documents):
    filename = f"clean_opinion_{i}.txt"
    with open(os.path.join(clean_dir, filename), "w", encoding="utf-8") as f:
        f.write(text)

print(f"Saved {len(cleaned_documents)} cleaned rulings to {clean_dir}")

Saved 74 cleaned rulings to data/clean
