# Ruling Preprocessing: Cleaning, Party Extraction, and Metadata Preparation

This notebook performs preprocessing on court rulings, including:
- Cleaning and formatting decision text for NLP and IR tasks
- Extracting party names using a structured heuristic
- Preparing and saving a metadata dataset for semantic search and analysis

This sets the foundation for embedding, retrieval, and interactive exploration in later phases.

In [43]:
os.chdir("/Users/lidasmac/compliance-nlp/")

In [44]:
import importlib
import src.preprocessing

importlib.reload(src.preprocessing)

# --- Imports ---

# Standard libraries
import os
import re
import pandas as pd

# Custom preprocessing functions
from src.preprocessing import clean_html, extract_decision_section, clean_decision_text, clean_and_format_decision, extract_party_block,  clean_party_line

## Step 1: Load and Inspect Legal Rulings

In [45]:
# --- Load ruling documents ---

# Set working directory and data path
DATA_DIR = "data/raw"

# Gather all .txt file paths
filepaths = [os.path.join(DATA_DIR, f) for f in os.listdir(DATA_DIR) if f.endswith(".txt")]

# Read and store ruling content
documents = []
for filepath in filepaths:
    with open(filepath, "r", encoding="utf-8") as file:
        text = file.read().strip()
        documents.append(text)

# Confirm number of rulings loaded
print(f"Loaded {len(documents)} rulings into memory.")


Loaded 74 rulings into memory.


# Step 2: Preview Sample Rulings

Before preprocessing, we preview a few rulings to understand the formatting and plan cleaning steps.

We start with the first two rulings for manual inspection.

In [46]:
# Preview the first few rulings to check content structure
for i, doc in enumerate(documents[:2]):
    print(f"\n Ruling {i+1} Preview\n{'-'*100}")
    print(doc[:300]) 
    print(f"\n End of Ruling {i+1} Preview\n{'-'*100}")


 Ruling 1 Preview
----------------------------------------------------------------------------------------------------
<div>

People v Palm (<span class="citation" data-id="11022071"><a href="/opinion/10555483/people-v-palm/" aria-description="Citation for case: People v. Palm">2025 NY Slip Op 02799</a></span>)



<table width="80%" border="1" cellspacing="2" cellpadding="5" align="center">
<tr>
<td align="center"><

 End of Ruling 1 Preview
----------------------------------------------------------------------------------------------------

 Ruling 2 Preview
----------------------------------------------------------------------------------------------------
<div>

Pantanilla v Yuson (<span class="citation" data-id="10889300"><a href="/opinion/10422712/pantanilla-v-yuson/" aria-description="Citation for case: Pantanilla v. Yuson">2025 NY Slip Op 02597</a></span>)



<table width="80%" border="1" cellspacing="2" cellpadding="5" align="center">
<tr>
<td a

 End of Ruling 2 Preview
-----

## Step 3: Clean Rulings - Initial phase

To prepare for semantic search, we apply lightweight cleaning focused on structure preservation:

- Remove HTML tags and markup
- Preserve line breaks to retain paragraph and logical structure
- No normalization, lowercasing, or stopword removal — this will come later (if at all)

This ensures that ruling structure remains intact for retrieval tasks.

In [47]:
# Apply HTML cleaning to all raw documents
cleaned_documents = [clean_html(doc) for doc in documents]

### 3.1 Preview Cleaned Rulings

We now inspect a few cleaned rulings to confirm:

In [48]:
# --- Preview cleaned rulings (middle section) ---

for i, doc in enumerate(cleaned_documents[:1]):
    print(f"\n Cleaned Ruling {i+1} Preview\n{'-'*100}")
    print(doc[500:1000]) # You can adjust slice or use full print for deeper review
    print(f"\n End Cleaned Ruling {i+1} Preview\n{'-'*100}")


 Cleaned Ruling 1 Preview
----------------------------------------------------------------------------------------------------
ROBERT J. MILLER

LOURDES M. VENTURA

JAMES P. MCCORMACK, JJ.




2023-05297

 (Ind. No. 70575/22)


[*1]The People of the State of New York, respondent,

v
Nicholas Palm, appellant.





Mark Diamond, Pound Ridge, NY, for appellant.


David M. Hoovler, District Attorney, Goshen, NY (Edward D. Saslaw of counsel), for respondent.




DECISION & ORDER


Appeal by the defendant from a judgment of the County Court, Orange County (Craig S. Brown, J.), rendered May 16, 2023, convicting him of criminal

 End Cleaned Ruling 1 Preview
----------------------------------------------------------------------------------------------------


### 3.2 Ruling Structure 
Each court ruling generally follows a consistent structure that can be used for parsing and semantic analysis.

**1. Case Header**
- Case name (e.g., People v Palm)
- Slip opinion number (e.g., 2025 NY Slip Op 02799)
- Decision date
- Court (usually Appellate Division and department)
- Publication note and court information
- Docket and index numbers

**2. Judges**
- List of justices or judges presiding over the decision

**3. Party and Counsel Information**

- Named parties: appellant(s) and respondent(s)
- Legal representation: attorney listings for each side

**4. Decision Section**
- Procedural Summary: What is being appealed and why
- Outcome Statement: Usually starts with ORDERED that...
- Factual Background: Case facts and trial history
- Legal Analysis: Citations and interpretation of legal standards
- Constitutional/Procedural Issues: Miranda rights, due process, etc.
- Counsel Effectiveness: Review of legal representation if raised
- Remaining Issues: Any minor or unpreserved claims

**5. Conclusion**
- Final judgment (affirmed/reversed/remanded)
- List of concurring judges
- Clerk of the Court (signature block)

## Step 4: Extract Decision Section 

The decision section contains the core legal analysis and conclusion of each ruling, which is essential for risk and relevance assessments.

We locate the decision body by:
- Starting **after** the "DECISION & ORDER" heading (if present), or at the first legal paragraph if not.
- Stopping **before** judge signatures (e.g., lines containing `J.P.` or `JJ.`) or official closings like `"THIS CONSTITUTES..."` or `"ENTERED:"`.

This ensures a clean, focused extraction of the ruling without trailing metadata or author signatures.


In [49]:
decision_texts = [extract_decision_section(doc) for doc in cleaned_documents]

# Preview a few decision sections
for i, decision in enumerate(decision_texts[:1]):
    print(f"\nDoc {i+1} Decision Preview\n{'-'*100}")
    if decision:
        print(decision[:10000])
    else:
        print("No decision text found.")
    print(f"\nEnd of Doc {i+1} Decision Preview\n{'-'*100}")


Doc 1 Decision Preview
----------------------------------------------------------------------------------------------------
Appeal by the defendant from a judgment of the County Court, Orange County (Craig S. Brown, J.), rendered May 16, 2023, convicting him of criminal possession of a weapon in the second degree, upon a jury verdict, and imposing sentence. The appeal brings up for review the denial, after a hearing, of those branches of the defendant's omnibus motion which were to suppress physical evidence and his statements to law enforcement officials.
ORDERED that the judgment is affirmed.
The defendant was arrested after a stop and frisk resulting in the recovery of a firearm. After a hearing, the County Court denied those branches of the defendant's omnibus motion which were to suppress the physical evidence recovered, including the firearm, and the defendant's statements made to law enforcement officials. After a jury trial, the defendant was convicted of criminal possession o

### 4.1 Identify and Review Missing Decisions

After extracting the decision section from each ruling, we check for any that returned `None`. These may be:

- Placeholder rulings without a decision
- Metadata-only entries
- Malformed inputs that didn't match the expected pattern

This step helps us:
- Quantify how many rulings are unusable for semantic tasks
- Review a short preview of each problematic case for possible pattern refinement or exclusion

In [50]:
# Initialize counter to keep track of how many rulings return None
none_count = 0

# Loop through each extracted decision text with its index
for i, text in enumerate(decision_texts):
    # Check if the opinion section extraction returned None
    if text is None:
        none_count += 1  # Increment counter
        print(f"Ruling {i+1} returned None")  # Log which ruling failed extraction
        print(cleaned_documents[i][:50])  # Preview first 400 characters of the original cleaned document
        print("-" * 100)  # Separator for readability

# Print the total number of rulings where no opinion section was found
print(f"\nTotal number of rulings that returned None: {none_count}")

Ruling 10 returned None
Matter of Boldi (
2025 NY Slip Op 02340
)









----------------------------------------------------------------------------------------------------
Ruling 11 returned None
Matter of Rodriguez v New York State Dept. of Moto
----------------------------------------------------------------------------------------------------
Ruling 23 returned None
FG&N Trust v 165 Hous. Corp. (
2025 NY Slip Op 021
----------------------------------------------------------------------------------------------------
Ruling 39 returned None
Matter of C.J. (J.C.) (
2025 NY Slip Op 02032
)



----------------------------------------------------------------------------------------------------
Ruling 47 returned None
JDM Wash. St. LLC v 90 Wash. St., LLC (
2025 NY Sl
----------------------------------------------------------------------------------------------------
Ruling 48 returned None
U.S. Bank N.A. v DLJ Mtge. Capital, Inc. (
2025 NY
---------------------------------------------

## Summary of Extracted Decision Sections

Out of 74 appellate rulings:

- 59 rulings successfully yielded a decision/opinion section
- 15 rulings returned None, typically due to:
    - Metadata-only records (e.g., no DECISION & ORDER or factual reasoning)
    - Placeholder documents without substantive text

These None cases can be excluded for now.

## Step 6: Clean and Format Decision Texts

To make legal rulings ready for semantic search, LLM summarization, or interactive retrieval, we apply a minimal yet effective cleaning and formatting:

**Step 1: Clean**

- Remove editorial markers like [*1]
- Collapse line breaks and normalize spacing
- Preserve legal citations (e.g., People v De Bour, 40 NY2d 210)

**Step 2: Format for Display**

Insert paragraph breaks after key transitional phrases:
- "ORDERED that"
- "Here,"
- "Accordingly,"
- "The defendant appeals."
- "The defendant contends that"
- Improve readability and prepare text for chunked display or embedding

**Why This Matters**

This structure:
- Increases interpretability for human readers
- Helps with chunk-based semantic search
- mproves LLM prompt clarity and summarization quality

In [51]:
formatted_texts = [clean_and_format_decision(t) for t in decision_texts if t is not None]

# Preview the first 3 cleaned and formatted decisions
for i, text in enumerate(formatted_texts[:3]):
    print(f"Formatted Ruling {i+1} Preview\n{'-'*100}")
    print(text[:10000])  # Show first 1500 characters
    print(f"End of Formatted Ruling {i+1} Preview\n{'-'*100}")

Formatted Ruling 1 Preview
----------------------------------------------------------------------------------------------------
Appeal by the defendant from a judgment of the County Court, Orange County (Craig S. Brown, J.), rendered May 16, 2023, convicting him of criminal possession of a weapon in the second degree, upon a jury verdict, and imposing sentence. The appeal brings up for review the denial, after a hearing, of those branches of the defendant's omnibus motion which were to suppress physical evidence and his statements to law enforcement officials. 

ORDERED that the judgment is affirmed. The defendant was arrested after a stop and frisk resulting in the recovery of a firearm. After a hearing, the County Court denied those branches of the defendant's omnibus motion which were to suppress the physical evidence recovered, including the firearm, and the defendant's statements made to law enforcement officials. After a jury trial, the defendant was convicted of criminal possess

## Step 7: Extract Party Information from Rulings

We extract the named parties in each case for later use in search, filtering, or metadata tagging.

### Heuristic:
- Locate the `[∗1]` marker, which precedes party listings
- Grab the next two non-empty lines:
  - Line containing respondent/plaintiff
  - Line containing appellant/defendant
- Combine into a single string like `Party A v Party B`

### Why Extract Parties?

Extracting parties helps us:

- Index or group rulings by case participants (e.g., recurring litigants)
- Enable user-friendly display of results in dashboards or semantic search tools
- Link to related court data or filings based on party names

While other metadata (like judge names, docket numbers, or decision dates) can also be extracted for deeper analysis or filtering, we’re currently prioritizing decision content and party identity to support our IR/NLP pipeline. Those fields can be layered in later as needed.

## Step 6: Apply Party Extraction Across All Rulings

We now apply the `extract_party_block()` function to all cleaned rulings and preview the results.

This helps verify whether our `[∗1] → v → party` heuristic reliably identifies both parties in most rulings.

In [52]:
# Apply party extraction to all documents
extracted_parties = []

for i, doc in enumerate(cleaned_documents):
    result = extract_party_block(doc)
    extracted_parties.append((i, result))

# Preview extracted party pairs
for i, text in extracted_parties[:10]:
    print(f"\n Doc {i}: {text}")


 Doc 0: The People of the State of New York, respondent, v Nicholas Palm, appellant.

 Doc 1: Maria Airene Pantanilla, respondent, v Guillerma Yuson, appellant.

 Doc 2: Franklin Carroll, LLC, appellant, v Carroll Development Plaza, LLC, respondent.

 Doc 3: In the Matter of Pamela De Phillips, etc., appellant, v Nicole Pascone Perez, respondent.

 Doc 4: Bank of America, N.A., respondent, v Dale Bente, appellant, et al., defendants.

 Doc 5: Justin John Flood, etc., appellant, v Ritha Alhindawi, etc., et al., respondents.

 Doc 6: In the Matter of Christian M. L. (Anonymous), etc. MercyFirst, respondent; Christopher M. L. (Anonymous), etc., et al., appellants. (Proceeding No. 1.) v In the Matter of Londyn M. L. (Anonymous), etc. MercyFirst, respondent; Christopher M. L. (Anonymous), etc., et al., appellants. (Proceeding No. 3.)

 Doc 7: Castle Village Owners Corp., Respondent, v Guillermina Girardi, Appellant.

 Doc 8: In the Matter of Cecil T. N. (Anonymous), Jr. Administration for 

## Step 7: Final Party Line Cleanup

We now remove common prefixes like `"In the Matter of"` from extracted party lines.  
This helps standardize structure before creating a metadata DataFrame.

In [53]:
# Clean party lines in-place for consistency
cleaned_parties = [(i, clean_party_line(line)) for i, line in extracted_parties]

# Preview the cleaned party lines
for i, text in cleaned_parties:
    print(f"\nDoc {i}: {text}")


Doc 0: The People of the State of New York, respondent, v Nicholas Palm, appellant.

Doc 1: Maria Airene Pantanilla, respondent, v Guillerma Yuson, appellant.

Doc 2: Franklin Carroll, LLC, appellant, v Carroll Development Plaza, LLC, respondent.

Doc 3: Pamela De Phillips, etc., appellant, v Nicole Pascone Perez, respondent.

Doc 4: Bank of America, N.A., respondent, v Dale Bente, appellant, et al., defendants.

Doc 5: Justin John Flood, etc., appellant, v Ritha Alhindawi, etc., et al., respondents.

Doc 6: Christian M. L. (Anonymous), etc. MercyFirst, respondent; Christopher M. L. (Anonymous), etc., et al., appellants. (Proceeding No. 1.) v Londyn M. L. (Anonymous), etc. MercyFirst, respondent; Christopher M. L. (Anonymous), etc., et al., appellants. (Proceeding No. 3.)

Doc 7: Castle Village Owners Corp., Respondent, v Guillermina Girardi, Appellant.

Doc 8: Cecil T. N. (Anonymous), Jr. Administration for Children's Services, respondent; Kascha P. (Anonymous), appellant. v Muriel G

## Last Step: Create Metadata DataFrame 

In [54]:
# Create DataFrame with party and decision info
metadata_df = pd.DataFrame({
    "party_line": [line for _, line in cleaned_parties],
    "decision_text": decision_texts
})

In [55]:
metadata_df

Unnamed: 0,party_line,decision_text
0,"The People of the State of New York, responden...",Appeal by the defendant from a judgment of the...
1,"Maria Airene Pantanilla, respondent, v Guiller...","In an action, in effect, to recover damages fo..."
2,"Franklin Carroll, LLC, appellant, v Carroll De...","In an action, inter alia, for injunctive relie..."
3,"Pamela De Phillips, etc., appellant, v Nicole ...",In related proceedings pursuant to Family Cour...
4,"Bank of America, N.A., respondent, v Dale Bent...","In an action to foreclose a mortgage, the defe..."
...,...,...
69,"The People of the State of New York, responden...",Appeal by the defendant from a judgment of the...
70,"The People of the State of New York, responden...",Appeal by the defendant from a judgment of the...
71,"The People of the State of New York, Responden...","Judgment, Supreme Court, New York County (Robe..."
72,"The People of the State of New York, responden...","Appeal by the defendant, as limited by his mot..."


In [56]:
# Filter out rows where the decision text is None
metadata_df = metadata_df[metadata_df["decision_text"].notnull()].reset_index(drop=True)

In [57]:
metadata_df

Unnamed: 0,party_line,decision_text
0,"The People of the State of New York, responden...",Appeal by the defendant from a judgment of the...
1,"Maria Airene Pantanilla, respondent, v Guiller...","In an action, in effect, to recover damages fo..."
2,"Franklin Carroll, LLC, appellant, v Carroll De...","In an action, inter alia, for injunctive relie..."
3,"Pamela De Phillips, etc., appellant, v Nicole ...",In related proceedings pursuant to Family Cour...
4,"Bank of America, N.A., respondent, v Dale Bent...","In an action to foreclose a mortgage, the defe..."
5,"Justin John Flood, etc., appellant, v Ritha Al...","In an action, inter alia, to recover damages f..."
6,"Christian M. L. (Anonymous), etc. MercyFirst, ...",In related proceedings pursuant to Social Serv...
7,"Castle Village Owners Corp., Respondent, v Gui...","Appeal from order, Supreme Court, New York Cou..."
8,"Cecil T. N. (Anonymous), Jr. Administration fo...",In a proceeding pursuant to Family Court Act a...
9,"Anthony Mancheno Granizo, appellant, v Krystal...","In an action, inter alia, to recover damages f..."


In [58]:
# Save the DataFrame to a CSV file in the data folder
metadata_df.to_csv("data/party_and_decision_metadata.csv", index=False)

print("Saved metadata_df with party and decision info to data/party_and_decision_metadata.csv")

Saved metadata_df with party and decision info to data/party_and_decision_metadata.csv
