## Step 1: Setup and Configuration

In this notebook, I will process the raw text files fetched in the previous step. My goal is to clean and structure this data into a `pandas` DataFrame that will serve as the primary input for all subsequent analysis.

The process involves:
-   Setting up the environment and importing necessary libraries.
-   Loading each raw text file.
-   Applying custom functions to clean the raw text.
-   Extracting the two most critical fields for this project: the core **`decision_text`** and its associated **`party_line`**.
-   Saving the final, cleaned DataFrame to the `processed` data folder.

While the raw data contains additional metadata (like dates, judges, etc.), I made a strategic decision to defer full metadata extraction to a future phase. This allowed me to keep the scope of Phase 1 focused on building and validating the core text analysis pipeline.

In [24]:
# --- Foundational Library Imports ---
import os
import sys
import re
import pandas as pd
from pathlib import Path

# --- Project Root Setup ---
# This block ensures the notebook's working directory is always the project root (compliance-nlp/).
# This makes all file paths for data and source code consistent and reproducible.
current_dir = Path.cwd()
if current_dir.name == 'notebooks':
    os.chdir(current_dir.parent)
    print(f"Changed working directory to project root: {os.getcwd()}")
else:
    print(f"Already at project root: {os.getcwd()}")

Already at project root: /Users/lidasmac/compliance-nlp


In [26]:
# --- Custom & Configuration Imports ---
# This cell can only run correctly after the working directory has been set above.
from config import DATA_DIR

# Import all the specific cleaning functions from my custom preprocessing module
from src.preprocessing import (
    clean_html,
    extract_decision_section,
    clean_and_format_decision,
    extract_party_block,
    clean_party_line,
)

## Step 2: Load and Inspect Raw Rulings

First, I will load all the raw `.txt` files from the `data/raw` directory into a list. I'll then confirm the number of rulings loaded, to ensure the data has been loaded correctly before proceeding with cleaning.

In [27]:
# --- Load all raw ruling documents from the target directory ---

# 1. Define the directory containing the raw text files
# This uses the DATA_DIR from our config and robustly targets the 'raw' sub-directory.
raw_data_dir = Path(DATA_DIR) / "raw"

# 2. Gather all .txt file paths using pathlib's glob method
# This is a clean and efficient way to find all matching files.
filepaths = sorted(list(raw_data_dir.glob("*.txt")))

# 3. Read and store the content of each ruling in a list
# This list comprehension is a concise way to build the 'documents' list.
documents = [path.read_text(encoding="utf-8").strip() for path in filepaths]

# 4. Confirm the number of rulings loaded
print(f"Loaded {len(documents)} rulings from '{raw_data_dir}' into memory.")

Loaded 74 rulings from 'data/raw' into memory.


## Step 3: Preview Raw Rulings

Before I begin preprocessing, I'll preview a few raw rulings. This manual inspection helps me understand the text's structure, identify common formatting issues, and plan the specific cleaning functions I'll need to apply.

I will start by examining the first two rulings.

In [28]:
# --- Preview a few raw rulings to inspect Structure ---
for i, doc in enumerate(documents[:2]):
    print(f"\n Ruling {i+1} Preview\n{'-'*100}")
    print(doc[:300]) 
    print(f"\n End of Ruling {i+1} Preview\n{'-'*100}")


 Ruling 1 Preview
----------------------------------------------------------------------------------------------------
<div>

Bank of N.Y. Mellon v Fenimore St. Realty, Inc. (<span class="citation" data-id="10889333"><a href="/opinion/10422745/bank-of-ny-mellon-v-fenimore-st-realty-inc/" aria-description="Citation for case: Bank of N.Y. Mellon v. Fenimore St. Realty, Inc.">2025 NY Slip Op 02566</a></span>)



<table

 End of Ruling 1 Preview
----------------------------------------------------------------------------------------------------

 Ruling 2 Preview
----------------------------------------------------------------------------------------------------
<div>

Citimortgage, Inc. v Benyacob (<span class="citation" data-id="10889332"><a href="/opinion/10422744/citimortgage-inc-v-benyacob/" aria-description="Citation for case: Citimortgage, Inc. v. Benyacob">2025 NY Slip Op 02567</a></span>)



<table width="80%" border="1" cellspacing="2" cellpadding

 End of Ruling 2 Preview
-----

## Step 4: Initial Text Cleaning

To prepare the text for the next phase of analysis (rule-based risk labeling), I will now apply a lightweight cleaning process. My focus here is on preserving as much of the original document structure as possible.

Therefore, my cleaning strategy for this initial phase is very specific:
-   I will remove any leftover HTML tags and markup.
-   I will ensure line breaks are preserved to maintain paragraph and logical structure.
-   I will **not** perform aggressive normalization like lowercasing or stopword removal at this stage. My rule-matching engine will handle case-insensitivity, and preserving the original casing can be useful for manual review.

This approach ensures the text is clean enough for processing while maintaining its integrity for accurate rule matching and analysis.

In [29]:
# --- Apply HTML cleaning to all raw documents ---
# I'll use a list comprehension for an efficient and readable way to apply
# my custom clean_html function to every document.
cleaned_documents = [clean_html(doc) for doc in documents]

print(f"HTML cleaning applied to all {len(cleaned_documents)} documents.")

HTML cleaning applied to all 74 documents.


### 4.1 Verification: Compare Before and After Cleaning

To confirm the cleaning worked as expected, I will now inspect the "before" and "after" versions of the first document. This provides a direct, qualitative check of the function's output.

In [30]:
# --- Verification Step ---
# I'll compare the raw and cleaned versions of the first document.
print(f"Before Cleaning (Raw)\n{'-'*100}")
print(documents[0][:500] + "...") # Show first 500 characters

print(f"\nAfter Cleaning\n{'-'*100}")
print(cleaned_documents[0][:500] + "...")

Before Cleaning (Raw)
----------------------------------------------------------------------------------------------------
<div>

Bank of N.Y. Mellon v Fenimore St. Realty, Inc. (<span class="citation" data-id="10889333"><a href="/opinion/10422745/bank-of-ny-mellon-v-fenimore-st-realty-inc/" aria-description="Citation for case: Bank of N.Y. Mellon v. Fenimore St. Realty, Inc.">2025 NY Slip Op 02566</a></span>)



<table width="80%" border="1" cellspacing="2" cellpadding="5" align="center">
<tr>
<td align="center"><b>Bank of N.Y. Mellon v Fenimore St. Realty, Inc.</b></td>
</tr>
<tr>
<td align="center"><span class="c...

After Cleaning
----------------------------------------------------------------------------------------------------
Bank of N.Y. Mellon v Fenimore St. Realty, Inc. (
2025 NY Slip Op 02566
)








Bank of N.Y. Mellon v Fenimore St. Realty, Inc.






2025 NY Slip Op 02566






Decided on April 30, 2025






Appellate Division, Second Department






Published by New

### 4.2 Save Cleaned Documents for Manual Review

As the final step in this initial cleaning phase, I will save each of the 74 cleaned documents to its own text file.

Saving the files at this stage allows me to open and manually inspect a sample of them outside of the notebook. This offline review is essential for identifying the consistent patterns and document structure needed for the next phase of metadata extraction.

The detailed structural analysis I present in the next step is the direct result of the insights I gained from this manual inspection.

In [32]:
# --- Save each cleaned document as a separate .txt file ---

# 1. Define the output directory within our established 'processed' folder
output_dir = Path(DATA_DIR) / "processed"/ "01_initial_clean"
os.makedirs(output_dir, exist_ok=True) # Ensure the directory exists

# 2. Loop through the original filepaths and the cleaned documents in parallel
#    The zip() function pairs each original path with its corresponding cleaned text.
for original_path, cleaned_text in zip(filepaths, cleaned_documents):
    
    # 3. Construct the new output path, keeping the original filename
    #    original_path.name gets the filename (e.g., "opinion_12345.txt")
    new_filepath = output_dir / original_path.name
    
    # 4. Save the cleaned text to the new file
    with open(new_filepath, "w", encoding="utf-8") as f:
        f.write(cleaned_text)

print(f"Successfully saved {len(cleaned_documents)} cleaned documents to the '{output_dir}' directory.")

Successfully saved 74 cleaned documents to the 'data/processed/01_initial_clean' directory.


## Step 5: Analyze Ruling Structure & Plan for Extraction

Now that the documents are clean, my next goal is to extract structured information from them, specifically the `party_line` and the core `decision_text`.

Based on my manual review of the cleaned documents, I've identified a consistent structure that can be used for parsing. Understanding this schema is the critical first step before writing the extraction functions.

### Identified Ruling Structure

Each court ruling generally follows this format:

**1. Case Header**
* Case name (e.g., *People v Palm*)
* Slip opinion number and decision date
* Court (e.g., Appellate Division, Second Department)

**2. Party and Counsel Information**
* Named parties: appellant(s) and respondent(s)
* Legal representation: attorney listings for each side

**3. Decision Section**
* Procedural Summary: What is being appealed and why.
* Outcome Statement: Often begins with `ORDERED that...`.
* Factual Background & Legal Analysis: The core reasoning of the court.

**4. Conclusion**
* Final judgment (affirmed/reversed/remanded).
* List of concurring judges.

This structural analysis provides the roadmap for the functions I will now apply to extract the key data fields.

## Step 6: Extract the Core Decision Text

Now that I understand the document structure, I will extract the core decision section from each ruling. This part of the text contains the legal analysis and conclusions, which are essential for our risk and relevance assessments.

I locate the decision body using the following logic:
-   **Start** the extraction *after* a common heading like `"DECISION & ORDER"`, if one is present.
-   **Stop** the extraction *before* common sign-offs like judge signatures (e.g., lines containing `J.P.` or `JJ.`) or official closings like `"ENTERED:"`.

This approach ensures a clean, focused extraction of the ruling's reasoning without including header metadata or footer signatures.

In [33]:
# --- Extract decision sections from all cleaned rulings ---
# I'll apply my custom extract_decision_section function to the list of cleaned documents.
decision_texts = [extract_decision_section(doc) for doc in cleaned_documents]

print(f"Core decision text extracted from all {len(decision_texts)} documents.")

Core decision text extracted from all 74 documents.


Now, I will preview the first two extracted decision sections to verify that the parsing logic worked correctly.

In [34]:
# --- Preview a few extracted decision sections ---
for i, decision in enumerate(decision_texts[:2]):
    print(f"Doc {i+1} Decision Preview\n{'-'*100}")
    
    # This check handles cases where no decision text could be found
    if decision:
        print(decision[:500] + "...")
    else:
        print("--> No decision text found for this document.")
        
    print(f"End of Doc {i+1} Preview\n{'-'*100}")

Doc 1 Decision Preview
----------------------------------------------------------------------------------------------------
In an action to foreclose a mortgage, the defendant Fenimore St. Realty, Inc., appeals from an order of the Supreme Court, Kings County (Larry D. Martin, J.), dated April 25, 2023. The order, insofar as appealed from, denied that defendant's cross-motion to reduce the amount of postjudgment interest accrued.
ORDERED that the order is affirmed insofar as appealed from, with costs.
In 2011, the plaintiff commenced this action to foreclose a mortgage on certain real property located in Brooklyn. In...
End of Doc 1 Preview
----------------------------------------------------------------------------------------------------
Doc 2 Decision Preview
----------------------------------------------------------------------------------------------------
In an action to foreclose a mortgage, the defendant Yehudit Benyacob appeals from an order of the Supreme Court, Kings County 

### 6.1 Identify and Review Missing Decisions

After attempting to extract the decision section from each ruling, I will now check for any documents where this process failed (i.e., returned an empty result).

This is a critical data quality step that helps me:
-   Quantify how many rulings may be unusable for future analysis.
-   Identify potential edge cases or inputs that my parsing logic missed.
-   Manually inspect these problematic cases to decide whether to exclude them or refine my extraction patterns.

In [36]:
# --- Identify, Inspect, and Save Rulings with Missing Decision Sections ---

# 1. Define and create a dedicated folder for any parsing failures
failure_dir = Path(DATA_DIR) / "processed" / "01_decision_parsing_failures"
os.makedirs(failure_dir, exist_ok=True)

# 2. Use a list comprehension to efficiently find the indices of all failed documents
missing_indices = [i for i, text in enumerate(decision_texts) if not text]

# 3. Report the summary and save the failed files if any exist
if missing_indices:
    print(f"Found {len(missing_indices)} documents where decision text could not be extracted.")
    print(f"   Failed indices: {missing_indices}")
    print(f"   Saving these documents to '{failure_dir}' for review ...")

    # Loop through the failed indices and save each corresponding document
    for i in missing_indices:
        # Get the original filename to preserve the link to the raw data
        original_filename = filepaths[i].name
        output_path = failure_dir / original_filename
        
        # Save the full cleaned document that caused the failure
        with open(output_path, "w", encoding="utf-8") as f:
            f.write(cleaned_documents[i])
            
    print(f"\nProcess complete. Saved {len(missing_indices)} failed documents.")
else:
    print("Process complete. All documents were parsed successfully.")

Found 15 documents where decision text could not be extracted.
   Failed indices: [10, 29, 38, 41, 43, 44, 45, 46, 47, 50, 51, 53, 54, 60, 68]
   Saving these documents to 'data/processed/01_decision_parsing_failures' for review ...

Process complete. Saved 15 failed documents.


### 6.2 Summary of Extraction Results & Path Forward

My analysis of the 15 documents that failed parsing reveals a clear pattern. The failures are primarily due to variations in the document headers that my initial strategy did not account for.

Common headers in the failed documents include:
-   `Decision and Judgement`
-   `Opinion and Order`
-   `Order, Supreme Court`
-   `Order, Family Court`
-   `Order of fact-finding`

While I could refine my `extract_decision_section` function to handle these new patterns, the current process has already successfully extracted **59 high-quality decision texts.** This is a more than sufficient dataset for the primary goal of this project phase: to build, refine, and evaluate the rule-based Triage model.

Therefore, to maintain project momentum and keep the scope focused, **I will proceed with the 59 successfully parsed documents.** Improving the parser's robustness to handle these additional header formats is a valuable task that I will note for future work.

## Step 7: Final Text Cleaning and Formatting

As the final preprocessing step in this notebook, I will apply a more detailed cleaning and formatting routine to the extracted decision texts. My goal is to create a clean, consistently structured text body that is ready for the next steps of the analysis.

My two-step process is as follows:

**1. Text Cleaning:**
* I will remove editorial markers like `[*1]`.
* I will collapse multiple line breaks and normalize all whitespace.
* I will preserve legal citations (e.g., *People v De Bour, 40 NY2d 210*), as they are important parts of the text.

**2. Readability Formatting:**
* To improve readability and prepare the text for later steps, I will insert paragraph breaks after key transitional phrases such as `"ORDERED that"`, `"Here,"`, and `"Accordingly,"`.

**Why This Matters:**

This structured cleaning is important for the project's future phases:
-   **For Human Review:** It makes the text much easier for an analyst to read.
-   **For Semantic Search:** It helps break the text into more logical chunks.
-   **For LLMs:** A clean, well-structured input improves the quality of summarization and other language model tasks.

In [37]:
# --- Apply final cleaning and formatting to the extracted decision texts ---
# I'll use a list comprehension to apply the function, filtering out any
# documents where the initial extraction failed.
formatted_texts = [clean_and_format_decision(t) for t in decision_texts if t] # 'if t' is a concise way to check for not None and not empty

print(f"Final cleaning and formatting applied to {len(formatted_texts)} documents.")

Final cleaning and formatting applied to 59 documents.


Now, I will preview the first two formatted texts to confirm that my final cleaning and formatting logic worked as intended.

In [38]:
# --- Preview the first two fully processed rulings ---
for i, text in enumerate(formatted_texts[:2]):
    print(f"\nFormatted Ruling {i+1} Preview\n{'-'*100}")
    # Previewing a slice from the middle of the text can be a good way to see the formatting
    print(text[:500] + "...") 
    print(f"\nEnd of Formatted Ruling {i+1} Preview\n{'-'*100}")


Formatted Ruling 1 Preview
----------------------------------------------------------------------------------------------------
In an action to foreclose a mortgage, the defendant Fenimore St. Realty, Inc., appeals from an order of the Supreme Court, Kings County (Larry D. Martin, J.), dated April 25, 2023. The order, insofar as appealed from, denied that defendant's cross-motion to reduce the amount of postjudgment interest accrued. 

ORDERED that the order is affirmed insofar as appealed from, with costs. In 2011, the plaintiff commenced this action to foreclose a mortgage on certain real property located in Brooklyn. ...

End of Formatted Ruling 1 Preview
----------------------------------------------------------------------------------------------------

Formatted Ruling 2 Preview
----------------------------------------------------------------------------------------------------
In an action to foreclose a mortgage, the defendant Yehudit Benyacob appeals from an order of the Supr

## Step 8: Extract Party Information

Next, I will extract the named parties from each case to use for metadata tagging and filtering in later analysis.

### My Extraction Logic

Based on my review of the document structure, I will use a regular expression to isolate the party information:
-   First, I locate the block of text that starts with the `[*1]` marker and ends just before a major section heading like `"DECISION & ORDER"`.
-   Within that block, I clean up the lines and extract the relevant party names.
-   Finally, I combine these lines into a single `party_line` string (e.g., *Party A v Party B*).

### Why This is Valuable

Extracting this structured information allows me to:

-   Index or group rulings by case participants (e.g., to find recurring litigants).
-   Enable a more user-friendly display of results in dashboards or search tools.

While other metadata (like judge names or docket numbers) could also be extracted, I am currently prioritizing the decision content and party identity to keep the scope of this phase focused. These other fields can be layered in later as needed.

In [39]:
# --- Apply party extraction to all cleaned documents ---
# I'll use a list comprehension for a concise way to apply my custom function.
# The results will be stored as a list of tuples: (original_index, extracted_party_string).
extracted_parties = [
    (i, extract_party_block(doc)) for i, doc in enumerate(cleaned_documents)
]

print(f"Party information extracted from all {len(extracted_parties)} documents.")

Party information extracted from all 74 documents.


Now, I will preview the first 10 extracted party lines to verify that the parsing logic worked correctly.

In [40]:
# --- Preview the first 10 extracted party lines ---
print("--- Sample of Extracted Party Information ---")

for i, party_line in extracted_parties[:10]:
    # This check makes the output clearer if the function failed for a document
    if party_line:
        print(f"Doc {i}: '{party_line}'")
    else:
        print(f"Doc {i}: --> No party information found.")

--- Sample of Extracted Party Information ---
Doc 0: 'Bank of New York Mellon, etc., respondent, v Fenimore St. Realty, Inc., appellant, et al., defendants.'
Doc 1: 'Citimortgage, Inc., etc., respondent, v Yehudit Benyacob, etc., appellant, et al., defendants. Henry Kohn, Brooklyn, NY, for appellant.'
Doc 2: 'In the Matter of 853-855 McLean, LLC, respondent, v City of Yonkers, NY, et al., appellants.'
Doc 3: 'Edwin Delcid-Funez, appellant-respondent, v Seasons at East Meadow Home Owners Association, Inc., et al., respondents-appellants, et al., defendant.'
Doc 4: 'In the Matter of Kristin M. Cirillo, appellant, v Ronald Grullon, respondent. Kenneth M. Tuccillo, Hastings on Hudson, NY, for appellant. Steven A. Feldman, Manhasset, NY, for respondent.'
Doc 5: 'P. A., respondent, v Poly Prep Country Day School, appellant.'
Doc 6: 'Maria Airene Pantanilla, respondent, v Guillerma Yuson, appellant. David De Andrade, New York, NY, for appellant.'
Doc 7: 'The People of the State of New York, r

### 8.1 Final Party Line Cleanup

As a final refinement step, I will clean the extracted party lines by removing common legal prefixes like `"In the Matter of"`.

This standardization is important for creating a clean and consistent `party_line` column in the final DataFrame I will build next.

In [41]:
# --- Apply final cleanup to the extracted party lines ---
# I'll use a list comprehension to apply my custom clean_party_line function.
cleaned_parties = [(i, clean_party_line(line)) for i, line in extracted_parties]

print(f"Final cleanup applied to all {len(cleaned_parties)} party lines.")

Final cleanup applied to all 74 party lines.


Now, I will preview a sample of the cleaned party lines to verify that the prefixes were removed as expected.

In [43]:
# --- Preview the first 10 cleaned party lines ---
print("--- Sample of Final, Cleaned Party Information ---")

for i, party_line in cleaned_parties[:10]: # We'll just preview the first 10
    if party_line:
        print(f"Doc {i}: '{party_line}'")
    else:
        # This handles cases where the original extraction might have failed
        print(f"Doc {i}: --> No party information found.")

--- Sample of Final, Cleaned Party Information ---
Doc 0: 'Bank of New York Mellon, etc., respondent, v Fenimore St. Realty, Inc., appellant, et al., defendants.'
Doc 1: 'Citimortgage, Inc., etc., respondent, v Yehudit Benyacob, etc., appellant, et al., defendants. Henry Kohn, Brooklyn, NY, for appellant.'
Doc 2: '853-855 McLean, LLC, respondent, v City of Yonkers, NY, et al., appellants.'
Doc 3: 'Edwin Delcid-Funez, appellant-respondent, v Seasons at East Meadow Home Owners Association, Inc., et al., respondents-appellants, et al., defendant.'
Doc 4: 'Kristin M. Cirillo, appellant, v Ronald Grullon, respondent. Kenneth M. Tuccillo, Hastings on Hudson, NY, for appellant. Steven A. Feldman, Manhasset, NY, for respondent.'
Doc 5: 'P. A., respondent, v Poly Prep Country Day School, appellant.'
Doc 6: 'Maria Airene Pantanilla, respondent, v Guillerma Yuson, appellant. David De Andrade, New York, NY, for appellant.'
Doc 7: 'The People of the State of New York, respondent, v Brendan Dowling,

## Step 9: Create and Save the Final DataFrame

As the final step in this notebook, I will combine all the cleaned and extracted pieces of information—the party lines and the formatted decision texts—into a single, structured `pandas` DataFrame.

A key decision here is how to handle the 15 documents for which my parsing logic could not extract a decision text. For data integrity and traceability, **I will keep all 74 original rows** in the final DataFrame.

The `decision_text` for the failed documents will be stored as an empty string. This maintains a perfect 1-to-1 mapping back to the raw source files and allows for easy filtering in subsequent analysis.

This final DataFrame is the key deliverable from this preprocessing pipeline and will be the primary input for the next phase of the project.

In [44]:
# --- Create the Final DataFrame ---

# 1. Prepare the 'party_line' data
# I'll extract just the text from the 'cleaned_parties' list of tuples.
final_party_lines = [line for i, line in cleaned_parties]

# 2. Prepare the 'decision_text' data
# I'll use the original 'decision_texts' list (which has 74 items) and
# replace any 'None' values with an empty string to ensure data integrity.
final_decision_texts = [text if text is not None else "" for text in decision_texts]

# 3. Create the DataFrame from a dictionary of these lists
# This is the standard, efficient way to build a DataFrame and ensures all
# columns have the same length (74).
metadata_df = pd.DataFrame({
    "doc_index": range(len(documents)), # Create an index from 0 to 73
    "party_line": final_party_lines,
    "decision_text": final_decision_texts
})

# 4. Verification
print(f"Successfully created DataFrame with {len(metadata_df)} rows.")
display(metadata_df.head())

# To be certain, I'll also check a row where the decision text was missing
# to confirm it was handled correctly as an empty string.
try:
    first_empty_index = metadata_df[metadata_df['decision_text'] == ""].index[0]
    print(f"\n--- Verifying a row (index {first_empty_index}) where parsing failed ---")
    display(metadata_df.iloc[first_empty_index:first_empty_index+1])
except IndexError:
    print("\n--- No rows with empty decision text found. All documents parsed successfully. ---")

Successfully created DataFrame with 74 rows.


Unnamed: 0,doc_index,party_line,decision_text
0,0,"Bank of New York Mellon, etc., respondent, v F...","In an action to foreclose a mortgage, the defe..."
1,1,"Citimortgage, Inc., etc., respondent, v Yehudi...","In an action to foreclose a mortgage, the defe..."
2,2,"853-855 McLean, LLC, respondent, v City of Yon...",In a proceeding pursuant to CPLR article 78 to...
3,3,"Edwin Delcid-Funez, appellant-respondent, v Se...",In an action to recover damages for personal i...
4,4,"Kristin M. Cirillo, appellant, v Ronald Grullo...",In a proceeding pursuant to Family Court Act a...



--- Verifying a row (index 10) where parsing failed ---


Unnamed: 0,doc_index,party_line,decision_text
10,10,"Angel I. Rodriguez, petitioner, v New York Sta...",


## Step 10: Save the Final Cleaned Dataset

As the final action in this notebook, I will save the cleaned and structured `metadata_df` to a CSV file.

This file will be stored in the `data/processed/` directory and will serve as the primary input for the next notebook in our pipeline, where I will perform the rule-based labeling.

In [45]:
# --- Define the output path and save the DataFrame ---

# 1. Define the output directory using our established structure
output_dir = Path(DATA_DIR) / "processed"
os.makedirs(output_dir, exist_ok=True) # Ensure the directory exists

# 2. Construct the full file path
output_path = output_dir / "party_and_decision_metadata.csv"

# 3. Save the DataFrame to CSV, excluding the pandas index
metadata_df.to_csv(output_path, index=False)

print(f"Final DataFrame successfully saved to: {output_path}")

Final DataFrame successfully saved to: data/processed/party_and_decision_metadata.csv
