## LORA training/testing pipeline exploration 

# Step 1 : Load data (master_clauses file from CUAD)

Dataset Description Summarized : 

1. Columns NOT ending in "Answer" (Context Columns)
- Role: These columns contain the text context (the actual excerpt or "clause") extracted from the contract.
- Content: A string of text directly from the contract that is responsive to a specific category.
- Purpose: This serves as the "evidence" or the "source passage" that justifies a specific determination.
- Handling of Omissions: If parts of the text are irrelevant, they may be replaced with <omitted>.

2. Columns ending in "Answer" (Label Columns)
- Role: These columns contain the derived human-input answers based on the text context found in the corresponding Context column.
- Content:
- For "Yes/No" Categories (33 types): The value is "Yes" if the clause exists, or "No" if no string was found. (e.g., Termination for Convenience).
- For Extraction Categories (8 types): The value is a normalized string representing a specific entity, date, or number (e.g., converting "8th day of May 2014" in the text to "5/8/2014" in the Answer column).
- Purpose: This is the "ground truth" or "label" for the machine learning task.

In [1]:
import pandas as pd
import json
from pathlib import Path
import csv
import re
from sklearn.model_selection import train_test_split


In [2]:
CUAD_PATH = Path('data/cuad') 
MASTER_CLAUSES_PATH = CUAD_PATH/'master_clauses.csv' 

try:
    # Read the file manually using the CSV module first to handle inconsistencies
    data = []
    with open(MASTER_CLAUSES_PATH, 'r', encoding='utf-8', errors='replace') as f:
        # Use csv.Sniffer to deduce format if possible, or enforce standard strictness
        reader = csv.DictReader(f) 
        for i, row in enumerate(reader):
            data.append(row)

    # Convert the list of dicts to a DataFrame
    df = pd.DataFrame(data)

    print(f"Data Loaded Successfully via CSV module.")
    print(f"Total Contracts: {len(df)}")
    print(df.head(3))

except Exception as e:
    print(f"Error: {e}")

Data Loaded Successfully via CSV module.
Total Contracts: 545
                                            Filename   Document Name  \
0  CybergyHoldingsInc_20140520_10-Q_EX-10.27_8605...   """"MA"""")""   
1  EuromediaHoldingsCorp_20070215_10SB12G_EX-10.B...      ['July 11    
2  FulucaiProductionsLtd_20131223_10-Q_EX-10.9_83...   ['November 15   

     Document Name-Answer   Parties Parties-Answer Agreement Date  \
0  ['8th day of May 2014'    'May 8       2014']""         5/8/14   
1                2006']""   7/11/06     ['July 11        2006']""   
2                2012']""  11/15/12  ['November 15       2012']""   

                               Agreement Date-Answer  \
0  ['This agreement shall begin upon the date of ...   
1                                            7/11/06   
2                                           11/15/12   

                                      Effective Date  \
0                                                      
1  ['The term of this Agreement (th

In [3]:
df.head(3)

Unnamed: 0,Filename,Document Name,Document Name-Answer,Parties,Parties-Answer,Agreement Date,Agreement Date-Answer,Effective Date,Effective Date-Answer,Expiration Date,...,Liquidated Damages-Answer,Warranty Duration,Warranty Duration-Answer,Insurance,Insurance-Answer,Covenant Not To Sue,Covenant Not To Sue-Answer,Third Party Beneficiary,Third Party Beneficiary-Answer;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;,NaN
0,CybergyHoldingsInc_20140520_10-Q_EX-10.27_8605...,"""""""""MA"""""""")""""",['8th day of May 2014','May 8,"2014']""""",5/8/14,['This agreement shall begin upon the date of ...,,['This agreement shall begin upon the date of ...,12/31/14,...,No,[],No,[],No,[],No,[],No,
1,EuromediaHoldingsCorp_20070215_10SB12G_EX-10.B...,['July 11,"2006']""""",7/11/06,['July 11,"2006']""""",7/11/06,"['The term of this Agreement (the """"""""Initial ...",unless earlier terminated in accordance with ...,shall terminate on June 30,...,No,[],No,[],No,"[""""""""Notwithstanding any other provision of th...",Rogers may terminate this Agreement,at any time,upon sixty (60) days' prior written notice to...,
2,FulucaiProductionsLtd_20131223_10-Q_EX-10.9_83...,['November 15,"2012']""""",11/15/12,['November 15,"2012']""""",11/15/12,[],,['License Term Perpetual,...,Satellite x Pay: Terrestrial,Cable,Satellite x Direct Satellite IP Distribution ...,'Producer further grants to ConvergTV the rig...,at its sole cost and expense,new and different versions of the Program,create foreign language,subtitled or translated versions of the Progr...,including NTCS,


In [4]:
def clean_text(text):
    if pd.isna(text):
        return ""
    text = str(text)
    # Remove special characters but keep spaces
    return re.sub(r'[^a-zA-Z0-9\s]', '', text)

# Clean column names
df.columns = [clean_text(col).strip() for col in df.columns]
for col in df.columns:
    print(col)

Filename
Document Name
Document NameAnswer
Parties
PartiesAnswer
Agreement Date
Agreement DateAnswer
Effective Date
Effective DateAnswer
Expiration Date
Expiration DateAnswer
Renewal Term
Renewal TermAnswer
Notice Period To Terminate Renewal
Notice Period To Terminate Renewal Answer
Governing Law
Governing LawAnswer
Most Favored Nation
Most Favored NationAnswer
Competitive Restriction Exception
Competitive Restriction ExceptionAnswer
NonCompete
NonCompeteAnswer
Exclusivity
ExclusivityAnswer
NoSolicit Of Customers
NoSolicit Of CustomersAnswer
NoSolicit Of Employees
NoSolicit Of EmployeesAnswer
NonDisparagement
NonDisparagementAnswer
Termination For Convenience
Termination For ConvenienceAnswer
RofrRofoRofn
RofrRofoRofnAnswer
Change Of Control
Change Of ControlAnswer
AntiAssignment
AntiAssignmentAnswer
RevenueProfit Sharing
RevenueProfit SharingAnswer
Price Restrictions
Price RestrictionsAnswer
Minimum Commitment
Minimum CommitmentAnswer
Volume Restriction
Volume RestrictionAnswer
Ip Own

In [5]:
new_columns = {}
for col in df.columns:
    print(f"Processing column: '{col}'")
    if "Answer" in col:
            # Remove "Answer" from the string and append "_Answer" at the end
            new_columns[col] = f"{col.replace('Answer', '').strip()}_Answer"

df = df.rename(columns=new_columns)

Processing column: 'Filename'
Processing column: 'Document Name'
Processing column: 'Document NameAnswer'
Processing column: 'Parties'
Processing column: 'PartiesAnswer'
Processing column: 'Agreement Date'
Processing column: 'Agreement DateAnswer'
Processing column: 'Effective Date'
Processing column: 'Effective DateAnswer'
Processing column: 'Expiration Date'
Processing column: 'Expiration DateAnswer'
Processing column: 'Renewal Term'
Processing column: 'Renewal TermAnswer'
Processing column: 'Notice Period To Terminate Renewal'
Processing column: 'Notice Period To Terminate Renewal Answer'
Processing column: 'Governing Law'
Processing column: 'Governing LawAnswer'
Processing column: 'Most Favored Nation'
Processing column: 'Most Favored NationAnswer'
Processing column: 'Competitive Restriction Exception'
Processing column: 'Competitive Restriction ExceptionAnswer'
Processing column: 'NonCompete'
Processing column: 'NonCompeteAnswer'
Processing column: 'Exclusivity'
Processing column:

In [6]:
for col in df.columns:
    print(col)

Filename
Document Name
Document Name_Answer
Parties
Parties_Answer
Agreement Date
Agreement Date_Answer
Effective Date
Effective Date_Answer
Expiration Date
Expiration Date_Answer
Renewal Term
Renewal Term_Answer
Notice Period To Terminate Renewal
Notice Period To Terminate Renewal_Answer
Governing Law
Governing Law_Answer
Most Favored Nation
Most Favored Nation_Answer
Competitive Restriction Exception
Competitive Restriction Exception_Answer
NonCompete
NonCompete_Answer
Exclusivity
Exclusivity_Answer
NoSolicit Of Customers
NoSolicit Of Customers_Answer
NoSolicit Of Employees
NoSolicit Of Employees_Answer
NonDisparagement
NonDisparagement_Answer
Termination For Convenience
Termination For Convenience_Answer
RofrRofoRofn
RofrRofoRofn_Answer
Change Of Control
Change Of Control_Answer
AntiAssignment
AntiAssignment_Answer
RevenueProfit Sharing
RevenueProfit Sharing_Answer
Price Restrictions
Price Restrictions_Answer
Minimum Commitment
Minimum Commitment_Answer
Volume Restriction
Volume Res

In [7]:
def save_jsonl(data, filename):
    with open(filename, 'w') as f:
        for entry in data:
            f.write(json.dumps(entry) + '\n')

In [8]:
formatted_rows = []

for index, row in df.iterrows():
    contract_text = row["Parties"]  
    # NOTE: Using 'Parties' column often contains the full text in some versions, 
    # or you map specific columns. Per your readme, 'master_clauses.csv' 
    # has text context. Let's assume a generic extraction logic here.
    # In CUAD master_clauses, the first column is filename, others are [Category] Answer.
    # We actually need the 'Text Context' usually found in specific columns or the raw txt.
    # If using master_clauses.csv heavily, we iterate strictly over the Answer columns.

    # SIMPLIFIED LOGIC based on your `data_roles_cuad.md` strategy:
    # Iterate over all 41 categories
    for col_name in df.columns:
        if "Answer" in col_name:
            category = col_name.replace("_Answer", "")
            answer = row[col_name]
            
            # Skip empty answers if you want a denser model, 
            # OR keep "No" answers to train negative detection (Recommended in your notes).
            if pd.isna(answer) or str(answer).strip() == "":
                continue
            
            # Construct Entry
            formatted_rows.append({
                "instruction": f"Extract the {category} from the contract text. Return in JSON format.",
                "input": str(row.get(f"{category}", "")), # Should be the context snippet
                "output": json.dumps({category: answer})
            })

# Split Data
train_data, val_data = train_test_split(formatted_rows, test_size=0.1, random_state=42)

In [10]:
CUAD_TRAIN_PATH = CUAD_PATH/'train'
CUAD_VALIDATION_PATH = CUAD_PATH/'validation'

In [11]:
save_jsonl(train_data, CUAD_TRAIN_PATH/'cuad_train.jsonl')
save_jsonl(val_data, CUAD_VALIDATION_PATH/'cuad_validation.jsonl')
print(f"Saved {len(train_data)} training samples and {len(val_data)} validation samples.")

Saved 18067 training samples and 2008 validation samples.
