<img src="https://imagedelivery.net/Dr98IMl5gQ9tPkFM5JRcng/3e5f6fbd-9bc6-4aa1-368e-e8bb1d6ca100/Ultra" alt="Image description" width="160" />

<br/>

# Legal Contract Data Extraction with Contextual AI

This notebook demonstrates how to use the [Contextual AI Platform](https://docs.contextual.ai/) for automated legal contract analysis and data extraction. We'll build a specialized agent that can answer yes/no questions about legal documents. The Agent's outputs are provided in a structured format, so they can easily be ingested into other workflows or applications.
This approach can be generalized for any type of data extraction from unstructured documents.

## Notebook Structure

1. **Setup and Prerequisites** - Environment setup, API configuration, and dependencies
2. **Datastore Creation and Document Management** - Creating datastores, ingesting documents, and adding metadata
3. **Agent Configuration** - Setting up the legal analysis agent with specialized prompts for extraction
4. **Structured Output Schema** - Defining JSON schemas for consistent responses
5. **Batch Extraction** - Running comprehensive analysis across documents

You can run this notebook entirely in Colab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ContextualAI/examples/blob/main/12-legal-contract-extraction/legal_contract_extraction.ipynb)

## 1. Setup and Prerequisites

### Prerequisites

- **API Key**: Available in the UI inside your workplace at contextual.ai
- **Python Client**: Install with `pip install contextual-client`
- **Sample Data**: We'll use the several contracts found in the data folder

**Security Note**: Never hardcode API keys in notebooks. Use environment variables or secure key management.

In [None]:
!pip install contextual-client

In [None]:
import os
import requests
import json
from pathlib import Path
from typing import List, Optional, Dict
from IPython.display import display, JSON
import pandas as pd
from contextual import ContextualAI

In [None]:
# Initialize Contextual AI client
# You can store the API key as an environment variable: 
# os.environ["CONTEXTUAL_API_KEY"] = API_KEY

client = ContextualAI(
    api_key=os.getenv("CONTEXTUAL_API_KEY")
)

## 2. Datastore Creation and Document Management

### 2.1 Create Datastore

First, we'll create a new datastore for our legal contracts. This will serve as the repository for all documents we want to analyze.

In [None]:
result = client.datastores.create(name="Demo_legal_contracts")
datastore_id = result.id
print(f"Datastore ID: {datastore_id}")

### 2.2 Download and Ingest Sample Documents

We'll download the sample legal contract and ingest it into our datastore. This contract is a non-compete contract found in the Atticus Open Contract Dataset on [Kaggle](https://www.kaggle.com/datasets/konradb/atticus-open-contract-dataset-aok-beta). Feel free to use this notebook with multiple contracts.

In [None]:
def fetch_file(filepath):
    os.makedirs(os.path.dirname(filepath), exist_ok=True) if '/' in filepath else None
    if not os.path.exists(filepath):
        print(f"Fetching {filepath}")
        response = requests.get(f"https://raw.githubusercontent.com/ContextualAI/examples/main/12-legal-contract-extraction/{filepath}")
        if response.ok:
            with open(filepath, 'wb') as f:
                f.write(response.content)
            print(f"Saved {filepath}")
        else:
            print(f"Failed to fetch {filepath}")

fetch_file('data/QuakerChemicalCorporation.pdf')
fetch_file('data/western.pdf')
fetch_file('data/vivintsolar.pdf')

Ingesting into a datastore

In [None]:
def ingest_documents(folder_path, datastore_id) -> Dict[str, str]:
    folder = Path(folder_path)
    document_ids = {}  # Dictionary to store filename: document_id pairs

    for file_path in folder.iterdir():
        if file_path.is_file() and file_path.suffix.lower() in ['.pdf', '.html']:
            try:
                with open(file_path, 'rb') as f:
                    ingestion_result = client.datastores.documents.ingest(datastore_id, file=f)
                    document_ids[file_path.name] = ingestion_result.id
                    print(f"Successfully uploaded {file_path.name} to datastore {datastore_id}")
            except Exception as e:
                print(f"Error uploading {file_path.name}: {str(e)}")

    return document_ids

# Usage example
folder_path = 'data'
uploaded_docs = ingest_documents(folder_path, datastore_id)

### 2.3 Set Document Metadata

Metadata is a powerful feature to improve retrieval. Let's add filename metadata to make documents easier to filter and identify during queries.  


Other uses for metadata could include fields like version, jurisdiction, or financial quarter. Anything that would be useful for retrieval or generation is a great candidate for including as metadata.
You can also decide if you want to include metadata inside the chunk (the default setting) and if you want to include using it for filtering retrievals (the default setting), find more in the [docs](https://docs.contextual.ai/api-reference/datastores-documents/update-document-metadata).


In [None]:
# Retrieve all documents from the datastore
docs = client.datastores.documents.list(datastore_id=datastore_id)
doc_pairs = [(doc.id, doc.name, doc.status) for doc in docs.documents]
print("Document ID and Name pairs:")
for doc_id, name, status in doc_pairs:
    print(f"ID: {doc_id}, Name: {name}, Status: {status}")

You can add metadata at ingest time or after a document has been processed. Here we are adding metadata with the assumption that the document has been processed.
If the status is not completed, please wait before adding metadata.

In [None]:
# Add Filename metadata to each document for easier querying
for doc_id, name, status in doc_pairs:
    result = client.datastores.documents.set_metadata(
        datastore_id=datastore_id,
        document_id=doc_id,
        custom_metadata={"Filename": name}
    )
    print(f"Set metadata for {name} (ID: {doc_id}): {result}")

In [None]:
# Verify metadata configuration
document_id = docs.documents[0].id
metadata = client.datastores.documents.metadata(datastore_id=datastore_id,
                        document_id=document_id)
print("Document metadata:", metadata.custom_metadata)

## 3. Agent Configuration

### 3.1 Create Legal Analysis Agent

We'll create a specialized agent for legal document analysis with access to our datastore.

In [None]:
app_response = client.agents.create(
    name="Demo Legal Extraction",
    description="Extraction Agent for legal contracts",
    datastore_ids=[datastore_id]
)
agent_id= app_response.id
print(f"Agent ID created: {agent_id}")

In [None]:
# Disable the filter prompt to get the top 15 chunks
response = client.agents.update(
    agent_id=agent_id,
    extra_body={
        "agent_configs": {
            "global_config": {
                "enable_filter": False
            }
        }
    }
)

### 3.2 Define Legal Extraction System Prompt

We'll create a specialized system prompt designed for legal document analysis with structured yes/no responses.
Adding the JSON structure you want for an output is recommended.

In [None]:
# Specialized system prompt for legal document analysis
extraction_system_prompt = """
LEGAL DATA EXTRACTION REQUEST

You are a legal document assistant to answer questions in a Yes or No format about relevant documentation provided to you.
Your responses should be precise, accurate, and sourced exclusively from the provided information.
Your answers will be Y for yes or N for no. If you do not know, answer IDK.  You will add a one sentence explanation.

Please follow these guidelines:
* Only use information from the provided documentation. Avoid opinions, speculation, or assumptions.
* Directly answer the question, then STOP.
* If the information is irrelevant, answer IDK.

 JSON FORMATTING REQUIREMENTS:
- Each query MUST have a "value" field and a "exp" field
- Values should be Y, N, or IDK
- exp should be one sentence
- Ensure valid JSON syntax with proper quotes and commas

REQUIRED OUTPUT FORMAT:
Return a complete, valid JSON object with this exact structure:
{
 "Answer": {"value": "Y"},
 "Explanation": {"exp": "No clause in the contract was identified"}
}
"""


### 3.3 Update Agent Configuration

Apply the specialized prompt to our agent for legal document analysis.

In [None]:
# Update agent with the legal extraction prompt
client.agents.update(agent_id=agent_id, system_prompt=extraction_system_prompt)

# Verify the updated configuration
agent_config = client.agents.metadata(agent_id=agent_id)
print("Updated system prompt:")
print(agent_config.system_prompt)

## 4. Structured Output Schema  

### 4.1 Define JSON Schema

We'll create a structured output schema to ensure consistent, parseable responses from our legal analysis agent.
Note the format of using `json_schema`.

In [None]:
# JSON schema for structured legal analysis responses
answer_with_explanation_schema = {
    "json_schema": {
    "type": "object",
    "properties": {
        "answer": {
            "type": "string",
            "enum": ["Y", "N", "IDK"],
            "description": "Must be exactly Y, N, or IDK"
        },
        "explanation": {
            "type": "string",
            "description": "Brief explanation of the answer based on the document"
        }
    },
    "required": ["answer", "explanation"]
}}

Write a simple query to test extraction, the query includes:
- Filters to limit extraction to a specific document
- Structured output in a json schema
- Ability to include the retrievals with all the chunk data in the generated answer
- Ability to only get the chunks with no generated answer (using retrievals only)


In [None]:
query = "Is the non-competition period limited to 2 years or less?"
source_document = "QuakerChemicalCorporation.pdf"

query_result = client.agents.query.create(
        agent_id=agent_id,
        messages=[{
            "content": query,
            "role": "user"
        }],
        documents_filters= {
        "operator": "AND",
        "filters": [
            {"field": "Filename", "operator": "equals", "value": source_document}
        ]
        },
        extra_body = {"structured_output": answer_with_explanation_schema},
        include_retrieval_content_text=False,
        retrievals_only=False
        )

query_result.message.content

## Batch Extraction

### 5.1 Create Query Function

Create a reusable function to query documents with structured output.

In [None]:
def query_document(query, document_name):
    """Simple function to query a document with Y/N/IDK structured output"""

    query_result = client.agents.query.create(
        agent_id=agent_id,
        messages=[{
            "content": query,
            "role": "user"
        }],
        documents_filters= {
        "operator": "AND",
        "filters": [
            {"field": "Filename", "operator": "equals", "value": document_name}
        ]
        },
        extra_body = {"structured_output": answer_with_explanation_schema},
        include_retrieval_content_text=False,
        retrievals_only=False
        )
    return query_result.message.content

Let's test and verify this works for different documents.

In [None]:
query = "Is the non-competition period limited to 2 years or less?"
source_document = "QuakerChemicalCorporation.pdf"
result = query_document(query, source_document)
print(result)

In [None]:
query = "Is the non-competition period limited to 2 years or less?"
source_document = "western.pdf"
result = query_document(query, source_document)
print(result)

### 5.2 Run a Batch Extraction Job

Here is a list of extraction questions

In [None]:
# Legal Due Diligence Questions List
legal_questions = [
    "Is the non-competition period limited to 2 years or less?",
    "Does the non-competition covenant exclude China from its geographic scope?",
    "Is the employee non-solicitation period longer than the non-competition period?",
    "Does the agreement include severability provisions for unenforceable restrictions?",
    "Are there judicial modification provisions allowing courts to narrow overly broad restrictions?",
    "Is there acknowledgment that legal remedies would be inadequate for breach?",
    "Is there a passive investment exception for holdings under 10%?",
    "Does the agreement allow routine day-to-day business transactions?",
    "Is there a revenue threshold ($25M+) exception for acquired businesses?",
    "Can sellers hire employees terminated by buyer after a 12-month waiting period?",
    "Is Pennsylvania law designated as the governing law?",
    "Do federal and state courts in Philadelphia have exclusive jurisdiction?",
    "Can the buyer seek injunctive relief without posting a bond?",
    "Does the prevailing party recover attorney's fees?",
    "Does the agreement define \"control\" as power to direct management?",
    "Do the restrictive covenants apply to the sellers' affiliates?",
    "Are there special restrictions on Russian Oil beyond general provisions?",
    "Are there confidentiality obligations covering buyer and company information?"
]

print(f"Created list of {len(legal_questions)} legal due diligence questions")


Create a function to process all questions against a single document.

In [None]:
# Example: Query one document with all questions
def query_all_questions(document_name, questions_list):
    """Query a document with all questions and return results"""
    results = []

    for i, question in enumerate(questions_list, 1):
        print(f"Question {i}/{len(questions_list)}: {question}")
        try:
            result = query_document(question, document_name)
            answer = json.loads(result)
            results.append({
                'question': question,
                'answer': answer,
                'document': document_name
            })
            print(f"Answer: {answer}\n")
        except Exception as e:
            print(f"Error: {e}\n")
            results.append({
                'question': question,
                'answer': 'ERROR',
                'document': document_name
            })

    return results

### 5.3 Execute Batch Analysis

Run the list of extractions on our target document.

In [None]:
results = query_all_questions("QuakerChemicalCorporation.pdf", legal_questions)
for r in results:
    print(f"Q: {r['question']}")
    print(f"A: {r['answer']}\n")


The results should look like this:
```
Question 18/18: Are there confidentiality obligations covering buyer and company information?
Answer: {'answer': 'IDK', 'explanation': "I don't have access to any information about confidentiality obligations or agreements covering buyer and company information. To get accurate information about this topic, I would need relevant documentation or policies to be uploaded to my datastore."}

Q: Is the non-competition period limited to 2 years or less?
A: {'answer': 'IDK', 'explanation': "I don't have access to any information that would allow me to determine the length of the non-competition period. To get an accurate answer to this question, I would need access to relevant legal documents or agreements that specify the non-competition terms."}

Q: Does the non-competition covenant exclude India from its geographic scope?
A: {'answer': 'IDK', 'explanation': "I don't have access to any information about the non-competition covenant or its geographic scope. To answer this question accurately, I would need access to the relevant legal documents or agreements."}
```


### 5.4 Save Results to File

Save the analysis results to both JSON and CSV formats for further analysis.


In [None]:
# Save results to file for analysis
import json
from datetime import datetime

# Create timestamp for filename
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"legal_analysis_results_{timestamp}.json"

# Save results to JSON file
with open(filename, 'w') as f:
    json.dump(results, f, indent=2)

print(f"Results saved to {filename}")

# Also save as CSV for easier analysis
import pandas as pd
df_results = pd.DataFrame(results)
csv_filename = f"legal_analysis_results_{timestamp}.csv"
df_results.to_csv(csv_filename, index=False)
print(f"Results also saved to {csv_filename}")

## Next Steps

- You can easily scale up this code to run across multiple documents
- Leverage more metadata to make retrieval easier
- Experiment with more complex structured data extraction
