# ClauseGuard — Automated Contract Clause Classification & Risk Advice
## Powered by RAG, Langchain, Azure OpenAI, and Azure AI Search

In prior work (repo: https://github.com/DucDungTran/NLP/tree/main/Contract-clause-analysis-chatbot), contract-clause classifiers were developed by fine-tuning LegalBERT and RoBERTa, and a risk-advisory chatbot was implemented using a GPT model via the OpenAI API. This project leverages retrieval-augmented generation (RAG) with LangChain, Azure OpenAI, and Azure AI Search to enhance grounding and relevance. In particular, a european contract law book (PDF file: https://www.legiscompare.fr/web/IMG/pdf/CFR_I-XXXIV_1-614.pdf) is utilized to provide specific knowledge base for LLM models.

## Prerequisites

The following Azure resources are required to carry out this project:
* Azure OpenAI with two models: GPT-4.1 and Embedding model (text-embedding-3-large).
* Azure AI Search.

The introduction to building RAG in cloud can be found in the repo: https://github.com/DucDungTran/RAG/tree/main/rag-cloud.

In this project, a RAG framework is developed via Langchain, Azure OpenAI and AI Search. The fine-tuned LegalBERT model performed in the prior work is then loaded to classify given contract clauses. After that, the clauses and respective classification results are fed into the RAG-based LLM model for risk analysis.

In [None]:
# %pip install -q langchain_community pypdf torch transformers titoken langchain-openai python-dotenv
# %pip install -q azure-search-documents azure-identity

## Loading documents

In [1]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "data/eu-contract-law.pdf"
loader = PyPDFLoader(file_path)

docs_all = loader.load()
docs = docs_all[:10] + docs_all[34:]

print(len(docs))

624


In [2]:
print(f"{docs[14].page_content}\n")
print(docs[14].metadata)

Chapter 1: Contract
Academy of European Private Lawyers of Pavia (“Pavia Project”) (articles 4 and 20),
decided to treat unilateral undertakings as contracts, in general terms.
The issue is probably more theoretical than practical: it is about knowing if an offeror
is bound even without the acceptance of the offer by its recipient and without such
recipient having knowledge of the undertaking (like promises of a reward under German
law). In practice, this analysis will seldom be useful because it will be possible to identify
an implied acceptance by the recipient. Such is the position under English law. The
question arises, however, as to the difference between a unilateral promise and a uni-
lateral undertaking.
Several solutions are possible regarding the treatment of unilateral undertakings in
European Contract Law.
First possibility: maintain the general terms of the Pavia Project and of
PECL.
Second possibility: delete all references to unilateral undertakings as a separate catego

## Splitting/chunking text by characters with legal-friendly separators

In [3]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

separators = [
    r"\n\nChapter ", r"\n\nSection ", r"\n\nArticle ",
    "\n\n", "\n", " "
]

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1000,           
    chunk_overlap=150,        
    separators=separators,
    add_start_index=True,
    encoding_name="cl100k_base",  # OpenAI/ Azure embedding tokenizer
)

chunks = text_splitter.split_documents(docs)
print("Number of splits:", len(chunks))

Number of splits: 706


In [4]:
for chunk in chunks[20:30]:
    print(chunk.page_content)
    print("---------------")

Chapter 1: Contract
2. A more limited polysemy in Acquis International
More traditionally, in Acquis International, the notion of“engagement” seems to refer to
a manifestation of intention, through which a person takes on an obligation – either in
the form of a promise, or by entering a contract. Several examples illustrating these two
meanings are set out below.
Article 33.1.c of the Hague Convention relating to a uniform law on the interna-
tional sale of goods of 1 July 1964 provides that “The seller shall not have fulfilled his
obligation to deliver the goods where he has handed over:
[... ] c) goods which lack the
qualities of a sample or model which the seller has handed over or sent to the buyer,
unless the seller has submitted it without any express or impliedundertaking that the
goods would conform therewith”.
Articles 3.8 and 3.9 of theUNIDROIT Principles relating to fraud and threat26 pro-
vide that: “A party may avoid the contract when ithas been led to conclude the contrac

## Count the number of tokens in a text

Like LLM models, Embedding models defines a `max input`. It is defined in number of `tokens`. The `max_input` for `text-embedding-3-large` is 8191 tokens. So we need to split the text into chunks of 8191 tokens or less. For that, we need to get the number of tokens in a text string.

In [5]:
import tiktoken

def num_tokens_from_string(string: str) -> int:
    encoding = tiktoken.get_encoding(encoding_name="cl100k_base")
    num_tokens = len(encoding.encode(string, disallowed_special=()))
    return num_tokens

# Test the function
num_tokens_from_string("tiktoken is great!")

6

Count the number of tokens in each chunk. It should not exceed the max specified by the Embedding model (8191).

In [6]:
for chunk in chunks:
    num_tokens = num_tokens_from_string(chunk.page_content)
    if num_tokens > 8191:
        print(chunk.metadata["page"])
        print(num_tokens)
        print("---------------")

## Creating Embeddings

`Vector search` is a common way to store and search over unstructured data (such as unstructured text). The idea is to store numeric vectors that are associated with the text. Given a query, we can embed it as a vector of the same dimension and use vector similarity metrics (such as cosine similarity) to identify related text.

`LangChain` supports embeddings from dozens of providers. These models specify how text should be converted into a numeric vector. 

In [7]:
import os
from langchain_openai import AzureOpenAIEmbeddings
from dotenv import load_dotenv

if os.path.exists(".env"):
    load_dotenv(override=True)


embeddings = AzureOpenAIEmbeddings(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    openai_api_key=os.environ["AZURE_OPENAI_API_KEY"],
    azure_deployment=os.environ["AZURE_OPENAI_EMBEDDING_MODEL"],
    openai_api_version=os.environ["AZURE_OPENAI_EMBEDDING_API_VERSION"],
)

In [8]:
vector_1 = embeddings.embed_query(chunks[0].page_content)
vector_2 = embeddings.embed_query(chunks[1].page_content)

print(f"Generated vectors of length {len(vector_1)}\n")
print(vector_1[:10])
print(vector_2[:10])

Generated vectors of length 3072

[-0.024074196815490723, 0.001347526558674872, -0.024477045983076096, 0.02897282876074314, -0.026281803846359253, 0.008266441524028778, -0.00510005559772253, 0.04144500195980072, -0.03487052395939827, -0.003581318771466613]
[-0.018066614866256714, -0.022289978340268135, -0.031851205974817276, 0.02941690757870674, -0.03361094370484352, 0.0008949903422035277, 0.012178833596408367, 0.024797601625323296, -0.0584525391459465, 0.006932623218744993]


Armed with a model for generating text embeddings, we can next store them in a special data structure that supports efficient similarity search.

## Vector stores for Azure AI Search

`LangChain VectorStore` objects contain methods for adding text and `Document` objects to the store, and querying them using various similarity metrics. They are often initialized with embedding models, which determine how text data is translated to numeric vectors.

`LangChain` includes a suite of integrations with different vector store technologies. Some vector stores are hosted by a provider (e.g., various cloud providers) and require specific credentials to use; some (such as `Postgres`) run in separate infrastructure that can be run locally or via a third-party; others can run in-memory for lightweight workloads.

Create instance of the `AzureSearch` class using the embeddings from above.

In [10]:
from langchain_community.vectorstores.azuresearch import AzureSearch

# Specify additional properties for the Azure client such as the following https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/core/azure-core/README.md#configurations
vector_store = AzureSearch(
    azure_search_endpoint=os.environ["AZURE_SEARCH_SERVICE_ENDPOINT"],
    azure_search_key=os.environ["AZURE_SEARCH_SERVICE_ADMIN_KEY"],
    index_name=os.environ["AZURE_SEARCH_SERVICE_INDEX"],
    embedding_function=embeddings.embed_query,
    # Configure max retries for the Azure client
    additional_search_client_options={"retry_total": 3},
    relevance_score_fn="cosine",
)

## Insert text and embeddings into vector store

This step loads, chunks, and vectorizes the sample document, and then indexes the content into a search index on `Azure AI Search`.

In [11]:
for i in range(0, len(chunks), 300):
    print("Uploading doscuments from ", i, " to ", i+300)
    vector_store.add_documents(documents=chunks[i:i+300])

Uploading doscuments from  0  to  300
Uploading doscuments from  300  to  600
Uploading doscuments from  600  to  900


## Load fine-tuned contract clause classification model (LegalBERT)

In [12]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification

# Load model
model = BertForSequenceClassification.from_pretrained("fine-tuned-legal-bert-v1")
tokenizer = BertTokenizer.from_pretrained("fine-tuned-legal-bert-v1")

#Use gpu if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

# Clause classification
def classify_clause(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512).to(device)
    outputs = model(**inputs)
    preds = torch.argmax(outputs.logits, dim=-1).item()
    classification_label = "Audit Clause" if preds == 1 else "Not an Audit Clause"
    return classification_label

  from .autonotebook import tqdm as notebook_tqdm


In [13]:
clause = "Supplier acknowledges and agrees that regulatory agencies may audit Supplier's performance at any time during normal business hours and that such audits may include both methods and results under this Agreement."

output = classify_clause(clause)
output

'Audit Clause'

## Retrieval and Generation using LLM

In [None]:
from langchain_openai import AzureChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

llm = AzureChatOpenAI(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    openai_api_key=os.environ["AZURE_OPENAI_API_KEY"],
    azure_deployment=os.environ["AZURE_OPENAI_DEPLOYMENT_NAME"],
    openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
)

In [27]:
system_rag = """You are a legal advisor. Use ONLY the provided context. If a point is not supported by the context, say "I don't know." Provide a concise, cohesive explanation linking the clause and its classification. Use the exact template below:
    **Clause**: <clause>
    
    **Classification**: <label>
    
    **Key risks**: <Flag any potential risks in the clauses using bullets>
    
    **Mitigations**: <Solutions for risks>"""
    
prompt_rag = ChatPromptTemplate.from_messages([
    ("system", system_rag),
    ("user",
     "Below is a contract clause that has been classified as '{classification_label}' and given context:\n\n"
     "Clause:\n{clause}\n\nContext:\n{context}")
])

# RAG chain: retriever -> prompt -> llm
def run_rag(clause: str, classification_label: str):
    retrieved_docs = vector_store.similarity_search(clause, k=10)
    docs_content = "\n\n".join(doc.page_content for doc in retrieved_docs)
    return llm.invoke(prompt_rag.invoke({"classification_label": classification_label, "clause": clause, "context": docs_content})).content

In [28]:
clause = "Supplier acknowledges and agrees that regulatory agencies may audit Supplier's performance at any time during normal business hours and that such audits may include both methods and results under this Agreement."

classification_label = classify_clause(clause)

rag_out = run_rag(clause, classification_label)

print(rag_out)

**Clause**: Supplier acknowledges and agrees that regulatory agencies may audit Supplier's performance at any time during normal business hours and that such audits may include both methods and results under this Agreement.

**Classification**: Audit Clause

**Key risks**:
- Lack of specificity regarding notice: The clause allows audits "at any time during normal business hours" but does not require advance notice, potentially leading to disruption of Supplier’s operations.
- Broad scope of audit: Both "methods and results" may be audited, which could involve intrusive inspections, access to confidential or proprietary information, or ambiguity about the extent of the review.
- No limitation on frequency or duration: The clause permits audits at any time, without limiting how often or how long audits may occur, which may create operational or resource concerns for the Supplier.

**Mitigations**:
- Require reasonable advance notice: Add a requirement for the regulatory agency to give re