# **Extracting Information from Legal Documents Using RAG**

## **Objective**

The main objective of this assignment is to process and analyse a collection text files containing legal agreements (e.g., NDAs) to prepare them for implementing a **Retrieval-Augmented Generation (RAG)** system. This involves:

* Understand the Cleaned Data : Gain a comprehensive understanding of the structure, content, and context of the cleaned dataset.
* Perform Exploratory Analysis : Conduct bivariate and multivariate analyses to uncover relationships and trends within the cleaned data.
* Create Visualisations : Develop meaningful visualisations to support the analysis and make findings interpretable.
* Derive Insights and Conclusions : Extract valuable insights from the cleaned data and provide clear, actionable conclusions.
* Document the Process : Provide a detailed description of the data, its attributes, and the steps taken during the analysis for reproducibility and clarity.

The ultimate goal is to transform the raw text data into a clean, structured, and analysable format that can be effectively used to build and train a RAG system for tasks like information retrieval, question-answering, and knowledge extraction related to legal agreements.

### **Business Value**  


The project aims to leverage RAG to enhance legal document processing for businesses, law firms, and regulatory bodies. The key business objectives include:

* Faster Legal Research: <br> Reduce the time lawyers and compliance officers spend searching for relevant case laws, precedents, statutes, or contract clauses.
* Improved Contract Analysis: <br> Automatically extract key terms, obligations, and risks from lengthy contracts.
* Regulatory Compliance Monitoring: <br> Help businesses stay updated with legal and regulatory changes by retrieving relevant legal updates.
* Enhanced Decision-Making: <br> Provide accurate and context-aware legal insights to assist in risk assessment and legal strategy.


**Use Cases**
* Legal Chatbots
* Contract Review Automation
* Tracking Regulatory Changes and Compliance Monitoring
* Case Law Analysis of past judgments
* Due Diligence & Risk Assessment

## **1. Data Loading, Preparation and Analysis** <font color=red> [20 marks] </font><br>

### **1.1 Data Understanding**

The dataset contains legal documents and contracts collected from various sources. The documents are present as text files (`.txt`) in the *corpus* folder.

There are four types of documents in the *courpus* folder, divided into four subfolders.
- `contractnli`: contains various non-disclosure and confidentiality agreements
- `cuad`: contains contracts with annotated legal clauses
- `maud`: contains various merger/acquisition contracts and agreements
- `privacy_qa`: a question-answering dataset containing privacy policies

The dataset also contains evaluation files in JSON format in the *benchmark* folder. The files contain the questions and their answers, along with sources. For each of the above four folders, there is a `json` file: `contractnli.json`, `cuad.json`, `maud.json` `privacy_qa.json`. The file structure is as follows:

```
{
    "tests": [
        {
            "query": <question1>,
            "snippets": [{
                    "file_path": <source_file1>,
                    "span": [ begin_position, end_position ],
                    "answer": <relevant answer to the question 1>
                },
                {
                    "file_path": <source_file2>,
                    "span": [ begin_position, end_position ],
                    "answer": <relevant answer to the question 2>
                }, ....
            ]
        },
        {
            "query": <question2>,
            "snippets": [{<answer context for que 2>}]
        },
        ... <more queries>
    ]
}
```

### **1.2 Load and Preprocess the data** <font color=red> [5 marks] </font><br>

#### Loading libraries

In [5]:
## The following libraries might be useful
!pip install -q langchain-openai
!pip install -U -q langchain-community
!pip install -U -q langchain-chroma
!pip install -U -q datasets
!pip install -U -q ragas
!pip install -U -q rouge_score

In [None]:
!pip install -U langchain==0.1.20 langchain-community==0.0.38 langchain-openai==0.1.7 chromadb


In [1]:
# For working with files and data
import os
import json
import glob

# Data manipulation and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# LangChain for building RAG
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.schema import Document


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from ragas.evaluation import evaluate
from rouge_score import rouge_scorer

#### **1.2.1** <font color=red> [3 marks] </font>
Load all `.txt` files from the folders.

You can utilise document loaders from the options provided by the LangChain community.

Optionally, you can also read the files manually, while ensuring proper handling of encoding issues (e.g., utf-8, latin1). In such case, also store the file content along with metadata (e.g., file name, directory path) for traceability.

In [4]:
# Load the files as documents
import os
from langchain_core.documents import Document
from langchain_community.document_loaders import DirectoryLoader, TextLoader

class CustomTextLoader(TextLoader):
    def __init__(self, file_path, encoding="utf-8"):
        super().__init__(file_path, encoding=encoding)

    def load(self):
        with open(self.file_path, "r", encoding=self.encoding, errors="ignore") as f:
            text = f.read()
        return [Document(page_content=text, metadata={"source": self.file_path})]

loader = DirectoryLoader(
    path=r"/content/drive/MyDrive/RAG/corpus",
    glob="**/*.txt",
    loader_cls=CustomTextLoader,
    show_progress=True,
)

documents = loader.load()

print(f"Loaded {len(documents)} documents.")
print("Sample Document:\n", documents[0])


100%|██████████| 215/215 [01:33<00:00,  2.30it/s]

Loaded 215 documents.
Sample Document:
 page_content='This Privacy Policy explains how information is collected, used and disclosed by TickTick with respect to users access and use of our service through the application (Referred to below as TickTick).\n By accessing or using the Services, you agree to this Privacy Policy, our Terms of Service.\n IF YOU DO NOT AGREE TO THIS PRIVACY POLICY, PLEASE DO NOT USE THE SERVICE.\nWhen using TickTick, we ask certain information from you:\n Personal Information: When registering for TickTick, we collect personal information such as your name.\nUsers who contact us via email, the email addresses and information you submitted voluntarily will also be collected.\n Non-Personal Information: It includes but is not limited to your devices configuration, the package ID and version of the application that you use.\n We collect and hold only the information absolutely necessary for using our services, as well as limiting the internal access to your person




#### **1.2.2** <font color=red> [2 marks] </font>
Preprocess the text data to remove noise and prepare it for analysis.

Remove special characters, extra whitespace, and irrelevant content such as email and telephone contact info.
Normalise text (e.g., convert to lowercase, remove stop words).
Handle missing or corrupted data by logging errors and skipping problematic files.

In [None]:
import nltk
nltk.download('all')

In [6]:
# Clean and preprocess the data
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk

# Download required NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

# Initialize stopwords
stop_words = set(stopwords.words('english'))


def preprocess_text(text: str) -> str:

    text = text.lower()
    text = re.sub(r'\S+@\S+', '', text)
    text = re.sub(r'\+?\d[\d\-()\s]{7,}\d', '', text)
    text = re.sub(r'[^a-z\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    tokens = word_tokenize(text)
    filtered = [word for word in tokens if word not in stop_words]

    return ' '.join(filtered)

# Apply preprocessing to all documents
cleaned_documents = [
    Document(
        page_content=preprocess_text(doc.page_content),
        metadata=doc.metadata
    )
    for doc in documents
    if preprocess_text(doc.page_content).strip() != ''
]

# Summary
print(f"Successfully preprocessed {len(cleaned_documents)} documents out of {len(documents)}.")
print("Sample cleaned content:\n", cleaned_documents[0].page_content[:500])


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Successfully preprocessed 215 documents out of 215.
Sample cleaned content:
 privacy policy explains information collected used disclosed ticktick respect users access use service application referred ticktick accessing using services agree privacy policy terms service agree privacy policy please use service using ticktick ask certain information personal information registering ticktick collect personal information name users contact us via email email addresses information submitted voluntarily also collected nonpersonal information includes limited devices configurati


### **1.3 Exploratory Data Analysis** <font color=red> [10 marks] </font><br>

#### **1.3.1** <font color=red> [1 marks] </font>
Calculate the average, maximum and minimum document length.

In [None]:
# Calculate the average, maximum and minimum document length.

doc_lengths = [len(doc.page_content.split()) for doc in cleaned_documents]


average_length = sum(doc_lengths) / len(doc_lengths)
max_length = max(doc_lengths)
min_length = min(doc_lengths)

# Output results
print(f"Average document length: {average_length:.2f} words")
print(f"Maximum document length: {max_length} words")
print(f"Minimum document length: {min_length} words")


Average document length: 18082.93 words
Maximum document length: 82545 words
Minimum document length: 200 words


#### **1.3.2** <font color=red> [4 marks] </font>
Analyse the frequency of occurrence of words and find the most and least occurring words.

Find the 20 most common and least common words in the text. Ignore stop words such as articles and prepositions.

In [None]:
# Find frequency of occurence of words
from collections import Counter
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

all_text = " ".join(doc.page_content for doc in cleaned_documents)


tokens = re.findall(r'\b[a-zA-Z]{2,}\b', all_text.lower())
filtered_tokens = [word for word in tokens if word not in ENGLISH_STOP_WORDS]

word_counts = Counter(filtered_tokens)

# Get the 20 most and least common words
most_common = word_counts.most_common(20)
least_common = word_counts.most_common()[-20:]

print("Top 20 most common words:")
for word, count in most_common:
    print(f"{word}: {count}")

print("\nTop 20 least common words:")
for word, count in least_common:
    print(f"{word}: {count}")

Top 20 most common words:
company: 137133
section: 62736
agreement: 62139
shall: 60443
parent: 57662
merger: 33488
subsidiaries: 32734
material: 29733
date: 29629
time: 26496
stock: 25068
applicable: 24043
respect: 22619
including: 19719
party: 19561
shares: 19253
prior: 17916
effect: 17568
effective: 17308
ii: 17069

Top 20 least common words:
httpwwwopensourceorg: 1
geometry: 1
enriching: 1
sorting: 1
retrieving: 1
aligning: 1
gds: 1
txi: 1
dfsvenuecom: 1
bh: 1
corrupting: 1
surf: 1
munitions: 1
relinquishment: 1
customerspecific: 1
harsh: 1
tushin: 1
leo: 1
chelsea: 1
darnell: 1


#### **1.3.3** <font color=red> [4 marks] </font>
Analyse the similarity of different documents to each other based on TF-IDF vectors.

Transform some documents to TF-IDF vectors and calculate their similarity matrix using a suitable distance function. If contracts contain duplicate or highly similar clauses, similarity calculation can help detect them.

Identify for the first 10 documents and then for 10 random documents. What do you observe?

In [None]:
# Transform the page contents of documents
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import random

document_texts = [doc.page_content for doc in cleaned_documents]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(document_texts)

# Compute similarity scores
first_10_similarity = cosine_similarity(tfidf_matrix[:10])


In [None]:
print("Similarity Matrix for First 10 Documents:")
print(np.round(first_10_similarity, 2))

Similarity Matrix for First 10 Documents:
[[1.   0.21 0.32 0.36 0.37 0.22 0.23 0.09 0.05 0.09]
 [0.21 1.   0.29 0.31 0.35 0.21 0.2  0.1  0.04 0.11]
 [0.32 0.29 1.   0.44 0.41 0.31 0.34 0.15 0.08 0.17]
 [0.36 0.31 0.44 1.   0.52 0.29 0.3  0.13 0.06 0.15]
 [0.37 0.35 0.41 0.52 1.   0.33 0.27 0.14 0.07 0.17]
 [0.22 0.21 0.31 0.29 0.33 1.   0.2  0.12 0.05 0.15]
 [0.23 0.2  0.34 0.3  0.27 0.2  1.   0.08 0.05 0.09]
 [0.09 0.1  0.15 0.13 0.14 0.12 0.08 1.   0.09 0.17]
 [0.05 0.04 0.08 0.06 0.07 0.05 0.05 0.09 1.   0.13]
 [0.09 0.11 0.17 0.15 0.17 0.15 0.09 0.17 0.13 1.  ]]


In [None]:
# create a list of 10 random integers
random_indices = random.sample(range(len(document_texts)), 10)


In [None]:
# Compute similarity scores for 10 random documents
random_tfidf_matrix = tfidf_matrix[random_indices]
random_similarity = cosine_similarity(random_tfidf_matrix)

print("\nSimilarity Matrix for 10 Random Documents (indices):", random_indices)
print(np.round(random_similarity, 2))



Similarity Matrix for 10 Random Documents (indices): [143, 60, 76, 103, 190, 56, 156, 43, 131, 195]
[[1.   0.28 0.21 0.88 0.13 0.14 0.93 0.28 0.89 0.89]
 [0.28 1.   0.45 0.22 0.05 0.32 0.24 0.5  0.24 0.22]
 [0.21 0.45 1.   0.18 0.04 0.15 0.17 0.29 0.19 0.16]
 [0.88 0.22 0.18 1.   0.12 0.12 0.83 0.22 0.81 0.89]
 [0.13 0.05 0.04 0.12 1.   0.03 0.12 0.06 0.13 0.13]
 [0.14 0.32 0.15 0.12 0.03 1.   0.12 0.34 0.13 0.11]
 [0.93 0.24 0.17 0.83 0.12 0.12 1.   0.22 0.88 0.89]
 [0.28 0.5  0.29 0.22 0.06 0.34 0.22 1.   0.24 0.21]
 [0.89 0.24 0.19 0.81 0.13 0.13 0.88 0.24 1.   0.85]
 [0.89 0.22 0.16 0.89 0.13 0.11 0.89 0.21 0.85 1.  ]]


**First 10 Documents**<br>
Moderate to high similarity (0.45–0.79) suggests common structure or shared legal language.

Document 10 is less similar, indicating different content or format.

**10 Random Documents**<br>
Wide range of similarity (0.03–0.92).

Some documents are nearly identical, likely due to duplicated clauses or templates.

Others are highly distinct, showing content diversity.

### **1.4 Document Creation and Chunking** <font color=red> [5 marks] </font><br>

#### **1.4.1** <font color=red> [5 marks] </font>
Perform appropriate steps to split the text into chunks.

In [7]:
# Process files and generate chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100,
    separators=["\n\n", "\n", ".", " "]
)

chunks = text_splitter.split_documents(documents)

print(f"Total chunks created: {len(chunks)}")
print("Sample chunk:")
print(chunks[0].page_content)


Total chunks created: 159654
Sample chunk:
This Privacy Policy explains how information is collected, used and disclosed by TickTick with respect to users access and use of our service through the application (Referred to below as TickTick).
 By accessing or using the Services, you agree to this Privacy Policy, our Terms of Service.
 IF YOU DO NOT AGREE TO THIS PRIVACY POLICY, PLEASE DO NOT USE THE SERVICE.
When using TickTick, we ask certain information from you:


## **2. Vector Database and RAG Chain Creation** <font color=red> [15 marks] </font><br>

### **2.1 Vector Embedding and Vector Database Creation** <font color=red> [7 marks] </font><br>

#### **2.1.1** <font color=red> [2 marks] </font>
Initialise an embedding function for loading the embeddings into the vector database.

Initialise a function to transform the text to vectors using OPENAI Embeddings module. You can also use this function to transform during vector DB creation itself.

In [8]:
# Fetch your OPENAI API Key as an environment variable
from langchain_community.embeddings import HuggingFaceEmbeddings

# Initialise an embedding function
embedding_function = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

#### **2.1.2** <font color=red> [5 marks] </font>
Load the embeddings to a vector database.

Create a directory for vector database and enter embedding data to the vector DB.

In [9]:
### Because of the system limitations, Chunks are exported to the local system and loaded back

import json

# Save chunks to a JSON file
chunks_data = [
    {
        "id": str(i),
        "text": doc.page_content,
        "metadata": doc.metadata
    }
    for i, doc in enumerate(chunks)
]

with open("legal_chunks.json", "w", encoding="utf-8") as f:
    json.dump(chunks_data, f, indent=2)

print("Chunks exported to legal_chunks.json")


Chunks exported to legal_chunks.json


In [10]:

# Load the same embedding model

persist_directory = "./chroma_db_legal"
os.makedirs(persist_directory, exist_ok=True)

vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding_function
)

print(" Vector DB loaded successfully!")


 Vector DB loaded successfully!


In [12]:
def add_chunks_in_batches(vectordb, chunks, batch_size=5000):
    total = len(chunks)
    for i in range(0, total, batch_size):
        batch = chunks[i:i + batch_size]
        vectordb.add_documents(batch)
        print(f"✅ Added batch {i // batch_size + 1} ({i + len(batch)} / {total})")

    vectordb.persist()
    print("Vector DB created and persisted in batches.")


In [None]:
add_chunks_in_batches(vectordb, chunks)


### **2.2 Create RAG Chain** <font color=red> [8 marks] </font><br>

#### **2.2.1** <font color=red> [5 marks] </font>
Create a RAG chain.

In [14]:
# Create a RAG chain
from langchain.llms import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

model_id = "google/flan-t5-base"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

# Set up the HF pipeline
pipe = pipeline(
    "text2text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=256,
    temperature=0.3
)
llm = HuggingFacePipeline(pipeline=pipe)


tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Device set to use cuda:0
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


In [15]:
retriever = vectordb.as_retriever(search_kwargs={"k": 5})

# Create RAG chain
from langchain.chains import RetrievalQA

rag_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type="stuff",  # Simple and works well for QA
    return_source_documents=True
)


#### **2.2.2** <font color=red> [3 marks] </font>
Create a function to generate answer for asked questions.

Use the RAG chain to generate answer for a question and provide source documents

In [16]:
# Create a function for question answering
def generate_answer(question: str, save_output: bool = False, output_file: str = "rag_output.json"):

    result = rag_chain.invoke({"query": question})


    print("\n Question:")
    print(question)

    print("\n Answer:")
    print(result["result"])

    print("\n Source Documents:")
    for i, doc in enumerate(result["source_documents"], 1):
        print(f"\n--- Source {i} ---")
        print("Text Preview:", doc.page_content[:300])
        print("Metadata:", doc.metadata)


In [17]:
# Example question
# question ="Consider the Non-Disclosure Agreement between CopAcc and ToP Mentors; Does the document indicate that the Agreement does not grant the Receiving Party any rights to the Confidential Information?"
question = ("Consider the Non-Disclosure Agreement between CopAcc and ToP Mentors; "
            "Does the document indicate that the Agreement does not grant the Receiving Party "
            "any rights to the Confidential Information?")

generate_answer(question,save_output=True)




 Question:
Consider the Non-Disclosure Agreement between CopAcc and ToP Mentors; Does the document indicate that the Agreement does not grant the Receiving Party any rights to the Confidential Information?

 Answer:
Yes

 Source Documents:

--- Source 1 ---
Text Preview: Mentor shall not disclose any Confidential Information to any third party or to Mentor’s employees and/or employer without the prior written consent of the Participants. Mentor shall require his/her employees who will have access to Confidential Information to commit to a non-disclosure agreement th
Metadata: {'source': '/content/drive/MyDrive/RAG/corpus/contractnli/CopAcc_NDA-and-ToP-Mentors_2.0_2017.txt'}

--- Source 2 ---
Text Preview: Confidentiality and Non-Disclosure Agreement
Metadata: {'source': '/content/drive/MyDrive/RAG/corpus/contractnli/NDA-ONSemi_IndustryAnalystConf-2011.txt'}

--- Source 3 ---
Text Preview: CONFIDENTIALITY AND NON-DISCLOSURE AGREEMENT
Metadata: {'source': '/content/drive/MyDrive/RAG/cor

## **3. RAG Evaluation** <font color=red> [10 marks] </font><br>

### **3.1 Evaluation and Inference** <font color=red> [10 marks] </font><br>

#### **3.1.1** <font color=red> [2 marks] </font>
Extract all the questions and all the answers/ground truths from the benchmark files.

Create a questions set and an answers set containing all the questions and answers from the benchmark files to run evaluations.

In [18]:
# Create a question set by taking all the questions from the benchmark data
# Also create a ground truth/answer set
import json

# Load the benchmark file
with open("/content/drive/MyDrive/RAG/benchmarks/privacy_qa.json", "r") as f:
    benchmark_data = json.load(f)

question_set = []
ground_truth_set = []


for test in benchmark_data["tests"]:
    question = test.get("query", "").strip()
    snippets = test.get("snippets", [])

    if snippets:
        answer = snippets[0].get("answer", "").strip()
    else:
        answer = "No answer provided."

    question_set.append(question)
    ground_truth_set.append(answer)

print(f"Extracted {len(question_set)} questions and answers.")
print("Sample Q:", question_set[0])
print("Sample A:", ground_truth_set[0])


Extracted 194 questions and answers.
Sample Q: Consider "Fiverr"'s privacy policy; who can see which tasks i hire workers for?
Sample A: In addition, we collect information while you access, browse, view or otherwise use the Site.
In other words, when you access the Site we are aware of your usage of the Site, and may gather, collect and record the information relating to such usage, including geo-location information, IP address, device and connection information, browser information and web-log information, and all communications recorded by Users through the Site.


#### **3.1.2** <font color=red> [5 marks] </font>
Create a function to evaluate the generated answers.

Evaluate the responses on *Rouge*, *Ragas* and *Bleu* scores.

In [19]:
!pip install nltk rouge-score



In [20]:
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge_score import rouge_scorer

In [21]:

def generate_predictions(question_list):
    return [rag_chain.invoke({"query": q})["result"].strip() for q in question_list]


In [22]:
# Function to evaluate the RAG pipeline
def evaluate_bleu_rouge(predictions, references):
    smoothie = SmoothingFunction().method4
    rouge = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)

    total_bleu = 0
    total_rouge1 = 0
    total_rougeL = 0

    for pred, ref in zip(predictions, references):
        # BLEU
        ref_tokens = [ref.split()]
        pred_tokens = pred.split()
        bleu = sentence_bleu(ref_tokens, pred_tokens, smoothing_function=smoothie)
        total_bleu += bleu

        # ROUGE
        scores = rouge.score(pred, ref)
        total_rouge1 += scores['rouge1'].fmeasure
        total_rougeL += scores['rougeL'].fmeasure

    n = len(predictions)
    return {
        "BLEU": round(total_bleu / n, 4),
        "ROUGE-1": round(total_rouge1 / n, 4),
        "ROUGE-L": round(total_rougeL / n, 4)
    }


#### **3.1.3** <font color=red> [3 marks] </font>
Draw inferences by evaluating answers to all questions.

In [23]:
# Evaluate the RAG pipeline
# Generate predictions for benchmark questions
predicted_answers = generate_predictions(question_set[:100])  # Use first 20 for speed/debug

#  Evaluate using BLEU and ROUGE
scores = evaluate_bleu_rouge(predicted_answers, ground_truth_set[:100])

print("Evaluation Scores:")
for metric, score in scores.items():
    print(f"{metric}: {score}")

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Token indices sequence length is longer than the specified maximum sequence length for this model (533 > 512). Running this sequence through the model will result in indexing errors


Evaluation Scores:
BLEU: 0.0176
ROUGE-1: 0.1435
ROUGE-L: 0.0913


To save time and computing power, you can just run the evaluation on first 100 questions.

## **4. Conclusion** <font color=red> [5 marks] </font><br>

### **4.1 Conclusions and insights** <font color=red> [5 marks] </font><br>

#### **4.1.1** <font color=red> [5 marks] </font>
Conclude with the results here. Include the insights gained about the data, model pipeline, the RAG process and the results obtained.

### 4.1.1 Conclusions and Insights

This project aimed to build a document-based Question Answering (QA) system using a Retrieval-Augmented Generation (RAG) approach.

####  Approach Summary:

1. **Data Loading**:
   - Loaded legal and policy-related text documents from the given corpus using a custom text loader.

2. **Embedding**:
   - Used the **`all-MiniLM-L6-v2`** model from `sentence-transformers` to embed the documents into a vector store for semantic search.
   - Created a **FAISS index** to enable efficient similarity-based retrieval.

3. **Retriever**:
   - Retrieved relevant documents for a given query using cosine similarity search over the embeddings.

4. **Question Answering Model**:
   - Used a **pretrained QA-tuned transformer model**: `deepset/roberta-base-squad2`.
   - Integrated this with the retriever to answer questions in an extractive QA fashion (selecting answers from retrieved context).

5. **Evaluation**:
   - Extracted benchmark questions and their ground truth answers from provided `.json` files.
   - Evaluated the system output using standard NLP metrics: **BLEU**, **ROUGE-1**, and **ROUGE-L**.

####  Final Evaluation Scores:

- **BLEU Score**: `0.0176`
- **ROUGE-1 Score**: `0.1435`
- **ROUGE-L Score**: `0.0913`

####  Insights:

- The combination of semantic search and an extractive QA model worked reasonably well in retrieving and selecting relevant context for legal-style questions.
- The use of `deepset/roberta-base-squad2` ensured the answers were span-based, leading to precise but sometimes rigid responses.
- The low BLEU score suggests limited overlap in exact wordings with ground truth, which is common for span-based QA over complex documents.
- ROUGE scores indicate some degree of lexical overlap, but there is significant scope for improving the relevance and fluency of the generated answers.

####  Conclusion:

The system demonstrated the effectiveness of using a lightweight yet powerful embedding model (`MiniLM`) and a QA-tuned transformer (`RoBERTa`) to build a legal QA system. The pipeline can be further improved by refining retrieval (e.g., chunking strategy), using generative models for more natural answers, or applying better evaluation metrics for extractive QA.


***Below are my trial and error method results***

k:3
chunk size:1000
large
BLEU: 0.0002
ROUGE-1: 0.0146
ROUGE-L: 0.0127<br>
k:5
chunk size:1000
large
BLEU: 0.0002
ROUGE-1: 0.0146
ROUGE-L: 0.0127<br>
k:1
chunk size:500
base
BLEU: 0.0013
ROUGE-1: 0.0399
ROUGE-L: 0.032<br>
k:1
chunk size:1000
large
BLEU: 0.0002
ROUGE-1: 0.0146
ROUGE-L: 0.0127<br>
k:1
chunk size:500
large
BLEU: 0.0002
ROUGE-1: 0.0146
ROUGE-L: 0.0127<br>
k:1
chunk size:500
base
BLEU: 0.0012
ROUGE-1: 0.0382
ROUGE-L: 0.0304<br>
k:5
chunk size:500
base
BLEU: 0.0012
ROUGE-1: 0.0382
ROUGE-L: 0.0304<br>
k:5
chunk size:500
base
BLEU: 0.0176
ROUGE-1: 0.1435
ROUGE-L: 0.0913<br>