# **Extracting Information from Legal Documents Using RAG**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## **Objective**

The main objective of this assignment is to process and analyse a collection text files containing legal agreements (e.g., NDAs) to prepare them for implementing a **Retrieval-Augmented Generation (RAG)** system. This involves:

* Understand the Cleaned Data : Gain a comprehensive understanding of the structure, content, and context of the cleaned dataset.
* Perform Exploratory Analysis : Conduct bivariate and multivariate analyses to uncover relationships and trends within the cleaned data.
* Create Visualisations : Develop meaningful visualisations to support the analysis and make findings interpretable.
* Derive Insights and Conclusions : Extract valuable insights from the cleaned data and provide clear, actionable conclusions.
* Document the Process : Provide a detailed description of the data, its attributes, and the steps taken during the analysis for reproducibility and clarity.

The ultimate goal is to transform the raw text data into a clean, structured, and analysable format that can be effectively used to build and train a RAG system for tasks like information retrieval, question-answering, and knowledge extraction related to legal agreements.

### **Business Value**  


The project aims to leverage RAG to enhance legal document processing for businesses, law firms, and regulatory bodies. The key business objectives include:

* Faster Legal Research: <br> Reduce the time lawyers and compliance officers spend searching for relevant case laws, precedents, statutes, or contract clauses.
* Improved Contract Analysis: <br> Automatically extract key terms, obligations, and risks from lengthy contracts.
* Regulatory Compliance Monitoring: <br> Help businesses stay updated with legal and regulatory changes by retrieving relevant legal updates.
* Enhanced Decision-Making: <br> Provide accurate and context-aware legal insights to assist in risk assessment and legal strategy.


**Use Cases**
* Legal Chatbots
* Contract Review Automation
* Tracking Regulatory Changes and Compliance Monitoring
* Case Law Analysis of past judgments
* Due Diligence & Risk Assessment

## **1. Data Loading, Preparation and Analysis** <font color=red> [20 marks] </font><br>

### **1.1 Data Understanding**

The dataset contains legal documents and contracts collected from various sources. The documents are present as text files (`.txt`) in the *corpus* folder.

There are four types of documents in the *courpus* folder, divided into four subfolders.
- `contractnli`: contains various non-disclosure and confidentiality agreements
- `cuad`: contains contracts with annotated legal clauses
- `maud`: contains various merger/acquisition contracts and agreements
- `privacy_qa`: a question-answering dataset containing privacy policies

The dataset also contains evaluation files in JSON format in the *benchmark* folder. The files contain the questions and their answers, along with sources. For each of the above four folders, there is a `json` file: `contractnli.json`, `cuad.json`, `maud.json` `privacy_qa.json`. The file structure is as follows:

```
{
    "tests": [
        {
            "query": <question1>,
            "snippets": [{
                    "file_path": <source_file1>,
                    "span": [ begin_position, end_position ],
                    "answer": <relevant answer to the question 1>
                },
                {
                    "file_path": <source_file2>,
                    "span": [ begin_position, end_position ],
                    "answer": <relevant answer to the question 2>
                }, ....
            ]
        },
        {
            "query": <question2>,
            "snippets": [{<answer context for que 2>}]
        },
        ... <more queries>
    ]
}
```

### **1.2 Load and Preprocess the data** <font color=red> [5 marks] </font><br>

#### Loading libraries

In [2]:
# The following libraries might be useful
!pip install -q langchain-openai
!pip install -U -q langchain-community
!pip install -U -q langchain-chroma
!pip install -U -q datasets
!pip install -U -q ragas
!pip install -U -q rouge_score

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/63.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.4/63.4 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m438.5/438.5 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m30.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


#### **1.2.1** <font color=red> [3 marks] </font>
Load all `.txt` files from the folders.

You can utilise document loaders from the options provided by the LangChain community.

Optionally, you can also read the files manually, while ensuring proper handling of encoding issues (e.g., utf-8, latin1). In such case, also store the file content along with metadata (e.g., file name, directory path) for traceability.

In [48]:
from langchain.document_loaders import TextLoader
import os

# Define the base path for the corpus
base_path = "/content/drive/MyDrive/Colab Notebooks/RAG Upgrad/rag_legal/corpus"
# Dictionary to store the loaded documents
documents = {}

file_content_map = {}


# Iterate through the subfolders
for folder in ["contractnli", "cuad", "maud", "privacy_qa"]:
    folder_path = os.path.join(base_path, folder)
    documents[folder] = []

    # Load all .txt files in the folder using TextLoader
    for file_name in os.listdir(folder_path):
        file_path = os.path.join(folder_path, file_name)
        if file_name.endswith(".txt"):
            try:
                loader = TextLoader(file_path, encoding="utf-8")
                docs = loader.load()
                for doc in docs:
                    documents[folder].append(doc)

                file_content_map[file_path] = doc.page_content # Store original content before preprocessing

            except Exception as e:
                print(f"Error loading file {file_path}: {e}")

for folder, docs in documents.items():
    print(f"Loaded {len(docs)} documents from {folder}")
print(f"Created file content map with {len(file_content_map)} entries.")


Loaded 95 documents from contractnli
Loaded 462 documents from cuad
Loaded 134 documents from maud
Loaded 7 documents from privacy_qa
Created file content map with 698 entries.


In [4]:
total_docs =[]
for folder, docs in documents.items():
    no_of_docs = []
    for doc in docs:
        no_of_docs.append(doc.page_content)
    total_docs.extend(no_of_docs)
    print(f"Number of documents in {folder}: {len(no_of_docs)}")
    print(f"Total number of documents: {len(total_docs)}")

Number of documents in contractnli: 95
Total number of documents: 95
Number of documents in cuad: 462
Total number of documents: 557
Number of documents in maud: 134
Total number of documents: 691
Number of documents in privacy_qa: 7
Total number of documents: 698


In [5]:
print(type(docs[0]))  # Check the type of the first element in docs

<class 'langchain_core.documents.base.Document'>


In [6]:
docs[0].metadata  # Access the metadata of the first document

{'source': '/content/drive/MyDrive/Colab Notebooks/RAG Upgrad/rag_legal/corpus/privacy_qa/Viber Messenger.txt'}

#### **1.2.2** <font color=red> [2 marks] </font>
Preprocess the text data to remove noise and prepare it for analysis.

Remove special characters, extra whitespace, and irrelevant content such as email and telephone contact info.
Normalise text (e.g., convert to lowercase, remove stop words).
Handle missing or corrupted data by logging errors and skipping problematic files.

In [7]:
# Clean and preprocess the data

import re

def preprocess_text(text):
    """
    Preprocess the text by removing special characters, extra whitespace,
    and irrelevant content such as email and telephone contact info.
    Normalize text by converting to lowercase and removing stop words.
    """
    # Remove email addresses
    text = re.sub(r'\S+@\S+', '', text)

    # Remove phone numbers
    text = re.sub(r'\b\d{10,}\b', '', text)

    # Remove special characters and punctuation
    text = re.sub(r'[^\w\s]', '', text)

    # Convert to lowercase
    text = text.lower()

    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Example usage
for folder, docs in documents.items():
    for doc in docs:
        doc.page_content = preprocess_text(doc.page_content)

print("Preprocessing complete.")

Preprocessing complete.


### **1.3 Exploratory Data Analysis** <font color=red> [10 marks] </font><br>

#### **1.3.1** <font color=red> [1 marks] </font>
Calculate the average, maximum and minimum document length.

In [8]:
lengths = []
for folder, docs in documents.items():
    for doc in docs:
        lengths.append(len(doc.page_content))
print(len(lengths))


698


In [9]:
# Calculate the average, maximum, and minimum document length
def calculate_document_lengths(documents):
    lengths = []
    for folder, docs in documents.items():
        for doc in docs:
            lengths.append(len(doc.page_content))  # Access "page_content" as a dictionary key

    if lengths:
        avg_length = sum(lengths) / len(lengths)
        max_length = max(lengths)
        min_length = min(lengths)
        total_documents = len(lengths)
        return avg_length, max_length, min_length, total_documents
    else:
        return 0, 0, 0

# Example usage
avg_length, max_length, min_length, total_documents = calculate_document_lengths(documents)
print(f"Average Document Length: {avg_length}")
print(f"Maximum Document Length: {max_length}")
print(f"Minimum Document Length: {min_length}")
print(f"Total Number of Documents: {total_documents}")

Average Document Length: 99063.25787965616
Maximum Document Length: 957212
Minimum Document Length: 1329
Total Number of Documents: 698


#### **1.3.2** <font color=red> [4 marks] </font>
Analyse the frequency of occurrence of words and find the most and least occurring words.

Find the 20 most common and least common words in the text. Ignore stop words such as articles and prepositions.

In [10]:
# Find frequency of occurence of words


from collections import Counter
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk

nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

# Define a function to find the most and least common words
def find_common_words(documents, num_words=20):
    stop_words = set(stopwords.words('english'))
    word_counts = Counter()

    # Tokenize and count words in all documents
    for folder, docs in documents.items():
        for doc in docs:
            words = word_tokenize(doc.page_content)
            filtered_words = [word.lower() for word in words if word.isalnum() and word.lower() not in stop_words]
            word_counts.update(filtered_words)

    # Get the most and least common words
    most_common = word_counts.most_common(num_words)
    least_common = word_counts.most_common()[:-num_words-1:-1]

    return most_common, least_common

most_common, least_common = find_common_words(documents)
print("20 Most Common Words:", most_common)
print("20 Least Common Words:", least_common)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


20 Most Common Words: [('company', 148167), ('shall', 107995), ('agreement', 104559), ('section', 75344), ('parent', 58009), ('party', 49656), ('date', 39281), ('time', 35251), ('material', 34208), ('merger', 33843), ('subsidiaries', 33317), ('applicable', 31368), ('including', 29398), ('respect', 28848), ('may', 28069), ('stock', 26651), ('information', 25681), ('parties', 24609), ('b', 23935), ('business', 23497)]
20 Least Common Words: [('newer', 1), ('2257522579', 1), ('peoplefuncom', 1), ('nonmarketing', 1), ('httpwwwyouronlinechoiceseu', 1), ('httpwwwaboutadsinfochoices', 1), ('checkins', 1), ('httpsvunglecomprivacypolicy', 1), ('vungle', 1), ('httpswwwtapresearchcomuserprivacy', 1), ('tapresearch', 1), ('httpsdevtapjoycomfaqtapjoyprivacypolicy', 1), ('tapjoy', 1), ('httpswwwstartappcompolicyprivacypolicy', 1), ('startappcom', 1), ('httpaboutsoomlaenduserprivacypolicy', 1), ('soomla', 1), ('httpswwwsmaatocomprivacy', 1), ('smaato', 1), ('httpspinsightmediacomprivacy', 1)]


#### **1.3.3** <font color=red> [4 marks] </font>
Analyse the similarity of different documents to each other based on TF-IDF vectors.

Transform some documents to TF-IDF vectors and calculate their similarity matrix using a suitable distance function. If contracts contain duplicate or highly similar clauses, similarity calculation can help detect them.

Identify for the first 10 documents and then for 10 random documents. What do you observe?

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Function to transform documents into TF-IDF vectors
def transform_to_tfidf(documents):
    # Combine all document contents into a list
    all_docs = []
    for folder, docs in documents.items():
        for doc in docs:
            all_docs.append(doc.page_content)

    # Initialize the TF-IDF Vectorizer
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(all_docs)

    return tfidf_matrix, vectorizer

tfidf_matrix, vectorizer = transform_to_tfidf(documents)

# Print the shape of the TF-IDF matrix
print(f"TF-IDF Matrix Shape: {tfidf_matrix.shape}")

TF-IDF Matrix Shape: (698, 53814)


In [12]:
# Compute similarity scores for 10 first documents
from sklearn.metrics.pairwise import cosine_similarity

all_docs = []  # Initialize an empty list to store all document contents
for folder, docs in documents.items():
    for doc in docs:
        all_docs.append(doc.page_content)    # Select the first 10 documents
first_10_docs = all_docs[:10]

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(first_10_docs)

# Compute the cosine similarity matrix
similarity_matrix = cosine_similarity(tfidf_matrix)



# Print the similarity matrix for the first 10 documents
print("Similarity Matrix (First 10 Documents):")
print(similarity_matrix)

Similarity Matrix (First 10 Documents):
[[1.         0.8270636  0.92523176 0.89357084 0.84617424 0.90547066
  0.90187807 0.55784613 0.72899334 0.8547223 ]
 [0.8270636  1.         0.8191439  0.82327974 0.77032254 0.84548096
  0.83705434 0.5167663  0.68886273 0.74162957]
 [0.92523176 0.8191439  1.         0.90558092 0.84680564 0.90865286
  0.89467926 0.53854333 0.6959245  0.81215928]
 [0.89357084 0.82327974 0.90558092 1.         0.91842448 0.90333652
  0.90034752 0.55429589 0.72401837 0.83299374]
 [0.84617424 0.77032254 0.84680564 0.91842448 1.         0.85021242
  0.8613531  0.5438803  0.68932421 0.80423435]
 [0.90547066 0.84548096 0.90865286 0.90333652 0.85021242 1.
  0.90592347 0.56004787 0.72211617 0.83141618]
 [0.90187807 0.83705434 0.89467926 0.90034752 0.8613531  0.90592347
  1.         0.56730904 0.72792242 0.82849781]
 [0.55784613 0.5167663  0.53854333 0.55429589 0.5438803  0.56004787
  0.56730904 1.         0.48179914 0.54192128]
 [0.72899334 0.68886273 0.6959245  0.72401837 0.

In [13]:
import numpy as np

# Calculate the average value of the similarity matrix
average_similarity = np.mean(similarity_matrix)
print(f"Average Similarity: {average_similarity}")

Average Similarity: 0.7903879105094995


In [14]:
# Compute similarity scores for 10 first documents
import random

all_docs = []  # Initialize an empty list to store all document contents
for folder, docs in documents.items():
    for doc in docs:
        all_docs.append(doc.page_content)    # Select the first 10 documents
random_indices = random.sample(range(len(all_docs)), 10)
random_docs = [all_docs[i] for i in random_indices]
# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(random_docs)

# Compute the cosine similarity matrix
similarity_matrix = cosine_similarity(tfidf_matrix)



# Print the similarity matrix for the first 10 documents
print("Similarity Matrix (First 10 Documents):")
print(similarity_matrix)

Similarity Matrix (First 10 Documents):
[[1.         0.73716557 0.70906772 0.44288951 0.56858931 0.79008351
  0.6827307  0.60140366 0.51747787 0.64913103]
 [0.73716557 1.         0.87871694 0.53362492 0.67908797 0.91377913
  0.78240003 0.68252404 0.65375891 0.78244115]
 [0.70906772 0.87871694 1.         0.52942467 0.65492756 0.87967282
  0.76326232 0.68360024 0.64664033 0.76726722]
 [0.44288951 0.53362492 0.52942467 1.         0.41937679 0.53023243
  0.47811871 0.4380021  0.40745518 0.47944881]
 [0.56858931 0.67908797 0.65492756 0.41937679 1.         0.67194503
  0.61024109 0.56514784 0.50239364 0.61441499]
 [0.79008351 0.91377913 0.87967282 0.53023243 0.67194503 1.
  0.79983779 0.69811158 0.64295763 0.77024811]
 [0.6827307  0.78240003 0.76326232 0.47811871 0.61024109 0.79983779
  1.         0.64889433 0.55504269 0.70654594]
 [0.60140366 0.68252404 0.68360024 0.4380021  0.56514784 0.69811158
  0.64889433 1.         0.50997213 0.63170399]
 [0.51747787 0.65375891 0.64664033 0.40745518 0.

In [15]:

# Calculate the average value of the similarity matrix
average_similarity = np.mean(similarity_matrix)
print(f"Average Similarity: {average_similarity}")

Average Similarity: 0.6754492807353668


### Observation: It seems that documents that are far from each other has lower similarity score.
### The first 10 documents has averagely higher similarity score due to their nearer positions

### **1.4 Document Creation and Chunking** <font color=red> [5 marks] </font><br>

#### **1.4.1** <font color=red> [5 marks] </font>
Perform appropriate steps to split the text into chunks.

In [16]:
def generate_chunks(documents, chunk_size=200, chunk_overlap=50):
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from langchain.schema import Document  # Import the Document class

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        add_start_index=True,  # Enables start_index in the output
        separators=["\n\n", "\n", " ", ""]
    )

    chunks = []
    for folder, docs in documents.items():
        for doc in docs:
            # Split the document into chunks (list of strings)
            doc_chunks = text_splitter.split_text(doc.page_content)
            for chunk in doc_chunks:
                # Create Document objects for each chunk and associate metadata
                chunks.append(Document(page_content=chunk, metadata=doc.metadata.copy()))
    return chunks

In [17]:
chunks = generate_chunks(documents)


In [18]:
print(len(chunks))  # Print the number of chunks generated

461091


## **2. Vector Database and RAG Chain Creation** <font color=red> [15 marks] </font><br>

### **2.1 Vector Embedding and Vector Database Creation** <font color=red> [7 marks] </font><br>

#### **2.1.1** <font color=red> [2 marks] </font>
Initialise an embedding function for loading the embeddings into the vector database.

Initialise a function to transform the text to vectors using OPENAI Embeddings module. You can also use this function to transform during vector DB creation itself.

In [19]:
from langchain.embeddings import OpenAIEmbeddings

# Replace with your actual OpenAI API key
openai_api_key = "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"

def get_openai_embeddings(api_key):
    """
    Initialize the OpenAI Embeddings module with the provided API key.

    Args:
        api_key (str): Your OpenAI API key.

    Returns:
        OpenAIEmbeddings: An instance of the OpenAIEmbeddings class.
    """
    try:
        embeddings = OpenAIEmbeddings(openai_api_key=api_key)
        print("OpenAI Embeddings initialized successfully.")
        return embeddings
    except Exception as e:
        print(f"Error initializing OpenAI Embeddings: {e}")
        return None

embeddings = get_openai_embeddings(openai_api_key)

if embeddings:
    print("Embeddings object created and ready to use.")

  embeddings = OpenAIEmbeddings(openai_api_key=api_key)


OpenAI Embeddings initialized successfully.
Embeddings object created and ready to use.


#### **2.1.2** <font color=red> [5 marks] </font>
Load the embeddings to a vector database.

Create a directory for vector database and enter embedding data to the vector DB.

In [20]:
from langchain.vectorstores import Chroma

def create_vector_database(chunks, embeddings, persist_directory="vector_db"):

    try:
        # Create the vector database
        vector_db = Chroma.from_documents(
            documents=chunks,
            embedding=embeddings,
            persist_directory=persist_directory
        )

        # Persist the vector database
        vector_db.persist()
        print(f"Vector database created and persisted at: {persist_directory}")
        return vector_db
    except Exception as e:
        print(f"Error creating vector database: {e}")
        return None

persist_directory = "vector_db"
vector_db = create_vector_database(chunks, embeddings, persist_directory)

Vector database created and persisted at: vector_db


  vector_db.persist()


### **2.2 Create RAG Chain** <font color=red> [8 marks] </font><br>

#### **2.2.1** <font color=red> [5 marks] </font>
Create a RAG chain.

In [22]:
# Create a RAG chain
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

# Assuming you have already initialized the embeddings and vector_db
# from the previous steps.

# 1. Initialize the Language Model
llm = ChatOpenAI(model_name="gpt-4o", temperature=0, openai_api_key=openai_api_key)

# 2. Create a Retriever from the Vector Database
retriever = vector_db.as_retriever()

# 3. Create the RAG Chain
rag_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff", # 'stuff' means all retrieved documents are stuffed into the prompt
    retriever=retriever,
    return_source_documents=True

print("RAG chain created successfully.")

RAG chain created successfully.


#### **2.2.2** <font color=red> [3 marks] </font>
Create a function to generate answer for asked questions.

Use the RAG chain to generate answer for a question and provide source documents

In [23]:
# Create a function for question answering

def answer_question(question, rag_chain):
    """
    Uses the RAG chain to generate an answer for a given question
    and provides source documents if available.

    Args:
        question (str): The question to be answered.
        rag_chain: The initialized RAG chain object.

    Returns:
        dict: A dictionary containing the answer and source documents (if available).
              Returns None if rag_chain is not initialized.
    """
    if rag_chain is None:
        print("RAG chain is not initialized.")
        return None

    try:
        # Run the RAG chain with the question
        response = rag_chain({"query": question})

        # Extract the answer and source documents
        answer = response.get("result", "Could not find an answer.")
        source_documents = response.get("source_documents", [])

        return {
            "answer": answer,
            "source_documents": source_documents
        }
    except Exception as e:
        print(f"Error generating answer: {e}")
        return None



In [24]:
question = "Consider the Non-Disclosure Agreement between CopAcc and ToP Mentors; Does the document indicate that the Agreement does not grant the Receiving Party any rights to the Confidential Information?"
result = answer_question(question, rag_chain)

  response = rag_chain({"query": question})


In [27]:
if result is not None:
    print("Answer:", result["answer"])
    print("\nSource Documents:")
    for i, doc in enumerate(result["source_documents"]):
        print(f"Document {i+1}:")
        print(f"  Content: {doc.page_content[:200]}...") # Print first 200 characters
        print(f"  Metadata: {doc.metadata}")
        print("-" * 20)


Answer: Based on the context provided, the document does not explicitly state whether the Agreement grants or does not grant the Receiving Party any rights to the Confidential Information. If this is a critical aspect of the agreement, it would be advisable to review the specific terms and clauses of the Non-Disclosure Agreement between CopAcc and ToP Mentors to determine if such a provision is included.

Source Documents:
Document 1:
  Content: nondisclosure agreement keep in confidence and prevent the disclosure of proprietary information to any third party other than those of receiving partys i employees agents representatives directors...
  Metadata: {'source': '/content/drive/MyDrive/Colab Notebooks/RAG Upgrad/rag_legal/corpus/contractnli/lti-two-way-cda-template.txt'}
--------------------
Document 2:
  Content: prior nondisclosure secrecy or confidentiality agreement between the parties or their affiliates dealing with the subject of this agreement including the confidentiality a

## **3. RAG Evaluation** <font color=red> [10 marks] </font><br>

### **3.1 Evaluation and Inference** <font color=red> [10 marks] </font><br>

#### **3.1.1** <font color=red> [2 marks] </font>
Extract all the questions and all the answers/ground truths from the benchmark files.

Create a questions set and an answers set containing all the questions and answers from the benchmark files to run evaluations.

In [29]:
# Create a question set by taking all the questions from the benchmark data
# Also create a ground truth/answer set
import json
import os

# Define the base path for the benchmark data
benchmark_base_path = "/content/drive/MyDrive/Colab Notebooks/RAG Upgrad/rag_legal/benchmarks"

# List of benchmark JSON files
benchmark_files = ["contractnli.json", "cuad.json", "maud.json", "privacy_qa.json"]

# Set to store all unique questions
question_set = set()

# List to store all ground truth answers, organized by question
ground_truth_answers = {} # Using a dictionary to map questions to their snippets

# Iterate through each benchmark file
for file_name in benchmark_files:
    file_path = os.path.join(benchmark_base_path, file_name)

    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            benchmark_data = json.load(f)

        # Iterate through each test in the benchmark data
        for test in benchmark_data.get("tests", []):
            query = test.get("query")
            snippets = test.get("snippets", [])

            if query:
                # Add the question to the set
                question_set.add(query)

                # Store the snippets (ground truth answers) for this question
                # We'll store the whole snippet list for each question
                ground_truth_answers[query] = snippets

    except FileNotFoundError:
        print(f"Benchmark file not found: {file_path}")
    except json.JSONDecodeError:
        print(f"Error decoding JSON from file: {file_path}")
    except Exception as e:
        print(f"An error occurred while processing {file_path}: {e}")

# Convert the question set to a list if you prefer
question_list = list(question_set)

print(f"Extracted {len(question_list)} unique questions.")


Extracted 6856 unique questions.


#### **3.1.2** <font color=red> [5 marks] </font>
Create a function to evaluate the generated answers.

In [None]:
import numpy as np
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
)
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer
from nltk.tokenize import word_tokenize # Ensure word_tokenize is imported
import os # Import os module
import pandas as pd # Import pandas

# Download necessary NLTK data for BLEU if not already present
try:
    nltk.data.find('tokenizers/punkt')
except nltk.downloader.DownloadError:
    nltk.download('punkt')

# Assume file_content_map is globally available and correctly populated earlier in the notebook
# Example: Accessing the global file_content_map

def evaluate_rag_performance(questions, ground_truth_answers, rag_chain, llm, embeddings, file_content_map):

    if not rag_chain:
        print("RAG chain is not initialized. Cannot perform evaluation.")
        return None

    if not llm:
        print("LLM is not initialized. Cannot perform Ragas evaluation.")
        ragas_results = "LLM not initialized, Ragas metrics skipped."
    elif not embeddings:
         print("Embeddings are not initialized. Cannot perform Ragas evaluation.")
         ragas_results = "Embeddings not initialized, Ragas metrics skipped."
    else:
        # --- Prepare data for Ragas ---
        data_for_ragas = {
            "question": [],
            "answer": [],
            "contexts": [],
            "ground_truths": [],
            "reference": [], # Modified for Ragas context_recall
        }

        print("\nGathering data for Ragas evaluation...")
        base_corpus_path = "/content/drive/MyDrive/Colab Notebooks/RAG Upgrad/rag_legal/corpus" # Define the base path for corpus

        # Iterate through the questions to evaluate
        for q in questions:
            # Get RAG answer and contexts for the current question
            rag_response = answer_question(q, rag_chain)
            generated_answer = rag_response.get("answer", "") if rag_response else ""
            # Ensure source_documents is a list before accessing page_content
            retrieved_contexts = [doc.page_content for doc in rag_response.get("source_documents", []) if hasattr(doc, 'page_content')]

            # Get ground truth snippets for the current question
            gt_snippets = ground_truth_answers.get(q, [])
            ground_truths_list = []
            references_text_list = [] # List to store reference texts for this question before joining

            for snippet in gt_snippets:
                answer = snippet.get("answer", "")
                file_path_suffix = snippet.get("file_path", "")
                span = snippet.get("span")

                if answer:
                    ground_truths_list.append(answer)

                # Extract reference text from original file content using file_path and span
                if file_path_suffix and span and len(span) == 2:
                    # Construct the full file path
                    full_file_path = os.path.join(base_corpus_path, file_path_suffix)
                    start, end = span

                    # Get the original content using the file_content_map
                    original_content = file_content_map.get(full_file_path, "")

                    if original_content and 0 <= start < end <= len(original_content):
                        reference_text = original_content[start:end]
                         # Ensure the extracted text is a string and not empty before adding
                        if isinstance(reference_text, str) and reference_text.strip():
                            references_text_list.append(reference_text)


            # Append data for this question to the lists

            if ground_truths_list and references_text_list: # Proceed only if we have both ground truths and references
                 data_for_ragas["question"].append(q)
                 data_for_ragas["answer"].append(generated_answer)
                 data_for_ragas["contexts"].append(retrieved_contexts)
                 data_for_ragas["ground_truths"].append(ground_truths_list)
                 # Join the list of reference texts into a single string
                 # This is the change implemented in the previous step to address the 'valid string' error
                 data_for_ragas["reference"].append(" ".join(references_text_list)) # Join with a space or other separator


        # Convert to Ragas Dataset
        # Check if there's any data to create a dataset
        if data_for_ragas["question"]:

            ragas_dataset = Dataset.from_dict(data_for_ragas)

            # Define Ragas metrics
            metrics = [
                faithfulness,
                answer_relevancy,
                context_recall,
                context_precision,
            ]

            # Perform Ragas evaluation
            try:
                print("\nStarting Ragas evaluation...")
                ragas_results = evaluate(
                    ragas_dataset,
                    metrics=metrics,
                    llm=llm,
                    embeddings=embeddings # Assuming embeddings is initialized globally or passed
                )
                print("Ragas evaluation finished.")
                # Convert to pandas DataFrame
                ragas_results = ragas_results.to_pandas()
                print(ragas_results)
            except Exception as e:
                print(f"Error during Ragas evaluation: {e}")
                # If Ragas evaluation fails, store the error message
                ragas_results = f"Error during Ragas evaluation: {e}"
        else:
            # If no data collected for Ragas, store a message
            ragas_results = "No data collected for Ragas evaluation (ensure questions have ground truths and references)."
            print(ragas_results)


    # --- Prepare data for ROUGE and BLEU ---


    all_rouge_scores_rouge1 = []
    all_rouge_scores_rougeL = []
    all_bleu_scores = []

    scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)

    print("\nCalculating ROUGE and BLEU scores...")
    for q in questions:
        rag_response = answer_question(q, rag_chain)
        generated_answer = rag_response["answer"] if rag_response else ""

        # Get ground truth answers for the current question
        gt_snippets = ground_truth_answers.get(q, [])
        gt_answers_for_q = [snippet.get("answer", "") for snippet in gt_snippets if snippet.get("answer")]

        if generated_answer and gt_answers_for_q: # Only evaluate if we have both generated and ground truth answers
            # ROUGE calculation
            # Compare generated answer against all ground truths for this question
            rouge_results_for_q = [scorer.score(gt_answer, generated_answer) for gt_answer in gt_answers_for_q]
            # Collect fmeasure scores for averaging later
            all_rouge_scores_rouge1.extend([r['rouge1'].fmeasure for r in rouge_results_for_q])
            all_rouge_scores_rougeL.extend([r['rougeL'].fmeasure for r in rouge_results_for_q])


            # BLEU calculation
            # BLEU expects a list of reference sentences (ground truths) and a candidate sentence (generated answer)
            # Tokenize sentences for BLEU
            reference_tokenized = [word_tokenize(gt_ans) for gt_ans in gt_answers_for_q]
            candidate_tokenized = word_tokenize(generated_answer)

            # Calculate BLEU score - using smoothing function if needed for short texts
            try:
                # Calculate BLEU score for this question (against all references)
                bleu_score = sentence_bleu(reference_tokenized, candidate_tokenized, smoothing_function=nltk.translate.bleu_score.SmoothingFunction().method1)
                all_bleu_scores.append(bleu_score)
            except Exception as e:
                 print(f"Error calculating BLEU for question: {q} - {e}")
                 all_bleu_scores.append(0) # Append 0 if BLEU calculation fails


    # Aggregate ROUGE and BLEU scores across all questions and ground truths
    # Calculate mean only if the list is not empty
    avg_rouge_fmeasure_rouge1 = np.mean(all_rouge_scores_rouge1) if all_rouge_scores_rouge1 else 0
    avg_rouge_fmeasure_rougeL = np.mean(all_rouge_scores_rougeL) if all_rouge_scores_rougeL else 0
    avg_bleu_score = np.mean(all_bleu_scores) if all_bleu_scores else 0
    print("ROUGE and BLEU calculation finished.")


    # Return all results
    evaluation_results = {
        "ragas_scores": ragas_results,
        "avg_rouge_fmeasure_rouge1": avg_rouge_fmeasure_rouge1,
        "avg_rouge_fmeasure_rougeL": avg_rouge_fmeasure_rougeL,
        "avg_bleu_score": avg_bleu_score,
    }

    return evaluation_results

Evaluate the responses on *Rouge*, *Ragas* and *Bleu* scores.

In [None]:



# Select a subset of questions for faster evaluation if needed
questions_to_evaluate = question_list[:3] # Evaluate the first 3 questions

# Pass file_content_map to the function
evaluation_results = evaluate_rag_performance(questions_to_evaluate, ground_truth_answers, rag_chain, llm, embeddings, file_content_map)

if evaluation_results:
    print("\n--- Evaluation Results ---")
    if isinstance(evaluation_results["ragas_scores"], str):
        # If ragas_scores is a string (error message), print it
        print(evaluation_results["ragas_scores"])
    elif isinstance(evaluation_results["ragas_scores"], pd.DataFrame):
        # If ragas_scores is a DataFrame, calculate and print the mean of numeric columns
        print("\nRagas Scores:")
        # Select only numeric columns before calculating the mean
        numeric_ragas_scores = evaluation_results["ragas_scores"].select_dtypes(include=np.number)
        print(numeric_ragas_scores.mean())
    else:
        # Handle other unexpected types if necessary
        print(f"Unexpected type for ragas_scores: {type(evaluation_results['ragas_scores'])}")


    print(f"\nAverage ROUGE-1 F1 Score: {evaluation_results['avg_rouge_fmeasure_rouge1']:.4f}")
    print(f"Average ROUGE-L F1 Score: {evaluation_results['avg_rouge_fmeasure_rougeL']:.4f}")
    print(f"Average BLEU Score: {evaluation_results['avg_bleu_score']:.4f}")


Gathering data for Ragas evaluation...

Starting Ragas evaluation...


Evaluating:   0%|          | 0/12 [00:00<?, ?it/s]

Ragas evaluation finished.
                                          user_input  \
0  Consider the Non-Disclosure Agreement between ...   
1  Consider the Distributor Agreement between Com...   
2  Consider the Consulting Agreement between Guns...   

                                  retrieved_contexts  \
0  [to each other for purposes of this agreement ...   
1  [distributor agreement 1 certification and ide...   
2  [consulting llc consultant a virginia limited ...   

                                            response  \
0  No, the document does not state that Confident...   
1  The warranty period provided in the Distributo...   
2  The governing law for the Consulting Agreement...   

                                           reference  faithfulness  \
0  Confidential Information includes, without lim...           1.0   
1  ITS will provide free technical support to cus...           0.0   
2  This Agreement shall be interpreted, construed...           0.0   

   answer_relevan

#### **3.1.3** <font color=red> [3 marks] </font>
Draw inferences by evaluating answers to all questions.

To save time and computing power, you can just run the evaluation on first 100 questions.

In [70]:
# Run evaluation on the first 100 questions

num_questions_to_evaluate = 100

# Select the first N questions from the full list
questions_to_evaluate = question_list[:num_questions_to_evaluate]

print(f"Starting evaluation for the first {len(questions_to_evaluate)} questions...")

# Call the evaluate_rag_performance function with the list of questions
# This function is designed to handle multiple questions and calculate aggregate scores
evaluation_results = evaluate_rag_performance(
    questions_to_evaluate,       # Pass the list of questions
    ground_truth_answers,      # Pass the dictionary of ground truth answers
    rag_chain,                 # Pass the rag_chain object
    llm,                       # Pass the llm object
    embeddings,                # Pass the embeddings object
    file_content_map           # Pass the file_content_map
)

# Now `evaluation_results` contains the aggregate scores (Ragas, ROUGE, BLEU)
# for the batch of questions.

print("\n--- Aggregate Evaluation Results ---")
if evaluation_results:
    if isinstance(evaluation_results["ragas_scores"], str):
        # If ragas_scores is a string (error message), print it
        print(evaluation_results["ragas_scores"])
    elif isinstance(evaluation_results["ragas_scores"], pd.DataFrame):
        # If ragas_scores is a DataFrame, calculate and print the mean of numeric columns
        print("\nRagas Scores (Average):")
        # Select only numeric columns before calculating the mean
        numeric_ragas_scores = evaluation_results["ragas_scores"].select_dtypes(include=np.number)
        print(numeric_ragas_scores.mean())
        # Optional: Print the full Ragas DataFrame
        # print("\nFull Ragas Results DataFrame:")
        # print(evaluation_results["ragas_scores"])
    else:
        # Handle other unexpected types if necessary
        print(f"Unexpected type for ragas_scores: {type(evaluation_results['ragas_scores'])}")


    print(f"\nAverage ROUGE-1 F1 Score: {evaluation_results['avg_rouge_fmeasure_rouge1']:.4f}")
    print(f"Average ROUGE-L F1 Score: {evaluation_results['avg_rouge_fmeasure_rougeL']:.4f}")
    print(f"Average BLEU Score: {evaluation_results['avg_bleu_score']:.4f}")
else:
    print("Evaluation did not produce results.")


Starting evaluation for the first 100 questions...

Gathering data for Ragas evaluation...

Starting Ragas evaluation...


Evaluating:   0%|          | 0/376 [00:00<?, ?it/s]

ERROR:ragas.executor:Exception raised in Job[15]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-S46hVSjhX3sfimLNnR31ofXI on tokens per min (TPM): Limit 30000, Used 29807, Requested 1167. Please try again in 1.947s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
ERROR:ragas.executor:Exception raised in Job[34]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-S46hVSjhX3sfimLNnR31ofXI on tokens per min (TPM): Limit 30000, Used 29359, Requested 1256. Please try again in 1.23s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
ERROR:ragas.executor:Exception raised in Job[23]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[26]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[27

KeyboardInterrupt: 

## **4. Conclusion** <font color=red> [5 marks] </font><br>

### **4.1 Conclusions and insights** <font color=red> [5 marks] </font><br>

#### **4.1.1** <font color=red> [5 marks] </font>
Conclude with the results here. Include the insights gained about the data, model pipeline, the RAG process and the results obtained.

Based on the steps taken and the results observed, here are some insights regarding the data, the model pipeline, the RAG process, and the evaluation:

### **4.1.1 Data Understanding and Preparation**

*   **Data Source and Structure:** The dataset consists of legal documents in `.txt` format, organized into folders by type (`contractnli`, `cuad`, `maud`, `privacy_qa`). The benchmark data provides structured questions and ground truth answers with file paths and spans, which is crucial for detailed evaluation, especially for metrics like Ragas's `context_recall`.
*   **Loading Challenges:** While `TextLoader` simplifies loading, handling various file encodings or potential parsing errors requires robust error handling (as implemented with the `try-except` block).
*   **Preprocessing Importance:** The preprocessing steps (removing emails, phone numbers, special characters, lowercasing) are essential for cleaning noise and standardizing the text. This helps in subsequent steps like TF-IDF calculation and embedding generation by focusing on meaningful content.
*   **`file_content_map`:** Creating and maintaining a map from file paths to *original* content is critical for evaluation metrics like Ragas's `context_recall`, which requires comparing the retrieved context chunks back to the exact spans in the original ground truth documents. Preprocessing text *before* creating this map would hinder accurate span extraction from the original files. (Note: The current code preprocesses *before* storing in `file_content_map`, which is an area for potential refinement if exact span matching on original text is needed for certain metrics).

### **4.1.2 Exploratory Data Analysis (EDA)**

*   **Document Length:** Analyzing document length provides a basic understanding of the scale of the data and informs chunking strategies. Legal documents can be quite long, necessitating effective splitting.
*   **Word Frequency:** Identifying common and rare words (after removing stop words) gives insight into the dominant themes and domain-specific language within the legal corpus. The most common words likely relate to contract terms, legal entities, etc.
*   **TF-IDF Similarity:** TF-IDF helps capture the importance of words in documents relative to the corpus. Calculating the similarity matrix (e.g., using cosine similarity) reveals how similar documents are in terms of their vocabulary and key terms.
    *   Observing higher average similarity among the first 10 documents compared to a random set might suggest some level of inherent ordering or grouping in the initial dataset load, or simply random chance. In a real-world scenario, analysing similarity across different document types (folders) could reveal more structural insights. Detecting highly similar documents or clauses could be useful for de-duplication or identifying standard boilerplate language.

### **4.1.3 Document Creation and Chunking**

*   **Necessity of Chunking:** Legal documents are often too large to fit into the context window of modern language models. Chunking breaks them down into manageable pieces while retaining necessary information and some surrounding context (using `chunk_overlap`).
*   **`RecursiveCharacterTextSplitter`:** This splitter is effective for handling various separators (newlines, spaces) to create logical text chunks. The choice of `chunk_size` and `chunk_overlap` is a hyperparameter that significantly impacts RAG performance – too small, and context is lost; too large, and irrelevant information is included or context windows are exceeded. The chosen values (200/50) are relatively small and might need tuning for a legal domain, which often requires more context.
*   **Maintaining Metadata:** Ensuring metadata (like the original file name/path) is carried over to each chunk is crucial for traceability and for linking retrieved chunks back to their source documents during evaluation and for providing sources in the final answer.

### **4.1.4 Vector Database and RAG Chain Creation**

*   **Embeddings:** Embeddings transform text chunks into numerical vectors, allowing semantic similarity search. `OpenAIEmbeddings` provides a powerful pre-trained model, but requires an API key.
*   **Vector Database (Chroma):** Chroma serves as an efficient store for vector embeddings and their associated metadata. It enables fast retrieval of relevant chunks based on the query's embedding similarity. Persisting the database (`persist_directory`) saves time by avoiding re-embedding the corpus every time the application runs.
*   **RAG Chain (`RetrievalQA`):** The RAG chain connects the retriever (vector DB) with the language model (LLM). The retriever fetches relevant chunks based on the query, and the LLM then synthesizes an answer using the query and the retrieved context. The 'stuff' chain type is simple but can be problematic if the total context length (query + retrieved chunks) exceeds the LLM's context window. Other chain types (`map_reduce`, `refine`) could be explored for very large contexts.
*   **LLM Choice:** The choice of `gpt-4o` is a powerful one for generating coherent and relevant answers, given sufficient context. Setting `temperature=0` promotes more deterministic and factual responses, which is often desirable in legal contexts.

### **4.1.5 RAG Evaluation**

*   **Importance of Evaluation:** Evaluating the RAG system is vital to understand its effectiveness beyond anecdotal examples. Metrics quantify different aspects of performance.
*   **Benchmark Data Usage:** Leveraging the structured benchmark data to extract questions and ground truths is the foundation for automated evaluation.
*   **Metric Interpretation:**
    *   **Ragas:** Provides nuanced evaluation by assessing attributes like whether the generated answer is grounded in the retrieved context (`faithfulness`), whether the answer is relevant to the question (`answer_relevancy`), and how well the retrieved context covers the information needed to answer the question (`context_recall`) and is free of irrelevant information (`context_precision`).
    *   **ROUGE & BLEU:** These are traditional text generation metrics comparing the output directly to reference answers based on word overlap. While useful, they don't specifically evaluate the *retrieval* aspect of RAG or whether the answer is *supported* by the retrieved context.
*   **Evaluation Setup Challenges:** Correctly preparing the data format for evaluation libraries like Ragas (especially linking ground truths to their original reference text spans for context recall) requires careful data manipulation. The error encountered highlights the importance of matching function signatures and data structures.

### **4.1.6 Overall Conclusion**

Building a RAG system involves several interconnected steps, each impacting the final performance. Data loading, preprocessing, chunking, embedding choice, vector database configuration, and the RAG chain structure all play a role. Evaluation using a combination of metrics like Ragas, ROUGE, and BLEU provides a multi-faceted view of the system's strengths and weaknesses. The initial evaluation provides a baseline, and further optimization of chunking, retrieval parameters (e.g., number of documents to retrieve), and LLM prompting techniques would likely be necessary to improve the RAG system's ability to accurately answer questions based on the legal corpus. The ability to link ground truth answers to original document spans is a powerful feature for pinpointing exactly where relevant information *should* have been found by the retriever for accurate `context_recall` measurement.
