# **Extracting Information from Legal Documents Using RAG**

## **Objective**

The main objective of this assignment is to process and analyse a collection text files containing legal agreements (e.g., NDAs) to prepare them for implementing a **Retrieval-Augmented Generation (RAG)** system. This involves:

* Understand the Cleaned Data : Gain a comprehensive understanding of the structure, content, and context of the cleaned dataset.
* Perform Exploratory Analysis : Conduct bivariate and multivariate analyses to uncover relationships and trends within the cleaned data.
* Create Visualisations : Develop meaningful visualisations to support the analysis and make findings interpretable.
* Derive Insights and Conclusions : Extract valuable insights from the cleaned data and provide clear, actionable conclusions.
* Document the Process : Provide a detailed description of the data, its attributes, and the steps taken during the analysis for reproducibility and clarity.

The ultimate goal is to transform the raw text data into a clean, structured, and analysable format that can be effectively used to build and train a RAG system for tasks like information retrieval, question-answering, and knowledge extraction related to legal agreements.

### **Business Value**  


The project aims to leverage RAG to enhance legal document processing for businesses, law firms, and regulatory bodies. The key business objectives include:

* Faster Legal Research: <br> Reduce the time lawyers and compliance officers spend searching for relevant case laws, precedents, statutes, or contract clauses.
* Improved Contract Analysis: <br> Automatically extract key terms, obligations, and risks from lengthy contracts.
* Regulatory Compliance Monitoring: <br> Help businesses stay updated with legal and regulatory changes by retrieving relevant legal updates.
* Enhanced Decision-Making: <br> Provide accurate and context-aware legal insights to assist in risk assessment and legal strategy.


**Use Cases**
* Legal Chatbots
* Contract Review Automation
* Tracking Regulatory Changes and Compliance Monitoring
* Case Law Analysis of past judgments
* Due Diligence & Risk Assessment

## **1. Data Loading, Preparation and Analysis** <font color=red> [20 marks] </font><br>

### **1.1 Data Understanding**

The dataset contains legal documents and contracts collected from various sources. The documents are present as text files (`.txt`) in the *corpus* folder.

There are four types of documents in the *courpus* folder, divided into four subfolders.
- `contractnli`: contains various non-disclosure and confidentiality agreements
- `cuad`: contains contracts with annotated legal clauses
- `maud`: contains various merger/acquisition contracts and agreements
- `privacy_qa`: a question-answering dataset containing privacy policies

The dataset also contains evaluation files in JSON format in the *benchmark* folder. The files contain the questions and their answers, along with sources. For each of the above four folders, there is a `json` file: `contractnli.json`, `cuad.json`, `maud.json` `privacy_qa.json`. The file structure is as follows:

```
{
    "tests": [
        {
            "query": <question1>,
            "snippets": [{
                    "file_path": <source_file1>,
                    "span": [ begin_position, end_position ],
                    "answer": <relevant answer to the question 1>
                },
                {
                    "file_path": <source_file2>,
                    "span": [ begin_position, end_position ],
                    "answer": <relevant answer to the question 2>
                }, ....
            ]
        },
        {
            "query": <question2>,
            "snippets": [{<answer context for que 2>}]
        },
        ... <more queries>
    ]
}
```

### **1.2 Load and Preprocess the data** <font color=red> [5 marks] </font><br>

#### Loading libraries

In [1]:
## The following libraries might be useful
!pip install -q langchain-openai
!pip install -U -q langchain-community
!pip install -U -q langchain-chroma
!pip install -U -q datasets
!pip install -U -q ragas
!pip install -U -q rouge_score

In [5]:
# Import essential libraries



#### **1.2.1** <font color=red> [3 marks] </font>
Load all `.txt` files from the folders.

You can utilise document loaders from the options provided by the LangChain community.

Optionally, you can also read the files manually, while ensuring proper handling of encoding issues (e.g., utf-8, latin1). In such case, also store the file content along with metadata (e.g., file name, directory path) for traceability.

In [23]:
from langchain.document_loaders import TextLoader
import os

# Define the base path for the corpus
base_path = "/Users/tien-nguyen/Downloads/RAG Upgrad/Starter and Dataset RAG Legal/rag_legal/corpus"
# Dictionary to store the loaded documents
documents = {}

# Iterate through the subfolders
for folder in ["contractnli", "cuad", "maud", "privacy_qa"]:
    folder_path = os.path.join(base_path, folder)
    documents[folder] = []
    
    # Load all .txt files in the folder using TextLoader
    for file_name in os.listdir(folder_path):
        file_path = os.path.join(folder_path, file_name)
        if file_name.endswith(".txt"):
            try:
                loader = TextLoader(file_path, encoding="utf-8")
                docs = loader.load()
                for doc in docs:
                    documents[folder].append(doc)
                #         "page_content": doc["page_content"],
                #         "metadata": {"file_name": file_name}
                #     })
            except Exception as e:
                print(f"Error loading file {file_path}: {e}")

# Example: Print the number of documents loaded for each folder
for folder, docs in documents.items():
    print(f"Loaded {len(docs)} documents from {folder}")

Loaded 95 documents from contractnli
Loaded 462 documents from cuad
Loaded 134 documents from maud
Loaded 7 documents from privacy_qa


In [18]:
print(type(docs[0]))  # Check the type of the first element in docs

<class 'langchain_core.documents.base.Document'>


In [25]:
docs[0].metadata  # Access the metadata of the first document

{'source': '/Users/tien-nguyen/Downloads/RAG Upgrad/Starter and Dataset RAG Legal/rag_legal/corpus/privacy_qa/Groupon.txt'}

#### **1.2.2** <font color=red> [2 marks] </font>
Preprocess the text data to remove noise and prepare it for analysis.

Remove special characters, extra whitespace, and irrelevant content such as email and telephone contact info.
Normalise text (e.g., convert to lowercase, remove stop words).
Handle missing or corrupted data by logging errors and skipping problematic files.

In [7]:
# Clean and preprocess the data


In [27]:
import re

def preprocess_text(text):
    """
    Preprocess the text by removing special characters, extra whitespace,
    and irrelevant content such as email and telephone contact info.
    Normalize text by converting to lowercase and removing stop words.
    """
    # Remove email addresses
    text = re.sub(r'\S+@\S+', '', text)
    
    # Remove phone numbers
    text = re.sub(r'\b\d{10,}\b', '', text)
    
    # Remove special characters and punctuation
    text = re.sub(r'[^\w\s]', '', text)
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

# Example usage
for folder, docs in documents.items():
    for doc in docs:
        doc.page_content = preprocess_text(doc.page_content)

print("Preprocessing complete.")

Preprocessing complete.


### **1.3 Exploratory Data Analysis** <font color=red> [10 marks] </font><br>

#### **1.3.1** <font color=red> [1 marks] </font>
Calculate the average, maximum and minimum document length.

In [28]:
# Calculate the average, maximum, and minimum document length
def calculate_document_lengths(documents):
    lengths = []
    for folder, docs in documents.items():
        for doc in docs:
            lengths.append(len(doc.page_content))  # Access "page_content" as a dictionary key

    if lengths:
        avg_length = sum(lengths) / len(lengths)
        max_length = max(lengths)
        min_length = min(lengths)
        return avg_length, max_length, min_length
    else:
        return 0, 0, 0

# Example usage
avg_length, max_length, min_length = calculate_document_lengths(documents)
print(f"Average Document Length: {avg_length}")
print(f"Maximum Document Length: {max_length}")
print(f"Minimum Document Length: {min_length}")

Average Document Length: 99063.25787965616
Maximum Document Length: 957212
Minimum Document Length: 1329


#### **1.3.2** <font color=red> [4 marks] </font>
Analyse the frequency of occurrence of words and find the most and least occurring words.

Find the 20 most common and least common words in the text. Ignore stop words such as articles and prepositions.

In [29]:
# Find frequency of occurence of words


from collections import Counter
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk

# Download NLTK data (if not already downloaded)
nltk.download('punkt')
nltk.download('punkt_tab')  # Ensure punkt_tab is downloaded
nltk.download('stopwords')

# Define a function to find the most and least common words
def find_common_words(documents, num_words=20):
    stop_words = set(stopwords.words('english'))
    word_counts = Counter()

    # Tokenize and count words in all documents
    for folder, docs in documents.items():
        for doc in docs:
            words = word_tokenize(doc.page_content)
            filtered_words = [word.lower() for word in words if word.isalnum() and word.lower() not in stop_words]
            word_counts.update(filtered_words)

    # Get the most and least common words
    most_common = word_counts.most_common(num_words)
    least_common = word_counts.most_common()[:-num_words-1:-1]

    return most_common, least_common

# Example usage
most_common, least_common = find_common_words(documents)
print("20 Most Common Words:", most_common)
print("20 Least Common Words:", least_common)

[nltk_data] Downloading package punkt to /Users/tien-
[nltk_data]     nguyen/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /Users/tien-
[nltk_data]     nguyen/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/tien-
[nltk_data]     nguyen/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


20 Most Common Words: [('company', 148167), ('shall', 107995), ('agreement', 104559), ('section', 75344), ('parent', 58009), ('party', 49656), ('date', 39281), ('time', 35251), ('material', 34208), ('merger', 33843), ('subsidiaries', 33317), ('applicable', 31368), ('including', 29398), ('respect', 28848), ('may', 28069), ('stock', 26651), ('information', 25681), ('parties', 24609), ('b', 23935), ('business', 23497)]
20 Least Common Words: [('enrich', 1), ('simplify', 1), ('devicekeep', 1), ('appwe', 1), ('cached', 1), ('usernames', 1), ('wwwaboutcookiesorg', 1), ('safari', 1), ('personalise', 1), ('effortless', 1), ('signin', 1), ('customise', 1), ('gigs', 1), ('gig', 1), ('systemoperating', 1), ('deviceoperating', 1), ('countrycontinent', 1), ('weblog', 1), ('eat', 1), ('policypdf', 1)]


#### **1.3.3** <font color=red> [4 marks] </font>
Analyse the similarity of different documents to each other based on TF-IDF vectors.

Transform some documents to TF-IDF vectors and calculate their similarity matrix using a suitable distance function. If contracts contain duplicate or highly similar clauses, similarity calculation can help detect them.

Identify for the first 10 documents and then for 10 random documents. What do you observe?

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Function to transform documents into TF-IDF vectors
def transform_to_tfidf(documents):
    # Combine all document contents into a list
    all_docs = []
    for folder, docs in documents.items():
        for doc in docs:
            all_docs.append(doc.page_content)

    # Initialize the TF-IDF Vectorizer
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(all_docs)

    return tfidf_matrix, vectorizer

# Example usage
tfidf_matrix, vectorizer = transform_to_tfidf(documents)

# Print the shape of the TF-IDF matrix
print(f"TF-IDF Matrix Shape: {tfidf_matrix.shape}")

TF-IDF Matrix Shape: (698, 53814)


In [None]:
# Compute similarity scores for 10 first documents
from sklearn.metrics.pairwise import cosine_similarity

# # Function to compute similarity scores for the first 10 documents
# def compute_similarity_for_first_10(documents):
#     # Combine all document contents into a list
#     all_docs = []
#     for folder, docs in documents.items():
#         for doc in docs:
#             all_docs.append(doc["page_content"])
    
    # Select the first 10 documents
    first_10_docs = all_docs[:10]
    
    # Initialize the TF-IDF Vectorizer
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(first_10_docs)
    
    # Compute the cosine similarity matrix
    similarity_matrix = cosine_similarity(tfidf_matrix)
    
    return similarity_matrix

# Example usage
similarity_matrix = compute_similarity_for_first_10(documents)

# Print the similarity matrix for the first 10 documents
print("Similarity Matrix (First 10 Documents):")
print(similarity_matrix)

Similarity Matrix (First 10 Documents):
[[1.         0.6997468  0.8511427  0.6997468  0.79425781 0.66438238
  0.76982506 0.68768813 0.75082418 0.74143225]
 [0.6997468  1.         0.75729146 1.         0.69855627 0.59016426
  0.69249744 0.6425857  0.69853589 0.70027608]
 [0.8511427  0.75729146 1.         0.75729146 0.89258414 0.67856654
  0.87353035 0.73556672 0.86221725 0.83430655]
 [0.6997468  1.         0.75729146 1.         0.69855627 0.59016426
  0.69249744 0.6425857  0.69853589 0.70027608]
 [0.79425781 0.69855627 0.89258414 0.69855627 1.         0.64830348
  0.9142389  0.72640849 0.87030152 0.798751  ]
 [0.66438238 0.59016426 0.67856654 0.59016426 0.64830348 1.
  0.62882171 0.58418373 0.6537631  0.62822574]
 [0.76982506 0.69249744 0.87353035 0.69249744 0.9142389  0.62882171
  1.         0.68872373 0.8490064  0.77839615]
 [0.68768813 0.6425857  0.73556672 0.6425857  0.72640849 0.58418373
  0.68872373 1.         0.7034693  0.67288339]
 [0.75082418 0.69853589 0.86221725 0.69853589 0.

In [13]:
# Compute similarity scores for 10 random documents

import random

# Function to compute similarity scores for 10 random documents
def compute_similarity_for_random_10(documents):
    # Combine all document contents into a list
    all_docs = []
    for folder, docs in documents.items():
        for doc in docs:
            all_docs.append(doc["page_content"])
    
    # Select 10 random documents
    random_indices = random.sample(range(len(all_docs)), 10)
    random_docs = [all_docs[i] for i in random_indices]
    
    # Initialize the TF-IDF Vectorizer
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(random_docs)
    
    # Compute the cosine similarity matrix
    similarity_matrix = cosine_similarity(tfidf_matrix)
    
    return similarity_matrix, random_indices

# Example usage
similarity_matrix, random_indices = compute_similarity_for_random_10(documents)

# Print the similarity matrix for the 10 random documents
print("Similarity Matrix (10 Random Documents):")
print(similarity_matrix)

# Print the indices of the selected random documents
print("Indices of Random Documents:", random_indices)

Similarity Matrix (10 Random Documents):
[[1.         0.42483855 0.73037102 0.62487471 0.57398258 0.51691625
  0.53705175 0.59572903 0.61962474 0.57051859]
 [0.42483855 1.         0.48269049 0.47477452 0.45592912 0.37241368
  0.44611859 0.44124871 0.49191732 0.44405895]
 [0.73037102 0.48269049 1.         0.69906811 0.68570154 0.60061654
  0.6329849  0.66292598 0.74824945 0.65389716]
 [0.62487471 0.47477452 0.69906811 1.         0.69333676 0.77323196
  0.63126641 0.68075239 0.72568615 0.68511696]
 [0.57398258 0.45592912 0.68570154 0.69333676 1.         0.60909226
  0.59777255 0.63043674 0.69774946 0.61084449]
 [0.51691625 0.37241368 0.60061654 0.77323196 0.60909226 1.
  0.50761389 0.58856072 0.57288325 0.55526598]
 [0.53705175 0.44611859 0.6329849  0.63126641 0.59777255 0.50761389
  1.         0.58317117 0.6905728  0.56912925]
 [0.59572903 0.44124871 0.66292598 0.68075239 0.63043674 0.58856072
  0.58317117 1.         0.67149559 0.60640506]
 [0.61962474 0.49191732 0.74824945 0.72568615 0

### Observation: It seems that documents that are far from each other has lower similarity score. 
### The first 10 documents has averagely higher similarity score due to their nearer positions

### **1.4 Document Creation and Chunking** <font color=red> [5 marks] </font><br>

#### **1.4.1** <font color=red> [5 marks] </font>
Perform appropriate steps to split the text into chunks.

In [19]:
def generate_chunks(documents, chunk_size=500, chunk_overlap=50):
    from langchain.text_splitter import RecursiveCharacterTextSplitter

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", " ", ""]
    )

    chunks = []
    for folder, docs in documents.items():
        for doc in docs:
            doc_chunks = text_splitter.split_text(doc["page_content"])
            for chunk in doc_chunks:
                chunks.append({
                    "content": chunk,
                    "metadata": {
                        "source": folder,
                        "file_name": doc["metadata"].get("file_name", "unknown")  # Access metadata as a key
                    }
                })
    return chunks

In [20]:
chunks = generate_chunks(documents)
print(chunks[:5])  # Print the first 5 chunks

[{'content': 'confidentiality and nondisclosure agreement this confidentiality and nondisclosure agreement this agreement is dated _______ ___ 2018 the effective date and is between qep energy company owner a delaware corporation and _____________________ the receiving company a ______ ______________ owner and the receiving company are sometimes referred to herein individually as a party and collectively as the parties r e c i t a l s whereas owner has in its possession the confidential information as', 'metadata': {'source': 'contractnli', 'file_name': 'QEP-Williston-Form-of-Confidentiality-Agreement-BMO.txt'}}, {'content': 'in its possession the confidential information as hereinafter defined relating to owners and certain of its affiliates assets and properties located in the williston basin in dunn mckenzie mclean mercer and mountrail counties north dakota collectively the properties whereas in order for the receiving company to determine its interest in entering into a transaction

## **2. Vector Database and RAG Chain Creation** <font color=red> [15 marks] </font><br>

### **2.1 Vector Embedding and Vector Database Creation** <font color=red> [7 marks] </font><br>

#### **2.1.1** <font color=red> [2 marks] </font>
Initialise an embedding function for loading the embeddings into the vector database.

Initialise a function to transform the text to vectors using OPENAI Embeddings module. You can also use this function to transform during vector DB creation itself.

In [21]:
import os

# Fetch your OpenAI API Key from environment variables
openai_api_key = os.getenv("OPENAI_API_KEY")

if openai_api_key:
    print("OpenAI API Key fetched successfully.")
else:
    print("OpenAI API Key not found. Please set it as an environment variable.")

OpenAI API Key fetched successfully.


In [22]:
# Initialise an embedding function

from langchain.embeddings import OpenAIEmbeddings

def get_openai_embeddings(api_key):
    """
    Initialize the OpenAI Embeddings module with the provided API key.

    Args:
        api_key (str): Your OpenAI API key.

    Returns:
        OpenAIEmbeddings: An instance of the OpenAIEmbeddings class.
    """
    try:
        embeddings = OpenAIEmbeddings(openai_api_key=api_key)
        print("OpenAI Embeddings initialized successfully.")
        return embeddings
    except Exception as e:
        print(f"Error initializing OpenAI Embeddings: {e}")
        return None

# Example usage
import os

# Fetch the OpenAI API key from environment variables
openai_api_key = os.getenv("OPENAI_API_KEY")

if openai_api_key:
    embeddings = get_openai_embeddings(openai_api_key)
else:
    print("OpenAI API Key not found. Please set it as an environment variable.")

  embeddings = OpenAIEmbeddings(openai_api_key=api_key)


OpenAI Embeddings initialized successfully.


#### **2.1.2** <font color=red> [5 marks] </font>
Load the embeddings to a vector database.

Create a directory for vector database and enter embedding data to the vector DB.

In [28]:
def load_embeddings_to_vector_db(chunks, api_key, persist_directory="vector_db"):
    from langchain.vectorstores import Chroma
    from langchain.embeddings import OpenAIEmbeddings

    try:
        embeddings = OpenAIEmbeddings(openai_api_key=api_key)

        vector_db = Chroma.from_documents(
            documents=[chunk["content"] for chunk in chunks],  # Access "content" key
            embedding=embeddings,
            persist_directory=persist_directory,
            metadatas=[chunk["metadata"] for chunk in chunks]  # Access "metadata" key
        )

        vector_db.persist()
        print(f"Vector database created and persisted at: {persist_directory}")
        return vector_db

    except Exception as e:
        print(f"Error loading embeddings to vector database: {e}")
        return None

In [29]:
vector_db = load_embeddings_to_vector_db(chunks, openai_api_key)

Error loading embeddings to vector database: 'str' object has no attribute 'page_content'


In [27]:
print(chunks[:5])

[{'content': 'confidentiality and nondisclosure agreement this confidentiality and nondisclosure agreement this agreement is dated _______ ___ 2018 the effective date and is between qep energy company owner a delaware corporation and _____________________ the receiving company a ______ ______________ owner and the receiving company are sometimes referred to herein individually as a party and collectively as the parties r e c i t a l s whereas owner has in its possession the confidential information as', 'metadata': {'source': 'contractnli', 'file_name': 'QEP-Williston-Form-of-Confidentiality-Agreement-BMO.txt'}}, {'content': 'in its possession the confidential information as hereinafter defined relating to owners and certain of its affiliates assets and properties located in the williston basin in dunn mckenzie mclean mercer and mountrail counties north dakota collectively the properties whereas in order for the receiving company to determine its interest in entering into a transaction

### **2.2 Create RAG Chain** <font color=red> [8 marks] </font><br>

#### **2.2.1** <font color=red> [5 marks] </font>
Create a RAG chain.

In [18]:
# Create a RAG chain

#### **2.2.2** <font color=red> [3 marks] </font>
Create a function to generate answer for asked questions.

Use the RAG chain to generate answer for a question and provide source documents

In [19]:
# Create a function for question answering



In [20]:
# Example question
# question ="Consider the Non-Disclosure Agreement between CopAcc and ToP Mentors; Does the document indicate that the Agreement does not grant the Receiving Party any rights to the Confidential Information?"


## **3. RAG Evaluation** <font color=red> [10 marks] </font><br>

### **3.1 Evaluation and Inference** <font color=red> [10 marks] </font><br>

#### **3.1.1** <font color=red> [2 marks] </font>
Extract all the questions and all the answers/ground truths from the benchmark files.

Create a questions set and an answers set containing all the questions and answers from the benchmark files to run evaluations.

In [21]:
# Create a question set by taking all the questions from the benchmark data
# Also create a ground truth/answer set



#### **3.1.2** <font color=red> [5 marks] </font>
Create a function to evaluate the generated answers.

Evaluate the responses on *Rouge*, *Ragas* and *Bleu* scores.

In [22]:
# Function to evaluate the RAG pipeline



#### **3.1.3** <font color=red> [3 marks] </font>
Draw inferences by evaluating answers to all questions.

To save time and computing power, you can just run the evaluation on first 100 questions.

In [23]:
# Evaluate the RAG pipeline


## **4. Conclusion** <font color=red> [5 marks] </font><br>

### **4.1 Conclusions and insights** <font color=red> [5 marks] </font><br>

#### **4.1.1** <font color=red> [5 marks] </font>
Conclude with the results here. Include the insights gained about the data, model pipeline, the RAG process and the results obtained.