# RAG-Based App For Analyzing Legal Documents

This notebook creates a Retrieval Augmented Generation (RAG)application for analyzing the legal documents using **Pinecone**, **OpenAI** and **LangChain**. Through this RAG implementation, we can use ChatGPT to analyze out proprietary data.

The PDF document containing the excerpts of the deposition is placed in the **Documents** directory. We use the **PyPDFLoader** API from **LangChain** to read the document. Then, we split the large document into smaller chunks.

In the next step, we create the embedding vectors using **text-embedding-ada-002** model from **OpenAI** using the **OpenAIEmbeddings** high-level API. Then, we upsert / save these embeddings in the **Pincecone** vector database.

The subsequent step involves, we find all the instances of admissions and contradictions. Firstly, we retrieve the relevant embeddings of all the occurances of admissions and contradictions separately, from the Pincecone database based on the two distinct custom prompts. Then, we use a **Retrieval Chain** to find the exact instances of admissions and contradictions using ChatGPT. Finally, we store these instances of admissions and contradictions as **CSV** files in the **Admissions** and **Contradictions** directories, respectively.

## Set environment variables and keys

Let's start by setting up the API keys for **OpenAI** and **Pinecone** as environment variables.

In [1]:
import os
from dotenv import load_dotenv

#Load the .env file
load_dotenv(dotenv_path='.env')

#Set the OpenAI and Pinecone API keys as environment variables.
PINECONE_API_KEY = os.environ.get('PINECONE_API_KEY')
OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY')

## Manipulate the PDF file

In order to analyze the legal documents, place the PDF documents in the **Documents** sub-directory. We'll read the document, and split into smaller chunks.

In [2]:
from langchain.document_loaders import PyPDFLoader

In [3]:
#Get the list of PDF files present in the "Documents" directory
import glob
path_list = glob.glob("Documents/*.PDF")

In [4]:
#Create the PyPDFLoader object
# It can be scaled up in the future, but for now, we'll deal with just one document
loader = PyPDFLoader(file_path=path_list[0])

#Check that the langchain PyPDFLoader object has been created
loader

<langchain_community.document_loaders.pdf.PyPDFLoader at 0x28fb1811be0>

In [5]:
#Load the file contents
doc = loader.load()

In [6]:
#Check the length of document
len(doc)

75

Since our document is quite large, we need to split it into smaller chunks so that we can work with the **4K** token limit of **gpt-3.5-turbo-instruct** model. We can opt for **GPT-4** or **GPT-4o**, but it will only increase our cost. Nonetheless, at production stage, we might use higher-end GPTs.

In [7]:
#Instantiate the RecursiveCharacterTextSplitter API
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    
    chunk_size = 2000,
    chunk_overlap  = 200, #Set the overlap of chunks for better results
)

In [8]:
#Split the read document (langchain PyPDFLoader object)
doc_chunks = text_splitter.split_documents(doc)

In [9]:
#Check the length of the chunks
len(doc_chunks)

75

In [10]:
#Let's view a page of the document
doc_chunks[2].page_content

'1 I N D E X \n2 WITNESS \n3 JOSEPH NADEAU \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \n12 \n13 \n14 \n15 \n16 \n17 \n18 \n19 \n20 \n21 \n22 \n23 \n24 \n25 EXAMINATION BY MR. PIROZZOLO \nEXAMINATION BY MS. BARONI \nEXAMINATION BY MR. BRYAN \nEXAMINATION BY MR. PIROZZOLO \nVivian Dafoulas & Associates \n(401) 885-0992 148 \nPAGE \n150 \n199 \n203 \n216 \n99e5b696-09ac-4d72-826c-c2c83d416066 \nCDALEDEP003543'

**Since we also need to specify the page number in our answer, we'll have to append the page number at the end of each chunk.**

In [11]:
#Add the page number
for i in range(len(doc_chunks)):
    doc_chunks[i].page_content = doc_chunks[i].page_content + f"\n\npage: " + str(doc_chunks[i].metadata["page"] + 146)

In [12]:
#Let's view the added page number at the end of the document
doc_chunks[2].page_content

'1 I N D E X \n2 WITNESS \n3 JOSEPH NADEAU \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \n12 \n13 \n14 \n15 \n16 \n17 \n18 \n19 \n20 \n21 \n22 \n23 \n24 \n25 EXAMINATION BY MR. PIROZZOLO \nEXAMINATION BY MS. BARONI \nEXAMINATION BY MR. BRYAN \nEXAMINATION BY MR. PIROZZOLO \nVivian Dafoulas & Associates \n(401) 885-0992 148 \nPAGE \n150 \n199 \n203 \n216 \n99e5b696-09ac-4d72-826c-c2c83d416066 \nCDALEDEP003543\n\npage: 148'

In [13]:
type(doc_chunks[0])

langchain_core.documents.base.Document

## Pinecone

In this section, we'll create the embedding vectors for the document chunks and store them in **Pinecone** database.

In [14]:
from langchain_pinecone import PineconeVectorStore #High-level API from langchain
from pinecone import Pinecone, ServerlessSpec

In [15]:
#Initialize the client connection to Pinecone
pc = Pinecone(api_key=PINECONE_API_KEY)

In [16]:
#Store the relevant index in a variable
index_name = "legal-index"

In [17]:
#Check if our index already exists in the Pincecone

if index_name not in pc.list_indexes().names():
    print("Index does not exist: ", index_name)
else:
    print(index_name, "exists in the Pinecone, we may proceed!")

legal-index exists in the Pinecone, we may proceed!


In [17]:
"""
import time

# check for and delete index if already exists  
index_name = 'prod-index'  
if index_name in pc.list_indexes().names():  
    pc.delete_index(index_name)  
# create a new index  
pc.create_index(
    index_name,
    dimension=1536,  # dimensionality of text-embedding-ada-002
    metric='dotproduct',
)  
# wait for index to be initialized  
while not pc.describe_index(index_name).status['ready']:  
    time.sleep(1)
"""

"\nimport time\n\n# check for and delete index if already exists  \nindex_name = 'prod-index'  \nif index_name in pc.list_indexes().names():  \n    pc.delete_index(index_name)  \n# create a new index  \npc.create_index(\n    index_name,\n    dimension=1536,  # dimensionality of text-embedding-ada-002\n    metric='dotproduct',\n)  \n# wait for index to be initialized  \nwhile not pc.describe_index(index_name).status['ready']:  \n    time.sleep(1)\n"

In [18]:
#Check the index stats
index = pc.Index(index_name)  
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'doc1': {'vector_count': 75}},
 'total_vector_count': 75}

**total_vector_count** tells us whether out index has any vectors stored already or not.

In [19]:
#Instantiate the OpenAI Embeddings
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

In [20]:
#Create the embeddings for each chunk of the document and store it in the Pinecone index
# at the specified namespace "doc2"
vectorstore_from_docs = PineconeVectorStore.from_documents(
        doc_chunks,
        index_name=index_name,
        embedding=embeddings,
        namespace="doc2"
    )

## Retrieval

Let's create a retrieval to pull-up the relevant data from the **Pinecone** for witness' admissions and contradictions using custom prompts.

In [21]:
admissions_prompt = """
You are a legal expert analyzing a witness deposition. Your task is to identify all instances of "admissions" in the text. For each instance, provide the line number(s) and explain why it is considered an "admission".

Definitions:
"Admissions": Statements where the witness acknowledges or agrees to a fact, confirms involvement, or accepts responsibility. Examples include phrases like "I admit", "I agree", "Yes, that's true", "I acknowledge that", "I suppose so.", "I guess that's right.", "You could say that.", "I think so.", "To the best of my knowledge...", "I recognize this document.", "I received the email.", "I attended the meeting." etc.

Please analyze the following deposition text and identify any instances of admissions. At the start of each line, the line number is given, and at the end of the page, a page number is also given as "page: ". 

Provide the output in the following Python dictionary format. Please make sure that Python dictionary is well structured.:

{"admissions": [ {"topic": "Brief title of admission","content": "Content of the admission","reference": "line x, page y","reason": "Explanation of why this instance is considered an admission"},...]}
"""

In [40]:
contradictions_prompt = """
You are a legal expert analyzing a witness deposition. Your task is to identify instances of "contradictions" in the text. For each instance, provide the line number(s) and explain why it is considered a "contradiction".

Definitions:
"Contradictions": Statements where the witness's current statement conflicts with a previous statement or established fact. Examples include changing answers, denying previous acknowledgments, or providing conflicting information.

Please analyze the following deposition text and identify any instances of contradictions. At the start of each line, the line number is given, and at the end of the page, a page number is also given as "page: ". 

Provide the output in the following dictionary format. Please make sure that the dictionary is well structured.:

{"contradictions": [ {"topic": "Brief title of contradiction","assertion_content": "Content of the initial assertion","assertion_reference": "line x, page y","contradiction_content": "Content of the contradictory statement","contradiction_reference": "line x, page y","reason": "Explanation of why this instance is considered a contradiction"},...]}
"""


In [23]:
#Let's search the Pinecone vectorstore for our admission query
vectorstore_from_docs.similarity_search(admissions_prompt, 
                                        k = 10, # We want at max, 10 relevant instances
                                        namespace="doc2")

[Document(page_content='1 deposition of December 17, 2002? \n2 A. (Witness complying.) \n3 Q. May I ask you to go to Page 34 and read \n4 Lines 1 through 10 to yourself? \n5 (Witness reading document.) \n6 Q. Do you recall being asked --\n7 A. Yes. \n8 Q. --"did you see the cutout?" Do you \n9 remember that? \n10 A. Yes. \n11 Q. And was the cutout something you said \n12 was at the end of one of the walls? \n13 A. Yes. \n14 Q. And do you remember answering: "You \n15 know, I probably did. I just can\'t remember. I \n16 couldn\'t say for a fact. It\'s an assumption." Do \n17 you remember giving that testimony? \n18 A. Yes. \n19 Q. So you recall testifying in 2002 that it \n20 was an assumption that there was a cutout? \n21 MR. BRYAN: Objection. Leading. \n22 A. Yeah, I remember saying that but the \n23 more I think about it now, I do remember in fact \n24 \n25 it went out. I saw it go out. \nQ. It could have gone into a sewer? \nVivian Dafoulas & Associates \n(401) 885-0992 192 \n99e5b6

In [24]:
#Let's search the Pinecone vectorstore for our contradiction query
vectorstore_from_docs.similarity_search(contradictions_prompt, 
                                        k = 10, # We want at max, 10 relevant instances
                                        namespace="doc2")

[Document(page_content='1 deposition of December 17, 2002? \n2 A. (Witness complying.) \n3 Q. May I ask you to go to Page 34 and read \n4 Lines 1 through 10 to yourself? \n5 (Witness reading document.) \n6 Q. Do you recall being asked --\n7 A. Yes. \n8 Q. --"did you see the cutout?" Do you \n9 remember that? \n10 A. Yes. \n11 Q. And was the cutout something you said \n12 was at the end of one of the walls? \n13 A. Yes. \n14 Q. And do you remember answering: "You \n15 know, I probably did. I just can\'t remember. I \n16 couldn\'t say for a fact. It\'s an assumption." Do \n17 you remember giving that testimony? \n18 A. Yes. \n19 Q. So you recall testifying in 2002 that it \n20 was an assumption that there was a cutout? \n21 MR. BRYAN: Objection. Leading. \n22 A. Yeah, I remember saying that but the \n23 more I think about it now, I do remember in fact \n24 \n25 it went out. I saw it go out. \nQ. It could have gone into a sewer? \nVivian Dafoulas & Associates \n(401) 885-0992 192 \n99e5b6

## LangChain

For getting the exact instances of admissions and contradictions, we'll use the **Retrieval Chain** from **LangChain**. It will analyze the retrieved data using **ChatGPT** and provide us with the relevant and accurate information.

In [25]:
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA

In [26]:
#Instantiate the LLM
llm = ChatOpenAI(
    openai_api_key=OPENAI_API_KEY,
    model_name='gpt-3.5-turbo',
    temperature=0.0
)

In [27]:
#Instantiate the retrieval chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore_from_docs.as_retriever(
                                                search_type="mmr", #Maximal Marginal Relevance
        #Let's return maximum 8 relevant documents
        #lambda_mult sets the diversity of results returned by MMR
                                                search_kwargs={'k': 8, 'lambda_mult': 0.25}
    )
)

## Find Admissions

Let's find all the instances of Admission.

In [28]:
#Let's get the instances of admissions
admissions_results = qa_chain.invoke(admissions_prompt)

In [29]:
#Check the dictionay keys
admissions_results.keys()

dict_keys(['query', 'result'])

In [30]:
#Check the required result
admissions_results["result"]

'{\n    "admissions": [\n        {\n            "topic": "Recollection of seeing the cutout",\n            "content": "I saw it go out.",\n            "reference": "line 24, page 192",\n            "reason": "The witness initially mentioned it was an assumption, but later admitted to seeing the cutout."\n        },\n        {\n            "topic": "Observation of water changing color",\n            "content": "Yes.",\n            "reference": "line 15, page 170",\n            "reason": "The witness confirmed observing the water change color while working at Metro-Atlantic."\n        },\n        {\n            "topic": "Observation of water changing color",\n            "content": "Once I hit it with the hose, the water would change to the color of whatever was on the floor, it would wash to the drain and exit.",\n            "reference": "line 21, page 170",\n            "reason": "The witness explained the process of water changing color, indicating direct observation."\n        },\n 

In [32]:
#Convert the resultant string object to Python dictionary
import json
try:
    admissions_results_dict = json.loads(admissions_results["result"])
    print(admissions_results_dict)
except json.JSONDecodeError as e:
    print("Failed to parse JSON:", e)

{'admissions': [{'topic': 'Recollection of seeing the cutout', 'content': 'I saw it go out.', 'reference': 'line 24, page 192', 'reason': 'The witness initially mentioned it was an assumption, but later admitted to seeing the cutout.'}, {'topic': 'Observation of water changing color', 'content': 'Yes.', 'reference': 'line 15, page 170', 'reason': 'The witness confirmed observing the water change color while working at Metro-Atlantic.'}, {'topic': 'Observation of water changing color', 'content': 'Once I hit it with the hose, the water would change to the color of whatever was on the floor, it would wash to the drain and exit.', 'reference': 'line 21, page 170', 'reason': 'The witness explained the process of water changing color, indicating direct observation.'}, {'topic': 'Recollection of barrels having plastic liners', 'content': 'Yes.', 'reference': 'line 11, page 209', 'reason': 'The witness recalled that some barrels brought in for reconditioning had plastic liners.'}, {'topic': '

In [33]:
#Check the Python Dictionary
admissions_results_dict["admissions"]

[{'topic': 'Recollection of seeing the cutout',
  'content': 'I saw it go out.',
  'reference': 'line 24, page 192',
  'reason': 'The witness initially mentioned it was an assumption, but later admitted to seeing the cutout.'},
 {'topic': 'Observation of water changing color',
  'content': 'Yes.',
  'reference': 'line 15, page 170',
  'reason': 'The witness confirmed observing the water change color while working at Metro-Atlantic.'},
 {'topic': 'Observation of water changing color',
  'content': 'Once I hit it with the hose, the water would change to the color of whatever was on the floor, it would wash to the drain and exit.',
  'reference': 'line 21, page 170',
  'reason': 'The witness explained the process of water changing color, indicating direct observation.'},
 {'topic': 'Recollection of barrels having plastic liners',
  'content': 'Yes.',
  'reference': 'line 11, page 209',
  'reason': 'The witness recalled that some barrels brought in for reconditioning had plastic liners.'},

In [34]:
#Let's convert the results into well structured DataFrames and save them.

import pandas as pd

admissions_df = pd.DataFrame(data=admissions_results_dict["admissions"])
admissions_df

Unnamed: 0,topic,content,reference,reason
0,Recollection of seeing the cutout,I saw it go out.,"line 24, page 192",The witness initially mentioned it was an assu...
1,Observation of water changing color,Yes.,"line 15, page 170",The witness confirmed observing the water chan...
2,Observation of water changing color,"Once I hit it with the hose, the water would c...","line 21, page 170",The witness explained the process of water cha...
3,Recollection of barrels having plastic liners,Yes.,"line 11, page 209",The witness recalled that some barrels brought...
4,Handling of barrels with plastic liners,No.,"line 14, page 209",The witness admitted that not all barrels hand...


In [35]:
#Save the DataFrame to CSV file
admissions_df.to_csv(os.path.join("Admissions",str(doc[0].metadata["source"].split("\\")[1].split(".")[0]) + ".csv"))

## Find Contradictions

Let's find all the instances of Contradictions.

In [41]:
#Let's get the instances of contradiction
contradictions_results = qa_chain.invoke(contradictions_prompt)

In [42]:
#Check the dictionay keys
contradictions_results.keys()

dict_keys(['query', 'result'])

In [43]:
#Check the required result
contradictions_results["result"]

'{\n    "contradictions": [\n        {\n            "topic": "Knowledge of Ownership",\n            "assertion_content": "Did not know who owned what at that time.",\n            "assertion_reference": "line 17, page 150",\n            "contradiction_content": "Did not know where any boundaries of the land were as between ownership of one person or another.",\n            "contradiction_reference": "line 20, page 150",\n            "reason": "The witness initially stated he did not know who owned what, but later contradicted this by saying he did not know where the boundaries of the land were between different owners."\n        },\n        {\n            "topic": "Orientation to Map",\n            "assertion_content": "Recognized features east and west of the facilities.",\n            "assertion_reference": "line 19, page 218",\n            "contradiction_content": "Mentioned never looking at the features.",\n            "contradiction_reference": "line 23, page 218",\n            "re

In [44]:
#Convert the resultant string object to Python dictionary
import json
try:
    contradictions_results_dict = json.loads(contradictions_results["result"])
    print(contradictions_results_dict)
except json.JSONDecodeError as e:
    print("Failed to parse JSON:", e)

{'contradictions': [{'topic': 'Knowledge of Ownership', 'assertion_content': 'Did not know who owned what at that time.', 'assertion_reference': 'line 17, page 150', 'contradiction_content': 'Did not know where any boundaries of the land were as between ownership of one person or another.', 'contradiction_reference': 'line 20, page 150', 'reason': 'The witness initially stated he did not know who owned what, but later contradicted this by saying he did not know where the boundaries of the land were between different owners.'}, {'topic': 'Orientation to Map', 'assertion_content': 'Recognized features east and west of the facilities.', 'assertion_reference': 'line 19, page 218', 'contradiction_content': 'Mentioned never looking at the features.', 'contradiction_reference': 'line 23, page 218', 'reason': 'The witness first acknowledged recognizing features east and west of the facilities on the map, but later contradicted this by stating that he never looked at those features.'}]}


In [45]:
#Let's convert the results into well structured DataFrames and save them.

contradictions_df = pd.DataFrame(data=contradictions_results_dict["contradictions"])
contradictions_df

Unnamed: 0,topic,assertion_content,assertion_reference,contradiction_content,contradiction_reference,reason
0,Knowledge of Ownership,Did not know who owned what at that time.,"line 17, page 150",Did not know where any boundaries of the land ...,"line 20, page 150",The witness initially stated he did not know w...
1,Orientation to Map,Recognized features east and west of the facil...,"line 19, page 218",Mentioned never looking at the features.,"line 23, page 218",The witness first acknowledged recognizing fea...


In [46]:
#Save the DataFrame to CSV file
contradictions_df.to_csv(os.path.join("Contradictions",str(doc[0].metadata["source"].split("\\")[1].split(".")[0]) + ".csv"))