<a href="https://colab.research.google.com/github/OmdenaAI/SriLankaChapter_RegulatoryDecisionMaking/blob/main/task-3-eda/experimental_notebooks/Compare_Naive_RAG_vs_Graph_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# This notebook does the following:
  1. Take 3 pdf files assciated with Sri lanka Tea industry regulations as the input dataset to keep things simple.
  2. All the 3 pdf files are guidelines.
  3. Builds a naive RAG using vector database chroma and openai API (model: gpt-4o-mini)
  4. Asks the naive RAG a set of questions
  5. Builds a graph RAG using graph database neo4j and openai API (model: gpt-4o-mini). Uses cypher query to query the knowledge graphs.
  6. Asks the graph RAG same set of questions

# **Naive RAG (with vector database chroma)**

Try with just 3 pdf files associated with Sri lanka Tea industry regulations to start with. The pdf files are all guidelines.

In [37]:
# Load dataset i.e pdf files
!pip install -Uq langchain-community pypdf
from langchain_community.document_loaders import PyPDFLoader

pdf_files = ["https://www.tri.lk/wp-content/uploads/2020/02/TRISL_Guideline_03_2019e_Dec2019.pdf",
             "https://www.tri.lk/wp-content/uploads/2023/05/TRISL_Guideline_05_2018_E.pdf",
             "https://www.tri.lk/wp-content/uploads/2023/05/TRISL_Guideline_01_2020_E.pdf"]

pdf_docs = []
for pdf_file in pdf_files:
    loader = PyPDFLoader(pdf_file)
    pdf_docs.extend(loader.load())

len(pdf_docs)

7

Chunk the dataset

In [38]:
#Chunk the dataset
!pip install -Uq langchain_text_splitters
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
   chunk_size=5000,
   chunk_overlap=500,
   length_function=len,
   separators=["\n\n", "\n", " ", ""],
)

text_content = "".join([doc.page_content for doc in pdf_docs])
print("Length of text content: ", len(text_content))

text_chunks = text_splitter.split_text(text_content)
print("Num of text chunks: ", len(text_chunks))

Length of text content:  15272
Num of text chunks:  4


Create vector representation/embeddings for the text chunks

In [39]:
# Creat embeddings for the chunks and look at one of chunks
!pip install -Uq langchain_openai
import openai
from langchain_openai import OpenAIEmbeddings

from google.colab import userdata
openai.api_key = userdata.get('OPENAI_API_KEY')

embeddings = OpenAIEmbeddings(api_key=openai.api_key)
embedded_chunks = embeddings.embed_documents(text_chunks)

# Print the first embedded chunk to see its structure
print(f"Length of the first embedded chunk: {len(embedded_chunks[0])}")
print(f"First few values of the first embedded chunk: {embedded_chunks[0][:3]}")

Length of the first embedded chunk: 1536
First few values of the first embedded chunk: [-0.013750607147812843, -0.013898174278438091, -0.020887507125735283]


Store embeddings/vector representations in a vector database

In [41]:
#Create embeddings for chunks and store in a vector database chroma
!pip install -Uq langchain_chroma
from langchain_chroma import Chroma
from langchain.docstore.document import Document # import Document class

documents = [Document(page_content=text_chunk) for text_chunk in text_chunks]
db = Chroma.from_documents(documents, OpenAIEmbeddings(api_key=openai.api_key))

Create a prompt for the naive RAG

In [42]:
#Build a prompt for the RAG
from langchain.prompts import PromptTemplate
RAG_PROMPT_TEMPLATE = """
You are a helpful coding assistant that can answer questions about the provided context.
The context is usually a PDF document(s) or a text document(s).
Augment your answers with snippets from the context if necessary.

If you don't know the answer, say you don't know.

Context: {context}
Question: {question}
"""
PROMPT = PromptTemplate.from_template(RAG_PROMPT_TEMPLATE)

Chain a RAG chain, that chains together the retrieved docs/question, prompt, llm and output formatting

In [43]:
# Define a function to create a rag chain consisting of retrieved docs/question, prompt, llm and output formatting
from langchain_core.output_parsers import StrOutputParser
from langchain.chat_models import ChatOpenAI
from langchain_core.runnables import RunnablePassthrough

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

def create_rag_chain(chunks):
    documents = [Document(page_content=chunk) for chunk in chunks]
    embeddings = OpenAIEmbeddings(api_key = openai.api_key)
    chroma_db = Chroma.from_documents(documents, embeddings)
    retriever = chroma_db.as_retriever(
        search_type="similarity", search_kwargs={"k": 5}
    )
    llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0, openai_api_key = openai.api_key)

    rag_chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | PROMPT
        | llm
        | StrOutputParser()
    )

    return rag_chain

rag_chain = create_rag_chain(text_chunks)

Use the RAG chain to ask a set of questions

In [44]:
query = "What does mechanization aim to do?"
print(rag_chain.invoke(query))

Mechanization in agriculture aims to accomplish several key objectives, which include:

1. **Increase worker productivity**: By using machines, the efficiency of labor can be enhanced.
2. **Reduce cost of production**: Mechanization can help lower the overall costs associated with agricultural production.
3. **Improve product quality**: The use of machines can lead to more consistent and higher quality outputs.
4. **Ease manual operations**: Mechanization can reduce the physical strain on workers by automating labor-intensive tasks.
5. **Complete tasks timely**: Machines can perform operations faster than manual labor, ensuring that tasks are completed within optimal time frames.
6. **Change worker attitudes**: The introduction of mechanization can shift perceptions about agricultural work, potentially making it more appealing.

These objectives are particularly relevant in the context of the tea sector, which is facing a labor shortage, making mechanization a necessary consideration.


In [45]:
query = "What are the benefits of mechanization"
print(rag_chain.invoke(query))

The benefits of mechanization in tea plantations, as outlined in the guidelines, include:

1. **Increased Worker Productivity**: Mechanization allows tasks to be completed more efficiently, thereby enhancing the overall productivity of workers.

2. **Reduced Cost of Production**: By utilizing machines, the reliance on manual labor can be decreased, which can lead to lower production costs.

3. **Improved Product Quality**: Mechanization can help in maintaining consistent quality in the production process.

4. **Eased Manual Operations**: Machines can take over labor-intensive tasks, making operations easier for workers.

5. **Timely Completion of Tasks**: Mechanization enables operations to be completed more quickly, ensuring that tasks are done within the required time frames.

6. **Changed Worker Attitudes**: The introduction of mechanization can lead to a shift in how workers perceive their roles, potentially increasing job satisfaction and efficiency.

These objectives highlight th

In [46]:
query = "Which planting system is better for sri lanka? Single row planting or double hedge row planting?"
print(rag_chain.invoke(query))

The double hedge-row planting system is considered better for Sri Lanka, especially for facilitating mechanization of field operations. The guidelines state that the conventional single-row planting system at a spacing of 0.6 m x 1.2 m (2 feet x 4 feet) limits the use and efficiency of machines. In contrast, the double hedge-row planting system, which has a spacing of 0.6 m x 0.6 m x 1.5 m (2 feet x 2 feet x 5 feet), is more amenable to mechanization of post-planting tea field operations.

This system is commonly practiced in other tea-growing countries like Japan and China and allows for the mechanization of most labor-intensive field operations, such as harvesting. The guidelines emphasize that this planting system is particularly suitable for flat and undulating lands and can help address the acute worker shortage faced by the tea sector in Sri Lanka. 

In summary, the double hedge-row planting system is recommended for its advantages in mechanization and efficiency in tea cultivati

In [47]:
query = "When was the guideline for mechanization of field operations published and also what is that guideline number?"
print(rag_chain.invoke(query))


The guideline for the establishment of tea to facilitate mechanization of field operations was published in December 2019, and the guideline number is 03/2019.


In [48]:
query = "Where is Tea REsearch institute of Sri lanka located in?"
print(rag_chain.invoke(query))

The Tea Research Institute of Sri Lanka is located in Talawakelle, Sri Lanka.


In [49]:
query = "What organization is located in Talawakelle?"
print(rag_chain.invoke(query))

The organization located in Talawakelle is the Tea Research Institute of Sri Lanka.


In [50]:
query = "Provide me the list of guidelines that Tea research institute of sri lanka has issued"
print(rag_chain.invoke(query))

The Tea Research Institute of Sri Lanka has issued the following guidelines:

1. **Guideline No: 03/2019** - Guidelines for Establishment of Tea to Facilitate Mechanization of Field Operations (Issued in December 2019).
2. **Guideline No: 05/2018** - Guideline on Irrigation of Tea Fields (Issued in December 2018). 

These guidelines address the mechanization of tea plantations and irrigation techniques for tea fields, respectively.


In [51]:
query = "Give me a numerical count of guidelines that Tea research institute of sri lanka has issued"
print(rag_chain.invoke(query))

The Tea Research Institute of Sri Lanka has issued at least two guidelines based on the provided context:

1. Guideline No: 03/2019 - Guidelines for Establishment of Tea to Facilitate Mechanization of Field Operations (Issued in December 2019).
2. Guideline No: 05/2018 - Guideline on Irrigation of Tea Fields (Issued in December 2018).

Therefore, the numerical count of guidelines is **2**.


## Graph RAG (with graph database neo4j)

Transform the text chunks into their corresponding knowledge graphs

In [52]:
# Transform the text_chunks into knowledge graph
!pip install -Uq langchain_experimental
# !pip install -Uq json-repair
from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_core.documents import Document
import time

llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0, openai_api_key = openai.api_key)
llm_transformer = LLMGraphTransformer(llm=llm)

graph_documents = []

for text_chunk in text_chunks:
    document = [Document(page_content=text_chunk)]
    try:
      graph_documents.extend(llm_transformer.convert_to_graph_documents(document))
    except Exception as e:
      print(f"Error processing chunk: {e}")

    time.sleep(2) # Adjust the delay as needed to avoid rate limiting


Store the generated knowledge graphs into neo4j graph database. The knowledge graphs can be visualized using neo4j browser. Visit https://neo4j.com/ for creating neo4j instance and accessing the neo4j browser

In [53]:
# Store the generated graph documents into a neo4j graph database
!pip install -Uq langchain-community
!pip install -Uq neo4j

## Credentials for accessing neo4j graph database. An instance of neo4j database
# needs to be created on-premise or on the cloud. For this notebook, cloud instance
# has been created.
from google.colab import userdata
NEO4J_URI=userdata.get('NEO4J_URI')
NEO4J_USERNAME="neo4j"
NEO4J_PASSWORD= userdata.get('NEO4J_API_KEY')

#Instantiate an instance of neo4j graph database
from langchain_community.graphs import Neo4jGraph
graph=Neo4jGraph(
    url=NEO4J_URI,
    username=NEO4J_USERNAME,
    password=NEO4J_PASSWORD,
)

graph.add_graph_documents(
    graph_documents,
    baseEntityLabel= True,
    include_source=True)

Look into nodes and relationships of a knowledge graph. The knowledge graphs can be visuzlied using neo4j browser

In [61]:
len(graph_documents)

4

In [63]:
graph_documents[0].nodes

[Node(id='Tea', type='Crop', properties={}),
 Node(id='change worker attitudes', type='Objective', properties={}),
 Node(id='Japan', type='Country', properties={}),
 Node(id='sustainable crop production', type='Goal', properties={}),
 Node(id='increase worker productivity', type='Objective', properties={}),
 Node(id='agriculture', type='Field', properties={}),
 Node(id='complete tasks timely', type='Objective', properties={}),
 Node(id='ease manual operations', type='Objective', properties={}),
 Node(id='poor plant growth', type='Effect', properties={}),
 Node(id='Guideline No: 03/2019', type='Guideline', properties={}),
 Node(id='high planting density', type='Characteristic', properties={}),
 Node(id='Tea Research Institute of Sri Lanka', type='Organization', properties={}),
 Node(id='drought casualties', type='Effect', properties={}),
 Node(id='Guideline No: 05/2018', type='Guideline', properties={}),
 Node(id='competition for moisture', type='Effect', properties={}),
 Node(id='doubl

In [65]:
graph_documents[0].relationships

[Relationship(source=Node(id='Tea Research Institute of Sri Lanka', type='Organization', properties={}), target=Node(id='Guideline No: 03/2019', type='Guideline', properties={}), type='ISSUED_GUIDELINE', properties={}),
 Relationship(source=Node(id='Tea Research Institute of Sri Lanka', type='Organization', properties={}), target=Node(id='Guideline No: 05/2018', type='Guideline', properties={}), type='ISSUED_GUIDELINE', properties={}),
 Relationship(source=Node(id='Tea', type='Crop', properties={}), target=Node(id='labour-intensive plantation', type='Characteristic', properties={}), type='REQUIRES', properties={}),
 Relationship(source=Node(id='Mechanization', type='Process', properties={}), target=Node(id='agriculture', type='Field', properties={}), type='USED_IN', properties={}),
 Relationship(source=Node(id='Mechanization', type='Process', properties={}), target=Node(id='increase worker productivity', type='Objective', properties={}), type='AIMS_TO', properties={}),
 Relationship(so

Generate cypher query to query the knowledge graph.

In [67]:
# Generate cypher query to query the knowledge graph
from langchain.chains import GraphCypherQAChain
chain=GraphCypherQAChain.from_llm(llm=llm,graph=graph,verbose=True, allow_dangerous_requests=True)

Ask the same set of questions as for the naive RAG and see how the answers differ

In [68]:
# Ask a question
response = chain.invoke({"query" : "What are the aims of Mechanization?"})
response



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mcypher
MATCH (p:Process)-[:AIMS_TO]->(o:Objective)
WHERE p.id = 'Mechanization'
RETURN o
[0m
Full Context:
[32;1m[1;3m[{'o': {'id': 'complete tasks timely'}}, {'o': {'id': 'improve product quality'}}, {'o': {'id': 'reduce cost of production'}}, {'o': {'id': 'change worker attitudes'}}, {'o': {'id': 'increase worker productivity'}}, {'o': {'id': 'ease manual operations'}}][0m

[1m> Finished chain.[0m


{'query': 'What are the aims of Mechanization?',
 'result': 'The aims of Mechanization are to complete tasks timely, improve product quality, reduce cost of production, change worker attitudes, increase worker productivity, and ease manual operations.'}

In [69]:
# Ask a question
response = chain.invoke({"query" : "What does Mechanization aim to do?"})
response



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mcypher
MATCH (m:Concept {id: 'Mechanization'})-[:AIMS_TO]->(o:Objective)
RETURN o
[0m
Full Context:
[32;1m[1;3m[][0m

[1m> Finished chain.[0m


{'query': 'What does Mechanization aim to do?',
 'result': "I don't know the answer."}

In [70]:
# Ask a question
response = chain.invoke({"query" : "What are the benefits of Mechanization?"})
response



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mcypher
MATCH (d:Document)-[:MENTIONS]->(c:Concept {id: 'Mechanization'}) 
RETURN d.text AS Benefits
[0m
Full Context:
[32;1m[1;3m[][0m

[1m> Finished chain.[0m


{'query': 'What are the benefits of Mechanization?',
 'result': "I don't know the answer."}

In [71]:
# Ask a question
response = chain.invoke({"query" : "What are the objectives of Mechanization?"})
response



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mcypher
MATCH (o:Objective)<-[:AIMS_TO]-(p:Process)-[:PLAYS_ROLE_IN]->(:Concept {id: 'Mechanization'})
RETURN o
[0m
Full Context:
[32;1m[1;3m[][0m

[1m> Finished chain.[0m


{'query': 'What are the objectives of Mechanization?',
 'result': "I don't know the answer."}

In [72]:
# Ask a question
response = chain.invoke({"query" : "Which system is better for sri lanka? Single row planting or double hedge row planting?"})
response



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mcypher
MATCH (ps1:Planting System {id: 'single row planting'})-[:PRACTICED_IN]->(c:Country {id: 'Sri Lanka'}),
      (ps2:Planting System {id: 'double hedge row planting'})-[:PRACTICED_IN]->(c)
RETURN ps1, ps2
[0m


CypherSyntaxError: {code: Neo.ClientError.Statement.SyntaxError} {message: Invalid input 'System': expected a parameter, '&', ')', ':', 'WHERE', '{' or '|' (line 2, column 21 (offset: 27))
"MATCH (ps1:Planting System {id: 'single row planting'})-[:PRACTICED_IN]->(c:Country {id: 'Sri Lanka'}),"
                     ^}

In [73]:
# Ask a question
response = chain.invoke({"query" : "When was the guideline for mechanization of field operations published and also what is that guideline number?"})
response



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mcypher
MATCH (d:Document)-[:MENTIONS]->(g:Guideline)
WHERE g.text CONTAINS 'mechanization of field operations'
RETURN d.text AS guideline_text, g.id AS guideline_number
[0m
Full Context:
[32;1m[1;3m[][0m

[1m> Finished chain.[0m


{'query': 'When was the guideline for mechanization of field operations published and also what is that guideline number?',
 'result': "I don't know the answer."}

In [74]:
# Ask a question
response = chain.invoke({"query" : "Where is Tea Research Institute of Sri Lanka located in?"})
response



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mcypher
MATCH (org:Organization {id: 'Tea Research Institute of Sri Lanka'})-[:LOCATED_IN]->(loc:Location)
RETURN loc
[0m
Full Context:
[32;1m[1;3m[{'loc': {'id': 'Talawakelle'}}][0m

[1m> Finished chain.[0m


{'query': 'Where is Tea Research Institute of Sri Lanka located in?',
 'result': 'The Tea Research Institute of Sri Lanka is located in Talawakelle.'}

In [75]:
# Ask a question
response = chain.invoke({"query" : "What organization is located_in Talawakelle?"})
response



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mcypher
MATCH (o:Organization)-[:LOCATED_IN]->(l:Location {id: 'Talawakelle'})
RETURN o
[0m
Full Context:
[32;1m[1;3m[{'o': {'id': 'Tea Research Institute of Sri Lanka'}}][0m

[1m> Finished chain.[0m


{'query': 'What organization is located_in Talawakelle?',
 'result': 'The Tea Research Institute of Sri Lanka is located in Talawakelle.'}

In [76]:
# Ask a question
response = chain.invoke({"query" : "Provide me the list of guidelines that Tea Research Institute of Sri Lanka has issued"})
response



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mcypher
MATCH (org:Organization {id: 'Tea Research Institute of Sri Lanka'})-[:ISSUED_GUIDELINE]->(g:Guideline)
RETURN g
[0m
Full Context:
[32;1m[1;3m[{'g': {'id': 'Guidelines on Measures to be Adopted in Tea Lands Following a Drought'}}, {'g': {'id': 'Guideline No: 05/2018'}}, {'g': {'id': 'Guideline No: 03/2019'}}][0m

[1m> Finished chain.[0m


{'query': 'Provide me the list of guidelines that Tea Research Institute of Sri Lanka has issued',
 'result': 'The Tea Research Institute of Sri Lanka has issued the following guidelines: "Guidelines on Measures to be Adopted in Tea Lands Following a Drought," "Guideline No: 05/2018," and "Guideline No: 03/2019."'}

In [77]:
# Ask a question
response = chain.invoke({"query" : "Give me a numerical count of guidelines that Tea Research Institute of Sri Lanka has issued"})
response



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mcypher
MATCH (org:Organization {id: 'Tea Research Institute of Sri Lanka'})-[:ISSUED_GUIDELINE]->(g:Guideline)
RETURN COUNT(g) AS guideline_count
[0m
Full Context:
[32;1m[1;3m[{'guideline_count': 3}][0m

[1m> Finished chain.[0m


{'query': 'Give me a numerical count of guidelines that Tea Research Institute of Sri Lanka has issued',
 'result': 'The Tea Research Institute of Sri Lanka has issued 3 guidelines.'}