# Building a RAG System on Ramayana PDF

### Goals of this Notebook

We will:

Load an LLM model and check the response generated without help from RAG. Then,

1. Load a Ramayana PDF document  
2. Convert it into searchable text  
3. Create vector embeddings  
4. Build a Retrieval Augmented Generation (RAG) pipeline  
5. Ask questions to a model **without RAG**  
6. Ask the same questions **with RAG**  
7. Evaluate the quality improvement  



# Ramayana RAG System – Using Mistral (Local LLM)

This notebook builds a full Retrieval Augmented Generation system using:

- Mistral 7B (local via llama-cpp)
- Chroma Vector Database
- Sentence Transformer embeddings
- PyMuPDF PDF loader

In [20]:
# Installation for GPU llama-cpp-python
# uncomment and run the following code in case GPU is being used
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.85 --force-reinstall --no-cache-dir -q

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain-community 0.0.13 requires numpy<2,>=1, but you have numpy 2.4.2 which is incompatible.
langchain 0.1.1 requires numpy<2,>=1, but you have numpy 2.4.2 which is incompatible.
google-colab 1.0.0 requires pandas==2.2.2, but you have pandas 3.0.0 which is incompatible.
numba 0.60.0 requires numpy<2.1,>=1.22, but you have numpy 2.4.2 which is incompatible.
xarray 2025.12.0 requires packaging>=24.1, but you have packaging 23.2 which is incompatible.
bqplot 0.12.45 requires pandas<3.0.0,>=1.0.0, but you have pandas 3.0.0 which is incompatible.
opentelemetry-exporte

In [1]:
# For installing the libraries & downloading models from HF Hub
!pip install --upgrade pip -q

!pip install \
huggingface_hub \
pandas \
tiktoken \
pymupdf \
langchain \
langchain-community \
chromadb \
sentence-transformers \
llama-cpp-python -q


In [2]:
#Libraries for processing dataframes,text
import json,os
import tiktoken
import pandas as pd

#Libraries for Loading Data, Chunking, Embedding, and Vector Databases
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_community.vectorstores import Chroma

#Libraries for downloading and loading the llm
from huggingface_hub import hf_hub_download
from llama_cpp import Llama

## Question Answering using LLM

#### Downloading and Loading the model

In [3]:
#Using Mistral model
model_name_or_path = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF"
model_basename = "mistral-7b-instruct-v0.2.Q6_K.gguf"  # the model is in gguf format

In [4]:
# Using hf_hub_download to download a model from the Hugging Face model hub
# The repo_id parameter specifies the model name or path in the Hugging Face repository
# The filename parameter specifies the name of the file to download
model_path = hf_hub_download(
    repo_id= model_name_or_path, #code to mention the repo id
    filename= model_basename #code to mention the model name
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


mistral-7b-instruct-v0.2.Q6_K.gguf:   0%|          | 0.00/5.94G [00:00<?, ?B/s]

In [5]:
#Load model with GPU support
llm = Llama(
    model_path=model_path,
    n_ctx=2300, # Context window
    n_gpu_layers=38,
    n_batch=512
)

AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 


#### Response

In [6]:
#Function to generate responses from the LLM
#response function creates a reusable function to generate responses from the LLM
#- Handles all inference parameters in one place
#- Returns just the text response
def response(query,max_tokens=128,temperature=0,top_p=0.95,top_k=50):
    model_output = llm(
      prompt=query,
      max_tokens=max_tokens, #Maximum number of tokens to generate
      temperature=temperature, #Controls randomness
      top_p=top_p, #picks from top tokens that make up top_p of total probability
      top_k=top_k #considers only the top_k most likely tokens
    )

    return model_output['choices'][0]['text']

####Sample Questions


    1. Who is Hanuman and what role does he play in the Ramayana?
    2. What happened in the battle between Rama and Ravana?
    3. Tell me about Sita's character in the Ramayana.
    4. Who is Queen Tara and what happens to her?

In [8]:
#Storing sample questions into variables for easse of usage
qn1 = "Who is Hanuman and what role does he play in the Ramayana?"
qn2 = "What happened in the battle between Rama and Ravana?"
qn3 = "Tell me about Sita's character in the Ramayana."
qn4 = "Who is Queen Tara and what happens to her?"

In [9]:
#Question1
response(qn1)

'\n\nHanuman is a central character in the Indian epic Ramayana, which narrates the adventures of Prince Rama, an incarnation of the Hindu god Vishnu. Hanuman is known as the monkey god or the vanara god and is revered for his devotion to Lord Rama, his strength, and his intelligence.\n\nHanuman was born to Anjani, a vanara (monkey) princess, and the wind god, Vayu. He grew up in the forest with other monkeys and apes. Hanuman is best known for his unwavering'

In [10]:
#Question2
response(qn2)

Llama.generate: prefix-match hit


"\n\nRavana, the king of Lanka, was a powerful demon king who had abducted Sita, wife of Rama. Rama, along with his brother Lakshmana, set out on a journey to rescue Sita. They reached the forest of Dandaka where they met Sage Valmiki and received blessings from him.\n\nRavana, who was aware of Rama's mission, sent his brother Vibhishana to invite Rama for peace talks. Rama agreed but on the condition that Sita should be present during the talks. Ravana agreed"

In [11]:
#Question3
response(qn3)

Llama.generate: prefix-match hit


'\n\nSita, also known as Janaki or Seetha, is one of the most revered and beloved characters in Hindu mythology. She is the wife of Lord Rama, an avatar of Vishnu, and is considered an ideal woman and a paragon of virtue and devotion.\n\nAccording to the epic Ramayana, Sita was born in the kingdom of Mithila to King Janaka and Queen Sunanda. She was discovered by sage Valmiki while she was playing in the forest as a child, and he predicted that she would one day become the wife of an avatar'

In [12]:
#Question4
response(qn4)

Llama.generate: prefix-match hit


'\n\nQueen Tara, also known as Taranis, is a character in the 1982 animated film The Dark Crystal. She is the queen of the Gelfling tribe of Em-Kai and the wife of King Kermit. When the Skeksis gain control of the Crystal of Truth, they use it to corrupt the Saplings, which are the source of life for the Gelflings. Queen Tara becomes ill as a result of this corruption and eventually dies.\n\nThe main character of the story, Kermit the Gelfling, sets out on a'

**Observations on the Model**

- This establishes our baseline performance of the **Mistral language model** without any additional prompt engineering or retrieval augmentation. The intent was to understand how effectively the base model alone could answer Ramayana queries drawn from data out there.
- The model generated relevant and contextually coherent responses except for the last query where it answered the question out of context.The responses reflected general understanding and followed a logical flow.
- While the content was somwhat accurate except for one, the responses were incomplete, often stopping mid-sentence and lacked depth and was generic.
- The model generally delivers useful information; however, its responses are often high-level, making them appear more like general answers than context-specific response.

## Question Answering using LLM with Prompt Engineering

#### Defining system prompt ####

- Adds structure and guidance to LLM responses
- Sets expectations for medical accuracy
- Improves response format and quality
- Still no external context, just better instructions

#### Creating response function for 5 different combinations ####

Considering **response1** function as the base function, creating 4 other combinations and comparing it with response function
reponse1 has parameters

 **max_tokens** = 1024,   **temp** = 1.1   **top_p** = 0.95  **top_k** = 50

- **response2** - changed temp value from 0.7 to 0.0
- **response3** - changed top_p value from 0.95 to 0.85
- **response4** - changed top_k value from 50 to 80
- **response5** - changed max_tokens from 1024 to 512

In [None]:
#Creating system instruction
system_message = """"""  #code to define the system prompt

#Creeate system prompt with instructions for the model
#system_prompt = f"[INST]<<SYS>>\n{system_message}\n<</SYS>>[/INST]"
system_prompt = f"[INST]<<SYS>>\n{system_message}\n<</SYS>>\n"
