# Smart Document Assistant (RAG) with Llama 2

**Goal:** Build a system that can answer questions based on PDF documents using:
- **Retrieval-Augmented Generation (RAG)**
- **FAISS vector database**
- **Llama 2 local model**
- **PromptTemplate for concise answers**

**Pipeline:**
PDFs → Clean → Chunk → Embeddings → FAISS → Retriever → Llama 2 → Answer

In [1]:
!pip install -q langchain langchain-community pypdf sentence-transformers faiss-cpu ctransformers streamlit

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m50.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.8/23.8 MB[0m [31m95.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m50.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.0/9.0 MB[0m [31m113.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m457.2/457.2 kB[0m [31m32.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m115.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m122.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

# Importing libraries

In [2]:
import os
import re
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.llms import CTransformers
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Functions for cleaning texts

In [3]:
def clean_text(text: str) -> str:
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'\[\d+\]', '', text)
    text = re.sub(r'[^\w\s.,?!-]', '', text)
    return text.strip()

def is_useful_page(text: str, min_length: int = 50) -> bool:
    return len(text) > min_length

# Downloading and processing PDF files

In [4]:
DATA_PATH = "/kaggle/input/nlp-and-llm-related-arxiv-papers"
MAX_FILES = 270

documents = []
processed = 0

for root, _, files in os.walk(DATA_PATH):
    for file in files:
        if processed >= MAX_FILES:
            break
            
        if file.endswith(".pdf"):
            pdf_path = os.path.join(root, file)
            try:
                loader = PyPDFLoader(pdf_path)
                pages = loader.load()

                for doc in pages:
                    cleaned = clean_text(doc.page_content)
                    if is_useful_page(cleaned):
                        doc.page_content = cleaned
                        doc.metadata["source"] = file
                        documents.append(doc)

                processed += 1
                if processed % 10 == 0:
                    print(f"{processed} PDFs processed")

            except Exception as e:
                print(f"Error processing {file}: {e}")

print(f"Total pages loaded: {len(documents)}")

Ignoring wrong pointing object 6 0 (offset 0)
Ignoring wrong pointing object 8 0 (offset 0)
Ignoring wrong pointing object 10 0 (offset 0)
Ignoring wrong pointing object 12 0 (offset 0)
Ignoring wrong pointing object 14 0 (offset 0)
Ignoring wrong pointing object 16 0 (offset 0)
Ignoring wrong pointing object 19 0 (offset 0)
Ignoring wrong pointing object 21 0 (offset 0)
Ignoring wrong pointing object 27 0 (offset 0)
Ignoring wrong pointing object 29 0 (offset 0)
Ignoring wrong pointing object 31 0 (offset 0)
Ignoring wrong pointing object 36 0 (offset 0)
Ignoring wrong pointing object 38 0 (offset 0)
Ignoring wrong pointing object 54 0 (offset 0)
Ignoring wrong pointing object 90 0 (offset 0)
Ignoring wrong pointing object 92 0 (offset 0)
Ignoring wrong pointing object 113 0 (offset 0)
Ignoring wrong pointing object 115 0 (offset 0)
Ignoring wrong pointing object 132 0 (offset 0)
Ignoring wrong pointing object 150 0 (offset 0)
Ignoring wrong pointing object 152 0 (offset 0)


10 PDFs processed
20 PDFs processed


Ignoring wrong pointing object 39 0 (offset 0)
Ignoring wrong pointing object 61 0 (offset 0)
Ignoring wrong pointing object 63 0 (offset 0)


30 PDFs processed
40 PDFs processed


Ignoring wrong pointing object 6 0 (offset 0)
Ignoring wrong pointing object 8 0 (offset 0)
Ignoring wrong pointing object 10 0 (offset 0)
Ignoring wrong pointing object 12 0 (offset 0)
Ignoring wrong pointing object 14 0 (offset 0)
Ignoring wrong pointing object 17 0 (offset 0)
Ignoring wrong pointing object 19 0 (offset 0)
Ignoring wrong pointing object 21 0 (offset 0)
Ignoring wrong pointing object 27 0 (offset 0)
Ignoring wrong pointing object 29 0 (offset 0)
Ignoring wrong pointing object 31 0 (offset 0)
Ignoring wrong pointing object 46 0 (offset 0)
Ignoring wrong pointing object 52 0 (offset 0)
could not convert string to float: b'0.0000000000-170985' : FloatObject (b'0.0000000000-170985') invalid; use 0.0 instead
could not convert string to float: b'0.0000000000-170985' : FloatObject (b'0.0000000000-170985') invalid; use 0.0 instead


50 PDFs processed
60 PDFs processed
70 PDFs processed
80 PDFs processed
90 PDFs processed
100 PDFs processed
110 PDFs processed
120 PDFs processed
130 PDFs processed
140 PDFs processed
150 PDFs processed


Ignoring wrong pointing object 39 0 (offset 0)
Ignoring wrong pointing object 61 0 (offset 0)
Ignoring wrong pointing object 63 0 (offset 0)


160 PDFs processed
170 PDFs processed
180 PDFs processed
190 PDFs processed
200 PDFs processed
210 PDFs processed
220 PDFs processed
230 PDFs processed
240 PDFs processed
250 PDFs processed
260 PDFs processed
270 PDFs processed
Total pages loaded: 7583


# Splitting texts into chunks

In [5]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)

chunks = text_splitter.split_documents(documents)
print(f"Total text chunks: {len(chunks)}")

Total text chunks: 58616


# Creating embeddings and FAISS vector store

In [6]:
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs={"device": "cpu"}
)

vector_store = FAISS.from_documents(chunks, embeddings)
vector_store.save_local("faiss_index_arxiv")

print("FAISS Vector Store saved successfully")

  embeddings = HuggingFaceEmbeddings(
2026-01-05 19:48:51.333536: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1767642531.486881      20 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1767642531.534955      20 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

FAISS Vector Store saved successfully


# Preparing the Llama 2 model

In [7]:
llm = CTransformers(
    model="TheBloke/Llama-2-7B-Chat-GGUF",
    model_file="llama-2-7b-chat.Q4_K_M.gguf",
    model_type="llama",
    config={
        "temperature": 0.0,         
        "max_new_tokens": 256,
        "context_length": 2048
    }
)

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

llama-2-7b-chat.Q4_K_M.gguf:   0%|          | 0.00/4.08G [00:00<?, ?B/s]

# Setting up Prompts

In [8]:
qa_prompt = PromptTemplate(
    input_variables=["context", "question"],
    template="""
You are a STRICT document-based QA system.

Rules:
1. Answer ONLY using information explicitly stated in the context.
2. Do NOT use external knowledge.
3. Do NOT infer or generalize.
4. If the answer is NOT mentioned verbatim, reply ONLY with:
   "Not mentioned in the documents."

Context:
{context}

Question:
{question}

Answer (bullet points only):
"""
)

# Creating a RAG Chain

In [9]:
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",   
    retriever=vector_store.as_retriever(
        search_kwargs={"k": 5}
    ),
    return_source_documents=True,
    chain_type_kwargs={
        "prompt": qa_prompt
    }
)

print("RAG system is ready")

RAG system is ready


# System Experience

In [10]:
my_question = "What are the limitations of Large Language Models?"

response = qa_chain.invoke({"query": my_question})

print("\nAnswer:")
print(response["result"])

print("\nSources:")
for i, doc in enumerate(response["source_documents"], 1):
    print(f"{i}. {doc.metadata['source']} (page {doc.metadata.get('page', 'Unknown')})")


Answer:
• In most cases, larger models bring better performance.
• There are still many exceptions that should be considered when choosing the appropriate model.
• On certain tasks, with the size of LLMs increasing, the performance begins to decrease.
• Large language models not only continue to improve as we scale in terms of data or computational budget but also acquire new abilities.

Sources:
1. GLaM- Efficient Scaling of Language Models with Mixture-of-Experts.pdf (page 8)
2. Harnessing the Power of LLMs in Practice- A Survey on ChatGPT and Beyond.pdf (page 11)
3. A Survey of Large Language Models.pdf (page 59)
4. BloombergGPT- A Large Language Model for Finance.pdf (page 39)
5. Transcending Scaling Laws with 0.1 Extra Compute.pdf (page 2)


# Function to evaluate answers against documents

In [11]:
def context_overlap_score(answer, docs):
    context = " ".join(doc.page_content for doc in docs).lower()
    answer_words = answer.lower().split()
    overlap = sum(1 for w in answer_words if w in context)
    return overlap / max(len(answer_words), 1)

score = context_overlap_score(
    response["result"],
    response["source_documents"]
)

print(f"\nAnswer–Context Overlap Score: {score:.2f}")


Answer–Context Overlap Score: 0.89
