<a href="https://colab.research.google.com/github/Khushwant-singh/llama-rag/blob/main/RAGLlama.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install pypdf



In [2]:
import pypdf

# Load the policy handbook PDF
reader = pypdf.PdfReader("/content/sample_data/globomantics_policies.pdf")
print(f"Loaded PDF with {len(reader.pages)} pages")

# Extract text from all pages
full_text = ""
for page_num, page in enumerate(reader.pages):
    page_text = page.extract_text()
    full_text += f"\n--- Page {page_num + 1} ---\n{page_text}"

print(f"Total characters extracted: {len(full_text):,}")

Loaded PDF with 4 pages
Total characters extracted: 7,799


In [3]:
!pip install langchain_text_splitters



In [4]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Create the text splitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=150,
    separators=["\n\n", "\n", ". ", " ", ""]
)

# Split the document
chunks = splitter.split_text(full_text)

print(f"Created {len(chunks)} chunks")
print(f"Average chunk size: {sum(len(c) for c in chunks) // len(chunks)} characters")



Created 14 chunks
Average chunk size: 645 characters


In [5]:
# Preview all chunks
print("Chunk previews:")
print("=" * 70)
for i, chunk in enumerate(chunks):
    preview = chunk[:55].replace('\n', ' ')
    print(f"Chunk {i+1:2d}: {len(chunk):4d} chars | {preview}...")

Chunk previews:
Chunk  1:  750 chars | --- Page 1 --- Globomantics Internal Policies Handbook ...
Chunk  2:  788 chars | 1.2 Travel Expense Limits Airfare: For domestic flights...
Chunk  3:  667 chars |  Use company-preferred hotel chains (Marriott, Hilton,...
Chunk  4:  727 chars | --- Page 2 --- Example Trip Expense Calculation: Confer...
Chunk  5:  795 chars | payment with later reimbursement. Expense Report Submis...
Chunk  6:  725 chars | International Travel: Requires approval from department...
Chunk  7:  258 chars | report. Conference and Training: Conference registratio...
Chunk  8:  786 chars | --- Page 3 --- 2. Remote Work and Equipment Request Gui...
Chunk  9:  715 chars | and 1080p external webcam. For workspace ergonomics, yo...
Chunk 10:  753 chars | Specialized Equipment (Role-Specific): Additional equip...
Chunk 11:  266 chars | supports the request. Step 2: Submit Request Log into i...
Chunk 12:  763 chars | --- Page 4 --- Step 3: Approval and Delivery Your manag...


In [6]:
# Find and display a chunk about hotels
for i, chunk in enumerate(chunks):
    if 'hotel' in chunk.lower():
        print(f"=== Chunk {i+1} (Hotel Policy) ===")
        print(chunk)
        break

=== Chunk 2 (Hotel Policy) ===
1.2 Travel Expense Limits
Airfare:
For domestic flights under 5 hours, economy class is required. International flights over 8 hours allow premium
economy seating. Business class requires VP approval and must be over 12 hours in duration. Book flights at
least 14 days in advance when possible and use the company's preferred booking portal at
travel.globomantics.com.
Hotel Accommodations:
 Standard limit: $200 per night in most US cities
 High-cost cities (NYC, SF, LA, Seattle): $300 per night
 International travel: Check destination-specific limits in the travel portal
 Extended stays (7+ nights): Consider corporate housing options
 Use company-preferred hotel chains (Marriott, Hilton, Hyatt)
Meals and Per Diem:
 Breakfast: Up to $15
 Lunch: Up to $25
 Dinner: Up to $50


In [7]:
!pip install sentence-transformers chromadb -q
print("Libraries installed")

Libraries installed


In [8]:
from sentence_transformers import SentenceTransformer

# Load embedding model (downloads ~80MB on first run)
print("Loading embedding model...")
embedder = SentenceTransformer('all-MiniLM-L6-v2')
print("Embedding model is now loaded")

# Test with a sample sentence
test_embedding = embedder.encode("hotel limit for business travel")
print(f"Embedding dimensions: {len(test_embedding)}")

Loading embedding model...
Embedding model is now loaded
Embedding dimensions: 384


In [9]:
import numpy as np

def cosine_similarity(v1, v2):
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

# Compare three phrases
phrase1 = "hotel accommodation limits"
phrase2 = "lodging expense policy"
phrase3 = "equipment request process"

emb1 = embedder.encode(phrase1)
emb2 = embedder.encode(phrase2)
emb3 = embedder.encode(phrase3)

print(f"Similarity (hotel vs lodging):   {cosine_similarity(emb1, emb2):.3f}")
print(f"Similarity (hotel vs equipment): {cosine_similarity(emb1, emb3):.3f}")
print(f"Similarity (lodging vs equipment): {cosine_similarity(emb2, emb3):.3f}")

Similarity (hotel vs lodging):   0.597
Similarity (hotel vs equipment): -0.020
Similarity (lodging vs equipment): 0.149


In [10]:
import chromadb

# Create persistent database in current folder
client = chromadb.PersistentClient(path="policy_db")

# Create a collection for policy chunks
collection = client.get_or_create_collection(
    name="globomantics_policies",
    metadata={"description": "Globomantics company policy handbook"}
)

print(f"Collection created: {collection.name}")

Collection created: globomantics_policies


In [11]:
# Generate embeddings for all chunks
print("Generating embeddings for chunks...")
chunk_embeddings = embedder.encode(chunks)
print(f"Generated {len(chunk_embeddings)} embeddings")

# Add to collection
collection.add(
    ids=[f"chunk_{i}" for i in range(len(chunks))],
    embeddings=chunk_embeddings.tolist(),
    documents=chunks,
    metadatas=[{"chunk_index": i} for i in range(len(chunks))]
)

print(f"Added {collection.count()} chunks to database")

Generating embeddings for chunks...
Generated 14 embeddings
Added 14 chunks to database


In [12]:
def find_relevant_chunks(query, n_results=3):
    query_embedding = embedder.encode(query)
    results = collection.query(
        query_embeddings=[query_embedding.tolist()],
        n_results=n_results
    )
    return results['documents'][0], results['metadatas'][0]

# Test with a policy question
query = "What is the hotel limit for San Francisco?"
chunks_found, metadata = find_relevant_chunks(query)

print(f"Query: {query}\n")
for i, (chunk, meta) in enumerate(zip(chunks_found, metadata)):
    print(f"--- Result {i+1} (Chunk {meta['chunk_index']}) ---")
    print(f"{chunk}...")
    print()

Query: What is the hotel limit for San Francisco?

--- Result 1 (Chunk 1) ---
1.2 Travel Expense Limits
Airfare:
For domestic flights under 5 hours, economy class is required. International flights over 8 hours allow premium
economy seating. Business class requires VP approval and must be over 12 hours in duration. Book flights at
least 14 days in advance when possible and use the company's preferred booking portal at
travel.globomantics.com.
Hotel Accommodations:
 Standard limit: $200 per night in most US cities
 High-cost cities (NYC, SF, LA, Seattle): $300 per night
 International travel: Check destination-specific limits in the travel portal
 Extended stays (7+ nights): Consider corporate housing options
 Use company-preferred hotel chains (Marriott, Hilton, Hyatt)
Meals and Per Diem:
 Breakfast: Up to $15
 Lunch: Up to $25
 Dinner: Up to $50...

--- Result 2 (Chunk 2) ---
 Use company-preferred hotel chains (Marriott, Hilton, Hyatt)
Meals and Per Diem:
 Breakfast: Up to 

In [13]:
test_queries = [
    "What equipment do hybrid employees get?",
    "Do I need receipts for meals?",
    "Can I book business class for international flights?"
]

for query in test_queries:
    chunks_found, _ = find_relevant_chunks(query, n_results=1)
    print(f"Q: {query}")
    print(f"→ {chunks_found[0][:100]}...\n")

Q: What equipment do hybrid employees get?
→ --- Page 3 ---
2. Remote Work and Equipment Request Guidelines
2.1 Equipment Eligibility
Globomantic...

Q: Do I need receipts for meals?
→  Use company-preferred hotel chains (Marriott, Hilton, Hyatt)
Meals and Per Diem:
 Breakfast: Up t...

Q: Can I book business class for international flights?
→ 1.2 Travel Expense Limits
Airfare:
For domestic flights under 5 hours, economy class is required. In...



In [14]:
!pip install transformers accelerate bitsandbytes -q

In [None]:
from transformers import pipeline, BitsAndBytesConfig
import torch

# Configure 4-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

# Load Llama (takes 30-60 seconds)
llm = pipeline(
    "text-generation",
    model="meta-llama/Llama-3.1-8B-Instruct",
    model_kwargs={"quantization_config": quantization_config},
    device_map="auto"
)
print("Llama is loaded!")

config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [None]:
# Quick test
test = llm(
    [{"role": "user", "content": "Say 'Ready!'"}],
    max_new_tokens=10
)
print(test[0]["generated_text"][-1]["content"])

In [None]:
def answer_question(question, n_chunks=3):
    # Step 1: Find relevant chunks
    chunks_found, metadata = find_relevant_chunks(question, n_results=n_chunks)
    context = "\n\n".join(chunks_found)

    # Step 2: Build the prompt
    messages = [
        {
            "role": "system",
            "content": """You are a helpful assistant answering questions about
Globomantics company policies. Answer based on the provided context. Be direct and specific.
If the context contains relevant information, provide it clearly.
If the context has no relevant information, say so."""
        },
        {
            "role": "user",
            "content": f"""Context from company policies:
{context}

Question: {question}

Answer based only on the context above:"""
        }
    ]

    # Step 3: Generate answer
    response = llm(messages, max_new_tokens=300, temperature=0.1, pad_token_id=llm.tokenizer.eos_token_id)
    answer = response[0]["generated_text"][-1]["content"]

    return {"answer": answer, "sources": metadata, "context_used": chunks_found}

In [None]:
 Test the QA system
result = answer_question("What is the hotel limit for San Francisco?")

print("Q: What is the hotel limit for San Francisco?\n")
print(result["answer"])

Q: What is the hotel limit for San Francisco?

The context does not specify the hotel limit for San Francisco. However, it does mention that San Francisco is a high-cost city, and the limit for such cities is $300 per night.

In [None]:
test_questions = [
    "What equipment do I get if I work from home 3 days per week?",
    "Do I need receipts for all my meals?",
    "Can I book business class for a 10-hour flight to London?"
]

for question in test_questions:
    result = answer_question(question)
    print(f"Q: {question}")
    print(f"A: {result['answer']}\n")

Below is code for Gemma, another LLM model which does not require approval.

In [None]:
from transformers import pipeline, BitsAndBytesConfig
import torch

# Configure 4-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

# Load Gemma (a free and open-source alternative)
gemma_llm = pipeline(
    "text-generation",
    model="google/gemma-2b-it",
    model_kwargs={"quantization_config": quantization_config},
    device_map="auto"
)
print("Gemma is loaded!")

Now that the language model (LLM) is loaded, we can combine it with our retrieval system to answer questions using the policy handbook. This process is called Retrieval Augmented Generation (RAG).

We will define a function that:
1. Takes a user query.
2. Uses the `find_relevant_chunks` function to retrieve policy sections related to the query.
3. Constructs a prompt for the LLM that includes the user's question and the retrieved policy text.
4. Uses the loaded `llm` (Gemma in this case) to generate an answer based on the provided context.

In [None]:
def answer_question_with_rag(query):
    # 1. Retrieve relevant chunks
    relevant_docs, _ = find_relevant_chunks(query, n_results=3)

    # 2. Format the retrieved documents for the LLM prompt
    context = "\n\n".join(relevant_docs)

    # 3. Construct the prompt for the LLM
    prompt = f"""You are an AI assistant that answers questions based on the provided policy document. If the answer is not available in the document, state that you don't know.

Policy Document:
{context}

User Question: {query}

Answer:"""

    # 4. Generate the response using the LLM
    response = llm(prompt, max_new_tokens=256, temperature=0.1)[0]['generated_text']

    # The LLM might repeat the prompt or add its own prompt, so we extract just the answer.
    # We look for the last 'Answer:' and take everything after it.
    try:
        answer_start = response.rindex("Answer:") + len("Answer:")
        generated_answer = response[answer_start:].strip()
    except ValueError:
        generated_answer = response.strip() # If 'Answer:' isn't found, use the whole text.

    return generated_answer

# Test the RAG function with a sample query
query = "What is the hotel limit for San Francisco?"
rag_answer = answer_question_with_rag(query)

print(f"User Query: {query}")
print(f"\nRAG Answer:\n{rag_answer}")