## NLP Assignment 2 PS 14
## Group 94

| Sr No | Name               | BITS ID        | Contribution |
|------:|--------------------|----------------|--------------|
| 1     | Vinay Bora         | 2024ad05062    | 100%         |
| 2     | Thiyagesh D        | 2024ad05395    | 100%         |
| 3     | Sajal Jain         | 2024ac05874    | 100%         |
| 4     | Pavithra R         | 2024ad05121    | 100%         |
| 5     | Gaikwad Priyanka P | 2024ad05316    | 100%         |


In [23]:
#install all library
# !pip install -U sentence-transformers
# !pip install chromadb
# !pip install -U ragas
# !pip install -U langchain-community
# !pip install transformers accelerate bitsandbytes
# !pip install -U ragas transformers accelerate sentence-transformers langchain-community


# 1. The Ingestion Pipeline

In [24]:
#Read data from source
import requests
file_url="https://www.gutenberg.org/ebooks/55695.txt.utf-8"
content=requests.get(file_url).content
print(content[:100])

b'\xef\xbb\xbfThe Project Gutenberg eBook of Pioneer Saturn Encounter\r\n    \r\nThis ebook is for the use of anyon'


### Chunking Strategy

In [25]:
#fixed size chunking
chunk_size = 1000
overlap_size = 200

chunks = []
for i in range(0, len(content), chunk_size - overlap_size):
    chunk = content[i : i + chunk_size]
    chunks.append(chunk)

# Display the first few chunks and their lengths
for i, chunk in enumerate(chunks[:3]):
    print(f"Chunk {i+1} (length {len(chunk)}):\n{chunk[:200]}\n...")

Chunk 1 (length 1000):
b'\xef\xbb\xbfThe Project Gutenberg eBook of Pioneer Saturn Encounter\r\n    \r\nThis ebook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world at no cost and with almost no res'
...
Chunk 2 (length 1000):
b'line Distributed\r\n        Proofreading Team at http://www.pgdp.net\r\n\r\n\r\n*** START OF THE PROJECT GUTENBERG EBOOK PIONEER SATURN ENCOUNTER ***\r\n\r\n                        PIONEER SATURN ENCOUNTER\r\n\r\n\r\n '
...
Chunk 3 (length 1000):
b'd 11 spacecraft, launched in 1972 and 1973,\r\nrespectively, were well named: they made the first crossings of the\r\nasteroid belt and were the first to encounter Jupiter and its intense\r\nradiation belts'
...


In [26]:
print(f"Total number of chunks: {len(chunks)}")

Total number of chunks: 72


# Vectorization

In [28]:
# Load a pre-trained embedding model
# !pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer

model_name = 'all-MiniLM-L6-v2'
embedding_model = SentenceTransformer(model_name)

print(f"Embedding model '{model_name}' loaded successfully.")

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


Embedding model 'all-MiniLM-L6-v2' loaded successfully.


In [29]:
# Generate embeddings for the chunks
# The SentenceTransformer expects strings, so decode the byte chunks
chunk_strings = [chunk.decode('utf-8', errors='ignore') for chunk in chunks]
embeddings = embedding_model.encode(chunk_strings, show_progress_bar=True)

print(f"Generated {len(embeddings)} embeddings, each with dimension {embeddings.shape[1]}.")
print(f"First embedding (first 5 values): {embeddings[0][:5]}")

Batches:   0%|          | 0/3 [00:00<?, ?it/s]

Generated 72 embeddings, each with dimension 384.
First embedding (first 5 values): [-0.05332715 -0.04049236  0.06842367 -0.00429583  0.02655271]


# Storage

In [30]:
# Install chromadb
# !pip install chromadb

import chromadb

# Initialize an in-memory ChromaDB client
client = chromadb.Client()

# Create a collection
collection_name = "pioneer_saturn_chunks"
# Check if collection already exists to prevent errors on re-execution
if collection_name in [col.name for col in client.list_collections()]:
    client.delete_collection(name=collection_name)
collection = client.create_collection(name=collection_name,configuration={
        "hnsw": {
            "space": "cosine"
        }
    }, metadata={"hnsw:space": "cosine"})

# Prepare data for ChromaDB
# ChromaDB expects string IDs, so we'll use a simple index as ID
ids = [f"chunk_{i}" for i in range(len(chunk_strings))]

# Add the documents and embeddings to the collection
collection.add(
    documents=chunk_strings,
    embeddings=embeddings.tolist(), # Convert numpy array to list for ChromaDB
    ids=ids
)

print(f"Successfully created ChromaDB collection '{collection_name}' with {collection.count()} documents.")

Successfully created ChromaDB collection 'pioneer_saturn_chunks' with 72 documents.


# 2. The Retrieval Engine

In [31]:
def semantic_search(query: str, k: int = 3) -> list:
    # Embed the user query
    query_embedding = embedding_model.encode([query]).tolist()

    # Perform the similarity search
    results = collection.query(
        query_embeddings=query_embedding,
        n_results=k,
        include=['documents', 'distances']

    )

    # Extract and return relevant chunks
    retrieved_chunks = []
    if results and results['documents']:
        for i in range(len(results['documents'][0])):
            chunk_id = results['ids'][0][i]
            document = results['documents'][0][i]
            distance = results['distances'][0][i]
            retrieved_chunks.append({
                "id": chunk_id,
                "document": document,
                "distance": distance
            })
    return retrieved_chunks

# Demonstrate the function with an example query
query_example = "What was the mission of Pioneer 11?"
relevant_chunks = semantic_search(query_example, k=3)

print(f"\nQuery: '{query_example}'")
print(f"\nTop {len(relevant_chunks)} relevant chunks:")
for i, chunk in enumerate(relevant_chunks):
    print(f"\n--- Chunk {i+1} (ID: {chunk['id']}, Distance: {chunk['distance']:.4f}) ---")
    print(chunk['document'][:100] + "...") # Print first 100 characters for brevity


Query: 'What was the mission of Pioneer 11?'

Top 3 relevant chunks:

--- Chunk 1 (ID: chunk_4, Distance: 0.3987) ---
                          INTRODUCTION


We have entered into a new era of space exploration. Mis...

--- Chunk 2 (ID: chunk_2, Distance: 0.4432) ---
d 11 spacecraft, launched in 1972 and 1973,
respectively, were well named: they made the first cros...

--- Chunk 3 (ID: chunk_6, Distance: 0.4946) ---
 penetrate deep below the Jovian clouds.

In the coming years, each of these follow-on missions wi...


# 3. The Generation Component

# Prompt Engineering

In [32]:
def llm_prompt(context:str, question: str) -> str:
  prompt = f"""You are a helpful assistant. Please answer the following question based ONLY on the provided context.
  If the answer is not found in the context, please state that you don't have enough information.

  Context:
  {context}

  Question:
  {question}

  Answer:
  """
  return prompt


# LLM Integration

For this demonstration, we'll use a small model from HuggingFace to simulate the LLM generation step.
Here we have integrated a small model `google/flan-t5-base` (a lightweight instruction-tuned model) to respond to the `final_prompt_for_llm`.

In [33]:
# Install necessary libraries
# !pip install transformers accelerate bitsandbytes

import torch
import os
from transformers import AutoModelForCausalLM, AutoModelForSeq2SeqLM, AutoTokenizer
from google.colab import userdata

# Loading different, openly accessible model (Flan-T5-base or Mistral-7B)
model_name = "mistralai/Mistral-7B-Instruct-v0.2"
model_name = "google/flan-t5-base"

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

# Set pad token if it's not already set, common for causal models like Mistral
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# mistralai/Mistral-7B-Instruct-v0.2
# model = AutoModelForCausalLM.from_pretrained(
#     model_name,
#     device_map="auto",
#     torch_dtype=torch.bfloat16
# )
model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.bfloat16
)
# Ensure the model's generation config also has a pad_token_id set
model.config.pad_token_id = tokenizer.eos_token_id

print(f"LLM model '{model_name}' loaded successfully.")

Loading weights:   0%|          | 0/282 [00:00<?, ?it/s]



LLM model 'google/flan-t5-base' loaded successfully.


In [34]:
def generate_response(question: str) -> str:
    # 1. Retrieve relevant chunks using the semantic search function

    retrieved_info = semantic_search(question, k=3) # Retrieve top 3 chunks
    print("Similarity Search Done!")
    # 2. Compile the retrieved chunks into a single context string
    context_parts = [chunk['document'] for chunk in retrieved_info]
    context = "\n\n".join(context_parts)
    # 3. Format the prompt using the template and the gathered context
    final_prompt_for_llm=llm_prompt(context, question)

    # 4. Generate a response using the LLM
    # Tokenize the input and ensure attention_mask is returned
    tokenized_input = tokenizer(final_prompt_for_llm, return_tensors="pt", return_attention_mask=True).to(model.device)

    # print("Calling LLM:")
    output_ids = model.generate(
        tokenized_input.input_ids,
        attention_mask=tokenized_input.attention_mask, # Pass the attention_mask
        max_new_tokens=200,
        num_beams=5,
        early_stopping=True,
        pad_token_id=tokenizer.eos_token_id # Explicitly set pad_token_id for generation
    )

    # print("LLM call done!")
    # Decode and print the generated response
    llm_response = tokenizer.decode(output_ids[0], skip_special_tokens=True)

    answer_prefix = "Answer:"
    if answer_prefix in llm_response:
        actual_answer = llm_response.split(answer_prefix, 1)[1].strip()
    else:
        actual_answer = llm_response.strip()

    return actual_answer, context

In [35]:
question="Who authored the 'Pioneer Saturn Encounter' document?"
print(generate_response(question)[0])

Token indices sequence length is longer than the specified maximum sequence length for this model (659 > 512). Running this sequence through the model will result in indexing errors


Similarity Search Done!
United States. National Aeronautics and Space Administration


In [36]:
sample_queries = [
    "What is the primary objective of the Project Gutenberg eBook titled 'Pioneer Saturn Encounter'?",
    "When was Pioneer 11 launched and what were its initial targets before reaching Saturn?",
    "How did Jupiter's gravitational field influence Pioneer 11's trajectory towards Saturn?",
    "What are some of the key scientific observations or data collected by Pioneer 11 at Saturn?",
    "Who authored the 'Pioneer Saturn Encounter' document and when was it released?"
]

print(f"Generated {len(sample_queries)} sample queries:")
for i, query in enumerate(sample_queries):
    print(f"{i+1}. {query}")

Generated 5 sample queries:
1. What is the primary objective of the Project Gutenberg eBook titled 'Pioneer Saturn Encounter'?
2. When was Pioneer 11 launched and what were its initial targets before reaching Saturn?
3. How did Jupiter's gravitational field influence Pioneer 11's trajectory towards Saturn?
4. What are some of the key scientific observations or data collected by Pioneer 11 at Saturn?
5. Who authored the 'Pioneer Saturn Encounter' document and when was it released?


## Process and Answer Queries




In [37]:
print("\n-----------------------------------------------------")
print("Demonstration of RAG process for all sample queries complete.")
for i, query in enumerate(sample_queries):
    print(f"\n-----------------------------------------------------\nProcessing Query {i+1}:")
    print(f"Query: {query}")

    llm_response,context = generate_response(query)
    print(f"LLM's Answer: {llm_response}")
    print(f"First 100 tokens of Context: {context[:100]}...")





-----------------------------------------------------
Demonstration of RAG process for all sample queries complete.

-----------------------------------------------------
Processing Query 1:
Query: What is the primary objective of the Project Gutenberg eBook titled 'Pioneer Saturn Encounter'?
Similarity Search Done!
LLM's Answer: free distribution of electronic works
First 100 tokens of Context: ﻿The Project Gutenberg eBook of Pioneer Saturn Encounter
    
This ebook is for the use of anyone ...

-----------------------------------------------------
Processing Query 2:
Query: When was Pioneer 11 launched and what were its initial targets before reaching Saturn?
Similarity Search Done!
LLM's Answer: April 5, 1973
First 100 tokens of Context:  penetrate deep below the Jovian clouds.

In the coming years, each of these follow-on missions wi...

-----------------------------------------------------
Processing Query 3:
Query: How did Jupiter's gravitational field influence Pioneer 11's tra

# 4. Evaluation


To ensure the reliability of the Retrieval-Augmented Generation (RAG) pipeline, both the Retriever and Generator must be evaluated independently and jointly.

## 1. Retriever Evaluation — Contextual Precision
Definition:
Contextual Precision measures whether the retrieved chunks are actually relevant to the user’s query.

Theoretical Approach:
Create a small gold-standard dataset consisting of: User queries
Expected relevant passages from the ebook
Compare retrieved chunks with the ground truth.

Metric:
Contextual Precision = Number of Relevant Chunks Retrieved/Total Chunks Retrieved.
For example, if 2 out of the top 3 retrieved chunks are relevant, precision = 0.67.

Practical Implementation:
Use Top-k Precision and optionally Recall@k.
Manually label relevance for at least 10–20 queries.
Automate evaluation using RAGAS context_precision metric.

How this prevents hallucination:
If irrelevant context is retrieved, the generator is more likely to fabricate answers. High contextual precision ensures the model receives only useful evidence.


## 2. Generator Evaluation — Faithfulness

Definition:
Faithfulness measures whether the generated answer is strictly supported by the retrieved context and does not introduce external knowledge.

Theoretical Approach:
Faithfulness can be framed as an entailment problem:
Does the context logically support every claim made in the answer?

Metric:
Faithfulness = Number of Claims Supported by Context/Total Claims in Answer
	​
Practical Implementation:
Use LLM-based evaluators (e.g., RAGAS faithfulness metric).


## 3. Hallucination Prevention

A combination of design-time safeguards and evaluation checks ensures grounding.

(i). Prompt-Level Guardrails
Use an explicit instruction such as:
“Answer ONLY using the provided context. If the answer is not present, respond with ‘I cannot find this information in the provided text.’”

This discourages parametric knowledge usage.

(ii). Retrieval Constraints
Limit context to top-k high-similarity chunks.
Apply a similarity threshold to filter weak matches.
Use chunk overlap to avoid loss of meaning.

(iii). Citation-Based Generation
Require the model to:
Quote supporting lines, or
Provide passage references.
This makes hallucinations easily detectable.

(iv). Post-Generation Verification
Run a second LLM check:
Verification Prompt:
“Is every statement in this answer supported by the provided context? Return YES or NO with justification.”
Reject or regenerate answers flagged as unsupported.

(v). Automated Evaluation with RAGAS
RAGAS enables scalable testing without human labeling by measuring:
Context Precision
Faithfulness
Answer Relevance

This provides a quantitative reliability score for the pipeline.