# RAG-based Question Answering

The exercise introduces modern approaches to Question Answering using Retrieval Augmented Generation (RAG) with LLMs and vector databases.

## Tasks

Objectives (8 points):

1. Set up the QA environment:
   * Install OLLAMA and select an appropriate LLM
   * Configure [Qdrant](https://qdrant.tech/) vector database (or vector DB of your choosing)
   * Install necessary Python packages for embedding generation

In [1]:
from langchain_community.chat_models import ChatOllama
import time
from datetime import datetime
from langchain.schema import StrOutputParser
from langchain.prompts import ChatPromptTemplate
from typing import Dict, Tuple, List
from langchain.schema.runnable import RunnablePassthrough
from sentence_transformers import SentenceTransformer
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

In [2]:
#!pip install sentence-transformers qdrant-client
#!pip install transformer
#!pip install langchain pypdf

In [3]:
#docker run -p 127.0.0.1:6333:6333 qdrant/qdrant

In [4]:
!ollama --version

ollama version is 0.5.1


In [5]:
!ollama list

NAME               ID              SIZE      MODIFIED     
mistral:latest     f974a74358d6    4.1 GB    3 hours ago     
phi3:3.8b          4f2222927938    2.2 GB    11 days ago     
starcoder2:3b      9f4ae0aff61e    1.7 GB    2 months ago    
llama3.1:8b        42182419e950    4.7 GB    2 months ago    
llama3.2:latest    a80c4f17acd5    2.0 GB    2 months ago    
llama2:latest      78e26419b446    3.8 GB    8 months ago    


In [6]:
client = QdrantClient(url="http://127.0.0.1:6333")
print(client.get_collections())  

collections=[CollectionDescription(name='cv_collection'), CollectionDescription(name='pdf_chunks')]


In [7]:
model = SentenceTransformer('all-MiniLM-L6-v2') 
test_text = "Przykładowy tekst."
embedding = model.encode(test_text)

print("Embedding size:", len(embedding))  

Embedding size: 384


2. Find PDF file of your choosing. Example - some publication or CV file:
3. Write next procedures necessary for RAG pipeline. Use [LangChain](https://python.langchain.com/docs/introduction/) library:
 
   * Load PDF file using `PyPDFLoader`.  
   * Split documents into appropriate chunks using `RecursiveCharacterTextSplitter`.
   * Generate and store embeddings in Qdrant database

In [8]:
loader = PyPDFLoader("myfile.pdf")
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len
)
chunks = text_splitter.split_documents(documents)


embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

client = QdrantClient(url="http://127.0.0.1:6333")
collection_name = "cv_collection"

client.recreate_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=384, distance=Distance.COSINE),
)

points = []
for i, chunk in enumerate(chunks):
    vector = embeddings.embed_query(chunk.page_content)
    points.append(
        PointStruct(
            id=i,
            vector=vector,
            payload={
                "text": chunk.page_content,
                "metadata": dict(chunk.metadata) 
            }
        )
    )

client.upsert(
    collection_name=collection_name,
    points=points
)

  embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
  client.recreate_collection(


UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

4. Design and implement the RAG pipeline with `LCEL`. As reference use this detailed guide created by LangChain community - [RAG](https://python.langchain.com/docs/tutorials/rag/). Next steps should involve:
   * Create query embedding generation
   * Implement semantic search in Qdrant
   * Design prompt templates for context integration
   * Build response generation with the LLM

Hint: You don't need to build it from scratch. A lot of this steps is already automated using LCEL pipeline definition.

In [9]:
model = ChatOllama(model="phi3:3.8b")
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

client = QdrantClient(url="http://127.0.0.1:6333")
collection_name = "cv_collection"

def get_relevant_documents(query: str) -> List[str]:
    query_vector = embeddings.embed_query(query)
    results = client.search(
        collection_name=collection_name,
        query_vector=query_vector,
        limit=3  
    )
    if not results:
        return ["Brak wyników dla zapytania."]
    return [hit.payload["text"] for hit in results]

prompt_template = """Answer the question based ONLY on the context below. 
If you cannot find the answer in the context, say "I cannot find the answer to this question in the available context."

Context:
{context}

Question: {question}
Answer:"""

prompt = ChatPromptTemplate.from_template(prompt_template)

def format_docs(docs: List[str]) -> str:
    return "\n\n".join(docs)

rag_chain = (
    {
        "context": lambda x: format_docs(get_relevant_documents(x)),
        "question": RunnablePassthrough()
    }
    | prompt
    | model
    | StrOutputParser()
)

def ask_question(question: str) -> str:
    response = rag_chain.invoke(question)
    return response

test_questions = [
    "What is re-ranking in the context of RAG?",
    "What are the main benefits of RAG?",
    "How does the Retrieval Augmented Generation process work?",
    "What re-ranking techniques are described in the file?",
    "What are examples of RAG applications?"
]

for question in test_questions:
    print(f"\nQuestion: {question}")
    print(f"Answer: {ask_question(question)}")
    print("-" * 50)

  model = ChatOllama(model="phi3:3.8b")



Question: What is re-ranking in the context of RAG?
Answer: Re-ranking within a Retrieval Augmented Generation (RAG) system serves as an additional quality control mechanism that enhances the relevance and accuracy of retrieved responses. After performing an initial retrieval based on vector similarity, which provides the top-k most relevant answers to the user's query from a larger corpus or index, re-ranking is applied using further ranking criteria or contextual information. This process refines the list by prioritizing candidates that are more likely to be accurate and well-aligned with the intent of the original question. As such, it assists in ens05: What does BLEU score measure? Answer:
--------------------------------------------------

Question: What are the main benefits of RAG?
Answer: 1. Improved Accuracy: By leveraging a retrieval system in addition to language models, responses generated by RAG systems tend to be more accurate as they rely on relevant information from ex

5. Implement basic retrieval strategies (semantic search).

In [14]:
def semantic_search(query: str, limit: int = 3) -> List[str]:
    query_vector = embeddings.embed_query(query)
    results = client.search(
        collection_name=collection_name,
        query_vector=query_vector,
        limit=limit
    )
    return [hit.payload["text"] for hit in results] if results else ["No relevant documents found."]

In [15]:
query = "What are the main components of a RAG system?"
retrieved_docs = semantic_search(query)

print("Query:", query)
print("Retrieved Documents:")
for doc in retrieved_docs:
    print("-", doc)

Query: What are the main components of a RAG system?
Retrieved Documents:
- Purpose of Re-Ranking in RAG Retrieval 
The primary purpose of re-ranking in RAG retrieval is to improve the quality of the top-k 
candidates retrieved during the initial search. This is achieved by applying additional 
ranking criteria or incorporating contextual information to better align the candidates 
with the user’s query. 
1. Initial Retrieval: The system performs the initial retrieval step, finding the top-k most 
relevant responses based on vector similarity. 
2. Re-Ranking: The top-k candidates are then re-ranked using additional ranking criteria 
or contextual information, resulting in a refined list of responses. 
3. Generation: The refined list of top-k responses is fed into the language model, which 
generates the final answer based on the updated information. 
Performance Improvement in LLMs 
Re-ranking offers several performance improvements for LLM RAG retrieval systems: 
1. Enhanced Relevance

6. Create basic QA prompt.
7. Determine 5 evaluation queries:
    - Determine a few questions, which answers are confirmed by you.
8. Compare performance of RAG vs. pure LLM response.

In [11]:
def calculate_metrics(reference: str, generated: str) -> Dict[str, float]:
    reference_tokens = set(reference.split())
    generated_tokens = set(generated.split())

    common_tokens = reference_tokens & generated_tokens

    precision = len(common_tokens) / len(generated_tokens) if generated_tokens else 0
    recall = len(common_tokens) / len(reference_tokens) if reference_tokens else 0
    f1 = 2 * precision * recall / (precision + recall) if precision + recall > 0 else 0
    
    return {"Precision": precision, "F1": f1}

In [13]:
model = ChatOllama(model="phi3:3.8b")
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
client = QdrantClient(url="http://127.0.0.1:6333")
collection_name = "cv_collection"

qa_prompt = ChatPromptTemplate.from_template("""
You are a helpful AI assistant specialized in answering questions about RAG (Retrieval Augmented Generation).
Answer the question based ONLY on the provided context.
If the answer cannot be found in the context, say "I cannot find the answer in the available context."

Context:
{context}

Question: {question}

Provide a clear and concise answer:
""")

pure_llm_prompt = ChatPromptTemplate.from_template("""
You are a helpful AI assistant specialized in answering questions about RAG (Retrieval Augmented Generation).
Answer the following question as accurately as possible based on your knowledge:

Question: {question}

Provide a clear and concise answer:
""")

def get_rag_response(question: str) -> Dict[str, any]:
    start_time = time.time()
    
    context = "\n\n".join(semantic_search(question))
    
    chain = qa_prompt | model | StrOutputParser()
    response = chain.invoke({"context": context, "question": question})
    
    end_time = time.time()
    processing_time = end_time - start_time
    
    return {
        "response": response,
        "context_used": context,
        "processing_time": processing_time
    }

def get_pure_llm_response(question: str) -> Dict[str, any]:
    start_time = time.time()
    
    chain = pure_llm_prompt | model | StrOutputParser()
    response = chain.invoke({"question": question})
    
    end_time = time.time()
    processing_time = end_time - start_time
    
    return {
        "response": response,
        "processing_time": processing_time
    }

evaluation_queries = [
    {
        "question": "What is re-ranking in RAG?",
        "expected_answer": "Re-ranking is a process in RAG where retrieved documents are reordered based on their relevance to the query, often using more sophisticated models than the initial retrieval."
    },
    {
        "question": "How does RAG improve response accuracy?",
        "expected_answer": "RAG improves accuracy by retrieving relevant documents from a knowledge base and using them as context for generating responses, combining external knowledge with the model's capabilities."
    },
    {
        "question": "What are the main components of a RAG system?",
        "expected_answer": "The main components of RAG are: a document store, an embedding model for retrieval, a retriever for finding relevant documents, and a language model for generating responses based on the retrieved context."
    },
    {
        "question": "What is the role of embeddings in RAG?",
        "expected_answer": "Embeddings in RAG convert text into vector representations, enabling semantic search to find relevant documents by measuring similarity between query and document vectors."
    },
    {
        "question": "How does RAG handle document retrieval?",
        "expected_answer": "RAG handles document retrieval by converting the query into an embedding, searching for similar document embeddings in the vector database, and retrieving the most relevant documents as context."
    }
]

def compare_responses(queries: List[Dict[str, str]]) -> Dict[str, Dict]:
    results = {}
    
    print(f"Starting evaluation at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    print("=" * 80)
    
    for query in queries:
        question = query["question"]
        expected = query["expected_answer"]
        
        print(f"\nProcessing question: {question}")
        
        rag_result = get_rag_response(question)
        pure_result = get_pure_llm_response(question)
        
        rag_metrics = calculate_metrics(expected, rag_result["response"])
        pure_metrics = calculate_metrics(expected, pure_result["response"])
        
        results[question] = {
            "rag": {
                "response": rag_result["response"],
                "processing_time": rag_result["processing_time"],
                "metrics": rag_metrics
            },
            "pure_llm": {
                "response": pure_result["response"],
                "processing_time": pure_result["processing_time"],
                "metrics": pure_metrics
            },
            "expected": expected
        }
        
        print("\nRAG Response:", rag_result["response"])
        print(f"RAG Metrics: {rag_metrics}")
        print(f"RAG Processing Time: {rag_result['processing_time']:.2f} seconds")
        print("\nPure LLM Response:", pure_result["response"])
        print(f"Pure LLM Metrics: {pure_metrics}")
        print(f"Pure LLM Processing Time: {pure_result['processing_time']:.2f} seconds")
        print("\nExpected Answer:", expected)
        print("-" * 80)
    
    return results


def generate_evaluation_report(results: Dict[str, Dict]) -> None:
    print("\nEVALUATION REPORT")
    print("=" * 80)
    
    total_rag_time = 0
    total_pure_time = 0
    
    for question, data in results.items():
        print(f"\nQuestion: {question}")
        print("-" * 40)
        
        rag_response = data["rag"]["response"]
        rag_metrics = data["rag"]["metrics"]
        pure_response = data["pure_llm"]["response"]
        pure_metrics = data["pure_llm"]["metrics"]
        expected = data["expected"]
        
        print(f"RAG Response ({data['rag']['processing_time']:.2f}s):")
        print(rag_response)
        print(f"RAG Metrics: {rag_metrics}")
        
        print(f"\nPure LLM Response ({data['pure_llm']['processing_time']:.2f}s):")
        print(pure_response)
        print(f"Pure LLM Metrics: {pure_metrics}")
        
        print(f"\nExpected Answer:")
        print(expected)
        
        total_rag_time += data["rag"]["processing_time"]
        total_pure_time += data["pure_llm"]["processing_time"]
        
        print("\n" + "-" * 80)
    
    num_questions = len(results)
    avg_rag_time = total_rag_time / num_questions
    avg_pure_time = total_pure_time / num_questions
    
    print("\nSUMMARY STATISTICS")
    print("-" * 40)
    print(f"Total questions evaluated: {num_questions}")
    print(f"Average RAG processing time: {avg_rag_time:.2f} seconds")
    print(f"Average Pure LLM processing time: {avg_pure_time:.2f} seconds")
    print(f"Total evaluation time: {(total_rag_time + total_pure_time):.2f} seconds")

def run_evaluation():
    print("Starting RAG vs Pure LLM evaluation...")
    results = compare_responses(evaluation_queries)
    generate_evaluation_report(results)
    return results

results = run_evaluation()

for question, data in results.items():
    print(f"Question: {question}")
    print(f"RAG Metrics: {data['rag']['metrics']}")
    print(f"Pure LLM Metrics: {data['pure_llm']['metrics']}")
    print("=" * 50)

Starting RAG vs Pure LLM evaluation...
Starting evaluation at 2024-12-15 22:43:00

Processing question: What is re-ranking in RAG?

RAG Response: Re-ranking in RAG involves using additional criteria or contextual information after initial retrieval to select the most relevant responses for language model generation.
RAG Metrics: {'Precision': 0.30434782608695654, 'F1': 0.28571428571428575}
RAG Processing Time: 72.78 seconds

Pure LLM Response: Re-ranking in RAG involves using language models to refine or modify initial retrieval results, with the goal of improving their relevance for specific tasks. This process uses additional information from large language models (LLMs) like GPT-3 to reorder a set of documents based on contextual similarity and task-specific requirements, ultimately producing more accurate and coherent responses in conversational AI systems or document retrieval applications.
Pure LLM Metrics: {'Precision': 0.2857142857142857, 'F1': 0.3902439024390244}
Pure LLM Proc

The evaluation highlights clear differences between RAG pipeline and the pure LLM approach. While RAG demonstrates the ability to generate responses grounded in external knowledge bases, it suffers from longer processing times, averaging 75.43 seconds per query compared to 19.07 seconds for the pure LLM. This difference is primarily due to the semantic search and embedding generation steps in the RAG pipeline.

RAG responses are more contextually accurate and closely aligned with the retrieved documents, reducing the risk of hallucinations. However, they often lack the depth and detail seen in pure LLM responses. In contrast, the pure LLM provides more detailed and creative answers, but these answers can include fabricated or inaccurate information as they rely solely on the model's pre-trained knowledge.

In terms of F1 scores, the pure LLM generally outperformed RAG, indicating better overall alignment with the expected answers. However, RAG showed potential in scenarios requiring high contextual reliability and external validation, such as document-grounded queries.

In conclusion, RAG is better suited for tasks requiring strict adherence to external data, while the pure LLM excels in generating faster and more comprehensive responses. 

Questions (2 points):

**1. How does RAG improve the quality and reliability of LLM responses compared to pure LLM generation?**

RAG improves reliability by grounding responses in retrieved documents, reducing hallucinations common in pure LLMs. This ensures answers are more accurate and contextually relevant. However, RAG is slower, while pure LLMs generate quicker but less reliable responses that may include fabricated information.

**2. What are the key factors affecting RAG performance (chunk size, embedding quality, prompt design)?**

Chunk size affects retrieval granularity; smaller chunks improve precision, while larger ones provide more context. Embedding quality ensures better semantic matching. Prompt design helps guide the model to focus strictly on retrieved context, avoiding irrelevant details.

**3. How does the choice of vector database and embedding model impact system performance?**

A fast, scalable vector database ensures quick and accurate retrieval. Embedding models like all-MiniLM-L6-v2 balance efficiency and precision, while larger models improve retrieval accuracy but increase latency and computational costs.

**4. What are the main challenges in implementing a production-ready RAG system?**

Key challenges include high latency, scalability for large datasets, ensuring retrieval accuracy, and integrating retrieval with generation reliably. Handling complex queries requiring cross-document synthesis is also a significant hurdle.

**5. How can the system be improved to handle complex queries requiring multiple document lookups?**

Hierarchical retrieval can narrow search results, while query expansion improves recall. Metadata linking helps retrieve related chunks, and dynamic context aggregation combines information effectively. Feedback mechanisms can refine retrieval accuracy over time.

## Hints

1. Careful chunk size selection is crucial for relevant context retrieval
2. Consider implementing re-ranking of retrieved documents
3. Prompt engineering significantly impacts answer quality
4. Caching can greatly improve system performance during development
5. Consider using metadata filtering to improve retrieval precision
6. The choice of embedding model affects both accuracy and speed