## Q1. Explain the requirement of RAG when LLMs are already powerful.

### Answer
LLMs are powerful and trained on large volumes of data, but they have a knowledge cutoff and can miss the latest events or data not present in their training corpus. RAG (Retrieval-Augmented Generation) addresses this by retrieving relevant context from external knowledge sources and providing it to the LLM so responses are accurate and up-to-date.

## Q2. Is RAG still relevant in the era of long context LLMs?

### Answer
Yes. Long-context LLMs still face issues like getting “lost in the middle,” incurring high API costs, and increasing latency. RAG reduces these problems by returning only the most relevant information, improving accuracy and cost-efficiency.

## Q3. What are the fundamental challenges of RAG systems?

### Answer
Key challenges include:
- Scalability: efficient search and indexing over large, dynamic knowledge sources.
- Latency: retrieval + generation can add delay without optimizations.
- Hallucination risk: ambiguous or insufficient retrieved data can still lead to incorrect LLM outputs.
- Bias & noise: retrieved content may contain biases or errors that propagate into answers.

## Q4. What are effective strategies to reduce latency in RAG systems?

### Answer
Common strategies: caching retrieval results or generated responses; embedding quantization to reduce memory/computation; selective query rewriting for complex queries; and selective re-ranking to avoid expensive re-ranking for simple queries.

## Q5. Explain R, A, and G in RAG.

### Answer
RAG = Retrieval-Augmented Generation:
- Retrieval: search and fetch relevant information from external sources.
- Augmented: include retrieved context with the prompt (query + instructions) to the LLM.
- Generation: LLM processes the prompt+context and generates a coherent, context-aware answer.

## Q6. How does RAG help reduce hallucinations in LLM generated responses?

### Answer
RAG supplies factual, up-to-date context from trusted sources. By grounding the LLM’s generation in retrieved evidence, RAG reduces the likelihood of fabricated or unsupported details and improves answer reliability.

## Q7. Why is re-ranking important in the RAG pipeline after initial document retrieval?

### Answer
Initial retrieval may return irrelevant chunks ranked above relevant ones. Re-ranking (e.g., with a cross-encoder) refines relevance ordering so the LLM receives higher-quality context, reducing noise and improving answer quality.

## Q8. What is the purpose of character overlap during chunking in a RAG pipeline?

### Answer
Chunk overlap preserves contextual continuity across chunk boundaries, preventing loss of information that spans chunks. Typical overlap is ~10–20% of chunk size to balance context preservation and efficiency.

## Q9. What role does cosine similarity play in relevant chunk retrieval within a RAG pipeline?

### Answer
Cosine similarity measures the angle between the query embedding and chunk embeddings, ranking chunks by semantic closeness. Higher cosine scores indicate more relevant chunks for answering the query.

## Q10. Can you give examples of real-world applications where RAG systems have demonstrated value?

### Answer
Examples include AI search engines (e.g., Perplexity), enterprise knowledge assistants, customer support bots that require up-to-date policies, and internal document Q&A tools that need precise, sourced answers.

## Q11. Explain the steps in the indexing process in a RAG pipeline.

### Answer
Indexing steps (usually offline):
1. Parsing: extract document content.
2. Chunking: split content into semantically coherent chunks.
3. Encoding: compute embeddings for chunks using an embedding model.
4. Storing: save chunk embeddings and metadata in a vector database for fast similarity search.

## Q12. Explain the importance of chunking in RAG.

### Answer
Chunking creates focused, retrievable segments that preserve context while keeping retrieval efficient. Proper chunk size improves retrieval accuracy and avoids irrelevant noise; poor chunking harms retrieval quality and coherence.

## Q13. How do you choose the chunk size for a RAG system?

### Answer
Choose a chunk size by balancing granularity vs. context: small chunks (100–200 tokens) help precise fact retrieval; larger chunks (500–1,000 tokens) provide deeper context but can add noise and cost. Consider document structure, embedding model, and LLM context window.

## Q14. What are the potential consequences of having chunks that are too large versus chunks that are too small?

### Answer
Too large: mixed topics per chunk, coarse embeddings, and noisy context for the LLM. Too small: fragmented context, missed semantic connections, increased storage and slower search due to more chunks.

## Q15. Explain the retrieval process step-by-step in a RAG pipeline.

### Answer
Retrieval steps:
1. Encode the user query into an embedding.
2. Search the vector DB for nearest chunk embeddings using cosine similarity (or other distance).
3. Return top-K chunks (optionally re-rank).
4. Include selected chunks as context in the prompt and call the LLM to generate a grounded response.