# Retrieval Augmented Generation 

## What is RAG?

**Retrieval Augmented Generation (RAG)** is an AI architecture that combines the power of large language models (LLMs) with external information retrieval systems (such as vector stores or search engines). The core idea is to augment the generative capabilities of LLMs with relevant, up-to-date, or domain-specific knowledge retrieved from a database, document store, or the web.

## Why Use RAG?

- **Improved Accuracy:** RAG allows the LLM to access facts and information not present in its training data, reducing hallucinations.
- **Up-to-Date Answers:** Enables the model to answer questions about recent events or specialized topics.
- **Domain Adaptation:** RAG can work with proprietary or private data, making it valuable for enterprises and custom applications.

## How Does RAG Work?

RAG typically involves two main stages:

### 1. **Retrieval**
- When a user submits a query, the system first retrieves relevant documents, passages, or data chunks from an external source (vector store, search index, or database).
- Retrieval is often semantic (using embeddings) but can use keyword search or hybrid methods.

### 2. **Generation**
- The retrieved documents are provided as additional context to the LLM.
- The LLM generates an answer by grounding its response in both the retrieved context and its own pre-trained knowledge.

**Simple RAG Workflow:**
1. **User Query:** "What is Retrieval Augmented Generation?"
2. **Retriever:** Finds relevant documents or passages about RAG.
3. **Generator (LLM):** Reads the retrieved context and crafts a grounded, informed response.


## RAG Architecture Diagram

```
User Query
     |
     v
+------------+         +-----------------+
|  Retriever | ----->  |  Relevant Docs  |
+------------+         +-----------------+
     |                          |
     v                          v
+----------------------------------------+
|      Large Language Model (LLM)        |
|  (with retrieved docs as extra input)  |
+----------------------------------------+
     |
     v
Generated Answer
```

## Example: RAG in Practice

1. **User asks:** "What is LangChain?"
2. **Retriever** searches docs, Wikipedia, or a vector DB and finds relevant paragraphs about LangChain.
3. **LLM** receives both the user query and the retrieved info.
4. **LLM generates:** "LangChain is a framework for developing applications powered by language models, enabling retrieval, orchestration, and more."


## Benefits of RAG

- **Reduced Hallucination:** Answers are grounded in real data.
- **Custom Knowledge:** Leverage internal/company documents not in public LLM data.
- **Scalability:** Update the knowledge base without retraining the model.

**Summary:**  
RAG (Retrieval Augmented Generation) is a technique that combines LLMs and external retrieval systems to provide more accurate, current, and reliable answers by grounding generation in factual context.

# What is In-Context Learning?

**In-Context Learning (ICL)** is a capability of modern large language models (LLMs) where the model learns to perform new tasks or follow patterns by observing a few examples provided directly in the input prompt, without any additional parameter tuning or gradient updates. The model uses the provided “context” (examples, instructions, or demonstrations) to infer the task and generate appropriate outputs.

## How Does In-Context Learning Work?

- **Prompt as Context:** The user supplies a prompt containing one or more input-output examples (also called “shots”) and a new query.
- **Pattern Recognition:** The LLM generalizes from the examples to predict the output for the new query, following the demonstrated pattern or behavior.
- **No Training Required:** The model’s weights are not updated; all “learning” happens dynamically from the prompt context.

### Example: Few-Shot In-Context Learning

```
Translate English to French:
English: cat -> French: chat
English: dog -> French: chien
English: bird -> French:
```

**Model Output:** oiseau

### Types of In-Context Learning

- **Zero-Shot:** No examples, only instructions (e.g., “Translate to French: dog”)
- **One-Shot:** A single example is given in the prompt.
- **Few-Shot:** Multiple examples are provided (usually 2–10).

### Why is In-Context Learning Important?

- **Flexibility:** Allows LLMs to adapt to new tasks and user needs without retraining.
- **Rapid Prototyping:** Enables users to experiment and iterate on new tasks via prompt engineering.
- **Personalization:** Users can tailor prompts for domain-specific or personalized behaviors.

### In-Context Learning vs. Fine-Tuning

| In-Context Learning           | Fine-Tuning                      |
|------------------------------|----------------------------------|
| No model parameter updates   | Model weights updated            |
| Fast, dynamic, on-the-fly    | Offline, requires training       |
| Adapts to new tasks quickly  | Good for large, static datasets  |
| Prompt = “memory”            | Model “remembers” after training |

### Applications

- Language translation
- Text classification
- Summarization
- Code generation
- Custom workflow automation
  
### References

- [OpenAI GPT-3 Paper (Section 3: In-Context Learning)](https://arxiv.org/abs/2005.14165)
- [Stanford CS25: In-Context Learning](https://web.stanford.edu/class/cs25/)
- [Prompt Engineering Guide](https://www.promptingguide.ai/)

**Summary:**  
In-Context Learning allows language models to rapidly adapt to new tasks and patterns by observing examples in the prompt, enabling flexible and dynamic application of LLMs without retraining.

# Understanding RAG
• Indexing <br>
• Retrieval <br>
• Augmentation <br>
• Generation


## Indexing - Indexing

**Indexing** is the process of **preparing your knowledge base** so that it can be **efficiently searched** at query time.  
This step consists of 4 sub-steps.

## 1. Document Ingestion
- You load your source knowledge into memory
- Examples:
  - PDF reports, Word documents
  - YouTube transcripts, blog pages
  - GitHub repos, internal wikis
  - SQL records, scraped webpages

#### Tools:
- LangChain loaders (e.g., `PyPDFLoader`, `YoutubeLoader`, `WebBaseLoader`, `GitLoader`, etc.)

#### Illustration:
An icon of a web page representing document loader → yellow sheet of paper (Shv)  
**LLM → Context**  

This process converts raw documents into structured content that can be efficiently retrieved.

## 2. Text Chunking
- Break large documents into small, semantically meaningful **chunks**

#### Why chunk?
- LLMs have context limits (e.g., 4K-32K tokens)
- Smaller chunks are more focused → better semantic search

#### Tools:
- RecursiveCharacterTextSplitter
- MarkdownHeaderTextSplitter
- SemanticChunker

##### Illustration:
An icon of a web page representing document loader → yellow sheet of paper (chunked text)  
**Source** indicates raw document input, and the chunks are the resulting smaller segments for easier processing and retrieval.

## 3. Embedding Generation
- Convert each chunk into a **dense vector** (embedding) that captures its **meaning**

### Why embeddings?
- Similar ideas land close together in vector space
- Allows fast, fuzzy semantic search

### Tools:
- OpenAIEmbeddings
- SentenceTransformerEmbeddings
- InstructorEmbeddings
- etc.

### Illustration:
An icon of a web page representing document loader → yellow sheet of paper (embedded vector)  
**Source** indicates raw document, which is then transformed into dense vectors for semantic understanding and retrieval.

## 4. Storage in a Vector Store
- Store the vectors along with the original chunk text + metadata in a **vector database**

### Vector DB options:
- **Local**: FAISS, Chroma
- **Cloud**: Pinecone, Weaviate, Milvus, Qdrant

#### Illustration:
Icons representing different vector database options, indicating storage and retrieval.  
**Source** shows the storage of vector representations for efficient semantic search.

#### Retrieval
- **Retrieval** is the *real-time* process of **finding the most relevant pieces of information** from a **pre-built index** (created during indexing) based on the user's question.

#### It’s like asking:
> "From all the knowledge I have, which 3–5 chunks are most helpful to answer this query?"

#### Augmentation
- **Augmentation** refers to the step where the **retrieved documents** (chunks of relevant context) are **combined with the user’s query** to form a new, enriched prompt for the LLM.

#### Generation
- **Generation** is the final step where a **Large Language Model (LLM)** uses the **user’s query** and the **retrieved & augmented context** to generate a response.