# Retrieval-Augmented Generation (RAG)  

## 1. What is RAG?  

<img src="https://python.langchain.com/assets/images/rag_retrieval_generation-1046a4668d6bb08786ef73c56d4f228a.png" alt="Langchain Pipeline" style="width:800px;">  

Retrieval-Augmented Generation (RAG) is a technique that enhances language model responses by retrieving relevant information from an external knowledge source, such as a database or document collection. This allows the model to generate more accurate, contextually relevant, and up-to-date answers while reducing hallucinations.



## 1.1 Your Task: Build a RAG Pipeline  

Your goal is to construct a **Retrieval-Augmented Generation (RAG) pipeline** using a provided `LLM` and `vector_store`.  

### Steps to follow:  
1. **Retrieve relevant documents** – Fetch the top `n` most relevant documents from the `vector_store` based on a user query.  
2. **Generate a response** – Use the `LLM` to process the retrieved documents and generate a well-informed answer.  

Refer to Langchain’s [RAG Documentation](https://python.langchain.com/docs/tutorials/rag/#preview) for guidance on implementing this pipeline effectively.



In [1]:
# Predefined imports
from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.prompts import PromptTemplate


OLLAMA_URL = "http://localhost:11434"

#Load our models
llm = ChatOllama(model="smollm2:360m" ,base_url=OLLAMA_URL)
embedding_provider = OllamaEmbeddings(model="granite-embedding:278m", base_url=OLLAMA_URL)

#Load Vector store from disk
vector_store = FAISS.load_local("tei-client-index", embedding_provider, allow_dangerous_deserialization=True)

#Define a Prompt to use
prompt = PromptTemplate(
    input_variables=["question", "context"],
    template=(
        "You are an assistant for code understanding tasks. "
        "Use the following pieces of retrieved code to answer the question. "
        "If you don't know the answer, just say that you don't know. "
        "Try to answer in markdown syntax."
        "Question: {question}\n"
        "Context: {context}\n"
        "Answer:"
    )
)

### Add your implementation bellow:

In [2]:
from typing_extensions import List, TypedDict
from langchain_core.documents import Document
from langgraph.graph import START, StateGraph

# Your code goes here

# Define state for application
class State(TypedDict):
    question: str
    context: List[Document]
    answer: str


# Define application steps
def retrieve(state: State):
    retrieved_docs = vector_store.similarity_search(state["question"], k=10)
    return {"context": retrieved_docs}


def generate(state: State):
    docs_content = "\n\n".join(f"File:{doc.metadata["source"]}\nContent:{doc.page_content}" for doc in state["context"])
    messages = prompt.invoke({"question": state["question"], "context": docs_content})
    response = llm.invoke(messages)
    return {"answer": response.content}


# Compile application and test
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

Ask the RAG pipeline a question:

In [3]:
from IPython.display import display, Markdown

response = graph.invoke({"question": "How do i use the embed methode?"})
display(Markdown(response["answer"]))

To use the Embed method in a Grpc client, you need to create an EmbedServicer class and implement the embedded methods. The Embed method is used for generating embeddings of text, while the EmbedAll and EmbedStream methods are used for generating embeddings across multiple sources. To make sure that these embedded methods work as expected, you can use the RPCMethodHandlers module from the base code:

- Add an EmbedServicer class to the src/tei_client/stubs folder:

```python
from grpc import StreamServer
from gRPC.exceptions import GRPCUnavailableError

def _add_EmbedServicer_to_server(servicer, server):
    if not hasattr(servicer, "embed"):
        raise GRPCUnavailableError("Embed servcer not registered")
    for method_name in ("embed", "EmbedStream",):
        rpc_method_handlers = {
            "Embed": servicer.embed,
            "EmbedStream": servicer.EmbedStream,
        }
        server.add_rpc_method_handler(method_name, rpc_method_handlers)

def embed(
    request,
    target,
    options=(),
    channel_credentials=None,
    call_credentials=None,
    insecure=False,
    compression=None,
    wait_for_ready=None,
    timeout=None,
    metadata=None,
):
    return grpc.experimental.unary_unary(
        request,
        target,
        "/tei.v1.Embed/Embed",
        tei__pb2.EmbedRequest.SerializeToString,
        tei__pb2.EmbedResponse.FromString,
        options,
        channel_credentials,
        insecure,
        call_credentials,
        compression,
        wait_for_ready,
        timeout,
        metadata,
    )
```

- Add an Embed method to the server class:

```python
from gRPC import StreamServer
from grpc import StreamServerError

def _add_Embed(server):
    if not hasattr(server, "embed"):
        raise GRPCUnavailableError("Embed servcer not registered")
    for method_name in ("Embed",):
        server.add_rpc_method_handler(method_name, self._add_EmbedServicer_to_server)

def embed(
    request,
    target,
    options=(),
    channel_credentials=None,
    call_credentials=None,
    insecure=False,
    compression=None,
    wait_for_ready=None,
    timeout=None,
    metadata=None,
):
    return grpc.experimental.unary_unary(
        request,
        target,
        "/tei.v1.Embed/Embed",
        tei__pb2.EmbedRequest.SerializeToString,
        tei__pb2.EmbedResponse.FromString,
        options,
        channel_credentials,
        insecure,
        call_credentials,
        compression,
        wait_for_ready,
        timeout,
        metadata,
    )
```

## (Optional) Optimize the Implementation  

Enhance the pipeline by adjusting the number of retrieved documents (`n`) or refining the prompt to improve the accuracy of generated answers.  
You can also experiment with larger models, such as [`llama3.2`](https://ollama.com/library/llama3.2), to achieve better performance.

In [4]:
# Your code goes here