## RAG Pipelines

A RAG pipeline involves a **retrieval** and **generation step**, which is influenced by your choice of hyperparameters. 
Hyperparameters include things:
- embedding model to use for retrieval
- the number of nodes to retrieve ("top-K")
- LLM temperature
- prompt template
- etc.

----------

### Retrieval

Retrieval step includes:
1. **Vectorizing the initial input into an embedding**
2. **Performing vector search** - on vector store to retrieve top-k most similar vectorized text chunks
3. **Rerank retrieved nodes** - depending on use-case

Rag-evaluation hyper params:
- embedding model captures domain-specific nuances?
- reranker ranks the nodes in correct order
- retrieving right amount of information (text chunk size, top-K number)

___________________

### Generation

involves:
1. **Constructing a prompt** - based on vector fetched retrieval context
2. **Providing this to prompt to LLM**

Rag-evalutation questions:
- Smaller, faster, cheaper LLM
- Higher temperature
- Prompt change to output quality

______________

In [3]:
from deepeval.metrics import (
    ContextualPrecisionMetric,
    ContextualRecallMetric,
    ContextualRelevancyMetric
)
from rag_techniques.evaluation.custom_llm import CustomLlama3_8B

import os

custom_llm = CustomLlama3_8B()

contextual_precision = ContextualPrecisionMetric(threshold=0.5, include_reason=True, model = custom_llm ) #reranker => ranks more relevant nodes in your retrieval context higher than irrelevant ones
#contextual_recall = ContextualRecallMetric() #embedding model ==> accurately capture and retrieve relevant information based on the context of the input
#contextual_relevancy = ContextualRelevancyMetric() #text chunk size and top-K ==> to retrieve information without much irrelevancies

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Some parameters are on the meta device because they were offloaded to the disk.


In [4]:
from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    input="I'm on an F-1 visa, gow long can I stay in the US after graduation?",
    actual_output="You can stay up to 30 days after completing your degree.",
    expected_output="You can stay up to 60 days after completing your degree.",
    retrieval_context=[
        """If you are in the U.S. on an F-1 visa, you are allowed to stay for 60 days after completing
        your degree, unless you have applied for and been approved to participate in OPT."""
    ]
)

In [5]:
contextual_precision.measure(test_case)
print("Score: ", contextual_precision.score)
print("Reason: ", contextual_precision.reason)

Output()

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


KeyboardInterrupt: 