<a href="https://colab.research.google.com/github/El-Moghazy2/Machine-Learning-Tutorials/blob/master/building_rag_using_llama3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building Custom-Chatbot using RAG

In this tutorial we are going to explore how to use RAG to be able to use the reasoninig capabilities of open source LLMs with custom data without fine-tuning the model. We will make a RAG that answers your question based on one of HuggingFace courses that are publicly available.

**Before** we dive into what Retrieval-Augmented Generation (RAG) is and why it’s useful, it's important to first understand the foundation it builds upon: Large Language Models (LLMs). Also known as foundational models, LLMs are transformer-based deep neural networks trained on massive volumes of publicly available text data. Like traditional language models, their primary objective is to predict the next word in a sequence—but on a much larger scale and with far more complexity.

The term foundational reflects their broad training base: these models are designed to be general-purpose and adaptable across a wide range of tasks. Because of the extensive training data, LLMs are capable of generating human-like responses and offering reasonably useful general knowledge or limited reasoning but **their answers are limited to what they've seen during training.**

However, this generality also brings limitations. Out of the box, LLMs are not ideal for tasks that **require domain-specific expertise** or **highly accurate answers** on the first try. They may also **"hallucinate,"** or generate plausible-sounding but **incorrect information**.

To tailor these models for more specific applications, we often need to fine-tune them. Fine-tuning involves training the model further on labeled data specific to a target task or domain. For example, if we want the model to accurately classify emails as "spam" or "not spam," we can fine-tune it using a dataset of emails paired with correct labels. This reduces hallucination and improves task-specific performance.

The following figure illustrates the high-level process of pretraining and fine-tuning a foundational model.

![](https://raw.githubusercontent.com/El-Moghazy2/large-language-model-from-scratch/refs/heads/main/Figures/LLM.JPG?raw=true)

While fine-tuning can significantly improve performance for specific applications, it comes with trade-offs. First, it requires collecting and curating a labeled dataset tailored to the task at hand which can be a time-consuming and expensive process. Additionally, the fine-tuning itself demands computational resources and expertise, which may not be readily available in all settings.

Another important limitation is flexibility. If you need to update the model with new information to include recent documents or knowledge relevant to your chatbot, you'll likely need to go through the fine-tuning process again. This makes it a less convenient option for dynamic or frequently changing domains.

This is precisely where Retrieval-Augmented Generation (RAG) becomes valuable.

## What is a RAG?

As the name suggests, it has two main components: a Retrieval component and a Grounded Generation component.

1. Retrieval: When you ask a question—say, "What is data science?"—the retrieval component first searches for and gathers all relevant chunks of text related to the question from a knowledge base, this knowledge base can be a company-related PDFs that are not publicly available.

2. Grounded Generation: The generation process in a RAG system is "grounded" in the retrieved text, meaning the model can only generate answers based on that specific information. It doesn’t pull answers from outside the retrieved text, so you’re harnessing the reasoning and language abilities of the LLM to process and interpret the most relevant data, ensuring accurate and contextually appropriate responses.

In essence, RAG systems combine fast, efficient data retrieval with the powerful reasoning capabilities of LLMs, making them ideal for generating accurate responses based on domain-specific or large datasets.

![](https://raw.githubusercontent.com/El-Moghazy2/large-language-model-from-scratch/refs/heads/main/Figures/new%20example.JPG?raw=true)

## What are RAG systems used for:

RAG systems are ideal for situations where a large language model (LLM) needs to answer questions based on specific data that the LLM wasn't trained on. For instance, if your company has a large repository of legal documents and you want to create a question-answering bot that can respond using that legal data, a RAG system is a perfect fit. **Here's why**:

* Limited Training Data: Without fine-tuning, a general LLM won’t have access to the specific data (like your company's legal documents) since this data was not part of the model's original training set.

* Avoiding Costly Fine-Tuning: Fine-tuning a large language model can be expensive and not necessary for this case. RAG offers a way to enhance the model's capabilities without the need for fine-tuning, by retrieving relevant data on the fly, without the need to finetune for each legal case you enter in the database.

* Reducing Hallucinations: LLMs often "hallucinate" or generate plausible-sounding but incorrect information. By using a RAG system, you can provide the model with specific, contextually relevant information retrieved from your dataset, significantly reducing the likelihood of these hallucinations.

**In essence, RAG allows you to leverage the power of LLMs while grounding their responses in the specific, up-to-date data relevant to your domain, without requiring costly model fine-tuning.**

In [None]:
# Install required packages
!pip install chainlit
!pip install sentence-transformers
!pip install trulens-providers-litellm
!pip install litellm
!pip install faiss-gpu
!pip install langchain
!pip install langchain_community
!pip install langchain-huggingface
!pip install openai
!pip install -qU langchain-ollama
!pip install langgraph
!pip install ollama
!pip install trulens trulens-apps-langchain
!pip install langchain
!pip install langchainhub
!pip install trulens-providers-langchain
!pip install packaging==21.3
!pip install faiss-cpu bs4 tiktoken
!pip install trulens-providers-huggingface
!pip install --upgrade trulens pydantic
!pip install --upgrade scikit-learn
!pip install -U transformers
!pip install hf_xet

import os
import re
import subprocess
import numpy as np
import pandas as pd
import torch
import kagglehub
from uuid import uuid4
from typing import Optional
from typing_extensions import List, TypedDict
from pydantic import BaseModel, Field
from IPython.display import display, clear_output, Image
from langchain_ollama import OllamaEmbeddings
from langchain.vectorstores import FAISS


!pip install ragas

from ragas import EvaluationDataset
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper
from ragas.run_config import RunConfig
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness

import requests
from langchain.schema import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from bs4 import BeautifulSoup
from langgraph.graph import START, StateGraph

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from sentence_transformers import SentenceTransformer
from tqdm.notebook import tqdm

from langchain import LLMChain, PromptTemplate, hub
from langchain_huggingface import HuggingFacePipeline
from langchain_ollama import ChatOllama
from langchain.prompts import ChatPromptTemplate, HumanMessagePromptTemplate
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.output_parsers import PydanticOutputParser, OutputFixingParser, RetryOutputParser
from langchain_core.output_parsers.string import StrOutputParser
from langchain_core.runnables import RunnableLambda, RunnableParallel
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

from langchain_community.document_loaders import WebBaseLoader
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_community.vectorstores import FAISS

tqdm.pandas(desc='DataFrame Operation')

clear_output()

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
import chainlit as cl

@cl.on_message
async def main(message: str):
    # Your custom logic goes here...

    # Send a response back to the user
    await cl.Message(
        content=f"Received: {message}",
    ).send()

In [None]:
print("############")
print("IP v")
!curl https://ipv4.icanhazip.com/
print("############")

############
IP v
34.132.228.56
############


In [None]:
!chainlit run app.py & npx localtunnel --port 8000

[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0Kyour url is: https://tired-tigers-deny.loca.lt
Usage: chainlit run [OPTIONS] TARGET
Try 'chainlit run --help' for help.

Error: Invalid value: File does not exist: app.py
^C


Here we will start the Ollama server and download the model.

In [None]:
%%capture

# I will use capture to remove the cell output, feel free to comment it to show all the progress bar
# we can now start ollama using ollama serve command
!curl -fsSL https://ollama.com/install.sh | sh
subprocess.Popen("ollama serve", shell=True)


clear_output()

In [None]:
!lsof -iTCP -sTCP:LISTEN -P | grep ollama

ollama    14467 root    3u  IPv4 381306      0t0  TCP localhost:11434 (LISTEN)


In [None]:
# now we pull/download the powerful llam3.1
subprocess.Popen("ollama pull llama3.1:8b-instruct-q8_0", shell=True)

clear_output()

In [None]:
# now we pull/download the powerful llam3.3 we will use it later to test its embedding efect on the evaluation metrics
subprocess.Popen("ollama pull llama3.3", shell=True)

clear_output()

You can also try to use other models from the Ollama library such as deepseek-r1 and llama3.3, but in this tutorial we will use llama3.1 instruct 8b

# Dataset:

We aim to build a Retrieval-Augmented Generation (RAG) system using the Hugging Face NLP course. The RAG system will allow us to retrieve specific information from the course pages more efficiently by scraping the course content and organizing it in a way that the model can retrieve relevant information quickly.

The course consists of many pages, each page of the follows a consistent URL pattern. The base URL remains the same, while the only changes occur in the chapter and page numbers. Here’s the general structure:

For example:

* Chapter 1, Page 1: https://huggingface.co/learn/nlp-course/chapter1/1
* Chapter 1, Page 2: https://huggingface.co/learn/nlp-course/chapter1/2
- Chapter 2, Page 1: https://huggingface.co/learn/nlp-course/chapter2/1
- Chapter 2, Page 2: https://huggingface.co/learn/nlp-course/chapter2/2

In [None]:
courses = []
base_url = "https://huggingface.co/learn/nlp-course/chapter{}/{}"
for i in range(12):
    courses.extend(base_url.format(chapter, i) for chapter in range(20))

def load_page(url):
    """Fetch the page content using requests."""
    try:
        response = requests.get(url)
        if response.status_code == 200:
            return response.text
        else:
            pass
    except Exception as e:
        print(f"Error loading {url}: {e}")
    return None

docs = []

for url in courses:
    html = load_page(url)
    if html is None:
        continue

    soup = BeautifulSoup(html, 'html.parser')

    content_element = soup.select_one("div.prose-doc.prose.relative.mx-auto.max-w-4xl.break-words")
    if not content_element:
        print(f"Content element not found for {url}")
        continue

    text_content = content_element.get_text(separator="\n").strip()

    doc = Document(page_content=text_content, metadata={"url": url})
    docs.append(doc)

clear_output()

print(f"Loaded {len(docs)} valid documents.")

Loaded 95 valid documents.


Our next step is to leverage Llama 3 open-source models. While there are various methods to do this, I will choose to host the model locally within a Kaggle environment. Using Ollama is one of the easiest and most convenient approaches for this setup.

In [None]:
llm = ChatOllama(
    model='llama3.1:8b-instruct-q8_0',
    temperature=0)

Since SentenceTransformer (SBERT) doesn't provide the exact interface that FAISS expects, we'll create a wrapper class around it to ensure seamless integration. This wrapper will adapt SBERT's functionality to match the API expected by FAISS, making them compatible for efficient vector search.

In [None]:
!pip install -U huggingface_hub



In [None]:
%%capture

MARKDOWN_SEPARATORS = [
    "\n#{1,6} ",
    "```\n",
    "\n\\*\\*\\*+\n",
    "\n---+\n",
    "\n___+\n",
    "\n\n",
    "\n",
    " ",
    "",
]


text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=500,
                                               add_start_index=True,
                                               strip_whitespace=True,
                                               separators=MARKDOWN_SEPARATORS)
splits = text_splitter.split_documents(docs)

class SentenceTransformerWrapper:
    def __init__(self, model):
        self.model = model

    def embed_documents(self, texts):
        return self.model.encode(texts, convert_to_numpy=True).tolist()

    def embed_query(self, text):
        return self.model.encode(text, convert_to_numpy=True).tolist()

    def __call__(self, texts):
        if isinstance(texts, list):
            return self.embed_documents(texts)
        else:
            return self.embed_query(texts)

embedding_model = SentenceTransformerWrapper(SentenceTransformer('all-MiniLM-L6-v2'))

vector_store = FAISS.from_documents(splits, embedding_model)
# only the most relevant 2 splits
retriever = vector_store.as_retriever(search_kwargs={"k": 2})



# Prompt + RAG

LangChain provides predefined prompts for RAG setups, but I created a custom one tailored to the Llama 3 Instruct model. This model requires specific prompt formatting with system and user tags. You can find some useful examples in the [Llama3 prompt examples](https://www.llama.com/docs/model-cards-and-prompt-formats/meta-llama-3/).

In [None]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

prompt_template_str = (
    """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
    You are a helpful AI course instructor and mentor that answers the user question based on the CONTEXT provided.
    <|eot_id|>

    <|start_header_id|>user<|end_header_id|>
    Use the following pieces of retrieved CONTEXT to answer the question.

    If the answer is not clear or not found in the CONTEXT, say:
    **"I don't know based on the provided context."**
    DON'T answer based on your onw knowledge, don't answer if the QUESTION is not related to the CONTEXT

    QUESTION:
    {input}

    CONTEXT:
    {context}

    <|eot_id|>

    <|start_header_id|>assistant<|end_header_id|>
    """
)

prompt = PromptTemplate(template=prompt_template_str,
                            input_variables=['context', 'input'])


def construct_rag_chain(llm, prompt, retriever):
    question_answer_chain = create_stuff_documents_chain(llm, prompt)
    rag_chain = create_retrieval_chain(retriever, question_answer_chain)

    # to simplify and inderstand how a chain looks like we can do it in one step below
    rag_chain_only_str_answer = (
        {"context": retriever | format_docs, "input": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )

    return rag_chain, rag_chain_only_str_answer

In [None]:
rag_chain, rag_chain_only_str_answer = construct_rag_chain(llm, prompt, retriever)

To improve the output formatting, we'll create utility functions that generate consistent, well-structured, and visually clear prints for both the model's answers and supporting context documents.

In [None]:
def get_answer(query):

    results = rag_chain.invoke({"input": query})

    answer = f"""Answer:
    {results["answer"]}
    {"=" * 50}
    """

    context = []
    # Iterate over each document in the context and print its details in a neat format.
    for idx, doc in enumerate(results["context"], start=1):
        source = doc.metadata.get("url", "N/A")
        text = re.sub(r'\n+', '\n', doc.page_content.strip())

        context.append(f"Document {idx}")
        context.append(f"Source: {source}")
        context.append("-" * 50)
        context.append("Text:")
        context.append(text)
        context.append("=" * 50)

    clear_output()
    return answer, context

def get_formatted_answer(answer, context):

    print(answer)

    if not "I don't know" in answer:
        for line in context:
            print(line)

Now we ask about something in the course:

In [None]:
answer, context = get_answer("What is a transformer?")
get_formatted_answer(answer, context)

Answer:
    According to the context, a transformer is a type of model used for Natural Language Processing (NLP) tasks. It's primarily composed of two blocks: an Encoder and a Decoder. The Encoder builds a representation of the input, while the Decoder uses this representation along with other inputs to generate a target sequence.

Additionally, Transformer models are built with special layers called attention layers, which allow them to focus on specific parts of the input when generating output.

The context also mentions that Transformers can be used for various NLP tasks, such as sentence classification, named entity recognition, text generation, translation, and summarization.
    
Document 1
Source: https://huggingface.co/learn/nlp-course/chapter1/3
--------------------------------------------------
Text:
Transformers, what can they do?
 
 
 
 
In this section, we will look at what Transformer models can do and use our first tool from the 🤗 Transformers library: the 
pipeline()


In [None]:
answer, context = get_answer("What are the transformer components?")
get_formatted_answer(answer, context)

Answer:
    Based on the provided context, the components of a Transformer model include:

1. Encoder: The encoder receives an input and builds a representation of it (its features).
2. Decoder: The decoder uses the encoder's representation along with other inputs to generate a target sequence.
3. Attention layers: A key feature of Transformer models that allows them to focus on specific parts of the input when generating output.

Additionally, there are three types of Transformer architectures:

1. Encoder-only models
2. Decoder-only models
3. Encoder-decoder models (or sequence-to-sequence models)
    
Document 1
Source: https://huggingface.co/learn/nlp-course/chapter1/3
--------------------------------------------------
Text:
Transformers, what can they do?
 
 
 
 
In this section, we will look at what Transformer models can do and use our first tool from the 🤗 Transformers library: the 
pipeline()
 function.
 
👀 See that 
Open in Colab
 button on the top right? Click on it to open 

And another questions that is irrelevant:

In [None]:
answer, context = get_answer("What are cookies?")
get_formatted_answer(answer, context)

Answer:
    I don't know based on the provided context. The text appears to be about tokenization and reinforcement learning, but there's no mention of cookies.
    


# Evaluation

There are various ways to build a Retrieval-Augmented Generation (RAG) system, typically involving three main stages: indexing, retrieval, and generation. Each stage offers multiple implementation options.  

- **Indexing**: You can experiment with different vector store types, similarity measures, chunking strategies, and embedding models.  
- **Generation**: You might choose between various LLM architectures or explore advanced setups like Agentic RAG, where multiple agents collaborate to generate a better response.  

However, to determine which combination works best for your specific use case, a solid evaluation strategy is essential.

## Evaluation Approaches

There are several ways to evaluate the performance of a RAG system:

- **Human Evaluation**: The most reliable method involves using human-labeled datasets for benchmarking. While this provides high-quality feedback, it's often impractical for early-stage projects or quick proofs of concept due to time and cost constraints.

- **LLM-as-a-Judge**: A more scalable, though still evolving, approach is to use large language models to evaluate the quality of responses. This method is gaining popularity, especially in rapid prototyping scenarios.

In this tutorial, we’ll focus on automatic evaluation techniques based on key quality metrics, using two popular libraries:

- [**TruLens**](https://www.trulens.org/)
- [**Ragas**](https://github.com/explodinggradients/ragas)

These tools help us assess critical aspects of RAG performance such as faithfulness, context relevance, and answer relevance.



## TruLens

TruLens offers three core metrics to evaluate a RAG system:

- **Context Relevance**: Assesses whether the retrieved context is actually relevant to the user's query, helping reduce irrelevant or distracting information.
  
- **Groundedness**: Measures how well the generated answer sticks to the retrieved context, ensuring the model doesn’t "hallucinate" unsupported facts.
  
- **Answer Relevance**: Evaluates whether the final response meaningfully addresses the user's original query, focusing on usefulness and clarity.

https://www.trulens.org/getting_started/core_concepts/rag_triad/

Note that the Trulens chain requires the rag to output only the text answer so we are going to use the chain that has only text values.

In [None]:
from trulens.apps.langchain import TruChain
from trulens.core import TruSession

session = TruSession()
session.reset_database()

🦑 Initialized with db url sqlite:///default.sqlite .
🛑 Secret keys may be written to the database. See the `database_redact_keys` option of `TruSession` to prevent this.


Updating app_name and app_version in apps table: 0it [00:00, ?it/s]
Updating app_id in records table: 0it [00:00, ?it/s]
Updating app_json in apps table: 0it [00:00, ?it/s]


In [None]:
import litellm
from trulens.providers.litellm import LiteLLM
from trulens.core import Feedback

litellm.set_verbose = False

# first we need to load the LLM judge fromm ollama
ollama_provider = LiteLLM(
    model_engine="ollama/llama3.1:8b-instruct-q8_0", api_base="http://localhost:11434"
)

context = TruChain.select_context(rag_chain_only_str_answer)

# now we define our metrics, and the stage of evaluation, for example the groundness will be checked after
# after answer generation (getting the answer) and after context generation (getting the context)
f_groundedness = (
    Feedback(
        ollama_provider.groundedness_measure_with_cot_reasons, name="Groundedness"
    )
    .on(context.collect())  # collect context chunks into a list
    .on_output()
)

f_answer_relevance = Feedback(
    ollama_provider.relevance_with_cot_reasons, name="Answer Relevance"
).on_input_output()

f_context_relevance = (
    Feedback(
        ollama_provider.context_relevance_with_cot_reasons, name="Context Relevance"
    )
    .on_input()
    .on(context)
    .aggregate(np.mean)
)

✅ In Groundedness, input source will be set to __record__.app.first.steps__.context.first.invoke.rets[:].page_content.collect() .
✅ In Groundedness, input statement will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Answer Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Answer Relevance, input response will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Context Relevance, input question will be set to __record__.main_input or `Select.RecordInput` .
✅ In Context Relevance, input context will be set to __record__.app.first.steps__.context.first.invoke.rets[:].page_content .


In [None]:
from trulens.apps.app import TruApp

tru_recorder = TruChain(
    rag_chain_only_str_answer,
    app_name="ChatApplication",
    app_version="Chain1",
    feedbacks=[f_groundedness, f_answer_relevance, f_context_relevance],
)

clear_output()

Now let's have some questions that are related to the course that we scrapped that the rag can answer to calculate the metrics.

In [None]:
q_lst_tru = ["What is a transformers?",
         "What are the transformer components?",
         "What is the Encoder?",
         "Explain the transformer work theory",
         "What is self-Attention",
         "What is the bias of transformers?",
         "What are transformer limits?"
]

final step is to record the metrics as we invoke the model on each question

In [None]:
with tru_recorder as recording:
    for question in q_lst_tru:
        llm_response = rag_chain.invoke({"input": question})["answer"]
        print(llm_response)
        print("=" * 50)

Transformers are models used to solve various NLP tasks, such as those mentioned in the previous section. They are provided by the Hugging Face library and can be used for a wide range of applications, including text classification, language translation, sentiment analysis, and more. The 🤗 Transformers library allows users to create and use these shared models, which can be downloaded from the Model Hub or uploaded by users themselves.




Based on the provided context, the components of a Transformer model include:

1. Encoder: The encoder receives an input and builds a representation of it (its features).
2. Decoder: The decoder uses the encoder's representation along with other inputs to generate a target sequence.
3. Attention layers: A key feature of Transformer models that allows them to focus on specific parts of the input when generating output.

Additionally, there are three types of Transformer architectures:

1. Encoder-only models
2. Decoder-only models
3. Encoder-decoder models (or sequence-to-sequence models)




Based on the provided CONTEXT, an Encoder model is described as using only the encoder of a Transformer model and is often characterized as having "bi-directional" attention. It's best suited for tasks requiring an understanding of the full sentence, such as sentence classification, named entity recognition, and extractive question answering.




Based on the provided context, I'll explain the Transformer work theory.

The Transformer model is primarily composed of two blocks: the Encoder and the Decoder. The Encoder receives an input and builds a representation of it (its features), while the Decoder uses the encoder's representation along with other inputs to generate a target sequence.

A key feature of Transformer models is that they are built with special layers called Attention Layers, which allow the model to focus on specific parts of the input when generating output. This is in contrast to traditional recurrent neural networks (RNNs) and convolutional neural networks (CNNs), which process input sequentially or locally.

The Transformer architecture can be used for various NLP tasks, including:

* Encoder-only models: sentence classification, named entity recognition
* Decoder-only models: text generation
* Encoder-decoder models (sequence-to-sequence models): translation, summarization

This is a high-level overview of



Based on the provided context, self-Attention is not explicitly mentioned. However, attention layers are discussed in relation to the Transformer architecture.

However, I can infer that self-Attention might be related to the concept of "bi-directional" attention mentioned in the context of encoder models, which allows the attention layers to access all the words in the initial sentence. But this is not a direct answer to your question.

If you're asking about self-Attention specifically, I don't have enough information to provide an accurate answer based on the provided context.




I don't know based on the provided context. The question about the bias of transformers is not mentioned in the given text, which only discusses the size and environmental impact of transformer models, as well as their capabilities and usage.
I don't know based on the provided context.


Finally we can check the dashboard below.

In [None]:
session.get_leaderboard()

Unnamed: 0_level_0,Unnamed: 1_level_0,Answer Relevance,Context Relevance,Groundedness,latency,total_cost
app_name,app_version,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ChatApplication,Chain1,0.176471,0.547619,0.461438,5.679086,0.0


As we can see the answers are not bad in many cases, the context relevance can still be improved but the groundness is the best so far.

## Ragas

Another valuable library for evaluating LLM-based RAG systems is **Ragas**. It enables you to assess the quality of your system by comparing its outputs against **expected (reference) answers**. These reference answers are typically written by humans, but Ragas also supports generating them using an LLM — a feature we won’t use in this tutorial to maintain control over the evaluation process.

> _Note: The reference answers used here were generated using GitHub Copilot, based on the same course materials that the RAG system uses._

Ragas uses a separate LLM to act as a judge, scoring your system’s responses across several key dimensions, such as faithfulness and relevance. This approach allows for a more nuanced and automated evaluation compared to traditional manual reviews.

In [None]:
q_lst = ["What are transformers?",
         "What are the transformer components?",
         "What is the Encoder?"
]

expected_responses = [
    "Transformers are powerful deep learning models designed for various natural language processing tasks. They can perform sentiment analysis, text generation, summarization, translation, and even answer questions based on context. Hugging Face provides a vast collection of pretrained Transformer models that can be easily accessed using the pipeline() function.",
    "Transformers consist of two main components: the encoder, which processes and understands the input, and the decoder, which generates the output based on the encoder's representation. A key feature of transformers is the attention layers, which allow the model to focus on relevant parts of the input for better performance. These components work together to handle tasks like translation, summarization, and text generation.",
    "The encoder is a key component of transformer models, designed to process and build a meaningful representation of the input data (like text). It extracts features from the input, which are then optimized for understanding and interpretation. This representation is later used by the decoder or directly for tasks like classification and named entity recognition.",
]

In [None]:
def construct_eval_dataset(q_lst, expected_responses, chain):
  dataset = []

  for query, reference in zip(q_lst, expected_responses):
      relevant_docs = rag_chain.invoke({"input": query})["context"]
      response = rag_chain.invoke({"input": query})["answer"]
      dataset.append(
          {
              "user_input": query,
              "retrieved_contexts": [rdoc.page_content for rdoc in relevant_docs],
              "response": response,
              "reference": reference,
          }
      )

  evaluation_dataset = EvaluationDataset.from_list(dataset)

  return evaluation_dataset, dataset

In [None]:
evaluation_dataset, dataset = construct_eval_dataset(q_lst, expected_responses, rag_chain)

In [None]:
for data in dataset:
  print(data['user_input'])
  print(data['response'])
  print(data['reference'])
  print("=" * 50)

What are transformers?
Transformers are models used to solve various NLP tasks, such as text classification, language translation, and sentiment analysis. They can be used for a wide range of applications, including but not limited to:

* Text generation
* Sentiment analysis
* Language translation
* Text summarization
* Question answering

The 🤗 Transformers library provides the functionality to create and use these models, allowing users to download and utilize thousands of pre-trained models from the Model Hub.
Transformers are powerful deep learning models designed for various natural language processing tasks. They can perform sentiment analysis, text generation, summarization, translation, and even answer questions based on context. Hugging Face provides a vast collection of pretrained Transformer models that can be easily accessed using the pipeline() function.
What are the transformer components?
Based on the provided context, the components of a Transformer model include:

1. E

In [None]:
run_config = RunConfig(timeout=5000)

In [None]:
evaluator_llm = LangchainLLMWrapper(llm, run_config)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness()],
    llm=evaluator_llm,
)

result

Evaluating:   0%|          | 0/9 [00:00<?, ?it/s]

{'context_recall': 0.7778, 'faithfulness': 0.8889, 'factual_correctness(mode=f1)': 0.6067}

Now that we know how to evaluate it, let's try again but this time with a different, more powerful embedding model.

In [None]:
lm = ChatOllama(
    model='llama3.3',
    temperature=0)

vector_store = FAISS.from_documents(splits, embedding_model)

retriever = vector_store.as_retriever(search_kwargs={"k": 2})
rag_chain_improved, rag_chain_only_str_answer = construct_rag_chain(llm, prompt, retriever)

In [None]:
evaluation_dataset, dataset = construct_eval_dataset(q_lst, expected_responses, rag_chain)

evaluator_llm = LangchainLLMWrapper(llm, run_config)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness()],
    llm=evaluator_llm,
)

result

Evaluating:   0%|          | 0/9 [00:00<?, ?it/s]

{'context_recall': 0.7778, 'faithfulness': 0.8182, 'factual_correctness(mode=f1)': 0.7400}

In [None]:
with tru_recorder as recording:
    for question in q_lst_tru:
        llm_response = rag_chain_improved.invoke({"input": question})["answer"]
        print(llm_response)
        print("=" * 50)

Transformers are models used to solve various NLP tasks, such as those mentioned in the previous section. They are provided by the Hugging Face library and can be used for a wide range of applications, including text classification, language translation, sentiment analysis, and more. The 🤗 Transformers library allows users to create and use these shared models, which can be downloaded from the Model Hub or uploaded by users themselves.




Based on the provided context, the components of a Transformer model include:

1. Encoder: The encoder receives an input and builds a representation of it (its features).
2. Decoder: The decoder uses the encoder's representation along with other inputs to generate a target sequence.
3. Attention layers: A key feature of Transformer models that allows them to focus on specific parts of the input when generating output.

Additionally, there are three types of Transformer architectures:

1. Encoder-only models
2. Decoder-only models
3. Encoder-decoder models (or sequence-to-sequence models)




Based on the provided CONTEXT, an Encoder model is described as using only the encoder of a Transformer model and is often characterized as having "bi-directional" attention. It's best suited for tasks requiring an understanding of the full sentence, such as sentence classification, named entity recognition, and extractive question answering.




Based on the provided context, I'll explain the Transformer work theory.

The Transformer model is primarily composed of two blocks: the Encoder and the Decoder. The Encoder receives an input and builds a representation of it (its features), while the Decoder uses the encoder's representation along with other inputs to generate a target sequence.

A key feature of Transformer models is that they are built with special layers called Attention Layers, which allow the model to focus on specific parts of the input when generating output. This is in contrast to traditional recurrent neural networks (RNNs) and convolutional neural networks (CNNs), which process input sequentially or locally.

The Transformer architecture can be used for various NLP tasks, including:

* Encoder-only models: sentence classification, named entity recognition
* Decoder-only models: text generation
* Encoder-decoder models (sequence-to-sequence models): translation, summarization

This is a high-level overview of



Based on the provided context, self-Attention is not explicitly mentioned. However, attention layers are discussed in relation to the Transformer architecture.

However, I can infer that self-Attention might be related to the concept of "bi-directional" attention mentioned in the context of encoder models, which allows the attention layers to access all the words in the initial sentence.

But if we look at the original text about the decoder, it mentions that the decoder's attention layer will only have access to the words before the word currently being generated. This is often referred to as "self-attention" or "auto-regressive attention", where the model attends to all previous positions when predicting a new position.

So, I'll take a educated guess and say that self-Attention in this context refers to the ability of the decoder's attention layer to attend to all previous words when generating a new word.




I don't know based on the provided context. The bias of transformers is not mentioned in the given text.




I don't know based on the provided context.


In [None]:
session.get_leaderboard()

Unnamed: 0_level_0,Unnamed: 1_level_0,Answer Relevance,Context Relevance,Groundedness,latency,total_cost
app_name,app_version,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ChatApplication,Chain1,0.236111,0.433333,0.487037,5.839576,0.0


## Conclusion

In this notebook, we explored the integration of a Retrieval-Augmented Generation (RAG) system using Llama 3.1. By combining the generative strengths of Llama with a simple retrieval process, we aimed to enhance the model’s output with relevant and accurate context to improve the model output and make it more reliable. Here are the key takeaways:

- **Enhanced Relevance:** Incorporating real-time data retrieval into the generation process improves the accuracy and contextual relevance of the output, reduces halucinations of foundationa models and makes us confident about the sosurces of the data without relying on the training data of the LLM itself.
- **Room for Optimization:** While the initial results are promising, future work could focus on refining retrieval strategies and fine-tuning the prompt-engineering techniques to further boost performance. Experimenting with alternative datasets and optimizing query algorithms can also pave the way for more robust implementations.

Overall, this notebook demonstrates a constructive step towards building AI systems that are not only generative but also context-aware and fact-guided. The insights gained here is the basic foundation on which you can build to improve your RAG application.