# Performance Comparison of Agentic-RAG vs RAG

Retrieval-Augmented-Generation (RAG) involves using a large language model (LLM) to answer user queries based on information retrieved from a knowledge base. This approach offers several advantages over using a vanilla or fine-tuned LLM. It allows grounding answers in factual information, reducing confabulations, providing domain-specific knowledge, and enabling fine-grained control over access to information from the knowledge base.

However, RAG has some limitations, particularly:

1. It performs only one retrieval step: if the initial retrieval results are poor, the generated answer will also be poor.
2. Semantic similarity is computed with the user query as a reference, which may be suboptimal. For example, user queries are often   questions, while documents containing the true answers are in affirmative form. This discrepancy can lead to relevant information being missed.

These issues can be mitigated by creating a RAG agent, which is essentially an agent equipped with a retriever tool. This agent can:

- Formulate the query itself.
- Critique and re-retrieve information if necessary.

By doing so, the agent can recover some advanced RAG techniques. Instead of using the user query directly in semantic search, the agent formulates a reference sentence that is closer to the targeted documents, similar to the HyDE approach. Additionally, the agent can generate snippets and re-retrieve information as needed, akin to the Self-Query approach.

In [9]:
%pip install pandas langchain langchain-community sentence-transformers faiss-cpu smolagents --upgrade -q

Note: you may need to restart the kernel to use updated packages.


In [8]:
%pip install ipywidgets -q

Note: you may need to restart the kernel to use updated packages.


In [3]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Agentic RAG

In [7]:
%pip install datasets -q

Note: you may need to restart the kernel to use updated packages.


first load a knowledge base on which we want to perform RAG: this dataset is a compilation of the documentation pages for many huggingface packages, stored as markdown.

In [6]:
import datasets

knowledge_bas = datasets.load_dataset("m-ric/huggingface_doc", split="train")

README.md:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


huggingface_doc.csv:   0%|          | 0.00/22.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2647 [00:00<?, ? examples/s]

Now prepare the knowledge base by processing the dataset and storing it into a vector database to be used by the retriever.

We use LangChain for its excellent vector database utilities. For the embedding model, we use thenlper/gte-small.

In [12]:
from tqdm import tqdm
from transformers import AutoTokenizer
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores.utils import DistanceStrategy

source_doc = [
    Document(page_content=doc["text"], metadata={"source": doc["source"].split("/")[1]})
    for doc in knowledge_bas
]

text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
    AutoTokenizer.from_pretrained("thenlper/gte-small"),
    chunk_size=200,
    chunk_overlap=20,
    add_start_index=True,
    strip_whitespace=True,
    separators=["\n\n", "\n", ".", " ", ""],
)

#split dics and keep only unique ones
print("Splitting documents...")
docs_processed = []
unique_texts = {}
for doc in tqdm(source_doc):
    new_docs = text_splitter.split_documents([doc])
    for new_doc in new_docs:
        if new_doc.page_content not in unique_texts:
            unique_texts[new_doc.page_content] = True
            docs_processed.append(new_doc)

print("Embedding documents...")

embedding_mode = HuggingFaceEmbeddings(model_name="thenlper/gte-small")

vectordb = FAISS.from_documents(
    documents=docs_processed,
    embedding=embedding_mode,
    distance_strategy=DistanceStrategy.COSINE,
)

Splitting documents...


100%|██████████| 2647/2647 [00:36<00:00, 71.86it/s] 
  embedding_mode = HuggingFaceEmbeddings(model_name="thenlper/gte-small")


Embedding documents...


modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/68.1k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/66.7M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Embedding documents Take around 17 mins on CPU (Windows PC)

Database is ready. Now Building Agentic RAG. We only need a RetrieverTool that our agent can leverage to retrieve information from the knowledge base.

In [13]:
from smolagents import Tool
from langchain_core.vectorstores import VectorStore


class RetrieverTool(Tool):
    name = "retriever"
    description = "Using semantic similarity, retrieves some documents from the knowledge base that have the closest embeddings to the input query."
    inputs = {
        "query": {
            "type": "string",
            "description": "The query to perform. This should be semantically close to your target documents. Use the affirmative form rather than a question.",
        }
    }
    output_type = "string"

    def __init__(self, vectordb: VectorStore, **kwargs):
        super().__init__(**kwargs)
        self.vectordb = vectordb

    def forward(self, query: str) -> str:
        assert isinstance(query, str), "Your search query must be a string"

        docs = self.vectordb.similarity_search(
            query,
            k=7,
        )

        return "\nRetrieved documents:\n" + "".join(
            [
                f"===== Document {str(i)} =====\n" + doc.page_content
                for i, doc in enumerate(docs)
            ]
        )

Now create an agent that leverages this tool!

In [15]:
from smolagents import LiteLLMModel, ToolCallingAgent

model = LiteLLMModel(
    model_id="ollama/qwen2.5:7b",
    api_base="http://127.0.0.1:11434",
    num_ctx=8192,
)

retriever_tool = RetrieverTool(vectordb)
agent = ToolCallingAgent(tools=[retriever_tool], model=model)

In [16]:
agent_output = agent.run("How can I push a model to the Hub?")

print("Final output:")
print(agent_output)

Final output:
To push a model to the Hugging Face Hub, you need to be logged into your Hugging Face account. You can do this via `huggingface-cli login`. If you specified `push_to_hub=True` in your training configuration, the model will be automatically pushed after training. Otherwise, you can use the `trainer.push_to_hub()` method during or after training. For more detailed instructions and additional resources, refer to the Share a model guide available at https://huggingface.co/docs/transformers/model_sharing.


# Standard RAG

In [17]:
eval_dataset = datasets.load_dataset("m-ric/huggingface_doc_qa_eval", split="train")

README.md:   0%|          | 0.00/893 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


train-00000-of-00001.parquet:   0%|          | 0.00/289k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/65 [00:00<?, ? examples/s]

In [18]:
outputs_agentic_rag = []

for example in tqdm(eval_dataset):
    question = example["question"]

    enhanced_question = f"""Using the information contained in your knowledge base, which you can access with the 'retriever' tool,
give a comprehensive answer to the question below.
Respond only to the question asked, response should be concise and relevant to the question.
If you cannot find information, do not give up and try calling your retriever again with different arguments!
Make sure to have covered the question completely by calling the retriever tool several times with semantically different queries.
Your queries should not be questions but affirmative form sentences: e.g. rather than "How do I load a model from the Hub in bf16?", query should be "load a model from the Hub bf16 weights".

Question:
{question}"""
    answer = agent.run(enhanced_question)
    print("=======================================================")
    print(f"Question: {question}")
    print(f"Answer: {answer}")
    print(f'True answer: {example["answer"]}')

    results_agentic = {
        "question": question,
        "true_answer": example["answer"],
        "source_doc": example["source_doc"],
        "generated_answer": answer,
    }
    outputs_agentic_rag.append(results_agentic)

  0%|          | 0/65 [00:00<?, ?it/s]

  2%|▏         | 1/65 [00:11<12:24, 11.63s/it]

Question: What architecture is the `tokenizers-linux-x64-musl` binary designed for?

Answer: The `tokenizers-linux-x64-musl` binary is designed for the **x86_64-unknown-linux-musl** architecture.
True answer: x86_64-unknown-linux-musl


  3%|▎         | 2/65 [00:23<12:37, 12.02s/it]

Question: What is the purpose of the BLIP-Diffusion model?

Answer: BLIP-Diffusion is a model proposed in the paper 'BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing' that enables zero-shot subject-driven generation and control-guided zero-shot generation.
True answer: The BLIP-Diffusion model is designed for controllable text-to-image generation and editing.


  5%|▍         | 3/65 [00:39<13:55, 13.47s/it]

Question: How can a user claim authorship of a paper on the Hugging Face Hub?

Answer: To claim authorship of a paper on the Hugging Face Hub, if your paper is not linked to your account, you can click on your name in the corresponding Paper page and click 'claim authorship'. This will automatically redirect to your paper settings where you can confirm the request. The admin team will validate your request soon. Once confirmed, the Paper page will show as verified.
True answer: By clicking their name on the corresponding Paper page and clicking "claim authorship", then confirming the request in paper settings for admin team validation.


  6%|▌         | 4/65 [00:48<12:03, 11.86s/it]

Question: What is the purpose of the /healthcheck endpoint in the Datasets server API?

Answer: The purpose of the /healthcheck endpoint in the Datasets server API is to ensure that the app is running.
True answer: Ensure the app is running


  8%|▊         | 5/65 [01:01<12:08, 12.14s/it]

Question: What is the default context window size for Local Attention in the LongT5 model?

Answer: The default context window size for Local Attention in the LongT5 model is not explicitly mentioned in the provided documents. However, a context size of 128 tokens is suggested for training to balance speed and memory requirements.
True answer: 127 tokens


  9%|▉         | 6/65 [01:12<11:32, 11.74s/it]

Question: What method is used to load a checkpoint for a task using `AutoPipeline`?

Answer: To load a checkpoint for a task using `AutoPipeline`, you can use the `from_pretrained()` method. This method automatically retrieves the relevant pipeline given the name or path to the pretrained weights. Additionally, you can use the `from_pipe()` method to transfer components from one pipeline to another without reallocating additional memory.
True answer: from_pretrained()


 11%|█         | 7/65 [01:27<12:23, 12.81s/it]

Question: What is the purpose of Diffusers library?

Answer: The purpose of the Diffusers library is to provide a modular toolbox for state-of-the-art pretrained diffusion models. It focuses on usability over performance, simple over easy, and customizability over abstractions. The main goals include simplicity, accessibility, reproducibility, and responsibility towards its users. Additionally, it aims to be an extension of PyTorch with transparent management and consistency in project development.
True answer: To serve as a modular toolbox for both inference and training of state-of-the-art pretrained diffusion models across multiple modalities.


 12%|█▏        | 8/65 [01:34<10:29, 11.04s/it]

Question: What method does the EulerAncestralDiscreteScheduler use for sampling?

Answer: The EulerAncestralDiscreteScheduler uses ancestral sampling with Euler method steps for generating samples. This scheduler is known for its speed and often generates good outputs within 20-30 steps.
True answer: Ancestral sampling with Euler method steps.


 14%|█▍        | 9/65 [01:43<09:39, 10.35s/it]

Question: What is the name of the large multimodal model that can solve image-text tasks and is based on Flamingo?

Answer: The large multimodal model that can solve image-text tasks and is based on Flamingo is IDEFICS.
True answer: IDEFICS


 15%|█▌        | 10/65 [01:58<11:00, 12.01s/it]

Question: What is the purpose of the `gradio.Blocks` API?

Answer: The purpose of the `gradio.Blocks` API is to allow developers to have full control over the data flows and layout of their application. It enables the creation of complex, multi-step applications by providing a low-level approach for designing web apps with more flexible layouts and data flows. With `Blocks`, components can be controlled in terms of where they appear on the page, how outputs serve as inputs to other functions, and updating properties/visibility based on user interaction—all within Python.
True answer: The `gradio.Blocks` API allows you to have full control over the data flows and layout of your application, enabling the building of complex, multi-step applications.


 17%|█▋        | 11/65 [02:15<11:58, 13.30s/it]

Question: What is the purpose of the two-stage model proposed in the paper "Hierarchical Text-Conditional Image Generation with CLIP Latents"?

Answer: The purpose of the two-stage model proposed in the paper 'Hierarchical Text-Conditional Image Generation with CLIP Latents' is to first use a prior to turn a text caption into a CLIP image embedding, and then decode it using a diffusion model into an image. This hierarchical approach helps improve the quality and alignment between text and generated images.
True answer: The purpose of the two-stage model is to generate a CLIP image embedding given a text caption and then generate an image conditioned on the image embedding.


 18%|█▊        | 12/65 [02:25<11:03, 12.52s/it]

Question: What command is used to install the requirements for a research project using 🤗 Transformers?

Answer: To install the requirements for a research project using 🤗 Transformers, you should run the command `pip install -r requirements.txt` inside the folder of your choice. If you need to install 🤗 Transformers itself from source, use `pip install transformers`. For detailed instructions, refer to the README in each folder or the official documentation.
True answer: pip install -r requirements.txt


 20%|██        | 13/65 [02:37<10:42, 12.35s/it]

Question: What task does the `roberta-large-mnli` checkpoint perform?

Answer: The `roberta-large-mnli` checkpoint performs multi-label natural language inference (NLI) task.
True answer: Text classification


 22%|██▏       | 14/65 [02:51<10:52, 12.79s/it]

Question: What service is replacing the Paid tier of the Inference API at Hugging Face?

Answer: The service replacing the Paid tier of the Inference API at Hugging Face is Inference Endpoints. Hugging Face PRO users have access to exclusive API endpoints for a curated list of powerful models that benefit from ultra-fast inference, and these can be used as an alternative to the paid Inference API.
True answer: Inference Endpoints


 23%|██▎       | 15/65 [02:58<09:11, 11.04s/it]

Question: What architectural feature does SqueezeBERT use instead of fully-connected layers for the Q, K, V, and FFN layers?

Answer: SqueezeBERT uses grouped convolutions instead of fully-connected layers for the Q, K, V and FFN layers.
True answer: Grouped convolutions


 25%|██▍       | 16/65 [03:04<07:42,  9.44s/it]

Question: What type of license is the HuggingFace Team's software distributed under?

Answer: The HuggingFace Team's software is distributed under the Apache License, Version 2.0.
True answer: Apache License, Version 2.0


 26%|██▌       | 17/65 [03:13<07:34,  9.47s/it]

Question: What are the two parameter-reduction techniques proposed in the ALBERT model to lower memory consumption and increase training speed?

Answer: The two parameter-reduction techniques proposed in the ALBERT model to lower memory consumption and increase training speed are: 
1. Splitting the embedding matrix into two smaller matrices.
2. Using repeating layers split among groups.
True answer: Splitting the embedding matrix into two smaller matrices and using repeating layers split among groups.


 28%|██▊       | 18/65 [03:28<08:38, 11.04s/it]

Question: What are the three main steps for fine-tuning a model with the 🤗 Datasets library?

Answer: The three main steps for fine-tuning a model with the 🤗 Datasets library are: 1. Load a dataset from the Hugging Face Hub, 2. Preprocess the data with `Dataset.map()`, and 3. Train the model using appropriate training commands or methods.
True answer: 1. Load a dataset from the Hugging Face Hub. 2. Preprocess the data with `Dataset.map()`. 3. Load and compute metrics.


 29%|██▉       | 19/65 [03:38<08:06, 10.58s/it]

Question: What is the maximum improvement in throughput achieved by Hugging Face Infinity compared to vanilla transformers?

Answer: Hugging Face Infinity can deliver up to 800% higher throughput compared to vanilla transformers.
True answer: +800%


 31%|███       | 20/65 [03:54<09:15, 12.35s/it]

Question: What is the command to upload a spaCy pipeline to the Hugging Face Hub?

Answer: To upload a spaCy pipeline to the Hugging Face Hub, you can use the following command after setting up your authentication token:

```bash
python -m spacy huggingface-hub push [whl_path] [--org] [--msg] [--local-repo] [--verbose]
```
Make sure to package your spaCy pipeline using `spacy package` and build a wheel file (`--build wheel`). You can find more details in the provided resources.
True answer: python -m spacy huggingface-hub push en_ner_fashion-0.0.0-py3-none-any.whl


 32%|███▏      | 21/65 [04:09<09:31, 12.98s/it]

Question: What is the time and memory complexity of the Nyströmformer's approximation of self-attention?

Answer: The Nyströmformer's approximation of self-attention has a time complexity of \(O(n)\) and a memory complexity also reduced to \(O(n)\), as it approximates the standard self-attention mechanism with a linear method, avoiding the quadratic complexity of the original self-attention matrix.
True answer: O(n)


 34%|███▍      | 22/65 [04:22<09:29, 13.25s/it]

Question: What is the goal of the Named Entity Recognition task in token classification?

Answer: The goal of the Named Entity Recognition (NER) task in token classification is to find and classify entities such as person, location, or organization within a piece of text. This involves labeling each token with an appropriate entity class or a non-entity class.
True answer: The goal of the Named Entity Recognition task is to find the entities in a piece of text, such as person, location, or organization.


 35%|███▌      | 23/65 [04:28<07:44, 11.06s/it]

Question: What is the resolution of images used by the CLIPSeg model?

Answer: The resolution of images used by the CLIPSeg model is 352 x 352 pixels.
True answer: 352 x 352 pixels


 37%|███▋      | 24/65 [04:41<07:57, 11.66s/it]

Question: What can you use Gradio for?

Answer: Gradio can be used for building demonstrations of models, such as an ASR model, and sharing them with a testing team. You can also use it to create interactive interfaces that can be tested using a microphone on your device. For more examples and information, you can explore the demos in the Gradio repository, read the guides page which covers cool and advanced features, or check out the documentation.
True answer: Create a demo for your machine learning model, share your machine learning model with others, and debug your model.


 38%|███▊      | 25/65 [04:55<08:06, 12.17s/it]

Question: What TensorFlow API function is used to load a saved tensor file?

Answer: The TensorFlow API function used to load a saved tensor file is `safetensors.tensorflow.load_file` or `safetensors.tensorflow.load`. These functions are part of the safetensors library and are specifically designed for loading tensors saved in the `.safetensors` format.
True answer: safetensors.tensorflow.load_file


 40%|████      | 26/65 [05:05<07:26, 11.46s/it]

Question: Where can you access the logs of your Endpoints in Hugging Face Endpoints?

Answer: You can access the logs of your Endpoints in Hugging Face Endpoints through the UI in the ‘Logs’ tab. The Container Logs are only available when your Endpoint is in the ‘Running’ state.
True answer: In the "Logs" tab of your Endpoint through the UI.


 42%|████▏     | 27/65 [05:15<06:59, 11.05s/it]

Question: What is the latest task added to Hugging Face AutoTrain for Computer Vision?

Answer: The latest task added to Hugging Face AutoTrain for Computer Vision is Image Classification.
True answer: Image Classification


 43%|████▎     | 28/65 [05:24<06:25, 10.42s/it]

Question: What is the default repository type created by the `create_repo` function on Hugging Face Hub?

Answer: By default, the `create_repo` function on Hugging Face Hub creates a model repository.
True answer: model


 45%|████▍     | 29/65 [05:31<05:45,  9.60s/it]

Question: How many splits does the "duorc" dataset have?

Answer: The 'duorc' dataset has six splits.
True answer: Six


 46%|████▌     | 30/65 [05:44<06:11, 10.61s/it]

Question: What is the purpose of Fully Sharded Data Parallel (FSDP) in distributed training?

Answer: Fully Sharded Data Parallel (FSDP) is designed to enable efficient distributed training of large pretrained models up to 1T parameters. It achieves this by sharding model parameters, gradients, and optimizer states across data parallel processes, reducing memory usage compared to traditional DistributedDataParallel (DDP). FSDP also allows offloading sharded model parameters to a CPU when they are inactive, further improving memory efficiency and enabling the training of larger models on fewer GPUs.
True answer: FSDP is developed for distributed training of large pretrained models up to 1T parameters by sharding the model parameters, gradients, and optimizer states across data parallel processes.


 48%|████▊     | 31/65 [05:54<05:55, 10.46s/it]

Question: What file format is used to save and store PyTorch model weights more securely than `.bin` files?

Answer: The file format used to save and store PyTorch model weights more securely than `.bin` files is `safetensors`.
True answer: `.safetensors`


 49%|████▉     | 32/65 [06:03<05:29,  9.98s/it]

Question: What type of security certification does Hugging Face have?

Answer: Hugging Face is SOC2 Type 2 certified, which means they provide security certification to their customers and actively monitor and patch any security weaknesses.
True answer: SOC2 Type 2 certified


 51%|█████     | 33/65 [06:14<05:29, 10.30s/it]

Question: What do RAG models combine to generate outputs?

Answer: RAG models combine the powers of pretrained dense retrieval (DPR) and sequence-to-sequence models. They retrieve documents, pass them to a seq2seq model, then marginalize to generate outputs. The retriever and seq2seq modules are initialized from pretrained models, and fine-tuned jointly, allowing both retrieval and generation to adapt to downstream tasks.
True answer: Pretrained dense retrieval (DPR) and sequence-to-sequence models.


 52%|█████▏    | 34/65 [06:21<04:44,  9.19s/it]

Question: What library does MarkupLMFeatureExtractor use to extract data from HTML and XML files?

Answer: MarkupLMFeatureExtractor uses the Beautiful Soup library for extracting data from HTML and XML files.
True answer: Beautiful Soup


 54%|█████▍    | 35/65 [06:31<04:46,  9.54s/it]

Question: What is the file size limit for syncing to HF Spaces without using Git-LFS?

Answer: The file size limit for syncing to HF Spaces without using Git-LFS is 10MB. For files larger than 10MB, Spaces requires Git-LFS.
True answer: 10MB


 55%|█████▌    | 36/65 [06:37<04:04,  8.43s/it]

Question: What is the title of the paper introducing the ByT5 model?

Answer: ByT5: Towards a token-free future with pre-trained byte-to-byte models
True answer: ByT5: Towards a token-free future with pre-trained byte-to-byte models


 57%|█████▋    | 37/65 [06:43<03:32,  7.60s/it]

Question: What is the dimension of the feature vector for the base BERT model?

Answer: The dimension of the feature vector for the base BERT model is 768.
True answer: 768


 58%|█████▊    | 38/65 [06:53<03:45,  8.34s/it]

Question: What special identifier does the WordPiece Model use for continuing subwords?

Answer: The WordPiece Model uses the `##` prefix to identify tokens that are part of a word (i.e., not starting a word) for continuing subwords.
True answer: ##


 60%|██████    | 39/65 [07:07<04:22, 10.09s/it]

Question: What is the purpose of the 🧨 Diffusers tutorials?

Answer: The purpose of 🧨 Diffusers tutorials is to provide a beginner-friendly introduction to diffusion models and help users understand how to use the library as a modular toolbox. The tutorials cover how to use a pipeline for inference, deconstructing it to understand the library's core components, and eventually learn how to train your own diffusion model.
True answer: To provide a gentle introduction to diffusion models and help understand the library fundamentals.


 62%|██████▏   | 40/65 [07:15<03:53,  9.34s/it]

Question: What is the default setting for the `allow_flagging` parameter in Gradio's `Interface`?

Answer: The default setting for the `allow_flagging` parameter in Gradio's `Interface` is `'manual'`. This means users will see a button to flag, and samples are only flagged when the button is clicked.
True answer: "manual"


 63%|██████▎   | 41/65 [07:27<04:08, 10.35s/it]

Question: Where can the full code for the Stable Diffusion demo be found?

Answer: The full code for the Stable Diffusion demo can be found in several repositories. For a simplified version, see: https://hf.co/spaces/stabilityai/stable-diffusion/tree/main. The original codebases are available at CompVis/stable-diffusion (for v1.0) and Stability-AI/stablediffusion (for v2.0), as well as on GitHub repositories such as Xiang-cd/DiffEdit-stable-diffusion.
True answer: https://hf.co/spaces/stabilityai/stable-diffusion/tree/main


 65%|██████▍   | 42/65 [07:37<03:51, 10.08s/it]

Question: What transformation does the FNet model use to replace the self-attention layer in a BERT model?

Answer: The FNet model uses a Fourier transform to replace the self-attention layer in a BERT model.
True answer: Fourier transform


 66%|██████▌   | 43/65 [07:43<03:16,  8.95s/it]

Question: What type of test should typically accompany a bug fix in Gradio's testing strategy?

Answer: Typically, a test should accompany a bug fix in Gradio's testing strategy to ensure the fix works as intended and does not introduce new bugs.
True answer: Dynamic code test


 68%|██████▊   | 44/65 [07:58<03:48, 10.89s/it]

Question: How can you force mixed precision training when initializing the Accelerator in 🤗 Accelerate?

Answer: To force mixed precision training when initializing the Accelerator in 🤗 Accelerate, you need to use the `write_basic_config` function from the `accelerate.utils` module. Here is an example of how to do it:
```python
from accelerate.utils import write_basic_config
write_basic_config(training_environment=True)
```
This will configure your environment for mixed-precision training.
True answer: By passing `fp16=True` to the Accelerator init.


 69%|██████▉   | 45/65 [08:12<03:55, 11.76s/it]

Question: What is the purpose of tokenizers in the NLP pipeline?

Answer: Tokenizers in the NLP pipeline serve to translate raw text into numerical data that can be processed by models. They break down the text into smaller units, known as tokens or subwords, which are then mapped to unique IDs. This process is crucial for allowing models to understand and manipulate textual information effectively.
True answer: To translate text into data that can be processed by the model.


 71%|███████   | 46/65 [08:28<04:07, 13.04s/it]

Question: What is the purpose of the Safety Checker in the Diffusers library?

Answer: The Safety Checker in the Diffusers library is an essential component that helps flag inappropriate content generated during inference. It checks and compares the class probability of a set of hard-coded harmful concepts against an image after it has been generated, thereby screening out potentially harmful or NSFW content. Disabling the safety checker is not recommended for public-facing applications due to ethical and safety concerns, as highlighted by both the diffusers team and Hugging Face.
True answer: The Safety Checker checks and compares the class probability of a set of hard-coded harmful concepts in the embedding space against an image after it has been generated to mitigate the risk of generating harmful content.


 72%|███████▏  | 47/65 [08:35<03:21, 11.19s/it]

Question: What Python class allows you to retrieve Discussions and Pull Requests from a given repository on the Hugging Face Hub?

Answer: The `HfApi` class allows you to retrieve Discussions and Pull Requests on a given repo from the Hugging Face Hub.
True answer: HfApi


 74%|███████▍  | 48/65 [08:48<03:21, 11.82s/it]

Question: What is the name of the new library introduced by Hugging Face for hosting scikit-learn models?

Answer: The new library introduced by Hugging Face for hosting scikit-learn models is the `huggingface_hub` library. However, there isn't a specific library mentioned for directly handling scikit-learn models within the provided documents.
True answer: Skops


 75%|███████▌  | 49/65 [09:02<03:19, 12.48s/it]

Question: What is the purpose of Textual Inversion?

Answer: Textual Inversion is a training technique for personalizing image generation models with just a few example images of what you want it to learn. This technique works by learning and updating the text embeddings (the new embeddings are tied to a special word you must use in the prompt) to match the example images you provide.
True answer: Textual Inversion is a training method for personalizing models by learning new text embeddings from a few example images.


 77%|███████▋  | 50/65 [09:25<03:50, 15.37s/it]

Question: What is the recommended multiple of batch size for fp16 data type on an A100 GPU?

Answer: The recommended multiple of batch size for fp16 data type on an A100 GPU is not explicitly stated, but the documents suggest that a batch size of 8 or 16 could be optimal depending on the context. For example, a batch size of 16 with fp16 precision resulted in lower memory usage compared to other configurations.
True answer: 64


 78%|███████▊  | 51/65 [09:36<03:16, 14.06s/it]

Question: How do you run a Gradio Blocks app in reload mode using a Python IDE?

Answer: To run a Gradio Blocks app in reload mode using a Python IDE, use the command `gradio <filename>` instead of `python <filename>`. This will start the backend server in reload mode, which will watch for changes in the file and automatically reload the app if any are made.
True answer: Run `gradio run.py` in the terminal.


 80%|████████  | 52/65 [09:49<03:01, 13.94s/it]

Question: How can you install the Hugging Face Unity API in your Unity project?

Answer: To install the Hugging Face Unity API in your Unity project, follow these steps:

1. Open your Unity project.
2. Go to `Window` -> `Package Manager`.
3. Click `+` and select `Add Package from git URL`.
4. Enter `https://github.com/huggingface/unity-api.git`.
5. Once installed, the Unity API wizard should pop up. If not, go to `Window` -> `Hugging Face API Wizard`.
True answer: To install the Hugging Face Unity API in your Unity project, go to `Window` -> `Package Manager`, click `+` and select `Add Package from git URL`, then enter `https://github.com/huggingface/unity-api.git`.


 82%|████████▏ | 53/65 [10:04<02:49, 14.13s/it]

Question: What is the pretraining objective of the Wav2Vec2 context network?

Answer: The pretraining objective of the Wav2Vec2 context network is a contrastive task. The model has to predict the true quantized speech representation of the masked prediction from a set of false ones, encouraging the model to find the most similar context vector and quantized speech unit (the target label).
True answer: The pretraining objective of the Wav2Vec2 context network is a contrastive task where the model has to predict the true quantized speech representation of the masked prediction from a set of false ones.


 83%|████████▎ | 54/65 [10:11<02:10, 11.90s/it]

Question: What is the default checkpoint used by the sentiment analysis pipeline in the Transformers library?

Answer: The default checkpoint used by the sentiment analysis pipeline in the Transformers library is `distilbert-base-uncased-finetuned-sst-2-english`.
True answer: distilbert base uncased finetuned sst2 english


 85%|████████▍ | 55/65 [10:28<02:16, 13.69s/it]

Question: What is the purpose of the notebook "How to use DeepSpeed to train models with billions of parameters on Habana Gaudi"?

Answer: The purpose of the notebook 'How to use DeepSpeed to train models with billions of parameters on Habana Gaudi' is to demonstrate how to utilize DeepSpeed to pre-train or fine-tune large language models, such as the 1.6B-parameter GPT2-XL for causal language modeling, specifically on the Habana Gaudi hardware. This notebook aims to leverage the capabilities of the Gaudi accelerator for efficient model training with support from Optimum Habana.
True answer: To show how to use DeepSpeed to pre-train/fine-tune the 1.6B-parameter GPT2-XL for causal language modeling on Habana Gaudi.


 86%|████████▌ | 56/65 [10:38<01:53, 12.61s/it]

Question: What command line module does PyTorch provide to run a script on multiple GPUs?

Answer: To run a script on multiple GPUs using PyTorch from the command line, you can use the `torchrun` module. The command would look something like this: `CUDA_VISIBLE_DEVICES=2,0 torchrun trainer-program.py ...`
True answer: torchrun


 88%|████████▊ | 57/65 [10:49<01:34, 11.87s/it]

Question: What is the most popular vision transformer model on the Hugging Face Model Hub for image classification?

Answer: The most popular vision transformer model on the Hugging Face Model Hub for image classification is Vision Transformer (ViT) from Google AI.
True answer: google/vit-base-patch16-224


 89%|████████▉ | 58/65 [10:54<01:09,  9.98s/it]

Question: What is the command to upload an ESPnet model to a Hugging Face repository?

Answer: ./run.sh --stage 15 --skip_upload_hf false --hf_repo username/model_repo
True answer: ./run.sh --stage 15 --skip_upload_hf false --hf_repo username/model_repo


 91%|█████████ | 59/65 [11:00<00:52,  8.78s/it]

Question: What file should be added to a model repository to install custom Python dependencies for Inference Endpoints?

Answer: To install custom Python dependencies for Inference Endpoints, a `requirements.txt` file should be added to the model repository.
True answer: requirements.txt


 92%|█████████▏| 60/65 [11:06<00:38,  7.75s/it]

Question: How many images are needed to teach new concepts to Stable Diffusion using Textual Inversion?

Answer: 3-5 images
True answer: 3-5 images


 94%|█████████▍| 61/65 [11:16<00:34,  8.58s/it]

Question: What is the maximum size of a model checkpoint before it is automatically sharded in Transformers version 4.18.0?

Answer: Since version 4.18.0, model checkpoints that end up taking more than 10GB of space are automatically sharded in smaller pieces.
True answer: 10GB


 95%|█████████▌| 62/65 [11:32<00:32, 10.91s/it]

Question: What is the purpose of Weights and Biases (W&B) for data scientists and machine learning scientists?

Answer: Weights and Biases (W&B) is a tool designed for data scientists and machine learning scientists to track their experiments throughout the entire development lifecycle, from training to production. It helps in aggregating metrics, visualizing them, and creating customizable dashboards. W&B also supports bias analysis by providing tools and recommendations to address biases at various stages of the ML system development process, such as task definition, dataset curation, and model training.
True answer: To track their machine learning experiments at every stage, from training to production.


 97%|█████████▋| 63/65 [11:38<00:18,  9.26s/it]

Question: What is the name of the open-source library created by Hugging Face to simplify Transformer acceleration?

Answer: Optimum Intel
True answer: Optimum


 98%|█████████▊| 64/65 [11:49<00:09,  9.89s/it]

Question: What parameter is used to ensure that elements in a row have the same height in Gradio?

Answer: To ensure that elements in a row have the same height in Gradio, use the `equal_height` argument of the `style` method. Here is an example code snippet: ```python
with gr.Blocks() as demo:
    with gr.Row(equal_height=True):
        textbox = gr.Textbox()
        btn2 = gr.Button("Button 2")
```
True answer: equal_height


100%|██████████| 65/65 [11:56<00:00, 11.02s/it]

Question: What is the command to install the latest version of Optimum with OpenVINO support?

Answer: To install the latest version of Optimum with OpenVINO support, you should use the following command: pip install --upgrade-strategy eager optimum[openvino]
True answer: pip install --upgrade-strategy eager optimum["openvino"]





In [None]:
from tqdm import tqdm
from litellm import completion  # Import litellm directly

outputs_standard_rag = []

for example in tqdm(eval_dataset):
    question = example["question"]
    context = retriever_tool(question)

    prompt = f"""Given the question and supporting documents below, give a comprehensive answer to the question.
Respond only to the question asked, response should be concise and relevant to the question.
Provide the number of the source document when relevant.
If you cannot find information, do not give up and try calling your retriever again with different arguments!

Question:
{question}

{context}
"""
    # Generate the answer using litellm.completion
    response = completion(
        model="ollama/qwen2.5:7b",  # Match your model
        messages=[{"content": prompt, "role": "user"}],  # Format as a message
        api_base="http://127.0.0.1:11434",
    )
    answer = response.choices[0].message.content  # Extract the text from the response

    print("=======================================================")
    print(f"Question: {question}")
    print(f"Answer: {answer}")
    print(f'True answer: {example["answer"]}')

    results_agentic = {
        "question": question,
        "true_answer": example["answer"],
        "source_doc": example["source_doc"],
        "generated_answer": answer,
    }
    outputs_standard_rag.append(results_agentic)

  2%|▏         | 1/65 [00:10<11:01, 10.33s/it]

Question: What architecture is the `tokenizers-linux-x64-musl` binary designed for?

Answer: The `tokenizers-linux-x64-musl` binary is designed for the **x86_64-unknown-linux-musl** architecture.
True answer: x86_64-unknown-linux-musl


  3%|▎         | 2/65 [00:14<07:11,  6.85s/it]

Question: What is the purpose of the BLIP-Diffusion model?

Answer: The purpose of the BLIP-Diffusion model is to enable zero-shot subject-driven generation and control-guided zero-shot generation. This model was proposed in the paper [BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing](https://arxiv.org/abs/2305.14720).
True answer: The BLIP-Diffusion model is designed for controllable text-to-image generation and editing.


  5%|▍         | 3/65 [00:24<08:18,  8.05s/it]

Question: How can a user claim authorship of a paper on the Hugging Face Hub?

Answer: To claim authorship of a paper on the Hugging Face Hub, follow these steps:

1. Ensure your email is associated with your account (The Hub will attempt to automatically match papers to users based on their email).
2. Go to the Paper page where you want to claim authorship.
3. Click on your name and then click "claim authorship". This will redirect you to your paper settings where you can confirm the request.
4. Wait for the admin team to validate your request.

Once validated, the Paper page will show as verified. If your paper is not linked to your account and you need to claim it manually, follow steps 2-4.
True answer: By clicking their name on the corresponding Paper page and clicking "claim authorship", then confirming the request in paper settings for admin team validation.


  6%|▌         | 4/65 [00:26<05:54,  5.81s/it]

Question: What is the purpose of the /healthcheck endpoint in the Datasets server API?

Answer: The purpose of the /healthcheck endpoint in the Datasets server API is to ensure that the app is running. (Source Document 0)
True answer: Ensure the app is running


  8%|▊         | 5/65 [00:34<06:38,  6.64s/it]

Question: What is the default context window size for Local Attention in the LongT5 model?

Answer: The default context window size for Local Attention in the LongT5 model is not explicitly stated in the provided documents. However, based on the structure of the Longformer model which LongT5 extends, we can infer that the local attention window length `w` can be defined per layer and is configurable through `config.attention_window`. Therefore, without specific information from the LongT5 documentation, we cannot determine a default value.

If you need the exact default value, you may want to refer directly to the LongT5 model's configuration or documentation.
True answer: 127 tokens


  9%|▉         | 6/65 [00:38<05:42,  5.81s/it]

Question: What method is used to load a checkpoint for a task using `AutoPipeline`?

Answer: The method used to load a checkpoint for a task using `AutoPipeline` is the `from_pipe()` method. This method is used to transfer components from an existing pipeline to a new one, allowing you to reuse the same components without additional memory allocation.

Source: Document 0 and Document 5
True answer: from_pretrained()


 11%|█         | 7/65 [00:50<07:33,  7.82s/it]

Question: What is the purpose of Diffusers library?

Answer: The purpose of the Diffusers library is to provide a modular toolbox for generating images, audio, and 3D structures of molecules using state-of-the-art pretrained diffusion models. It focuses on usability over performance, simplicity over complexity, and customizability over abstractions (Document 1 and Document 3).

Specifically:
- **Usability**: Making it easy to use and customize the library.
- **Simplicity**: Keeping the project’s goals lean and coherent.
- **Customizability**: Allowing contributors to run the project with or without technical expertise, making research more accessible (Document 0).
- It supports both simple inference solutions and training of diffusion models.

Additionally, Diffusers is designed as a natural extension of PyTorch, adhering closely to its design principles (Document 2). It also uses LoRA for quick training and sharing of diffusion models (Document 5).
True answer: To serve as a modular t

 12%|█▏        | 8/65 [00:54<06:05,  6.42s/it]

Question: What method does the EulerAncestralDiscreteScheduler use for sampling?

Answer: The EulerAncestralDiscreteScheduler uses ancestral sampling with Euler method steps for its sampling process. This is a fast scheduler that can often generate good outputs in 20-30 steps, based on the original k-diffusion implementation.

Source: Document 0
True answer: Ancestral sampling with Euler method steps.


 14%|█▍        | 9/65 [00:56<04:44,  5.07s/it]

Question: What is the name of the large multimodal model that can solve image-text tasks and is based on Flamingo?

Answer: The large multimodal model that can solve image-text tasks and is based on Flamingo is IDEFICS (Document 2).
True answer: IDEFICS


 15%|█▌        | 10/65 [01:00<04:20,  4.74s/it]

Question: What is the purpose of the `gradio.Blocks` API?

Answer: The purpose of the `gradio.Blocks` API is to provide a low-level approach for designing web apps with more flexible layouts and data flows. It allows users to control where components appear on the page, handle complex data flows, and update component properties/visibility based on user interaction—all in Python.

Source: Document 0, Document 1
True answer: The `gradio.Blocks` API allows you to have full control over the data flows and layout of your application, enabling the building of complex, multi-step applications.


 17%|█▋        | 11/65 [01:07<04:56,  5.49s/it]

Question: What is the purpose of the two-stage model proposed in the paper "Hierarchical Text-Conditional Image Generation with CLIP Latents"?

Answer: The purpose of the two-stage model proposed in the paper "Hierarchical Text-Conditional Image Generation with CLIP Latents" is to turn a text caption into a CLIP image embedding using a prior, and then decode it into an image through a diffusion model. This approach leverages the strengths of both the CLIP model for understanding images from text and a diffusion model for generating high-fidelity images.

Source document: Document 0
True answer: The purpose of the two-stage model is to generate a CLIP image embedding given a text caption and then generate an image conditioned on the image embedding.


 18%|█▊        | 12/65 [01:10<04:05,  4.64s/it]

Question: What command is used to install the requirements for a research project using 🤗 Transformers?

Answer: The command used to install the requirements for a research project using 🤗 Transformers is:
```
pip install -r requirements.txt
``` 
Source Document: Document 0
True answer: pip install -r requirements.txt


 20%|██        | 13/65 [01:14<03:48,  4.39s/it]

Question: What task does the `roberta-large-mnli` checkpoint perform?

Answer: The `roberta-large-mnli` checkpoint performs the Multi-NLI (MNLI) task, which involves natural language inference and entailment classification. This task requires the model to determine whether the relationship between two sentences is entailed, contradiction, or neutral.

Reference: Document 0
True answer: Text classification


 22%|██▏       | 14/65 [01:22<04:49,  5.68s/it]

Question: What service is replacing the Paid tier of the Inference API at Hugging Face?

Answer: Based on the provided documents, it is not explicitly stated what service is replacing the Paid tier of the Inference API at Hugging Face. However, documents suggest that [Inference Endpoints](https://ui.endpoints.huggingface.co/) could be a relevant service for production deployment, offering more features and possibly higher rates than the free Inference API.

To directly answer the question: The provided documents do not specify a direct replacement for the Paid tier of the Inference API. However, Inference Endpoints might offer similar or enhanced functionalities in a paid version suitable for production environments.
True answer: Inference Endpoints


 23%|██▎       | 15/65 [01:25<03:58,  4.76s/it]

Question: What architectural feature does SqueezeBERT use instead of fully-connected layers for the Q, K, V, and FFN layers?

Answer: SqueezeBERT uses grouped convolutions instead of fully-connected layers for the Q, K, V, and FFN layers. (Source: Document 0)
True answer: Grouped convolutions


 25%|██▍       | 16/65 [01:28<03:31,  4.33s/it]

Question: What type of license is the HuggingFace Team's software distributed under?

Answer: The HuggingFace Team's software is distributed under the Apache License, Version 2.0. (Document 0, Document 1, Document 2, Document 3, Document 4, Document 5, Document 6)
True answer: Apache License, Version 2.0


 26%|██▌       | 17/65 [01:32<03:13,  4.04s/it]

Question: What are the two parameter-reduction techniques proposed in the ALBERT model to lower memory consumption and increase training speed?

Answer: The two parameter-reduction techniques proposed in the ALBERT model are:

1. Splitting the embedding matrix into two smaller matrices.
2. Using repeating layers split among groups.

These techniques help lower memory consumption and increase training speed (Document 0).
True answer: Splitting the embedding matrix into two smaller matrices and using repeating layers split among groups.


 28%|██▊       | 18/65 [01:35<03:04,  3.92s/it]

Question: What are the three main steps for fine-tuning a model with the 🤗 Datasets library?

Answer: The three main steps for fine-tuning a model with the 🤗 Datasets library are:

1. Load a dataset from the Hugging Face Hub.
2. Preprocess the data with `Dataset.map()`.
3. Load and compute metrics.

(Source Document 0)
True answer: 1. Load a dataset from the Hugging Face Hub. 2. Preprocess the data with `Dataset.map()`. 3. Load and compute metrics.


 29%|██▉       | 19/65 [01:38<02:38,  3.45s/it]

Question: What is the maximum improvement in throughput achieved by Hugging Face Infinity compared to vanilla transformers?

Answer: The maximum improvement in throughput achieved by Hugging Face Infinity compared to vanilla transformers is up to 800%, according to Document 0.
True answer: +800%


 31%|███       | 20/65 [01:45<03:28,  4.63s/it]

Question: What is the command to upload a spaCy pipeline to the Hugging Face Hub?

Answer: The command to upload a spaCy pipeline to the Hugging Face Hub is:

```bash
python -m spacy huggingface-hub push [whl_path] [--org] [--msg] [--local-repo] [--verbose]
```

This command can be used from a Python script as well using the `push` function from the `spacy_huggingface_hub` module.

Source document: Document 4
True answer: python -m spacy huggingface-hub push en_ner_fashion-0.0.0-py3-none-any.whl


 32%|███▏      | 21/65 [01:50<03:27,  4.71s/it]

Question: What is the time and memory complexity of the Nyströmformer's approximation of self-attention?

Answer: The Nyströmformer's approximation of self-attention has time complexity \(O(n)\) and memory complexity \(O(n)\), where \(n\) is the number of landmarks selected. This is because the method uses the Nyström method to approximate the self-attention mechanism, which reduces both time and memory complexities from \(O(n^2)\) to linear.

Source: Document 0
True answer: O(n)


 34%|███▍      | 22/65 [01:54<03:14,  4.51s/it]

Question: What is the goal of the Named Entity Recognition task in token classification?

Answer: The goal of the Named Entity Recognition (NER) task in token classification is to find and classify entities such as person, location, or organization in a piece of text. Each token in a sentence is labeled with an appropriate class corresponding to an entity or "no entity." 

Source: Document 0
True answer: The goal of the Named Entity Recognition task is to find the entities in a piece of text, such as person, location, or organization.


 35%|███▌      | 23/65 [01:56<02:43,  3.90s/it]

Question: What is the resolution of images used by the CLIPSeg model?

Answer: The resolution of images used by the CLIPSeg model is 352 x 352 pixels. This information can be found in Document 0.
True answer: 352 x 352 pixels


 37%|███▋      | 24/65 [02:01<02:52,  4.22s/it]

Question: What can you use Gradio for?

Answer: Gradio is a Python library that allows you to quickly create customizable web apps for your machine learning models and data processing pipelines. You can deploy Gradio apps on [Hugging Face Spaces](https://hf.space) for free, run them on your web server with Nginx, or use them to create demonstrations of ASR models that can be shared with a test team or tested using a microphone on a device.

Source: Document 2 and Document 5
True answer: Create a demo for your machine learning model, share your machine learning model with others, and debug your model.


 38%|███▊      | 25/65 [02:04<02:26,  3.67s/it]

Question: What TensorFlow API function is used to load a saved tensor file?

Answer: The TensorFlow API function used to load a saved tensor file is `safetensors.tensorflow.load_file`. This can be found in Document 0.
True answer: safetensors.tensorflow.load_file


 40%|████      | 26/65 [02:12<03:11,  4.91s/it]

Question: Where can you access the logs of your Endpoints in Hugging Face Endpoints?

Answer: You can access the logs of your Endpoints in Hugging Face Endpoints through the UI in the “Logs” tab of your Endpoint. Specifically, you will find access to both build logs and container logs.

- **Build Logs:** Available even when your Endpoint creation is in the “Failed” state, providing information about potential issues such as wrong versions of dependencies.
- **Container Logs:** Only available when your Endpoint is in the “Running” state during inference.

Source: Document 0
True answer: In the "Logs" tab of your Endpoint through the UI.


 42%|████▏     | 27/65 [02:14<02:37,  4.15s/it]

Question: What is the latest task added to Hugging Face AutoTrain for Computer Vision?

Answer: Based on the information provided in Document 0, the latest task added to Hugging Face AutoTrain for Computer Vision is Image Classification.
True answer: Image Classification


 43%|████▎     | 28/65 [02:19<02:40,  4.34s/it]

Question: What is the default repository type created by the `create_repo` function on Hugging Face Hub?

Answer: Based on the information provided in Document 3, the default repository type created by the `create_repo` function on Hugging Face Hub is a model repository. 

Document 3: 
```plaintext
By default, [`create_repo`] creates a model repository.
```

Therefore, the answer to the question is that the default repository type created by the `create_repo` function is a model repository.
True answer: model


 45%|████▍     | 29/65 [02:21<02:11,  3.65s/it]

Question: How many splits does the "duorc" dataset have?

Answer: The "duorc" dataset has six splits. This information can be found in Document 0 and Document 5.
True answer: Six


 46%|████▌     | 30/65 [02:28<02:50,  4.86s/it]

Question: What is the purpose of Fully Sharded Data Parallel (FSDP) in distributed training?

Answer: The purpose of Fully Sharded Data Parallel (FSDP) in distributed training is to shard model parameters, gradients, and optimizer states across data parallel processes, reducing memory usage and allowing for the scaling of training to larger batch or model sizes. This sharding improves GPU memory-efficiency, enabling the training of much larger models on fewer GPUs. Additionally, FSDP can offload these tensors to a CPU when they are inactive, further reducing memory demands.

Source: Document 0
True answer: FSDP is developed for distributed training of large pretrained models up to 1T parameters by sharding the model parameters, gradients, and optimizer states across data parallel processes.


 48%|████▊     | 31/65 [02:31<02:26,  4.30s/it]

Question: What file format is used to save and store PyTorch model weights more securely than `.bin` files?

Answer: The file format used to save and store PyTorch model weights more securely than `.bin` files is `safetensors`. This format is recommended for increased security as mentioned in Document 0.
True answer: `.safetensors`


 49%|████▉     | 32/65 [02:34<02:07,  3.86s/it]

Question: What type of security certification does Hugging Face have?

Answer: Hugging Face is SOC2 Type 2 certified. This certification indicates that Hugging Face provides security certification to its customers and actively monitors and patches any security weaknesses.

Source: Document 0 and Document 2
True answer: SOC2 Type 2 certified


 51%|█████     | 33/65 [02:39<02:08,  4.03s/it]

Question: What do RAG models combine to generate outputs?

Answer: RAG models combine the powers of pretrained dense retrieval (DPR) and sequence-to-sequence models. They retrieve documents using the retriever, pass them to a seq2seq model, and then generate outputs based on this information. This combination allows both retrieval and generation to adapt to downstream tasks during joint fine-tuning. (Source: Document 1 and Document 2)
True answer: Pretrained dense retrieval (DPR) and sequence-to-sequence models.


 52%|█████▏    | 34/65 [02:41<01:48,  3.50s/it]

Question: What library does MarkupLMFeatureExtractor use to extract data from HTML and XML files?

Answer: MarkupLMFeatureExtractor uses the Beautiful Soup Python library to extract data from HTML and XML files. Source Document 0 provides this information.
True answer: Beautiful Soup


 54%|█████▍    | 35/65 [02:45<01:51,  3.72s/it]

Question: What is the file size limit for syncing to HF Spaces without using Git-LFS?

Answer: The file size limit for syncing to HF Spaces without using Git-LFS is 10MB. This information can be found in Document 2, where it states "filesizelimit: 10485760 # this is 10MB so we can sync to HF Spaces."
True answer: 10MB


 55%|█████▌    | 36/65 [02:48<01:44,  3.60s/it]

Question: What is the title of the paper introducing the ByT5 model?

Answer: The title of the paper introducing the ByT5 model is "ByT5: Towards a token-free future with pre-trained byte-to-byte models" (Document 0, Document 3, Document 4, and Document 5).
True answer: ByT5: Towards a token-free future with pre-trained byte-to-byte models


 57%|█████▋    | 37/65 [02:51<01:30,  3.24s/it]

Question: What is the dimension of the feature vector for the base BERT model?

Answer: The dimension of the feature vector for the base BERT model is 768. This information can be found in Document 0 and Document 3.
True answer: 768


 58%|█████▊    | 38/65 [02:54<01:27,  3.25s/it]

Question: What special identifier does the WordPiece Model use for continuing subwords?

Answer: The WordPiece model uses the `##` prefix to identify tokens that are part of a word but not starting a new one. This is different from BPE, which starts from characters and builds bigger tokens as possible.

Reference: Document 2
True answer: ##


 60%|██████    | 39/65 [03:02<02:03,  4.76s/it]

Question: What is the purpose of the 🧨 Diffusers tutorials?

Answer: The purpose of the 🧨 Diffusers tutorials is to provide a gentle introduction to diffusion models and help learners understand how to use the 🧨 Diffusers library. Specifically, the tutorials aim to teach users how to:

- Use a pipeline for inference to generate content quickly.
- Deconstruct the pipeline to understand the library as a modular toolbox for building custom diffusion systems.
- Train their own diffusion models.

These objectives are designed to equip beginners with the necessary skills to explore and apply the library in their projects. (Source: Document 0)
True answer: To provide a gentle introduction to diffusion models and help understand the library fundamentals.


 62%|██████▏   | 40/65 [03:05<01:42,  4.11s/it]

Question: What is the default setting for the `allow_flagging` parameter in Gradio's `Interface`?

Answer: The default setting for the `allow_flagging` parameter in Gradio's `Interface` is `"manual"`. This can be found in Document 5.
True answer: "manual"


 63%|██████▎   | 41/65 [03:10<01:45,  4.38s/it]

Question: Where can the full code for the Stable Diffusion demo be found?

Answer: The full code for the Stable Diffusion demo can be found at:

1. For v1.0: https://github.com/CompVis/stable-diffusion
2. For v2.0: https://github.com/Stability-AI/stablediffusion

These repositories contain the original scripts and additional checkpoints that you can use for various tasks related to Stable Diffusion.

Source documents: 0, 1
True answer: https://hf.co/spaces/stabilityai/stable-diffusion/tree/main


 65%|██████▍   | 42/65 [03:13<01:31,  4.00s/it]

Question: What transformation does the FNet model use to replace the self-attention layer in a BERT model?

Answer: The FNet model uses a Fourier transform to replace the self-attention layer in a BERT model. Specifically, it returns only the real parts of the transform. Source: Document 0.
True answer: Fourier transform


 66%|██████▌   | 43/65 [03:17<01:27,  3.97s/it]

Question: What type of test should typically accompany a bug fix in Gradio's testing strategy?

Answer: According to Document 0, a bug fix in Gradio's testing strategy should typically be accompanied by a test that failed before the fix and passes afterwards. This test should _typically_ be a dynamic code test but could be a linting rule or new type if appropriate. 

Source: Document 0
True answer: Dynamic code test


 68%|██████▊   | 44/65 [03:34<02:46,  7.94s/it]

Question: How can you force mixed precision training when initializing the Accelerator in 🤗 Accelerate?

Answer: To force mixed precision training when initializing the Accelerator in 🤗 Accelerate, you can use the `--fp16` flag when launching your script. However, if you want to configure this programmatically, you can set it up using the configuration files or directly within your Python code.

Here’s how you can do it:

1. **Using Command Line Flag:**
   ```bash
   accelerate launch --fp16 your_script.py
   ```

2. **Programmatic Configuration (Python):**
   You can also configure this in your script by writing a configuration file or directly setting the `mixed_precision` parameter.

   Example:
   ```python
   from transformers import Trainer, TrainingArguments

   training_args = TrainingArguments(
       output_dir='./results',
       per_device_train_batch_size=8,
       gradient_accumulation_steps=4,
       learning_rate=2e-5,
       num_train_epochs=3,
       weight_decay=0.01

 69%|██████▉   | 45/65 [03:39<02:18,  6.95s/it]

Question: What is the purpose of tokenizers in the NLP pipeline?

Answer: The purpose of tokenizers in the NLP pipeline is to convert raw text into numerical data that can be processed by models. Specifically, tokenizers break down text into manageable units (tokens) and handle the conversion from text to numbers, which is necessary for input into neural networks. This process ensures that the model can effectively understand and analyze the textual data.

Source: Document 0
True answer: To translate text into data that can be processed by the model.


 71%|███████   | 46/65 [03:43<01:54,  6.02s/it]

Question: What is the purpose of the Safety Checker in the Diffusers library?

Answer: The purpose of the Safety Checker in the Diffusers library is to flag and prevent the generation or exposure of harmful and NSFW (Not Safe For Work) content. It checks generated outputs against known hardcoded inappropriate content to ensure responsible and ethical use of the model.

Source: Document 3, Document 6
True answer: The Safety Checker checks and compares the class probability of a set of hard-coded harmful concepts in the embedding space against an image after it has been generated to mitigate the risk of generating harmful content.


 72%|███████▏  | 47/65 [03:46<01:33,  5.17s/it]

Question: What Python class allows you to retrieve Discussions and Pull Requests from a given repository on the Hugging Face Hub?

Answer: The `HfApi` class allows you to retrieve Discussions and Pull Requests from a given repository on the Hugging Face Hub. This is demonstrated in Document 0 and can be used with methods like `get_repo_discussions`.
True answer: HfApi


 74%|███████▍  | 48/65 [03:53<01:38,  5.77s/it]

Question: What is the name of the new library introduced by Hugging Face for hosting scikit-learn models?

Answer: Based on the information provided, there is no specific mention of a new library introduced by Hugging Face for hosting scikit-learn models. However, Document 4 suggests that Hugging Face is building partnerships with Open Source ML libraries to provide free model hosting and versioning, but it does not specify which libraries are involved.

If you need more precise information, you might want to check the latest documentation or announcements from Hugging Face directly.
True answer: Skops


 75%|███████▌  | 49/65 [04:00<01:37,  6.12s/it]

Question: What is the purpose of Textual Inversion?

Answer: The purpose of Textual Inversion is to personalize image generation models by learning new text embeddings from a few example images. This allows the model to recognize and generate content related to specific concepts that are not part of its original training data, providing more control over the generated images.

Sources: Document 0, Document 1, Document 2, Document 4, Document 5, Document 6
True answer: Textual Inversion is a training method for personalizing models by learning new text embeddings from a few example images.


 77%|███████▋  | 50/65 [04:10<01:47,  7.14s/it]

Question: What is the recommended multiple of batch size for fp16 data type on an A100 GPU?

Answer: The documents provided do not contain specific information about the recommended multiple of batch size for fp16 data type on an A100 GPU. Therefore, based on the given information, I cannot provide a precise answer to the question.

However, from Document 5, we can infer that using FP16 with memory optimization techniques reduces memory usage significantly. This suggests that using FP16 can allow for larger batch sizes without running out of memory. For a definitive recommendation, you would typically need to consider the specific model and sequence length in use.

If you require a precise multiple, it is advisable to consult performance benchmarks or run experiments with your specific setup.
True answer: 64


 78%|███████▊  | 51/65 [04:14<01:30,  6.47s/it]

Question: How do you run a Gradio Blocks app in reload mode using a Python IDE?

Answer: To run a Gradio Blocks app in reload mode using a Python IDE, you should use the command `gradio` followed by the name of your script file. For example, if your script is named `run.py`, you would type:

```bash
gradio run.py
```

This will start the app in reload mode, automatically refreshing whenever you make changes to the file.
True answer: Run `gradio run.py` in the terminal.


 80%|████████  | 52/65 [04:23<01:32,  7.14s/it]

Question: How can you install the Hugging Face Unity API in your Unity project?

Answer: To install the Hugging Face Unity API in your Unity project, follow these steps:

1. Open your Unity project.
2. Go to `Window` -> `Package Manager`.
3. Click `+` and select `Add Package from git URL`.
4. Enter `https://github.com/huggingface/unity-api.git`.
5. Once installed, the Unity API wizard should pop up. If not, go to `Window` -> `Hugging Face API Wizard`.

Refer to [Document 0](#) for detailed installation instructions and a GIF demonstrating the process.
True answer: To install the Hugging Face Unity API in your Unity project, go to `Window` -> `Package Manager`, click `+` and select `Add Package from git URL`, then enter `https://github.com/huggingface/unity-api.git`.


 82%|████████▏ | 53/65 [04:27<01:15,  6.28s/it]

Question: What is the pretraining objective of the Wav2Vec2 context network?

Answer: The pretraining objective of the context network in Wav2Vec2 is a *contrastive task*. The model has to predict the true quantized speech representation of the masked prediction from a set of false ones, encouraging the model to find the most similar context vector and quantized speech unit. (Source: Document 0)
True answer: The pretraining objective of the Wav2Vec2 context network is a contrastive task where the model has to predict the true quantized speech representation of the masked prediction from a set of false ones.


 83%|████████▎ | 54/65 [04:30<00:57,  5.23s/it]

Question: What is the default checkpoint used by the sentiment analysis pipeline in the Transformers library?

Answer: The default checkpoint used by the sentiment analysis pipeline in the Transformers library is `distilbert-base-uncased-finetuned-sst-2-english` (Document 0).
True answer: distilbert base uncased finetuned sst2 english


 85%|████████▍ | 55/65 [04:38<00:59,  5.96s/it]

Question: What is the purpose of the notebook "How to use DeepSpeed to train models with billions of parameters on Habana Gaudi"?

Answer: The purpose of the notebook "How to use DeepSpeed to train models with billions of parameters on Habana Gaudi" is to demonstrate how to use DeepSpeed to pre-train/fine-tune large language models (with up to 1.6 billion parameters) for causal language modeling on Habana Gaudi hardware. This aligns with the focus on leveraging DeepSpeed and Habana's hardware capabilities for training large-scale models efficiently.

Source: Document 0
True answer: To show how to use DeepSpeed to pre-train/fine-tune the 1.6B-parameter GPT2-XL for causal language modeling on Habana Gaudi.


 86%|████████▌ | 56/65 [04:48<01:04,  7.11s/it]

Question: What command line module does PyTorch provide to run a script on multiple GPUs?

Answer: The command line module provided by PyTorch for running a script on multiple GPUs is `torchrun`. This is demonstrated in Document 1 with the following example:

```py
def main():
    world_size = 2
    mp.spawn(run_inference, args=(world_size,), nprocs=world_size, join=True)


if __name__ == "__main__":
    main()
```

And then calling `torchrun` with the number of GPUs per node:

```bash
CUDA_VISIBLE_DEVICES=2,0 torchrun trainer-program.py ...
```

This setup allows running inference across multiple GPUs. Therefore, the answer to the question is `torchrun`.
True answer: torchrun


 88%|████████▊ | 57/65 [04:51<00:47,  5.98s/it]

Question: What is the most popular vision transformer model on the Hugging Face Model Hub for image classification?

Answer: The most popular vision transformer model on the Hugging Face Model Hub for image classification is `google/vit-base-patch16-224`. This information can be found in Document 5.
True answer: google/vit-base-patch16-224


 89%|████████▉ | 58/65 [04:54<00:36,  5.23s/it]

Question: What is the command to upload an ESPnet model to a Hugging Face repository?

Answer: The command to upload an ESPnet model to a Hugging Face repository is:

```bash
./run.sh --stage 15 --skip_upload_hf false --hf_repo username/model_repo
```

This command can be found in Document 0.
True answer: ./run.sh --stage 15 --skip_upload_hf false --hf_repo username/model_repo


 91%|█████████ | 59/65 [04:58<00:28,  4.68s/it]

Question: What file should be added to a model repository to install custom Python dependencies for Inference Endpoints?

Answer: To install custom Python dependencies for Inference Endpoints, you should add a `requirements.txt` file to your model repository. This file lists the additional Python dependencies that need to be installed. Source Document 1 provides an example of such a `requirements.txt` file.
True answer: requirements.txt


 92%|█████████▏| 60/65 [05:02<00:22,  4.46s/it]

Question: How many images are needed to teach new concepts to Stable Diffusion using Textual Inversion?

Answer: According to the documents, Stable Diffusion can learn new concepts using textual inversion with just 3-5 example images. Therefore, the answer is that **3-5 images** are needed to teach new concepts to Stable Diffusion using Textual Inversion.

Source Document: Document 1 and Document 2
True answer: 3-5 images


 94%|█████████▍| 61/65 [05:04<00:15,  3.92s/it]

Question: What is the maximum size of a model checkpoint before it is automatically sharded in Transformers version 4.18.0?

Answer: According to Document 0, the maximum size of a model checkpoint before it is automatically sharded in Transformers version 4.18.0 is more than 10GB.
True answer: 10GB


 95%|█████████▌| 62/65 [05:08<00:11,  3.93s/it]

Question: What is the purpose of Weights and Biases (W&B) for data scientists and machine learning scientists?

Answer: The purpose of Weights and Biases (W&B) for data scientists and machine learning scientists is to track their machine learning experiments from training to production. It allows them to aggregate and visualize metrics in customizable dashboards, facilitating the monitoring and management of experimental processes. Source: Document 0 and Document 2
True answer: To track their machine learning experiments at every stage, from training to production.


 97%|█████████▋| 63/65 [05:11<00:06,  3.48s/it]

Question: What is the name of the open-source library created by Hugging Face to simplify Transformer acceleration?

Answer: The name of the open-source library created by Hugging Face to simplify Transformer acceleration is **Optimum**. This information can be found in Document 0.
True answer: Optimum


 98%|█████████▊| 64/65 [05:15<00:03,  3.52s/it]

Question: What parameter is used to ensure that elements in a row have the same height in Gradio?

Answer: The parameter used to ensure that elements in a row have the same height in Gradio is `equal_height`. This can be applied by passing it as an argument to the `.style()` method of `gr.Row()`. 

Source Document: Document 1
True answer: equal_height


100%|██████████| 65/65 [05:18<00:00,  4.90s/it]

Question: What is the command to install the latest version of Optimum with OpenVINO support?

Answer: The command to install the latest version of Optimum with OpenVINO support is:

```bash
pip install --upgrade-strategy eager optimum["openvino"]
```

This can be found in Document 1.
True answer: pip install --upgrade-strategy eager optimum["openvino"]





The evaluation prompt follows some of the best principles shown in our llm_judge cookbook: it follows a small integer Likert scale, has clear criteria, and a description for each score.

In [33]:
EVALUATION_PROMPT = """You are a fair evaluator language model.

You will be given an instruction, a response to evaluate, a reference answer that gets a score of 3, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 3. You should refer to the score rubric.
3. The output format should look as follows: \"Feedback: {{write a feedback for criteria}} [RESULT] {{an integer number between 1 and 3}}\"
4. Please do not generate any other opening, closing, and explanations. Be sure to include [RESULT] in your output.
5. Do not score conciseness: a correct answer that covers the question should receive max score, even if it contains additional useless information.

The instruction to evaluate:
{instruction}

Response to evaluate:
{response}

Reference Answer (Score 3):
{reference_answer}

Score Rubrics:
[Is the response complete, accurate, and factual based on the reference answer?]
Score 1: The response is completely incomplete, inaccurate, and/or not factual.
Score 2: The response is somewhat complete, accurate, and/or factual.
Score 3: The response is completely complete, accurate, and/or factual.

Feedback:"""

In [48]:
class EvaluationClient:
    def __init__(self, model="ollama/llama3.1:8b", api_base="http://127.0.0.1:11434"):
        self.model = model
        self.api_base = api_base

    def __call__(self, prompt):
        response = completion(
            model=self.model,
            messages=[{"content": prompt, "role": "user"}],
            api_base=self.api_base,
            max_tokens=1000  # Add this to limit output length
        )
        return response.choices[0].message.content

# Instantiate with the updated class
evaluation_client = EvaluationClient("ollama/llama3.1:8b")

In [49]:
import pandas as pd

results = {}
for system_type, outputs in [
    ("agentic", outputs_agentic_rag),
    ("standard", outputs_standard_rag),
]:
    for experiment in tqdm(outputs):
        eval_prompt = EVALUATION_PROMPT.format(
            instruction=experiment["question"],
            response=experiment["generated_answer"],
            reference_answer=experiment["true_answer"],
        )
        messages = [
            {"role": "system", "content": "You are a fair evaluator language model."},
            {"role": "user", "content": eval_prompt},
        ]

        eval_result = evaluation_client(eval_prompt)
        try:
            feedback, score = [item.strip() for item in eval_result.split("[RESULT]")]
            experiment["eval_score_LLM_judge"] = score
            experiment["eval_feedback_LLM_judge"] = feedback
        except:
            print(f"Parsing failed - output was: {eval_result}")

    results[system_type] = pd.DataFrame.from_dict(outputs)
    results[system_type] = results[system_type].loc[~results[system_type]["generated_answer"].str.contains("Error")]

100%|██████████| 65/65 [04:14<00:00,  3.92s/it]
  5%|▍         | 3/65 [00:17<07:08,  6.90s/it]

Parsing failed - output was: Feedback: Step 1 of the response accurately describes associating an email with a Hugging Face Hub account to claim authorship. [RESULT] 2
 
Feedback: However, step 3 of the response inaccurately states that clicking on one's name and then "claim authorship" redirects to paper settings where one can confirm the request. The correct procedure is simply clicking "claim authorship". [RESULT] 1
 
Feedback: The entire process outlined in the response accurately describes how a user claims authorship of a paper on the Hugging Face Hub, including waiting for admin team validation after submitting the claim and manually linking papers to an account if necessary. This matches the reference answer exactly. [RESULT] 3


100%|██████████| 65/65 [04:50<00:00,  4.47s/it]


In [50]:
DEFAULT_SCORE = 2 # Give average score whenever scoring fails
def fill_score(x):
    try:
        return int(x)
    except:
        return DEFAULT_SCORE

for system_type, outputs in [
    ("agentic", outputs_agentic_rag),
    ("standard", outputs_standard_rag),
]:

    results[system_type]["eval_score_LLM_judge_int"] = (
        results[system_type]["eval_score_LLM_judge"].fillna(DEFAULT_SCORE).apply(fill_score)
    )
    results[system_type]["eval_score_LLM_judge_int"] = (results[system_type]["eval_score_LLM_judge_int"] - 1) / 2

    print(
        f"Average score for {system_type} RAG: {results[system_type]['eval_score_LLM_judge_int'].mean()*100:.1f}%"
    )

Average score for agentic RAG: 79.2%
Average score for standard RAG: 88.5%
