# Install pre-requisites

In [1]:
!pip install -q torch transformers langchain sentence-transformers tqdm openpyxl openai pandas datasets ragatouille ratelimit retry duckdb

In [2]:
%reload_ext autoreload
%autoreload 2
%reload_ext dotenv
%dotenv

# Model preparations

To go through the evaluation process, we need following models:

1. Document model: Embedding model to generate document embeddings which will persisted in vector index. 
2. Reader model: A text completion model to answer the final question with augmented context.
3. Evaluator model: A chat completion model that will give final verdict about RAG output. As this model will affect scoring considerably, stronger model should be used. 

As the choice of different models is not subject of this article and won't impact the comparison between RAG frameworks, we are determined to use completed local solution for this experiment for better speed and lower cost. 

To be more precise, following models that are already optimized in Ollama are used:

* [all-minilm](https://huggingface.co/google/gemma-2b) for `Embedding model`;
* [Mixtral-8x7B](https://ollama.com/library/mixtral) for `Evaluator model` and `Reader model`
   

In [3]:
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.llms import Ollama
from langchain_community.chat_models import ChatOllama, MiniMaxChat
from langchain_openai import ChatOpenAI, OpenAI
import os

# points to a vLLM server
MIXTRAL_ENDPOINT = "http://192.168.0.134:30253"

# points to a ollama server
MINILM_ENDPOINT = "http://192.168.0.29:11434"

READER_MODEL_NAME = "mixtral:instruct"
EMBEDDING_NAME = "all-minilm"
EVALUATOR_NAME = "mixtral:instruct"

EMBEDDING_MODEL = OllamaEmbeddings(model=EMBEDDING_NAME, base_url = MINILM_ENDPOINT)
READER_LLM = Ollama(model=READER_MODEL_NAME, base_url = MIXTRAL_ENDPOINT)
EVAL_MODEL = ChatOllama(model=EVALUATOR_NAME, base_url = MIXTRAL_ENDPOINT)

LANGCHAIN_DATA_ROOT = "./data/langchain"
INSTINCT_DOC_AGENT_DATA_ROOT = "./data/doc_agent"



In [4]:
# Test all these models

EMBEDDING_MODEL.embed_query("hello")

READER_LLM.invoke("hello")

EVAL_MODEL.invoke("hello")


AIMessage(content=" Hello! It's nice to meet you. Is there something you would like to ask or talk about? I'm here to help with information and answer questions to the best of my ability.", response_metadata={'model': 'mixtral:instruct', 'created_at': '2024-04-17T09:44:49.408726113Z', 'message': {'role': 'assistant', 'content': ''}, 'done': True, 'total_duration': 583661542, 'load_duration': 520790, 'prompt_eval_duration': 14299000, 'eval_count': 41, 'eval_duration': 568214000}, id='run-83d09bb5-b90f-4e62-beac-30fbe418e19c-0')

# Build RAG pipeline using `langchain` 

1. transform training data in `m-ric/huggingface_doc` to `langchain`'s document objects
2. Load into faiss index if index file is absent
3. prompt with eval data `m-ric/huggingface_doc` using `READER_MODEL` 

## Knowledge base preparations

In [5]:
from tqdm.auto import tqdm
import pandas as pd
from typing import Optional, List, Tuple
import json
import datasets
pd.set_option("display.max_colwidth", None)

In [6]:
QA_generation_prompt = """
Your task is to write a factoid question and an answer given a context.
Your factoid question should be answerable with a specific, concise piece of factual information from the context.
Your factoid question should be formulated in the same style as questions users could ask in a search engine.
This means that your factoid question MUST NOT mention something like "according to the passage" or "context".

Provide your answer as follows:

Output:::
Factoid question: (your factoid question)
Answer: (your answer to the factoid question)

Now here is the context.

Context: {context}\n
Output:::"""

In [7]:
question_groundedness_critique_prompt = """
You will be given a context and a question.
Your task is to provide a 'total rating' scoring how well one can answer the given question unambiguously with the given context.
Give your answer on a scale of 1 to 5, where 1 means that the question is not answerable at all given the context, and 5 means that the question is clearly and unambiguously answerable with the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here are the question and context.

Question: {question}\n
Context: {context}\n
Answer::: """

question_relevance_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how useful this question can be to machine learning developers building NLP applications with the Hugging Face ecosystem.
Give your answer on a scale of 1 to 5, where 1 means that the question is not useful at all, and 5 means that the question is extremely useful.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the question.

Question: {question}\n
Answer::: """

question_standalone_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how context-independant this question is.
Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
The questions can contain obscure technical nouns or acronyms like Gradio, Hub, Hugging Face or Space and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.

For instance, "What is the name of the checkpoint from which the ViT model is imported?" should receive a 1, since there is an implicit mention of a context, thus the question is not independent from the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the question.

Question: {question}\n
Answer::: """

In [8]:
EVAL_DATASET = datasets.load_dataset("m-ric/huggingface_doc_qa_eval", split="train")


Using the latest cached version of the dataset since m-ric/huggingface_doc_qa_eval couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /Users/robinqu/.cache/huggingface/datasets/m-ric___huggingface_doc_qa_eval/default/0.0.0/72d15fdd245839652aa30a5f8717b3b79f106c2a (last modified on Thu Mar 28 20:55:20 2024).


In [9]:
from langchain.docstore.document import Document as LangchainDocument

RAW_KNOWLEDGE_BASE = [
    LangchainDocument(page_content=doc["text"], metadata={"source": doc["source"]}) for doc in tqdm(datasets.load_dataset("m-ric/huggingface_doc", split="train"))
]

Using the latest cached version of the dataset since m-ric/huggingface_doc couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /Users/robinqu/.cache/huggingface/datasets/m-ric___huggingface_doc/default/0.0.0/1b83935099b148190b6a9a9874b7e62a17fea889 (last modified on Thu Mar 28 20:10:51 2024).


  0%|          | 0/2647 [00:00<?, ?it/s]

In [10]:
from langchain.text_splitter import RecursiveCharacterTextSplitter


def split_documents(
    chunk_size: int,
    knowledge_base: List[LangchainDocument]
) -> List[LangchainDocument]:
    """
    Split documents into chunks of size `chunk_size` characters and return a list of documents.
    """
    
    text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
        model_name="gpt-4",
        chunk_size=chunk_size,
        # chunk_overlap=int(chunk_size / 10),
        chunk_overlap=0,
        add_start_index=True,
        strip_whitespace=True,
        separators=["\n\n", "\n", ".", " ", ""],
        disallowed_special=[],
        allowed_special="all"
    )

    docs_processed = []
    for doc in knowledge_base:
        docs_processed += text_splitter.split_documents([doc])

    # Remove duplicates
    unique_texts = {}
    docs_processed_unique = []
    for doc in docs_processed:
        if doc.page_content not in unique_texts:
            unique_texts[doc.page_content] = True
            docs_processed_unique.append(doc)

    return docs_processed_unique

In [11]:
from langchain.vectorstores import FAISS
from langchain_community.vectorstores.utils import DistanceStrategy
import os
from langchain_core.embeddings import Embeddings


def load_embeddings(
    langchain_docs: List[LangchainDocument],
    chunk_size: int,
    embedding_model: Embeddings,
    embedding_model_name: str
) -> FAISS:
    """
    Creates a FAISS index from the given embedding model and documents. Loads the index directly if it already exists.

    Args:
        langchain_docs: list of documents
        chunk_size: size of the chunks to split the documents into
        embedding_model: the embedding
        embedding_model_name: name of the embedding model to use

    Returns:
        FAISS index
         
    """
    # load embedding_model


    # Check if embeddings already exist on disk
    index_name = f"index_chunk:{chunk_size}_embeddings:{embedding_model_name}"
    index_folder_path = os.path.join(LANGCHAIN_DATA_ROOT, index_name)
    if os.path.isdir(index_folder_path):
        return FAISS.load_local(
            index_folder_path,
            embedding_model,
            distance_strategy=DistanceStrategy.COSINE,
        )

    else:
        docs_processed = split_documents(
            chunk_size,
            langchain_docs
        )
        print(f"Index not found, generating it... {len(docs_processed)} docs in total")
        knowledge_index = FAISS.from_documents(
            docs_processed, embedding_model, distance_strategy=DistanceStrategy.COSINE
        )
        knowledge_index.save_local(index_folder_path)
        return knowledge_index

## QA chain

In [12]:
RAG_PROMPT_TEMPLATE = """
<|system|>
Using the information contained in the context,
give a comprehensive answer to the question.
Respond only to the question asked, response should be concise and relevant to the question.
Provide the number of the source document when relevant.
If the answer cannot be deduced from the context, do not give an answer.</s>
<|user|>
Context:
{context}
---
Now here is the question you need to answer.

Question: {question}
</s>
<|assistant|>
"""

In [13]:
from ragatouille import RAGPretrainedModel
from langchain_core.vectorstores import VectorStore
from langchain_core.language_models.llms import LLM


def answer_with_rag(
    question: str,
    llm: LLM,
    knowledge_index: VectorStore,
    reranker: Optional[RAGPretrainedModel] = None,
    num_retrieved_docs: int = 30,
    num_docs_final: int = 7,
) -> Tuple[str, List[LangchainDocument]]:
    """Answer a question using RAG with the given knowledge index."""
    # Gather documents with retriever
    relevant_docs = knowledge_index.similarity_search(query=question, k=num_retrieved_docs)
    relevant_docs = [doc.page_content for doc in relevant_docs]  # keep only the text

    # Optionally rerank results
    if reranker:
        relevant_docs = reranker.rerank(question, relevant_docs, k=num_docs_final)
        relevant_docs = [doc["content"] for doc in relevant_docs]

    relevant_docs = relevant_docs[:num_docs_final]

    # Build the final prompt
    context = "\nExtracted documents:\n"
    context += "".join([f"Document {str(i)}:::\n" + doc for i, doc in enumerate(relevant_docs)])

    final_prompt = RAG_PROMPT_TEMPLATE.format(question=question, context=context)
    
    print("final prompt size:", len(final_prompt))

    # Redact an answer
    answer = llm.invoke(final_prompt)

    return answer, relevant_docs

# Generating answers

## Test function with langchain

In [14]:
from langchain_core.language_models import BaseChatModel 

def run_langchain_rag_tests(
    eval_dataset: datasets.Dataset,
    llm: LLM,
    knowledge_index: VectorStore,
    output_file: str,
    reranker: Optional[RAGPretrainedModel] = None,
    verbose: Optional[bool] = True,
    test_settings: Optional[str] = None,  # To document the test settings used
):
    """Runs RAG tests on the given dataset and saves the results to the given output file."""
    try:  # load previous generations if they exist
        with open(output_file, "r") as f:
            outputs = json.load(f)
    except:
        outputs = []

    for example in tqdm(eval_dataset):
        question = example["question"]
        
        if question in [output["question"] for output in outputs]:
            continue

        answer, relevant_docs = answer_with_rag(question, llm, knowledge_index, reranker=reranker)
        if verbose:
            print("=======================================================")
            print(f"Question: {question}")
            print(f"Answer: {answer}")
            print(f'True answer: {example["answer"]}')
        result = {
            "question": question,
            "true_answer": example["answer"],
            "source_doc": example["source_doc"],
            "generated_answer": answer,
            "retrieved_docs": [doc for doc in relevant_docs],
        }
        if test_settings:
            result["test_settings"] = test_settings
        outputs.append(result)

        with open(output_file, "w") as f:
            json.dump(outputs, f)

In [15]:
def run_langchain_test_all() -> str:
    """
    Build index and run langchain test with fixed parameter and model selections
    :return: 
    """
    if not os.path.exists("./output"):
        os.mkdir("./output")
    
    chunk_size = 200
    rerank = False
    
    settings_name = f"langchain_chunk:{chunk_size}_rerank:{rerank}_reader-model:{READER_MODEL_NAME}_embedding-model:{EMBEDDING_NAME}"
    output_file_name = os.path.join("./output", f"{settings_name}.json")
    

    print("Loading knowledge base embeddings...")
    knowledge_index = load_embeddings(
        RAW_KNOWLEDGE_BASE,
        chunk_size=chunk_size,
        embedding_model=EMBEDDING_MODEL,
        embedding_model_name=EMBEDDING_NAME
    )
    
    print(f"Running RAG with {settings_name}")
    reranker = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0") if rerank else None
    run_langchain_rag_tests(
        eval_dataset=EVAL_DATASET,
        llm=READER_LLM,
        knowledge_index=knowledge_index,
        output_file=output_file_name,
        reranker=reranker,
        verbose=True,
        test_settings=settings_name,
    )
    
    return output_file_name 

In [16]:
# execute test for langchain
LANGCHAIN_TEST_OUTPUT = run_langchain_test_all()

Loading knowledge base embeddings...
Running RAG with langchain_chunk:200_rerank:False_reader-model:mixtral:instruct_embedding-model:all-minilm


  0%|          | 0/67 [00:00<?, ?it/s]

final prompt size: 1913
Question: What architecture is the `tokenizers-linux-x64-musl` binary designed for?

Answer:  The `tokenizers-linux-x64-musl` binary is designed for the **x86_64-unknown-linux-musl** architecture.
True answer: x86_64-unknown-linux-musl
final prompt size: 3634
Question: What is the purpose of the BLIP-Diffusion model?

Answer:  Based on the provided context, there is no information given about the purpose of the BLIP-Diffusion model. The context mainly discusses Stable Diffusion and DDPO models.
True answer: The BLIP-Diffusion model is designed for controllable text-to-image generation and editing.
final prompt size: 4301
Question: How can a user claim authorship of a paper on the Hugging Face Hub?

Answer:  A user can claim authorship of a paper on the Hugging Face Hub by visiting the Paper page, clicking on their name in the corresponding Paper page, and then clicking "claim authorship". This will redirect them to their paper settings where they can confirm the

## Test function with doc-agent in instinct.cpp

You have to manually start `doc-agent` locally.

To build knowledge index with same knowledge base data from HF:

```shell
$DOC_AGENT_BIN --verbose \
  --parent_child_retriever \
  --child_chunk_size=200 \
  --chat_model_model_name=gemma:2b \
  --embedding_model_model_name=all-minilm:latest \
  --db_path=./data/instinct/index.db \
  --vector_table_dimension=384 \
  build \
  --force \
  --file=https://huggingface.co/api/datasets/m-ric/huggingface_doc/parquet/default/train/0.parquet \
  --type=PARQUET \
  --parquet_mapping=0:txt,1:metadata:source:varchar
```

To start http server for query:

```shell
$DOC_AGENT_BIN --verbose \
  --parent_child_retriever \
  --child_chunk_size=200 \
  --chat_model_model_name=gemma:2b \
  --embedding_model_model_name=all-minilm:latest \
  --db_path=/tmp/rag_eval.db \
  --vector_table_dimension=384 \
  serve \
  --port=9090 
```

Next, we will begin QA tests.

In [17]:
def answer_with_doc_agent(question: str):
    import requests
    res = requests.post("http://localhost:9090/v1/chat/completions", json={"messages": [{"content": question, "role": "human"}], "stream": False})
    assert res.status_code == 200
    body = res.json()
    return body["choices"][0]["message"]["content"]
    

def run_doc_agent_rag_tests(
    eval_dataset: datasets.Dataset,
    output_file: str,
    verbose: Optional[bool] = True,
    test_settings: Optional[str] = None,  # To document the test settings used
):
    """Runs RAG tests on the given dataset and saves the results to the given output file."""
    try:  # load previous generations if they exist
        with open(output_file, "r") as f:
            outputs = json.load(f)
    except:
        outputs = []

    for example in tqdm(eval_dataset):
        question = example["question"]
        if question in [output["question"] for output in outputs]:
            continue

        answer = answer_with_doc_agent(question)
        if verbose:
            print("=======================================================")
            print(f"Question: {question}")
            print(f"Answer: {answer}")
            print(f'True answer: {example["answer"]}')
        result = {
            "question": question,
            "true_answer": example["answer"],
            "source_doc": example["source_doc"],
            "generated_answer": answer
        }
        if test_settings:
            result["test_settings"] = test_settings
        outputs.append(result)

        with open(output_file, "w") as f:
            json.dump(outputs, f)

In [18]:
def run_doc_agent_test_all():
    if not os.path.exists("./output"):
        os.mkdir("./output")
    
    chunk_size = 200
    rerank = False
    
    settings_name = f"doc_agent_chunk:{chunk_size}_rerank:{rerank}_reader-model:{READER_MODEL_NAME}_embedding-model:{EMBEDDING_NAME}"
    output_file_name = f"./output/rag_{settings_name}.json"
    
    print(f"Running RAG with settings {settings_name}")
    run_doc_agent_rag_tests(
        eval_dataset=EVAL_DATASET,
        output_file=output_file_name,
        test_settings=settings_name
    )
    
    return output_file_name

In [25]:
DOC_AGENT_TEST_OUTPUT = run_doc_agent_test_all()

Running RAG with settings doc_agent_chunk:200_rerank:False_reader-model:mixtral:instruct_embedding-model:all-minilm


  0%|          | 0/67 [00:00<?, ?it/s]

Question: What architecture is the `tokenizers-linux-x64-musl` binary designed for?

Answer:  The `tokenizers-linux-x64-musl` binary is designed for the **x86_64-unknown-linux-musl** architecture. This refers to a 64-bit Intel x86 architecture running on a Linux system using musl as the C library implementation.
True answer: x86_64-unknown-linux-musl
Question: What is the purpose of the BLIP-Diffusion model?

Answer:  The BLIP-Diffusion model is a multimodal model that combines the strengths of two existing models, BLIP and Stable Diffusion. It uses the BLIP model to generate captions for images, and then uses the Stable Diffusion model to generate new images based on those captions. This allows for more creative and diverse image generation, as well as improved alignment between the generated images and their corresponding captions. The model can be used for a variety of applications, including text-to-image synthesis, image editing, and image captioning.
True answer: The BLIP-Diffusi

# Evaluation Runner

In [26]:
from ratelimit import limits,sleep_and_retry
from retry import retry
from langchain_core.prompts import ChatPromptTemplate

@sleep_and_retry
@limits(calls=6, period=60)
def throttled_invoke(eval_chat_model, eval_prompt):
    return eval_chat_model.invoke(eval_prompt)



@retry(exceptions=Exception, tries=6)
def evaluate_single_answer(
        evaluation_prompt_template: ChatPromptTemplate,
        experiment: dict,
        throttled:bool,
        eval_chat_model: BaseChatModel
):
    eval_prompt = evaluation_prompt_template.format_messages(
            instruction=experiment["question"],
            response=experiment["generated_answer"],
            reference_answer=experiment["true_answer"],
        )
    if throttled:
        eval_result = throttled_invoke(eval_chat_model, eval_prompt)
    else:
        eval_result = eval_chat_model.invoke(eval_prompt)
    splits = [item.strip() for item in eval_result.content.split("[RESULT]")]
    if len(splits) != 2:
        print(splits)
        raise Exception("Evaluation did not complete successfully")
    assert 1 <= int(splits[1]) <= 5
    return splits


def evaluate_answers(
    answer_path: str,
    eval_chat_model: BaseChatModel,
    evaluator_name: str,
    evaluation_prompt_template: ChatPromptTemplate,
    throttled:bool = True
) -> None:
    """Evaluates generated answers. Modifies the given answer file in place for better checkpointing."""
    answers = []
    if os.path.isfile(answer_path):  # load previous generations if they exist
        answers = json.load(open(answer_path, "r"))

    for experiment in tqdm(answers):
        if f"eval_score_{evaluator_name}" in experiment and experiment[f"eval_score_{evaluator_name}"]:
            continue
        
        splits = evaluate_single_answer(evaluation_prompt_template, experiment, throttled, eval_chat_model)
        
        if len(splits) != 2:
            print(splits)
            # experiment[f"eval_score_{evaluator_name}"] = ""
            # experiment[f"eval_feedback_{evaluator_name}"] = ""
            continue
        feedback, score = splits 
        experiment[f"eval_score_{evaluator_name}"] = score
        experiment[f"eval_feedback_{evaluator_name}"] = feedback

        with open(answer_path, "w") as f:
            json.dump(answers, f)

In [27]:
EVALUATION_PROMPT = """ You are a fair evaluator language model.
###Task Description:
An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: \"Feedback: {{write a feedback for criteria}} [RESULT] {{an integer number between 1 and 5}}\"
4. Please do not generate any other opening, closing, and explanations. Be sure to include [RESULT] in your output.

###The instruction to evaluate:
{instruction}

###Response to evaluate:
{response}

###Reference Answer (Score 5):
{reference_answer}

###Score Rubrics:
[Is the response correct, accurate, and factual based on the reference answer?]
Score 1: The response is completely incorrect, inaccurate, and/or not factual.
Score 2: The response is mostly incorrect, inaccurate, and/or not factual.
Score 3: The response is somewhat correct, accurate, and/or factual.
Score 4: The response is mostly correct, accurate, and factual.
Score 5: The response is completely correct, accurate, and factual.

###Feedback:"""

from langchain.prompts.chat import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
)
from langchain.schema import SystemMessage


EVALUATION_PROMPT_TEMPLATE = ChatPromptTemplate.from_messages(
    [
        # SystemMessage(content="You are a fair evaluator language model."),
        HumanMessagePromptTemplate.from_template(EVALUATION_PROMPT),
    ]
)

## Run evaluations

In [28]:
def generate_eval_results():
    import glob
    for output_file_name in glob.glob("./output/*.json"):
        print(f"Evaluating {output_file_name}")
        evaluate_answers(
            output_file_name,
            EVAL_MODEL,
            EVALUATOR_NAME,
            EVALUATION_PROMPT_TEMPLATE,
            # throttling is not needed for local model
            False
        )

generate_eval_results()

Evaluating ./output/langchain_chunk:200_rerank:False_reader-model:mixtral:instruct_embedding-model:all-minilm.json


  0%|          | 0/67 [00:00<?, ?it/s]

Evaluating ./output/rag_doc_agent_chunk:200_rerank:False_reader-model:mixtral:instruct_embedding-model:all-minilm.json


  0%|          | 0/67 [00:00<?, ?it/s]

Evaluating ./output/huggingface_doc_splits.json


  0%|          | 0/9 [00:00<?, ?it/s]

KeyError: 'question'

In [29]:
import pandas as pd
import json

def load_eval_results():
    import glob
    outputs = []
    for file in glob.glob("./output/*.json"):
        output = pd.DataFrame(json.load(open(file, "r")))
        output["settings"] = file
        outputs.append(output)
    return pd.concat(outputs)

EVAL_RESULTS = load_eval_results()
display(EVAL_RESULTS)

Unnamed: 0,question,true_answer,source_doc,generated_answer,retrieved_docs,test_settings,eval_score_mixtral:instruct,eval_feedback_mixtral:instruct,settings
0,What architecture is the `tokenizers-linux-x64-musl` binary designed for?\n,x86_64-unknown-linux-musl,huggingface/tokenizers/blob/main/bindings/node/npm/linux-x64-musl/README.md,The `tokenizers-linux-x64-musl` binary is designed for the **x86_64-unknown-linux-musl** architecture.,"[`tokenizers-linux-x64-musl`\n\nThis is the **x86_64-unknown-linux-musl** binary for `tokenizers`, `tokenizers-linux-arm64-musl`\n\nThis is the **aarch64-unknown-linux-musl** binary for `tokenizers`, `tokenizers-linux-x64-gnu`\n\nThis is the **x86_64-unknown-linux-gnu** binary for `tokenizers`, `tokenizers-linux-arm64-gnu`\n\nThis is the **aarch64-unknown-linux-gnu** binary for `tokenizers`, p align=""center"">\n <br>\n <img src=""https://huggingface.co/landing/assets/tokenizers/tokenizers-logo.png"" width=""600""/>\n <br>\n<p>\n<p align=""center"">\n <img alt=""Build"" src=""https://github.com/huggingface/tokenizers/workflows/Rust/badge.svg"">\n <a href=""https://github.com/huggingface/tokenizers/blob/master/LICENSE"">\n <img alt=""GitHub"" src=""https://img.shields.io/github/license/huggingface/tokenizers.svg?color=blue"">\n </a>\n <a href=""https://docs.rs/tokenizers/"">\n <img alt=""Doc"" src=""https://docs.rs/tokenizers/badge.svg"">\n </a>\n</p>\n<br>\n\n\nThe core of `tokenizers`, written in Rust.\nProvides an implementation of today's most used tokenizers, with a focus on performance and\nversatility., `tokenizers-win32-x64-msvc`\n\nThis is the **x86_64-pc-windows-msvc** binary for `tokenizers`, `tokenizers-win32-arm64-msvc`\n\nThis is the **aarch64-pc-windows-msvc** binary for `tokenizers`]",langchain_chunk:200_rerank:False_reader-model:mixtral:instruct_embedding-model:all-minilm,5,"The response is exactly the same as the reference answer, providing complete correctness, accuracy, and being factual.",./output/langchain_chunk:200_rerank:False_reader-model:mixtral:instruct_embedding-model:all-minilm.json
1,What is the purpose of the BLIP-Diffusion model?\n,The BLIP-Diffusion model is designed for controllable text-to-image generation and editing.,huggingface/diffusers/blob/main/docs/source/en/api/pipelines/blip_diffusion.md,"Based on the provided context, there is no information given about the purpose of the BLIP-Diffusion model. The context mainly discusses Stable Diffusion and DDPO models.","[Stable Diffusion\n\n## Overview\n\nStable Diffusion was proposed in [Stable Diffusion Announcement](https://stability.ai/blog/stable-diffusion-announcement) by Patrick Esser and Robin Rombach and the Stability AI team.\n\nThe summary of the model is the following:, The literature on Diffusion-based models is developing at a rapid pace which is why we partnered with [Jonathan Whitaker](https://github.com/johnowhitaker) to develop a course on it. The course is free, and you can check it out [here](https://github.com/huggingface/diffusion-models-class).\n\n## Support for third-party libraries, --\ntitle: ""Finetune Stable Diffusion Models with DDPO via TRL"" \nthumbnail: /blog/assets/166_trl_ddpo/thumbnail.png\nauthors:\n- user: metric-space\n guest: true\n- user: sayakpaul\n- user: kashif\n- user: lvwerra\n---\n\n# Finetune Stable Diffusion Models with DDPO via TRL\n\n\n## Introduction, Stable Diffusion 2 is a text-to-image _latent diffusion_ model built upon the work of the original [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release), and it was led by Robin Rombach and Katherine Crowson from [Stability AI](https://stability.ai/) and [LAION](https://laion.ai/)., In this blog post, we discuss how DDPO came to be, a brief description of how it works, and how DDPO can be incorporated into an RLHF workflow to achieve model outputs more aligned with the human aesthetics. We then quickly switch gears to talk about how you can apply DDPO to your models with the newly integrated `DDPOTrainer` from the `trl` library and discuss our findings from running DDPO on Stable Diffusion. \n\n## The Advantages of DDPO\n\nDDPO is not the only working answer to the question of how to attempt to fine-tune diffusion models with RL. \n\nBefore diving in, there are two key points to remember when it comes to understanding the advantages of one RL solution over the other\n\n1. Computational efficiency is key. The more complicated your data distribution gets, the higher your computational costs get.\n2. Approximations are nice, but because approximations are not the real thing, associated errors stack up., We've gone from the basic use of Stable Diffusion using ðŸ¤— Hugging Face Diffusers to more advanced uses of the library, and we tried to introduce all the pieces in a modern diffusion system. If you liked this topic and want to learn more, we recommend the following resources:\n- Our [Colab notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/stable_diffusion.ipynb).\n- The [Getting Started with Diffusers](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/diffusers_intro.ipynb) notebook, that gives a broader overview on Diffusion systems.\n- The [Annotated Diffusion Model](https://huggingface.co/blog/annotated-diffusion) blog post., In order to make it easy for everyone to take advantage of these improvements, we have converted the four official Stable Diffusion models and pushed them to the [Hub](https://huggingface.co/apple). These are all the variants:]",langchain_chunk:200_rerank:False_reader-model:mixtral:instruct_embedding-model:all-minilm,3,"The response accurately acknowledges that there is no information provided about the purpose of the BLIP-Diffusion model in the context given. This is a precise and honest evaluation, as the model correctly identifies that it cannot provide an answer based on the given context. However, this response does not include the reference answer's statement describing the actual purpose of the BLIP-Diffusion model.",./output/langchain_chunk:200_rerank:False_reader-model:mixtral:instruct_embedding-model:all-minilm.json
2,How can a user claim authorship of a paper on the Hugging Face Hub?\n,"By clicking their name on the corresponding Paper page and clicking ""claim authorship"", then confirming the request in paper settings for admin team validation.",huggingface/hub-docs/blob/main/docs/hub/paper-pages.md,"A user can claim authorship of a paper on the Hugging Face Hub by visiting the Paper page, clicking on their name in the corresponding Paper page, and then clicking ""claim authorship"". This will redirect them to their paper settings where they can confirm the request. The admin team will validate this request, and once confirmed, the Paper page will show as verified.","[* Visit the Paper page.\n* Filter for other models or datasets on the Hub that cite the same paper.\n\n<div class=""flex justify-center"">\n<img class=""block dark:hidden"" width=""300"" src=""https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datasets-arxiv.png""/>\n<img class=""hidden dark:block"" width=""300"" src=""https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datasets-arxiv-dark.png""/>\n</div>\n\n## Claiming authorship to a Paper\n\nThe Hub will attempt to automatically match paper to users based on their email., <div class=""flex justify-center"">\n<img class=""block dark:hidden"" width=""300"" src=""https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/papers-authors.png""/>\n<img class=""hidden dark:block"" width=""300"" src=""https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/papers-authors-dark.png""/>\n</div>\n\nIf your paper is not linked to your account, you can click in your name in the corresponding Paper page and click ""claim authorship"". This will automatically re-direct to your paper settings where you can confirm the request. The admin team will validate your request soon. Once confirmed, the Paper page will show as verified., It also helps us if you spread the word: reference the library from blog posts\non the awesome projects it made possible, shout out on Twitter every time it has\nhelped you, or simply star the repo to say ""thank you"".\n\nWhichever way you choose to contribute, please be mindful to respect our\n[code of conduct](https://github.com/huggingface/huggingface_hub/blob/main/CODE_OF_CONDUCT.md).\n\n> Looking for a good first issue to work on?\n> Please check out our contributing guide below and then select an issue from our [curated list](https://github.com/huggingface/huggingface_hub/contribute).\n> Pick one and get started with it!\n\n### The client library, `huggingface_hub`, # Model `license:other` challenge\n\nRelated to https://github.com/huggingface/hub-docs/issues/985.\n\n## Context, ### MODEL CARDS ON THE HUGGING FACE HUB\nSince 2018, new platforms and mediums for hosting and sharing model cards have also emerged. For example, particularly relevant to this project, Hugging Face hosts model cards on the Hugging Face Hub as README files in the repositories associated with ML models. As a result, model cards figure as a prominent form of documentation for users of models on the Hugging Face Hub. As part of our analysis of model cards, we developed and proposed model cards for several dozen ML models on the Hugging Face Hub, using the Hubâ€™s Pull Request (PR) and Discussion features to gather feedback on model cards, verify information included in model cards, and publish model cards for models on the Hugging Face Hub. At the time of writing of this guide book, all of Hugging Faceâ€™s models on the Hugging Face Hub have an associated model card on the Hub[^8]., --\ntitle: Hugging Face Collaborates with Microsoft to launch Hugging Face Model Catalog on Azure\nthumbnail: /blog/assets/75_hugging_face_endpoints_on_azure/01.jpg\nauthors:\n- user: jeffboudier\n- user: philschmid\n- user: juliensimon\n---\n\n# Hugging Face Collaborates with Microsoft to launch Hugging Face Model Catalog on Azure\n\n\n![Hugging Face Endpoints on Azure](assets/75_hugging_face_endpoints_on_azure/01.jpg ""Hugging Face Endpoints on Azure""), You can also revoke a user's membership or change their role on this page.\n\n## Organization domain name\n\nUnder the **Account** tab in the Organization settings, you can set an **Organization domain name**. Specifying a domain name will allow any user with a matching email address on the Hugging Face Hub to join your organization.]",langchain_chunk:200_rerank:False_reader-model:mixtral:instruct_embedding-model:all-minilm,5,"The response is correctly describing the steps to claim authorship of a paper on the Hugging Face Hub. It provides slightly more detail than the reference answer by specifying that the user will be redirected to their paper settings and that the admin team needs to validate the request. However, this does not affect the accuracy or correctness of the response, so it still aligns with the score rubric for a score 5.",./output/langchain_chunk:200_rerank:False_reader-model:mixtral:instruct_embedding-model:all-minilm.json
3,What is the purpose of the /healthcheck endpoint in the Datasets server API?\n,Ensure the app is running,huggingface/datasets-server/blob/main/services/api/README.md,The `/healthcheck` endpoint in the Datasets server API is used to ensure that the application is running. It's a basic health check for the server's status.,"[Datasets server API - rows endpoint\n\n> /rows endpoint\n\n## Configuration\n\nThe service can be configured using environment variables. They are grouped by scope.\n\n### API service\n\nSee [../../libs/libapi/README.md](../../libs/libapi/README.md) for more information about the API configuration.\n\n### Common\n\nSee [../../libs/libcommon/README.md](../../libs/libcommon/README.md) for more information about the common configuration.\n\n## Endpoints\n\nSee https://huggingface.co/docs/datasets-server\n\n- /healthcheck: ensure the app is running\n- /metrics: return a list of metrics in the Prometheus format\n- /rows: get a slice of rows of a dataset split, The endpoint response is a JSON with the `dataset_info` key. Its structure and content correspond to [DatasetInfo](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.DatasetInfo) object of the `datasets` library., The endpoint response is a JSON containing a list of the dataset's splits and configurations. For example, the [duorc](https://huggingface.co/datasets/duorc) dataset has six splits and two configurations:, - `/healthcheck`\n- `/metrics`: give info about the cache and the queue\n- `/cache-reports{processing_step}`: give detailed reports on the content of the cache for a processing step\n- `/cache-reports-with-content{processing_step}`: give detailed reports on the content of the cache for a processing step, including the content itself, which can be heavy\n- `/pending-jobs`: give the pending jobs, classed by queue and status (waiting or started)\n- `/force-refresh{processing_step}`: force refresh cache entries for the processing step. It's a POST endpoint. Pass the requested parameters, depending on the processing step's input type:\n - `dataset`: `?dataset={dataset}`\n - `config`: `?dataset={dataset}&config={config}`\n - `split`: `?dataset={dataset}&config={config}&split={split}`, Datasets server - worker\n\n> Workers that pre-compute and cache the response to /splits, /first-rows, /parquet, /info and /size.\n\n## Configuration\n\nUse environment variables to configure the workers. The prefix of each environment variable gives its scope.\n\n### Uvicorn\n\nThe following environment variables are used to configure the Uvicorn server (`WORKER_UVICORN_` prefix). It is used for the /healthcheck and the /metrics endpoints:\n\n- `WORKER_UVICORN_HOSTNAME`: the hostname. Defaults to `""localhost""`.\n- `WORKER_UVICORN_NUM_WORKERS`: the number of uvicorn workers. Defaults to `2`.\n- `WORKER_UVICORN_PORT`: the port. Defaults to `8000`.\n\n### Prometheus, - /healthcheck: Ensure the app is running\n- /metrics: Return a list of metrics in the Prometheus format\n- /webhook: Add, update or remove a dataset\n- /is-valid: Tell if a dataset is [valid](https://huggingface.co/docs/datasets-server/valid)\n- /splits: List the [splits](https://huggingface.co/docs/datasets-server/splits) names for a dataset\n- /first-rows: Extract the [first rows](https://huggingface.co/docs/datasets-server/first_rows) for a dataset split\n- /parquet: List the [parquet files](https://huggingface.co/docs/datasets-server/parquet) auto-converted for a dataset, Datasets server API\n\n> API on ðŸ¤— datasets\n\n## Configuration\n\nThe service can be configured using environment variables. They are grouped by scope.\n\n### API service\n\nSee [../../libs/libapi/README.md](../../libs/libapi/README.md) for more information about the API configuration.\n\n### Common\n\nSee [../../libs/libcommon/README.md](../../libs/libcommon/README.md) for more information about the common configuration.\n\n## Endpoints\n\nSee https://huggingface.co/docs/datasets-server]",langchain_chunk:200_rerank:False_reader-model:mixtral:instruct_embedding-model:all-minilm,5,"The response accurately explains the purpose of the `/healthcheck` endpoint in the Datasets server API, stating that it is used to ensure that the application is running and functions as a basic health check for the server's status. This aligns with the reference answer and meets the criteria for a score of 5.",./output/langchain_chunk:200_rerank:False_reader-model:mixtral:instruct_embedding-model:all-minilm.json
4,What is the default context window size for Local Attention in the LongT5 model?\n,127 tokens,huggingface/transformers/blob/main/docs/source/en/model_doc/longt5.md,"The default context window size for Local Attention in the LongT5 model is defined by `config.attention_window`. It can be of type `List` to define a different window size for each layer. For more specifics on the default value, it would be best to refer to the actual implementation or documentation as it could vary depending on the specific configuration of the model.","[Longformer self attention employs self attention on both a ""local"" context and a ""global"" context. Most tokens only\nattend ""locally"" to each other meaning that each token attends to its \\(\frac{1}{2} w\\) previous tokens and\n\\(\frac{1}{2} w\\) succeeding tokens with \\(w\\) being the window length as defined in\n`config.attention_window`. Note that `config.attention_window` can be of type `List` to define a\ndifferent \\(w\\) for each layer. A selected few tokens attend ""globally"" to all other tokens, as it is\nconventionally done for all tokens in `BertSelfAttention`., - [`LongT5ForConditionalGeneration`] is an extension of [`T5ForConditionalGeneration`] exchanging the traditional\nencoder *self-attention* layer with efficient either *local* attention or *transient-global* (*tglobal*) attention.\n- Unlike the T5 model, LongT5 does not use a task prefix. Furthermore, it uses a different pre-training objective\ninspired by the pre-training of [`PegasusForConditionalGeneration`].\n- LongT5 model is designed to work efficiently and very well on long-range *sequence-to-sequence* tasks where the\ninput sequence exceeds commonly used 512 tokens. It is capable of handling input sequences of a length up to 16,384 tokens.\n- For *Local Attention*, the sparse sliding-window local attention operation allows a given token to attend only `r`, [Longformer](#longformer) uses local attention: often, the local context (e.g., what are the two tokens to the\nleft and right?) is enough to take action for a given token. Also, by stacking attention layers that have a small\nwindow, the last layer will have a receptive field of more than just the tokens in the window, allowing them to build a\nrepresentation of the whole sentence.\n\nSome preselected input tokens are also given global attention: for those few tokens, the attention matrix can access\nall tokens and this process is symmetric: all other tokens have access to those specific tokens (on top of the ones in\ntheir local window). This is shown in Figure 2d of the paper, see below for a sample attention mask:, Note that ""locally"" and ""globally"" attending tokens are projected by different query, key and value matrices. Also note\nthat every ""locally"" attending token not only attends to tokens within its window \\(w\\), but also to all ""globally""\nattending tokens so that global attention is *symmetric*.\n\nThe user can define which tokens attend ""locally"" and which tokens attend ""globally"" by setting the tensor\n`global_attention_mask` at run-time appropriately. All Longformer models employ the following logic for\n`global_attention_mask`:\n\n- 0: the token attends ""locally"",\n- 1: the token attends ""globally"".\n\nFor more information please also refer to [`~LongformerModel.forward`] method., are constructed dynamically within each attention operation). As a consequence, *TGlobal* attention introduces\na few new parameters -- global relative position biases and a layer normalization for global token's embedding.\nThe complexity of this mechanism is `O(l(r + l/k))`.\n- An example showing how to evaluate a fine-tuned LongT5 model on the [pubmed dataset](https://huggingface.co/datasets/scientific_papers) is below., However, most Transformer models continued to trend towards more parameters, leading to new models focused on improving training efficiency. [ALBERT](model_doc/albert) reduces memory consumption by lowering the number of parameters in two ways: separating the larger vocabulary embedding into two smaller matrices and allowing layers to share parameters. [DeBERTa](model_doc/deberta) added a disentangled attention mechanism where the word and its position are separately encoded in two vectors. The attention is computed from these separate vectors instead of a single vector containing the word and position embeddings. [Longformer](model_doc/longformer) also focused on making attention more efficient, especially for processing documents with longer sequence lengths. It uses a combination of local windowed attention (attention only calculated from fixed window size around each token) and global attention (only for specific task tokens like `[CLS]` for classification) to create a sparse attention matrix instead of a full attention matrix.\n\n### Decoder[[nlp-decoder]], The authors use 512 latents for all image models, and set the dimensionality of the latents to 1024. Hence, the latents are a tensor of shape (batch_size, 512, 1024) - assuming we add a batch dimension. The cross-attention layer takes the queries of shape (batch_size, 512, 1024) and keys + values of shape (batch_size, 50176, 512) as input, and produces a tensor that has the same shape as the queries, so outputs a new tensor of shape (batch_size, 512, 1024). Next, a block of 6 self-attention layers is applied repeatedly (8 times), to produce final hidden states of the latents of shape (batch_size, 512, 1024)]",langchain_chunk:200_rerank:False_reader-model:mixtral:instruct_embedding-model:all-minilm,3,"While the response correctly identifies where to find the attention window size in the LongT5 model configuration, it does not provide a specific default value as stated in the rubric. The reference answer clearly indicates that the default context window size for Local Attention in the LongT5 model is 127 tokens. Therefore, the response lacks the specificity required to be considered fully accurate.\n\nFeedback: The response does not provide a specific default value for the context window size of Local Attention in the LongT5 model.",./output/langchain_chunk:200_rerank:False_reader-model:mixtral:instruct_embedding-model:all-minilm.json
...,...,...,...,...,...,...,...,...,...
62,What is the maximum size of a model checkpoint before it is automatically sharded in Transformers version 4.18.0?\n,10GB,huggingface/transformers/blob/main/docs/source/en/big_models.md,The maximum size of a model checkpoint at which automatic sharding occurs in Transformers version 4.18.0 is 10GB.,,doc_agent_chunk:200_rerank:False_reader-model:mixtral:instruct_embedding-model:all-minilm,5,"The response is entirely consistent with the reference answer, providing the same information about the maximum size of a model checkpoint before it is automatically sharded in Transformers version 4.18.0. This demonstrates accuracy and adherence to the given fact.",./output/rag_doc_agent_chunk:200_rerank:False_reader-model:mixtral:instruct_embedding-model:all-minilm.json
63,What is the purpose of Weights and Biases (W&B) for data scientists and machine learning scientists?\n,"To track their machine learning experiments at every stage, from training to production.",gradio-app/gradio/blob/main/guides/06_integrating-other-frameworks/Gradio-and-Wandb-Integration.md,"Weights and Biases (W&B) is a tool that helps data scientists and machine learning scientists track, visualize, and compare experiments and model performance. It allows users to log metrics, hyperparameters, code versions, and other relevant information about their models and datasets. W&B also provides tools for comparing different runs, visualizing the training process, and collaborating with team members. By using W&amp;B, data scientists and machine learning scientists can more easily manage and optimize their models, improve reproducibility, and share their results with others.",,doc_agent_chunk:200_rerank:False_reader-model:mixtral:instruct_embedding-model:all-minilm,4,"The response accurately captures the purpose of Weights and Biases (W&B) for data scientists and machine learning scientists. It highlights the main functionalities of W&B such as tracking, visualizing, and comparing experiments, logging metrics, hyperparameters, code versions, and other relevant information. The response also correctly mentions that W&B facilitates collaboration among team members, improves reproducibility, and helps optimize models. However, the response does not explicitly mention that W&B is used to track machine learning experiments ""at every stage, from training to production"" as stated in the reference answer.",./output/rag_doc_agent_chunk:200_rerank:False_reader-model:mixtral:instruct_embedding-model:all-minilm.json
64,What is the name of the open-source library created by Hugging Face to simplify Transformer acceleration?\n,Optimum,huggingface/blog/blob/main/intel.md,The name of the open-source library created by Hugging Face for Transformer acceleration is `Transformers.js`.,,doc_agent_chunk:200_rerank:False_reader-model:mixtral:instruct_embedding-model:all-minilm,1,"The response is incorrect as it provides the name `Transformers.js` which is not the correct name of the open-source library created by Hugging Face for Transformer acceleration. The reference answer indicates that the correct name is `Optimum`. Therefore, the response does not meet the criteria of being correct, accurate, and factual.",./output/rag_doc_agent_chunk:200_rerank:False_reader-model:mixtral:instruct_embedding-model:all-minilm.json
65,What parameter is used to ensure that elements in a row have the same height in Gradio?\n,equal_height,gradio-app/gradio/blob/main/guides/cn/03_building-with-blocks/02_controlling-layout.md,"The `elem_id` parameter can be used in Gradio to set the same height for all elements in a row. This parameter allows you to specify the ID of an HTML element, and any other elements with the same ID will have their height adjusted to match. For example:\n```\ngr.Interface(my_function, inputs=""text"", outputs=""text"").launch(elem_id=""same-height"")\n```\nThis will set the height of all text input elements in the interface to be the same. Note that this parameter is only available in Gradio version 3.0 and later.",,doc_agent_chunk:200_rerank:False_reader-model:mixtral:instruct_embedding-model:all-minilm,3,"While the response correctly explains how to use the `elem_id` parameter in Gradio to set a uniform height for elements in a row, it unfortunately does not provide the correct parameter name as per the reference answer. The parameter used to ensure that elements in a row have the same height in Gradio is actually called ""equal_height"", not ""elem_id"". Therefore, the response is not accurate in this regard. However, the explanation of how to use the parameter is correct and detailed.",./output/rag_doc_agent_chunk:200_rerank:False_reader-model:mixtral:instruct_embedding-model:all-minilm.json


In [30]:
# Get diffs
import duckdb
DIFF_SQL = "SELECT tbl1.question, tbl1.true_answer, tbl1.generated_answer as langchain_answer, tbl1.score as langchain_score, tbl2.generated_answer as doc_agent_answer, tbl2.score as doc_agent_score "\
           "FROM "\
           f"(SELECT *, \"eval_score_{EVALUATOR_NAME}\" as score FROM EVAL_RESULTS where test_settings like 'langchain%') AS tbl1 "\
           "JOIN "\
           f"(SELECT *, \"eval_score_{EVALUATOR_NAME}\" as score FROM EVAL_RESULTS where test_settings like 'doc_agent%') AS tbl2 "\
           "ON tbl1.question = tbl2.question " \
           f"WHERE tbl1.score > tbl2.score"

DIFFS = duckdb.query(DIFF_SQL).to_df()

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
display(DIFFS)

DIFFS.to_excel("./output/diffs.xlsx")

Unnamed: 0,question,true_answer,langchain_answer,langchain_score,doc_agent_answer,doc_agent_score
0,How can a user claim authorship of a paper on the Hugging Face Hub?\n,"By clicking their name on the corresponding Paper page and clicking ""claim authorship"", then confirming the request in paper settings for admin team validation.","A user can claim authorship of a paper on the Hugging Face Hub by visiting the Paper page, clicking on their name in the corresponding Paper page, and then clicking ""claim authorship"". This will redirect them to their paper settings where they can confirm the request. The admin team will validate this request, and once confirmed, the Paper page will show as verified.",5,"A user can claim authorship of a paper on the Hugging Face Hub by visiting the Paper page and clicking on their name. They will then be redirected to the paper settings where they can confirm the request. The admin team will validate the request, and once confirmed, the Paper page will show as verified.",4
1,What is the default context window size for Local Attention in the LongT5 model?\n,127 tokens,"The default context window size for Local Attention in the LongT5 model is defined by `config.attention_window`. It can be of type `List` to define a different window size for each layer. For more specifics on the default value, it would be best to refer to the actual implementation or documentation as it could vary depending on the specific configuration of the model.",3,The default context window size for Local Attention in the LongT5 model is 512.,1
2,What is the name of the large multimodal model that can solve image-text tasks and is based on Flamingo?\n,IDEFICS,"The name of the large multimodal model is Flamingo. It's mentioned in Document 2, which states that you can use this model to solve image-text tasks. However, it's not explicitly stated that Flamingo is based on Flamingo itself in the document. This information might be derived from other sources or documents.",3,AlignVisionModel,1
3,What is the purpose of the `gradio.Blocks` API?\n,"The `gradio.Blocks` API allows you to have full control over the data flows and layout of your application, enabling the building of complex, multi-step applications.","The `gradio.Blocks` API is a new, low-level interface in Gradio 3.0 that provides full control over the data flows and layout of an application. It allows for the creation of complex, multi-step applications. This is achieved by enabling developers to build their own blocks, define their behavior, and arrange them as needed.",5,"The `gradio.Blocks` API is a high-level interface for creating user interfaces (UIs) for machine learning models in Python. It allows developers to create custom UIs with minimal code, and includes features such as support for different input and output types, real-time updates, and the ability to share UIs via links or embed them in other websites. The `Blocks` API is part of the Gradio library, which provides a range of tools for building and deploying machine learning applications.",4
4,"What architectural feature does SqueezeBERT use instead of fully-connected layers for the Q, K, V, and FFN layers?\n",Grouped convolutions,"SqueezeBERT uses grouped convolutions instead of fully-connected layers for the Q, K, V, and FFN layers. This information can be found in the first document, ""SqueezeBERT: What can computer vision teach NLP about efficient neural networks?"".",5,"SqueezeBERT uses depthwise separable convolutions in place of fully-connected layers for the Q, K, V, and FNN layers. This reduces the number of parameters and computations required while maintaining similar performance to traditional fully-connected layers.",3
5,What are the two parameter-reduction techniques proposed in the ALBERT model to lower memory consumption and increase training speed?\n,Splitting the embedding matrix into two smaller matrices and using repeating layers split among groups.,"The two parameter-reduction techniques proposed in the ALBERT model to lower memory consumption and increase training speed are 1) splitting the embedding matrix into two smaller matrices, and 2) using repeating layers split among groups.",5,"The two parameter-reduction techniques proposed in the ALBERT model are cross-layer parameter sharing and factorized embedding parameterization. Cross-layer parameter sharing reduces the number of unique parameters by sharing them across multiple layers, while factorized embedding parameterization reduces the embedding size by decomposing it into smaller matrices.",4
6,What are the three main steps for fine-tuning a model with the ðŸ¤— Datasets library?\n,1. Load a dataset from the Hugging Face Hub. 2. Preprocess the data with `Dataset.map()`. 3. Load and compute metrics.,The three main steps for fine-tuning a model with the ðŸ¤— Datasets library are:\n\n1. Loading a dataset from the Hugging Face Hub.\n2. Preprocessing the data using `Dataset.map()`.\n3. Loading and computing metrics.\n\nThis is mentioned in Chapter 3 of the course you're taking on the ðŸ¤— Datasets library.,5,"The three main steps for fine-tuning a model using the ðŸ¤— (Hugging Face) Datasets library are:\n\n1. Preprocessing the data: This involves loading and preprocessing the dataset, which includes tokenization, padding, and batching.\n2. Fine-tuning the model: This involves defining the training parameters, such as the learning rate, number of epochs, and batch size, and then fine-tuning the model on the preprocessed data.\n3. Evaluating the model: This involves evaluating the performance of the fine-tuned model on a validation dataset.\n\nHere's an example notebook that demonstrates these steps using the ðŸ¤— Transformers and Datasets libraries to fine-tune a BERT model for text classification: [How to fine-tune BERT for text classification with ðŸ¤— Transformers and ðŸ¤— Datasets](https://github.com/huggingface/notebooks/blob/main/examples/text_classification_bert.ipynb).",3
7,What is the resolution of images used by the CLIPSeg model?\n,352 x 352 pixels,"The resolution of images used by the CLIPSeg model is 352 x 352 pixels. This information can be found in document 5, where it's mentioned that ""the model uses images of 352 x 352 pixels"".",5,"The default resolution for images used in the CLIPSeg model is 256x256 pixels. However, this can be changed during initialization of the `CLIPSegModel` or `CLIPSegForImageSegmentation` class by providing a different value for the `image_resolution` argument.",3
8,What is the default repository type created by the `create_repo` function on Hugging Face Hub?\n,model,"The default repository type created by the `create_repo` function on Hugging Face Hub is a 'model' repository. If you want to create a 'dataset' or a 'space', you can specify it using the `repo_type` argument.",5,The default repository type created by the `create_repo` function when used on the Hugging Face Hub is a model card.,2
9,What is the purpose of Fully Sharded Data Parallel (FSDP) in distributed training?\n,"FSDP is developed for distributed training of large pretrained models up to 1T parameters by sharding the model parameters, gradients, and optimizer states across data parallel processes.","The purpose of Fully Sharded Data Parallel (FSDP) in distributed training is to shard a model's parameters, gradients, and optimizer states across multiple GPUs. This reduces memory usage, improves GPU memory efficiency, and allows for the training of much larger models on fewer GPUs. FSDP also supports CPU offloading of sharded model parameters, enabling the training of large models that exceed the available GPU memory. It is designed for distributed training of large pretrained models up to 1T parameters.",5,"Fully Sharded Data Parallel (FSDP) is a technique used in distributed training to shard the model parameters, optimizer and gradient states, and even offload them to the CPU when they're inactive. This reduces the high cost of large-scale training by allowing you to train really large models on multiple GPUs or TPUs. FSDP can be a powerful tool for reducing memory usage and increasing throughput in distributed training.",4


## Scoring evaluation results

In [31]:
import pandas as pd

def scoring_output(eval_result: pd.DataFrame, evaluator_name: str):
    score_field = f"eval_score_{evaluator_name}"
    result = eval_result.loc[:, [score_field, "settings"]].copy()
    
    result[score_field] = result[score_field].apply(lambda x: int(x) if isinstance(x, str) else 1)
    
    result[score_field] = (result[score_field] - 1) / 4    
    average_scores = result.groupby("settings")[score_field].mean()

    average_scores.sort_values()
    return average_scores

scores = scoring_output(EVAL_RESULTS, EVALUATOR_NAME)
display(scores)

settings
./output/langchain_chunk:200_rerank:False_reader-model:mixtral:instruct_embedding-model:all-minilm.json        0.824627
./output/rag_doc_agent_chunk:200_rerank:False_reader-model:mixtral:instruct_embedding-model:all-minilm.json    0.712687
Name: eval_score_mixtral:instruct, dtype: float64