# RAG Evaluation

This notebook demonstrates how you can evaluate your RAG (Retrieval Augmented Generation), by building a synthetic evaluation dataset and using LLM-as-a-judge to compute the accuracy of your system.

For an introduction to RAG, you can check [this other cookbook](https://github.com/huggingface/cookbook/blob/main/notebooks/RAG_zephyr_langchain.ipynb)!

RAG systems are complex, with many moving parts: here a RAG diagram, where we noted in blue all possibilities for system enhancement:

<img src="https://huggingface.co/datasets/huggingface/cookbook-images/resolve/main/RAG_workflow.png" height="700">

Implementing any of these improvements can bring a huge performance boost; but changing anything is useless if you cannot monitor the impact of your changes on the system's performance!
So let's see how to evaluate our RAG system.

### Evaluating RAG performance

Since there are so many moving parts to tune with a big impact on performance, benchmarking the RAG system is crucial.

For our evaluation pipeline, we will need:
- an evaluation dataset
- an evaluator to compute the accuracy of our system.

➡️ It turns out, we can use LLMs to help us all along the way!
- the evaluation dataset will be synthetically generated by an LLM 🤖, and questions will be filtered out by other LLMs 🤖
- the evaluation will then be run on this synthetic dataset by a LLM-as-a-judge agent 🤖.

__Let's dig into it and start building our evaluation pipeline!__ First, we install the required model dependancies.

In [1]:
!pip install -q torch transformers transformers langchain sentence-transformers faiss-gpu openpyxl openai


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
%reload_ext autoreload
%autoreload 2
%reload_ext dotenv
%dotenv

In [3]:
from tqdm.notebook import tqdm
import pandas as pd
from typing import Optional, List, Tuple
from langchain_core.language_models import BaseChatModel
from datasets import Dataset

pd.set_option("display.max_colwidth", None)

### Load your knowledge base

In [4]:
import datasets

ds = datasets.load_dataset("m-ric/huggingface_doc", split="train")

Downloading readme:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/22.0M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

# 1. Build a synthetic dataset for evaluation
We first build a synthetic dataset of questions and associated contexts.

The idea is to randomly get elements from our knowledge base, and ask a LLM to generate questions based on these documents.

### Prepare source documents

In [10]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document as LangchainDocument

langchain_docs = [
    LangchainDocument(page_content=doc["text"], metadata={"source": doc["source"]})
    for doc in tqdm(ds)
]

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=200,
    add_start_index=True,
    separators=["\n\n", "\n", ".", " ", ""],
)

docs_processed = []
for doc in langchain_docs:
    docs_processed += text_splitter.split_documents([doc])

  0%|          | 0/2647 [00:00<?, ?it/s]

### Setup agents for question generation

In [11]:
from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI

QA_generation_prompt = """
Your task is to write a factoid question and an answer given a context.
Your factoid question should be answerable with a specific, concise piece of factual information from the context.
Your factoid question should be formulated in the same style as questions users could ask in a search engine.
This means that your factoid question MUST NOT mention something like "according to the passage" or "context".

Provide your answer as follows:

Output:::
Factoid question: (your factoid question)
Answer: (your answer to the factoid question)

Now here is the context.

Context: {context}\n
Output:::"""

chat_model = ChatOpenAI(model="gpt-4-1106-preview")
QA_generation_prompt = ChatPromptTemplate.from_template(QA_generation_prompt)
QA_generation_agent = QA_generation_prompt | chat_model

  warn_deprecated(


In [12]:
import random

N_GENERATIONS = 10

print(f"Generating {N_GENERATIONS} QA couples...")
outputs = []
for context in tqdm(random.sample(langchain_docs, N_GENERATIONS)):
    # Generate QA couple
    output_QA_couple = QA_generation_agent.invoke({"context": context.page_content}).content
    try:
        question = output_QA_couple.split("Factoid question: ")[1].split("Answer: ")[0]
        answer = output_QA_couple.split("Answer: ")[1]
        outputs.append(
            {
                "context": context.page_content,
                "question": question,
                "answer": answer,
                "source_doc": context.metadata["source"],
            }
        )
    except:
        continue

Generating 10 QA couples...


  0%|          | 0/10 [00:00<?, ?it/s]

In [13]:
display(pd.DataFrame(outputs).head(5))

Unnamed: 0,context,question,answer,source_doc
0,"@gradio/button\n\n## 0.2.13\n\n### Patch Changes\n\n- Updated dependencies [[`828fb9e`](https://github.com/gradio-app/gradio/commit/828fb9e6ce15b6ea08318675a2361117596a1b5d), [`73268ee`](https://github.com/gradio-app/gradio/commit/73268ee2e39f23ebdd1e927cb49b8d79c4b9a144)]:\n - @gradio/client@0.9.3\n - @gradio/upload@0.5.6\n\n## 0.2.12\n\n### Patch Changes\n\n- Updated dependencies [[`245d58e`](https://github.com/gradio-app/gradio/commit/245d58eff788e8d44a59d37a2d9b26d0f08a62b4)]:\n - @gradio/client@0.9.2\n - @gradio/upload@0.5.5\n\n## 0.2.11\n\n### Patch Changes\n\n- Updated dependencies [[`5d51fbc`](https://github.com/gradio-app/gradio/commit/5d51fbce7826da840a2fd4940feb5d9ad6f1bc5a), [`34f9431`](https://github.com/gradio-app/gradio/commit/34f943101bf7dd6b8a8974a6131c1ed7c4a0dac0)]:\n - @gradio/upload@0.5.4\n - @gradio/client@0.9.1\n\n## 0.2.10\n\n### Patch Changes\n\n- Updated dependencies [[`6a9151d`](https://github.com/gradio-app/gradio/commit/6a9151d5c9432c724098da7d88a539aaaf5ffe88), [`d76bcaa`](https://github.com/gradio-app/gradio/commit/d76bcaaaf0734aaf49a680f94ea9d4d22a602e70), [`67ddd40`](https://github.com/gradio-app/gradio/commit/67ddd40b4b70d3a37cb1637c33620f8d197dbee0)]:\n - @gradio/upload@0.5.3\n - @gradio/client@0.9.0\n\n## 0.2.9\n\n### Patch Changes\n\n- Updated dependencies []:\n - @gradio/upload@0.5.2\n\n## 0.2.8\n\n### Patch Changes\n\n- Updated dependencies [[`71f1a1f99`](https://github.com/gradio-app/gradio/commit/71f1a1f9931489d465c2c1302a5c8d768a3cd23a)]:\n - @gradio/client@0.8.2\n - @gradio/upload@0.5.1\n\n## 0.2.7\n\n### Patch Changes\n\n- Updated dependencies [[`9caddc17b`](https://github.com/gradio-app/gradio/commit/9caddc17b1dea8da1af8ba724c6a5eab04ce0ed8)]:\n - @gradio/upload@0.5.0\n\n## 0.2.6\n\n### Patch Changes\n\n- Updated dependencies [[`2f805a7dd`](https://github.com/gradio-app/gradio/commit/2f805a7dd3d2b64b098f659dadd5d01258290521)]:\n - @gradio/upload@0.4.2\n\n## 0.2.5\n\n### Patch Changes\n\n- Updated dependencies [[`324867f63`](https://github.com/gradio-app/gradio/commit/324867f63c920113d89a565892aa596cf8b1e486)]:\n - @gradio/client@0.8.1\n - @gradio/upload@0.4.1\n\n## 0.2.4\n\n### Patch Changes\n\n- Updated dependencies [[`854b482f5`](https://github.com/gradio-app/gradio/commit/854b482f598e0dc47673846631643c079576da9c), [`f1409f95e`](https://github.com/gradio-app/gradio/commit/f1409f95ed39c5565bed6a601e41f94e30196a57)]:\n - @gradio/upload@0.4.0\n - @gradio/client@0.8.0\n\n## 0.2.3\n\n### Patch Changes\n\n- Updated dependencies [[`bca6c2c80`](https://github.com/gradio-app/gradio/commit/bca6c2c80f7e5062427019de45c282238388af95), [`3cdeabc68`](https://github.com/gradio-app/gradio/commit/3cdeabc6843000310e1a9e1d17190ecbf3bbc780)]:\n - @gradio/client@0.7.2\n - @gradio/upload@0.3.3\n\n## 0.2.2\n\n### Patch Changes\n\n- Updated dependencies [[`aaa55ce85`](https://github.com/gradio-app/gradio/commit/aaa55ce85e12f95aba9299445e9c5e59824da18e)]:\n - @gradio/upload@0.3.2\n\n## 0.2.1\n\n### Patch Changes\n\n- Updated dependencies [[`2ba14b284`](https://github.com/gradio-app/gradio/commit/2ba14b284f908aa13859f4337167a157075a68eb)]:\n - @gradio/client@0.7.1\n - @gradio/upload@0.3.1\n\n## 0.2.0\n\n### Features\n\n- [#5498](https://github.com/gradio-app/gradio/pull/5498) [`287fe6782`](https://github.com/gradio-app/gradio/commit/287fe6782825479513e79a5cf0ba0fbfe51443d7) - fix circular dependency with client + upload. Thanks [@pngwn](https://github.com/pngwn)!\n- [#5498](https://github.com/gradio-app/gradio/pull/5498) [`287fe6782`](https://github.com/gradio-app/gradio/commit/287fe6782825479513e79a5cf0ba0fbfe51443d7) - Clean root url. Thanks [@pngwn](https://github.com/pngwn)!\n- [#5498](https://github.com/gradio-app/gradio/pull/5498) [`287fe6782`](https://github.com/gradio-app/gradio/commit/287fe6782825479513e79a5cf0ba0fbfe51443d7) - Publish all components to npm. Thanks [@pngwn](https://github.com/pngwn)!\n- [#5498](https://github.com/gradio-app/gradio/pull/5498) [`287fe6782`](https://github.com/gradio-app/gradio/commit/287fe6782825479513e79a5cf0ba0fbfe51443d7) - Custom components. Thanks [@pngwn](https://github.com/pngwn)!\n\n## 0.2.0-beta.7\n\n### Features\n\n- [#6143](https://github.com/gradio-app/gradio/pull/6143) [`e4f7b4b40`](https://github.com/gradio-app/gradio/commit/e4f7b4b409323b01aa01b39e15ce6139e29aa073) - fix circular dependency with client + upload. Thanks [@pngwn](https://github.com/pngwn)!\n- [#6136](https://github.com/gradio-app/gradio/pull/6136) [`667802a6c`](https://github.com/gradio-app/gradio/commit/667802a6cdbfb2ce454a3be5a78e0990b194548a) - JS Component Documentation. Thanks [@freddyaboulton](https://github.com/freddyaboulton)!\n- [#6149](https://github.com/gradio-app/gradio/pull/6149) [`90318b1dd`](https://github.com/gradio-app/gradio/commit/90318b1dd118ae08a695a50e7c556226234ab6dc) - swap `mode` on the frontned to `interactive` to match the backend. Thanks [@pngwn](https://github.com/pngwn)!\n\n## 0.2.0-beta.6\n\n### Fixes\n\n- [#6046](https://github.com/gradio-app/gradio/pull/6046) [`dbb7de5e0`](https://github.com/gradio-app/gradio/commit/dbb7de5e02c53fee05889d696d764d212cb96c74) - fix tests. Thanks [@pngwn](https://github.com/pngwn)!\n\n## 0.2.0-beta.5\n\n### Features\n\n- [#5960](https://github.com/gradio-app/gradio/pull/5960) [`319c30f3f`](https://github.com/gradio-app/gradio/commit/319c30f3fccf23bfe1da6c9b132a6a99d59652f7) - rererefactor frontend files. Thanks [@pngwn](https://github.com/pngwn)!\n- [#5938](https://github.com/gradio-app/gradio/pull/5938) [`13ed8a485`](https://github.com/gradio-app/gradio/commit/13ed8a485d5e31d7d75af87fe8654b661edcca93) - V4: Use beta release versions for '@gradio' packages. Thanks [@freddyaboulton](https://github.com/freddyaboulton)!\n\n## 0.2.3\n\n### Patch Changes\n\n- Updated dependencies []:\n - @gradio/upload@0.3.3\n\n## 0.2.2\n\n### Patch Changes\n\n- Updated dependencies []:\n - @gradio/utils@0.1.2\n - @gradio/upload@0.3.2\n\n## 0.2.1\n\n### Patch Changes\n\n- Updated dependencies []:\n - @gradio/upload@0.3.1\n\n## 0.2.0\n\n### Features\n\n- [#5554](https://github.com/gradio-app/gradio/pull/5554) [`75ddeb390`](https://github.com/gradio-app/gradio/commit/75ddeb390d665d4484667390a97442081b49a423) - Accessibility Improvements. Thanks [@hannahblair](https://github.com/hannahblair)!\n\n## 0.1.3\n\n### Patch Changes\n\n- Updated dependencies []:\n - @gradio/utils@0.1.1\n - @gradio/upload@0.2.1\n\n## 0.1.2\n\n### Patch Changes\n\n- Updated dependencies [[`abf1c57d`](https://github.com/gradio-app/gradio/commit/abf1c57d7d85de0df233ee3b38aeb38b638477db), [`79d8f9d8`](https://github.com/gradio-app/gradio/commit/79d8f9d891901683c5a1b7486efb44eab2478c96)]:\n - @gradio/utils@0.1.0\n - @gradio/upload@0.2.0\n\n## 0.1.1\n\n### Highlights\n\n#### Improve startup performance and markdown support ([#5279](https://github.com/gradio-app/gradio/pull/5279) [`fe057300`](https://github.com/gradio-app/gradio/commit/fe057300f0672c62dab9d9b4501054ac5d45a4ec))\n\n##### Improved markdown support\n\nWe now have better support for markdown in `gr.Markdown` and `gr.Dataframe`. Including syntax highlighting and Github Flavoured Markdown. We also have more consistent markdown behaviour and styling.\n\n##### Various performance improvements\n\nThese improvements will be particularly beneficial to large applications.\n\n- Rather than attaching events manually, they are now delegated, leading to a significant performance improvement and addressing a performance regression introduced in a recent version of Gradio. App startup for large applications is now around twice as fast.\n- Optimised the mounting of individual components, leading to a modest performance improvement during startup (~30%).\n- Corrected an issue that was causing markdown to re-render infinitely.\n- Ensured that the `gr.3DModel` does re-render prematurely.\n\nThanks [@pngwn](https://github.com/pngwn)!\n\n### Fixes\n\n- [#5285](https://github.com/gradio-app/gradio/pull/5285) [`cdfd4217`](https://github.com/gradio-app/gradio/commit/cdfd42174a9c777eaee9c1209bf8e90d8c7791f2) - Tweaks to `icon` parameter in `gr.Button()`. Thanks [@abidlabs](https://github.com/abidlabs)!\n\n## 0.1.0\n\n### Features\n\n- [#5080](https://github.com/gradio-app/gradio/pull/5080) [`37caa2e0`](https://github.com/gradio-app/gradio/commit/37caa2e0fe95d6cab8beb174580fb557904f137f) - Add icon and link params to `gr.Button`. Thanks [@hannahblair](https://github.com/hannahblair)!\n\n## 0.0.2\n\n### Patch Changes\n\n- Updated dependencies []:\n - @gradio/utils@0.0.2\n",What was the version number of @gradio/button when improved markdown support and startup performance enhancements were introduced?\n,0.1.1,gradio-app/gradio/blob/main/js/button/CHANGELOG.md
1,"!--Copyright 2021 The HuggingFace Team. All rights reserved.\n\nLicensed under the Apache License, Version 2.0 (the ""License""); you may not use this file except in compliance with\nthe License. You may obtain a copy of the License at\n\nhttp://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software distributed under the License is distributed on\nan ""AS IS"" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the\nspecific language governing permissions and limitations under the License.\n\n⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be\nrendered properly in your Markdown viewer.\n\n-->\n\n# BARTpho\n\n## Overview\n\nThe BARTpho model was proposed in [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen.\n\nThe abstract from the paper is the following:\n\n*We present BARTpho with two versions -- BARTpho_word and BARTpho_syllable -- the first public large-scale monolingual\nsequence-to-sequence models pre-trained for Vietnamese. Our BARTpho uses the ""large"" architecture and pre-training\nscheme of the sequence-to-sequence denoising model BART, thus especially suitable for generative NLP tasks. Experiments\non a downstream task of Vietnamese text summarization show that in both automatic and human evaluations, our BARTpho\noutperforms the strong baseline mBART and improves the state-of-the-art. We release BARTpho to facilitate future\nresearch and applications of generative Vietnamese NLP tasks.*\n\nThis model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). The original code can be found [here](https://github.com/VinAIResearch/BARTpho).\n\n## Usage example\n\n```python\n>>> import torch\n>>> from transformers import AutoModel, AutoTokenizer\n\n>>> bartpho = AutoModel.from_pretrained(""vinai/bartpho-syllable"")\n\n>>> tokenizer = AutoTokenizer.from_pretrained(""vinai/bartpho-syllable"")\n\n>>> line = ""Chúng tôi là những nghiên cứu viên.""\n\n>>> input_ids = tokenizer(line, return_tensors=""pt"")\n\n>>> with torch.no_grad():\n... features = bartpho(**input_ids) # Models outputs are now tuples\n\n>>> # With TensorFlow 2.0+:\n>>> from transformers import TFAutoModel\n\n>>> bartpho = TFAutoModel.from_pretrained(""vinai/bartpho-syllable"")\n>>> input_ids = tokenizer(line, return_tensors=""tf"")\n>>> features = bartpho(**input_ids)\n```\n\n## Usage tips\n\n- Following mBART, BARTpho uses the ""large"" architecture of BART with an additional layer-normalization layer on top of\n both the encoder and decoder. Thus, usage examples in the [documentation of BART](bart), when adapting to use\n with BARTpho, should be adjusted by replacing the BART-specialized classes with the mBART-specialized counterparts.\n For example:\n\n```python\n>>> from transformers import MBartForConditionalGeneration\n\n>>> bartpho = MBartForConditionalGeneration.from_pretrained(""vinai/bartpho-syllable"")\n>>> TXT = ""Chúng tôi là <mask> nghiên cứu viên.""\n>>> input_ids = tokenizer([TXT], return_tensors=""pt"")[""input_ids""]\n>>> logits = bartpho(input_ids).logits\n>>> masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item()\n>>> probs = logits[0, masked_index].softmax(dim=0)\n>>> values, predictions = probs.topk(5)\n>>> print(tokenizer.decode(predictions).split())\n```\n\n- This implementation is only for tokenization: ""monolingual_vocab_file"" consists of Vietnamese-specialized types\n extracted from the pre-trained SentencePiece model ""vocab_file"" that is available from the multilingual XLM-RoBERTa.\n Other languages, if employing this pre-trained multilingual SentencePiece model ""vocab_file"" for subword\n segmentation, can reuse BartphoTokenizer with their own language-specialized ""monolingual_vocab_file"".\n\n## BartphoTokenizer\n\n[[autodoc]] BartphoTokenizer\n",Who proposed the BARTpho model?\n,"Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen",huggingface/transformers/blob/main/docs/source/en/model_doc/bartpho.md
2,"--\n# For reference on dataset card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1\n# Doc / guide: https://huggingface.co/docs/hub/datasets-cards\n{{ card_data }}\n---\n\n# Dataset Card for {{ pretty_name | default(""Dataset Name"", true) }}\n\n<!-- Provide a quick summary of the dataset. -->\n\n{{ dataset_summary | default("""", true) }}\n\n## Dataset Details\n\n### Dataset Description\n\n<!-- Provide a longer summary of what this dataset is. -->\n\n{{ dataset_description | default("""", true) }}\n\n- **Curated by:** {{ curators | default(""[More Information Needed]"", true)}}\n- **Funded by [optional]:** {{ funded_by | default(""[More Information Needed]"", true)}}\n- **Shared by [optional]:** {{ shared_by | default(""[More Information Needed]"", true)}}\n- **Language(s) (NLP):** {{ language | default(""[More Information Needed]"", true)}}\n- **License:** {{ license | default(""[More Information Needed]"", true)}}\n\n### Dataset Sources [optional]\n\n<!-- Provide the basic links for the dataset. -->\n\n- **Repository:** {{ repo | default(""[More Information Needed]"", true)}}\n- **Paper [optional]:** {{ paper | default(""[More Information Needed]"", true)}}\n- **Demo [optional]:** {{ demo | default(""[More Information Needed]"", true)}}\n\n## Uses\n\n<!-- Address questions around how the dataset is intended to be used. -->\n\n### Direct Use\n\n<!-- This section describes suitable use cases for the dataset. -->\n\n{{ direct_use | default(""[More Information Needed]"", true)}}\n\n### Out-of-Scope Use\n\n<!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. -->\n\n{{ out_of_scope_use | default(""[More Information Needed]"", true)}}\n\n## Dataset Structure\n\n<!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->\n\n{{ dataset_structure | default(""[More Information Needed]"", true)}}\n\n## Dataset Creation\n\n### Curation Rationale\n\n<!-- Motivation for the creation of this dataset. -->\n\n{{ curation_rationale_section | default(""[More Information Needed]"", true)}}\n\n### Source Data\n\n<!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). -->\n\n#### Data Collection and Processing\n\n<!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. -->\n\n{{ data_collection_and_processing_section | default(""[More Information Needed]"", true)}}\n\n#### Who are the source data producers?\n\n<!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. -->\n\n{{ source_data_producers_section | default(""[More Information Needed]"", true)}}\n\n### Annotations [optional]\n\n<!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. -->\n\n#### Annotation process\n\n<!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. -->\n\n{{ annotation_process_section | default(""[More Information Needed]"", true)}}\n\n#### Who are the annotators?\n\n<!-- This section describes the people or systems who created the annotations. -->\n\n{{ who_are_annotators_section | default(""[More Information Needed]"", true)}}\n\n#### Personal and Sensitive Information\n\n<!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. -->\n\n{{ personal_and_sensitive_information | default(""[More Information Needed]"", true)}}\n\n## Bias, Risks, and Limitations\n\n<!-- This section is meant to convey both technical and sociotechnical limitations. -->\n\n{{ bias_risks_limitations | default(""[More Information Needed]"", true)}}\n\n### Recommendations\n\n<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->\n\n{{ bias_recommendations | default(""Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations."", true)}}\n\n## Citation [optional]\n\n<!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. -->\n\n**BibTeX:**\n\n{{ citation_bibtex | default(""[More Information Needed]"", true)}}\n\n**APA:**\n\n{{ citation_apa | default(""[More Information Needed]"", true)}}\n\n## Glossary [optional]\n\n<!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. -->\n\n{{ glossary | default(""[More Information Needed]"", true)}}\n\n## More Information [optional]\n\n{{ more_information | default(""[More Information Needed]"", true)}}\n\n## Dataset Card Authors [optional]\n\n{{ dataset_card_authors | default(""[More Information Needed]"", true)}}\n\n## Dataset Card Contact\n\n{{ dataset_card_contact | default(""[More Information Needed]"", true)}}",What section of the dataset card provides a description of how the dataset is intended to be used?\n,Direct Use,huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/datasetcard_template.md
3,"!--Copyright 2023 The HuggingFace Team. All rights reserved.\n\nLicensed under the Apache License, Version 2.0 (the ""License""); you may not use this file except in compliance with\nthe License. You may obtain a copy of the License at\n\nhttp://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software distributed under the License is distributed on\nan ""AS IS"" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the\nspecific language governing permissions and limitations under the License.\n\n⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be\nrendered properly in your Markdown viewer.\n\n-->\n\n# GPTSAN-japanese\n\n## Overview\n\nThe GPTSAN-japanese model was released in the repository by Toshiyuki Sakamoto (tanreinama).\n\nGPTSAN is a Japanese language model using Switch Transformer. It has the same structure as the model introduced as Prefix LM\nin the T5 paper, and support both Text Generation and Masked Language Modeling tasks. These basic tasks similarly can\nfine-tune for translation or summarization.\n\n### Usage example\n\nThe `generate()` method can be used to generate text using GPTSAN-Japanese model.\n\n```python\n>>> from transformers import AutoModel, AutoTokenizer\n>>> import torch\n\n>>> tokenizer = AutoTokenizer.from_pretrained(""Tanrei/GPTSAN-japanese"")\n>>> model = AutoModel.from_pretrained(""Tanrei/GPTSAN-japanese"").cuda()\n>>> x_tok = tokenizer(""は、"", prefix_text=""織田信長"", return_tensors=""pt"")\n>>> torch.manual_seed(0)\n>>> gen_tok = model.generate(x_tok.input_ids.cuda(), token_type_ids=x_tok.token_type_ids.cuda(), max_new_tokens=20)\n>>> tokenizer.decode(gen_tok[0])\n'織田信長は、2004年に『戦国BASARA』のために、豊臣秀吉'\n```\n\n## GPTSAN Features\n\nGPTSAN has some unique features. It has a model structure of Prefix-LM. It works as a shifted Masked Language Model for Prefix Input tokens. Un-prefixed inputs behave like normal generative models.\nThe Spout vector is a GPTSAN specific input. Spout is pre-trained with random inputs, but you can specify a class of text or an arbitrary vector during fine-tuning. This allows you to indicate the tendency of the generated text.\nGPTSAN has a sparse Feed Forward based on Switch-Transformer. You can also add other layers and train them partially. See the original GPTSAN repository for details.\n\n### Prefix-LM Model\n\nGPTSAN has the structure of the model named Prefix-LM in the `T5` paper. (The original GPTSAN repository calls it `hybrid`)\nIn GPTSAN, the `Prefix` part of Prefix-LM, that is, the input position that can be referenced by both tokens, can be specified with any length.\nArbitrary lengths can also be specified differently for each batch.\nThis length applies to the text entered in `prefix_text` for the tokenizer.\nThe tokenizer returns the mask of the `Prefix` part of Prefix-LM as `token_type_ids`.\nThe model treats the part where `token_type_ids` is 1 as a `Prefix` part, that is, the input can refer to both tokens before and after.\n\n## Usage tips\n\nSpecifying the Prefix part is done with a mask passed to self-attention.\nWhen token_type_ids=None or all zero, it is equivalent to regular causal mask\n\nfor example:\n\n>>> x_token = tokenizer(""ｱｲｳｴ"")\ninput_ids: | SOT | SEG | ｱ | ｲ | ｳ | ｴ |\ntoken_type_ids: | 1 | 0 | 0 | 0 | 0 | 0 |\nprefix_lm_mask:\nSOT | 1 0 0 0 0 0 |\nSEG | 1 1 0 0 0 0 |\nｱ | 1 1 1 0 0 0 |\nｲ | 1 1 1 1 0 0 |\nｳ | 1 1 1 1 1 0 |\nｴ | 1 1 1 1 1 1 |\n\n>>> x_token = tokenizer("""", prefix_text=""ｱｲｳｴ"")\ninput_ids: | SOT | ｱ | ｲ | ｳ | ｴ | SEG |\ntoken_type_ids: | 1 | 1 | 1 | 1 | 1 | 0 |\nprefix_lm_mask:\nSOT | 1 1 1 1 1 0 |\nｱ | 1 1 1 1 1 0 |\nｲ | 1 1 1 1 1 0 |\nｳ | 1 1 1 1 1 0 |\nｴ | 1 1 1 1 1 0 |\nSEG | 1 1 1 1 1 1 |\n\n>>> x_token = tokenizer(""ｳｴ"", prefix_text=""ｱｲ"")\ninput_ids: | SOT | ｱ | ｲ | SEG | ｳ | ｴ |\ntoken_type_ids: | 1 | 1 | 1 | 0 | 0 | 0 |\nprefix_lm_mask:\nSOT | 1 1 1 0 0 0 |\nｱ | 1 1 1 0 0 0 |\nｲ | 1 1 1 0 0 0 |\nSEG | 1 1 1 1 0 0 |\nｳ | 1 1 1 1 1 0 |\nｴ | 1 1 1 1 1 1 |\n\n### Spout Vector\n\nA Spout Vector is a special vector for controlling text generation.\nThis vector is treated as the first embedding in self-attention to bring extraneous attention to the generated tokens.\nIn the pre-trained model published from `Tanrei/GPTSAN-japanese`, the Spout Vector is a 128-dimensional vector that passes through 8 fully connected layers in the model and is projected into the space acting as external attention.\nThe Spout Vector projected by the fully connected layer is split to be passed to all self-attentions.\n\n## GPTSanJapaneseConfig\n\n[[autodoc]] GPTSanJapaneseConfig\n\n## GPTSanJapaneseTokenizer\n\n[[autodoc]] GPTSanJapaneseTokenizer\n\n## GPTSanJapaneseModel\n\n[[autodoc]] GPTSanJapaneseModel\n\n## GPTSanJapaneseForConditionalGeneration\n\n[[autodoc]] GPTSanJapaneseForConditionalGeneration\n - forward\n",What is the unique model structure of GPTSAN called as referenced in the T5 paper?\n,Prefix-LM,huggingface/transformers/blob/main/docs/source/en/model_doc/gptsan-japanese.md
4,"!--Copyright 2021 The HuggingFace Team. All rights reserved.\n\nLicensed under the Apache License, Version 2.0 (the ""License""); you may not use this file except in compliance with\nthe License. You may obtain a copy of the License at\n\nhttp://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software distributed under the License is distributed on\nan ""AS IS"" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the\nspecific language governing permissions and limitations under the License.\n\n⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be\nrendered properly in your Markdown viewer.\n\n-->\n\n# GPT-J\n\n## Overview\n\nThe GPT-J model was released in the [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax) repository by Ben Wang and Aran Komatsuzaki. It is a GPT-2-like\ncausal language model trained on [the Pile](https://pile.eleuther.ai/) dataset.\n\nThis model was contributed by [Stella Biderman](https://huggingface.co/stellaathena).\n\n## Usage tips\n\n- To load [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6B) in float32 one would need at least 2x model size\n RAM: 1x for initial weights and another 1x to load the checkpoint. So for GPT-J it would take at least 48GB\n RAM to just load the model. To reduce the RAM usage there are a few options. The `torch_dtype` argument can be\n used to initialize the model in half-precision on a CUDA device only. There is also a fp16 branch which stores the fp16 weights,\n which could be used to further minimize the RAM usage:\n\n```python\n>>> from transformers import GPTJForCausalLM\n>>> import torch\n\n>>> device = ""cuda""\n>>> model = GPTJForCausalLM.from_pretrained(\n... ""EleutherAI/gpt-j-6B"",\n... revision=""float16"",\n... torch_dtype=torch.float16,\n... ).to(device)\n```\n\n- The model should fit on 16GB GPU for inference. For training/fine-tuning it would take much more GPU RAM. Adam\n optimizer for example makes four copies of the model: model, gradients, average and squared average of the gradients.\n So it would need at least 4x model size GPU memory, even with mixed precision as gradient updates are in fp32. This\n is not including the activations and data batches, which would again require some more GPU RAM. So one should explore\n solutions such as DeepSpeed, to train/fine-tune the model. Another option is to use the original codebase to\n train/fine-tune the model on TPU and then convert the model to Transformers format for inference. Instructions for\n that could be found [here](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/howto_finetune.md)\n\n- Although the embedding matrix has a size of 50400, only 50257 entries are used by the GPT-2 tokenizer. These extra\n tokens are added for the sake of efficiency on TPUs. To avoid the mismatch between embedding matrix size and vocab\n size, the tokenizer for [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6B) contains 143 extra tokens\n `<|extratoken_1|>... <|extratoken_143|>`, so the `vocab_size` of tokenizer also becomes 50400.\n\n## Usage examples\n\nThe [`~generation.GenerationMixin.generate`] method can be used to generate text using GPT-J\nmodel.\n\n```python\n>>> from transformers import AutoModelForCausalLM, AutoTokenizer\n\n>>> model = AutoModelForCausalLM.from_pretrained(""EleutherAI/gpt-j-6B"")\n>>> tokenizer = AutoTokenizer.from_pretrained(""EleutherAI/gpt-j-6B"")\n\n>>> prompt = (\n... ""In a shocking finding, scientists discovered a herd of unicorns living in a remote, ""\n... ""previously unexplored valley, in the Andes Mountains. Even more surprising to the ""\n... ""researchers was the fact that the unicorns spoke perfect English.""\n... )\n\n>>> input_ids = tokenizer(prompt, return_tensors=""pt"").input_ids\n\n>>> gen_tokens = model.generate(\n... input_ids,\n... do_sample=True,\n... temperature=0.9,\n... max_length=100,\n... )\n>>> gen_text = tokenizer.batch_decode(gen_tokens)[0]\n```\n\n...or in float16 precision:\n\n```python\n>>> from transformers import GPTJForCausalLM, AutoTokenizer\n>>> import torch\n\n>>> device = ""cuda""\n>>> model = GPTJForCausalLM.from_pretrained(""EleutherAI/gpt-j-6B"", torch_dtype=torch.float16).to(device)\n>>> tokenizer = AutoTokenizer.from_pretrained(""EleutherAI/gpt-j-6B"")\n\n>>> prompt = (\n... ""In a shocking finding, scientists discovered a herd of unicorns living in a remote, ""\n... ""previously unexplored valley, in the Andes Mountains. Even more surprising to the ""\n... ""researchers was the fact that the unicorns spoke perfect English.""\n... )\n\n>>> input_ids = tokenizer(prompt, return_tensors=""pt"").input_ids.to(device)\n\n>>> gen_tokens = model.generate(\n... input_ids,\n... do_sample=True,\n... temperature=0.9,\n... max_length=100,\n... )\n>>> gen_text = tokenizer.batch_decode(gen_tokens)[0]\n```\n\n## Resources\n\nA list of official Hugging Face and community (indicated by 🌎) resources to help you get started with GPT-J. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.\n\n<PipelineTag pipeline=""text-generation""/>\n\n- Description of [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6B).\n- A blog on how to [Deploy GPT-J 6B for inference using Hugging Face Transformers and Amazon SageMaker](https://huggingface.co/blog/gptj-sagemaker).\n- A blog on how to [Accelerate GPT-J inference with DeepSpeed-Inference on GPUs](https://www.philschmid.de/gptj-deepspeed-inference).\n- A blog post introducing [GPT-J-6B: 6B JAX-Based Transformer](https://arankomatsuzaki.wordpress.com/2021/06/04/gpt-j/). 🌎\n- A notebook for [GPT-J-6B Inference Demo](https://colab.research.google.com/github/kingoflolz/mesh-transformer-jax/blob/master/colab_demo.ipynb). 🌎\n- Another notebook demonstrating [Inference with GPT-J-6B](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/GPT-J-6B/Inference_with_GPT_J_6B.ipynb). \n- [Causal language modeling](https://huggingface.co/course/en/chapter7/6?fw=pt#training-a-causal-language-model-from-scratch) chapter of the 🤗 Hugging Face Course.\n- [`GPTJForCausalLM`] is supported by this [causal language modeling example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling#gpt-2gpt-and-causal-language-modeling), [text generation example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-generation), and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb).\n- [`TFGPTJForCausalLM`] is supported by this [causal language modeling example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/language-modeling#run_clmpy) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb).\n- [`FlaxGPTJForCausalLM`] is supported by this [causal language modeling example script](https://github.com/huggingface/transformers/tree/main/examples/flax/language-modeling#causal-language-modeling) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/causal_language_modeling_flax.ipynb).\n\n**Documentation resources**\n- [Text classification task guide](../tasks/sequence_classification)\n- [Question answering task guide](../tasks/question_answering)\n- [Causal language modeling task guide](../tasks/language_modeling)\n\n## GPTJConfig\n\n[[autodoc]] GPTJConfig\n - all\n\n<frameworkcontent>\n<pt>\n\n## GPTJModel\n\n[[autodoc]] GPTJModel\n - forward\n\n## GPTJForCausalLM\n\n[[autodoc]] GPTJForCausalLM\n - forward\n\n## GPTJForSequenceClassification\n\n[[autodoc]] GPTJForSequenceClassification\n - forward\n\n## GPTJForQuestionAnswering\n\n[[autodoc]] GPTJForQuestionAnswering\n - forward\n\n</pt>\n<tf>\n\n## TFGPTJModel\n\n[[autodoc]] TFGPTJModel\n - call\n\n## TFGPTJForCausalLM\n\n[[autodoc]] TFGPTJForCausalLM\n - call\n\n## TFGPTJForSequenceClassification\n\n[[autodoc]] TFGPTJForSequenceClassification\n - call\n\n## TFGPTJForQuestionAnswering\n\n[[autodoc]] TFGPTJForQuestionAnswering\n - call\n\n</tf>\n<jax>\n\n## FlaxGPTJModel\n\n[[autodoc]] FlaxGPTJModel\n - __call__\n\n## FlaxGPTJForCausalLM\n\n[[autodoc]] FlaxGPTJForCausalLM\n - __call__\n</jax>\n</frameworkcontent>\n",What dataset was the GPT-J model trained on?\n,the Pile dataset,huggingface/transformers/blob/main/docs/source/en/model_doc/gptj.md


### Setup critique agents

The questions generated by the previous agent can have many flaws: we should do a quality check before validating these questions.

We thus build critique agents that will rate each question on several criteria, given in [this paper](https://huggingface.co/papers/2312.10003):
- **Groundedness:** can the question be answered from the given context?
- **Relevance:** is the question relevant to users? For instance, `"What is the date when transformers 4.29.1 was released?"` is not relevant for ML practicioners.

One last failure case we've noticed is when a function is tailored for the particular setting where the question was generated, but undecipherable by itself, like `"What is the name of the function used in this guide?"`.
We also build a critique agent for this criteria:
- **Stand-alone**: is the question understandable free of any context, for someone with domain knowledge/Internet access? The opposite of this would be `What is the function used in this article?` for a question generated from a specific blog article.

We systematically score functions with all these agents, and whenever the score is too low for any one of the agents, we eliminate the question from our eval dataset.

💡 __When asking the agents to output a score, we first ask them to produce its rationale. This will help us verify scores, but most importantly, asking it to first output rationale gives the model more tokens to think and elaborate an answer before summarizing it into a single score token.__

We now build and run these critique agents.

In [14]:
question_groundedness_critique_prompt = """
You will be given a context and a question.
Your task is to provide a 'total rating' scoring how well one can answer the given question unambiguously with the given context.
Give your answer on a scale of 1 to 5, where 1 means that the question is not answerable at all given the context, and 5 means that the question is clearly and unambiguously answerable with the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating)
Total rating: (your rating)

Now here are the question and context.

Question: {question}\n
Context: {context}\n
Answer::: """

question_relevance_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how useful this question can be to machine learning developers building NLP applications with the Hugging Face ecosystem.
Give your answer on a scale of 1 to 5, where 1 means that the question is not useful at all, and 5 means that the question is extremely useful.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating)
Total rating: (your rating)

Now here is the question.

Question: {question}\n
Answer::: """

question_standalone_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how context-independant this question is.
Give your answer on a scale of 1 to 5, where 1 means that the question only makes sense in a specific context, and 5 means that the question makes sense by itself.
For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
The questions can contain obscure technical nouns or acronyms like Gradio, Hub, Hugging Face or Space and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating)
Total rating: (your rating)

Now here is the question.

Question: {question}\n
Answer::: """

question_groundedness_critique_prompt = ChatPromptTemplate.from_template(
    question_groundedness_critique_prompt
)
question_groundedness_critique_agent = question_groundedness_critique_prompt | chat_model

question_relevance_critique_prompt = ChatPromptTemplate.from_template(
    question_relevance_critique_prompt
)
question_relevance_critique_agent = question_relevance_critique_prompt | chat_model

question_standalone_critique_prompt = ChatPromptTemplate.from_template(
    question_standalone_critique_prompt
)
question_standalone_critique_agent = question_standalone_critique_prompt | chat_model

In [15]:
print("Generating critique for each QA couple...")
for output in tqdm(outputs):
    # Critique the generated QA couple
    question_groundedness_evaluation = question_groundedness_critique_agent.invoke(
        {"context": output["context"], "question": output["question"]}
    ).content
    question_relevance_evaluation = question_relevance_critique_agent.invoke(
        {"question": output["question"]}
    ).content
    question_standalone_evaluation = question_standalone_critique_agent.invoke(
        {"question": output["question"]}
    ).content

    try:
        groundedness_score = int(question_groundedness_evaluation.split("Total rating: ")[1][0])
        groundedness_eval = question_groundedness_evaluation.split("Total rating: ")[0].split(
            "Evaluation: "
        )[1]
        relevance_score = int(question_relevance_evaluation.split("Total rating: ")[1][0])
        relevance_eval = question_relevance_evaluation.split("Total rating: ")[0].split(
            "Evaluation: "
        )[1]
        standalone_score = int(question_standalone_evaluation.split("Total rating: ")[1][0])
        standalone_eval = question_standalone_evaluation.split("Total rating: ")[0].split(
            "Evaluation: "
        )[1]
        output.update(
            {
                "groundedness_score": groundedness_score,
                "groundedness_eval": groundedness_eval,
                "relevance_score": relevance_score,
                "relevance_eval": relevance_eval,
                "standalone_score": standalone_score,
                "standalone_eval": standalone_eval,
            }
        )
    except:
        continue

Generating critique for each QA couple...


  0%|          | 0/10 [00:00<?, ?it/s]

Now let us filter out bad questions based on our critique agent scores:

In [26]:
import pandas as pd

pd.set_option("display.max_colwidth", None)

generated_questions = pd.DataFrame.from_dict(outputs)
display(
    generated_questions[
        ["question", "answer", "groundedness_score", "relevance_score", "standalone_score"]
    ]
)
generated_questions = generated_questions.loc[
    (generated_questions["groundedness_score"] >= 4)
    & (generated_questions["relevance_score"] >= 4)
    & (generated_questions["standalone_score"] >= 4)
]
display(
    generated_questions[
        ["question", "answer", "groundedness_score", "relevance_score", "standalone_score"]
    ]
)

eval_dataset = datasets.Dataset.from_pandas(
    generated_questions, split="train", preserve_index=False
)

Unnamed: 0,question,answer,groundedness_score,relevance_score,standalone_score
0,What was the version number of @gradio/button when improved markdown support and startup performance enhancements were introduced?\n,0.1.1,5.0,2.0,5.0
1,Who proposed the BARTpho model?\n,"Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen",5.0,3.0,5.0
2,What section of the dataset card provides a description of how the dataset is intended to be used?\n,Direct Use,5.0,4.0,4.0
3,What is the unique model structure of GPTSAN called as referenced in the T5 paper?\n,Prefix-LM,5.0,2.0,4.0
4,What dataset was the GPT-J model trained on?\n,the Pile dataset,5.0,4.0,5.0
5,"What is the official French translation of ""plugin"" in the KDE4 dataset?\n",module d'extension,3.0,2.0,5.0
6,What processing may data collators apply to form a batch?\n,Padding and random data augmentation like random masking.,,,
7,"What is the accuracy metric value of the ""my-cool-model"" in the model index?\n",0.9,5.0,2.0,4.0
8,What is the URL for the Hugging Face course chapter that provides additional information on Datasets functionalities beyond the basics?\n,https://huggingface.co/course/chapter5/1?fw=pt,5.0,4.0,4.0
9,How many main features does the Hugging Face Datasets library provide?\n,two,5.0,3.0,4.0


Unnamed: 0,question,answer,groundedness_score,relevance_score,standalone_score
2,What section of the dataset card provides a description of how the dataset is intended to be used?\n,Direct Use,5.0,4.0,4.0
4,What dataset was the GPT-J model trained on?\n,the Pile dataset,5.0,4.0,5.0
8,What is the URL for the Hugging Face course chapter that provides additional information on Datasets functionalities beyond the basics?\n,https://huggingface.co/course/chapter5/1?fw=pt,5.0,4.0,4.0


Now the synthetic evaluation dataset is complete! We can evaluate different RAG systems on this evaluation dataset.

# 2. Build our RAG System

### Building the vector database: preprocessing

- In this part, __we split the documents from our knowledge base into smaller chunks__ which will be the snippets on which the reader LLM will base its answer.
- The goal is to have semantically relevant snippets: not too small to be sufficient for supporting an answer, and not too large too avoid diluting each idea.

Many options exist for text splitting: splitting on words, on sentence boundaries, recursive chunking that processes documents in a tree-like way to preserve structure information... [this space](https://huggingface.co/spaces/A-Roucher/chunk_visualizer) lets you visualize how different splitting options affect the chunks you get. To learn more about chunking, I recommend you watch [this great guide](https://www.youtube.com/watch?v=8OJC21T2SL4) by Greg Kamradt.

> In the following, we use Langchain's `RecursiveCharacterTextSplitter`.

💡 To measure chunk length in our Text Splitter, our length function will not be the count of characters, but the count of tokens in the tokenized text: indeed, providing the embedder with similar-sized tokenized chunks empirically improves retrieval performance.

In [27]:
from langchain.docstore.document import Document as LangchainDocument

RAW_KNOWLEDGE_BASE = [
    LangchainDocument(page_content=doc["text"], metadata={"source": doc["source"]})
    for doc in tqdm(ds)
]

  0%|          | 0/2647 [00:00<?, ?it/s]

In [28]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from transformers import AutoTokenizer


def split_documents(
    chunk_size: int,
    knowledge_base: List[LangchainDocument],
    tokenizer_name: str,
) -> List[LangchainDocument]:
    """
    Split documents into chunks of size `chunk_size` characters and return a list of documents.
    """
    text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
        AutoTokenizer.from_pretrained(tokenizer_name),
        chunk_size=chunk_size,
        chunk_overlap=int(chunk_size / 10),
        add_start_index=True,
        strip_whitespace=True,
        separators=["\n\n", "\n", ".", " ", ""],
    )

    docs_processed = []
    for doc in knowledge_base:
        docs_processed += text_splitter.split_documents([doc])

    # Remove duplicates
    unique_texts = {}
    docs_processed_unique = []
    for doc in docs_processed:
        if doc.page_content not in unique_texts:
            unique_texts[doc.page_content] = True
            docs_processed_unique.append(doc)

    return docs_processed_unique

### Retriever - embeddings 🗂️
The __retriever acts like an internal search engine__: given the user query, it returns the most relevant documents from your knowledge base.

> For the knowledge base, we use Langchain vector databases since __it offers a convenient [FAISS](https://github.com/facebookresearch/faiss) index and allows us to keep document metadata throughout the processing__.

🛠️ __Options included:__

- Tune the chunking method:
    - Size of the chunks
    - Method: split on different separators, use [semantic chunking](https://python.langchain.com/docs/modules/data_connection/document_transformers/semantic-chunker)...
- Change the embedding model

In [29]:
from langchain.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores.utils import DistanceStrategy
import os


def load_embeddings(
    langchain_docs: List[LangchainDocument],
    chunk_size: int,
    embedding_model_name: Optional[str] = "thenlper/gte-small",
) -> FAISS:
    """
    Creates a FAISS index from the given embedding model and documents. Loads the index directly if it already exists.

    Args:
        langchain_docs: list of documents
        chunk_size: size of the chunks to split the documents into
        embedding_model_name: name of the embedding model to use

    Returns:
        FAISS index
    """
    # load embedding_model
    embedding_model = HuggingFaceEmbeddings(
        model_name=embedding_model_name,
        multi_process=True,
        model_kwargs={"device": "cuda"},
        encode_kwargs={"normalize_embeddings": True},  # set True to compute cosine similarity
    )

    # Check if embeddings already exist on disk
    index_name = f"index_chunk:{chunk_size}_embeddings:{embedding_model_name.replace('/', '~')}"
    index_folder_path = f"./data/indexes/{index_name}/"
    if os.path.isdir(index_folder_path):
        return FAISS.load_local(
            index_folder_path,
            embedding_model,
            distance_strategy=DistanceStrategy.COSINE,
        )

    else:
        print("Index not found, generating it...")
        docs_processed = split_documents(
            chunk_size,
            langchain_docs,
            embedding_model_name,
        )
        knowledge_index = FAISS.from_documents(
            docs_processed, embedding_model, distance_strategy=DistanceStrategy.COSINE
        )
        knowledge_index.save_local(index_folder_path)
        return knowledge_index

### Reader - LLM 💬

In this part, the __LLM Reader reads the retrieved documents to formulate its answer.__

🛠️ Here we tried the following options to improve results:
- Switch reranking on/off
- Change the reader model

In [30]:
RAG_PROMPT_TEMPLATE = """
<|system|>
Using the information contained in the context, 
give a comprehensive answer to the question.
Respond only to the question asked, response should be concise and relevant to the question.
Provide the number of the source document when relevant.
If the answer cannot be deduced from the context, do not give an answer.</s>
<|user|>
Context:
{context}
---
Now here is the question you need to answer.

Question: {question}
</s>
<|assistant|>
"""

In [31]:
import torch
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

READER_MODEL_NAME = "HuggingFaceH4/zephyr-7b-beta"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(READER_MODEL_NAME, quantization_config=bnb_config)
tokenizer = AutoTokenizer.from_pretrained(READER_MODEL_NAME)

READER_LLM = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    do_sample=True,
    temperature=0.2,
    repetition_penalty=1.1,
    return_full_text=False,
    max_new_tokens=500,
)

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

In [164]:
READER_LLM("What is the capital of France?")

'\n\nParis.\n\nNo, that\'s the most romantic city in the world. The capital is actually Paris, but I\'m getting ahead of myself.\n\nThe capital of France is actually a city called Paris. Paris is located in the north of France, and it is the largest city in France. Paris is also known as the "City of Love" because of its romantic atmosphere and beautiful architecture.\n\nParis is home to many famous landmarks, such as'

In [34]:
from ragatouille import RAGPretrainedModel


def answer_with_rag(
    question: str,
    call_llm: pipeline,
    knowledge_index: FAISS,
    reranker: Optional[RAGPretrainedModel] = None,
    num_retrieved_docs: int = 30,
    num_docs_final: int = 7,
) -> Tuple[str, List[LangchainDocument]]:
    # Gather documents with retriever
    relevant_docs = knowledge_index.similarity_search(query=question, k=num_retrieved_docs)
    relevant_docs = [doc.page_content for doc in relevant_docs]  # keep only the text

    # Optionally rerank results
    if reranker:
        relevant_docs = reranker.rerank(question, relevant_docs, k=num_docs_final)
        relevant_docs = [doc["content"] for doc in relevant_docs]

    relevant_docs = relevant_docs[:num_docs_final]

    # Build the final prompt
    context = "\nExtracted documents:\n"
    context += "".join([f"Document {str(i)}:::\n" + doc for i, doc in enumerate(relevant_docs)])

    final_prompt = RAG_PROMPT_TEMPLATE.format(question=question, context=context)

    # Redact an answer
    answer = call_llm(final_prompt)

    return answer, relevant_docs

# 3. Benchmarking the RAG system

The RAG system and the evaluation datasets are now ready. The last step is to judge the RAG system's output on this evlauation dataset.

To this end, __we setup a judge agent__. ⚖️🤖

Out of [the different RAG evaluation metrics](https://docs.ragas.io/en/latest/concepts/metrics/index.html), we choose to focus only on faithfulness since it the best end-to-end metric of our system's performance.

> We use GPT4 as a judge for its good performance, but you could try with other models such as `kaist-ai/prometheus-13b-v1.0` or `BAAI/JudgeLM-33B-v1.0`.

💡 In the evaluation prompt, we give a detailed description each metric on the scale 1-5, as is done in [Prometheus's prompt template](https://huggingface.co/kaist-ai/prometheus-13b-v1.0): this helps the model ground its metric precisely. If instead you give the judge LLM a vague scale to work with, the outputs will not be consistent enough inbetween examples.

💡 Again, prompting the LLM to output rationale before giving its final score gives it more tokens to help it formalize and elaborate a judgement.

In [35]:
import json
from ragatouille import RAGPretrainedModel


def run_rag_tests(
    eval_dataset: Dataset,
    llm: BaseChatModel,
    knowledge_index: FAISS,
    output_file: str,
    reranker: Optional[RAGPretrainedModel] = None,
    verbose: Optional[bool] = True,
):
    try:  # load previous generations if they exist
        with open(output_file, "r") as f:
            outputs = json.load(f)
    except:
        outputs = []

    for example in tqdm(eval_dataset["train"]):
        question = example["question"]
        if question in [output["question"] for output in outputs]:
            continue

        answer, relevant_docs = answer_with_rag(question, llm, knowledge_index, reranker=reranker)
        if verbose:
            print("=======================================================")
            print(f"Question: {question}")
            print(f"Answer: {answer}")
            print(f'True answer: {example["answer"]}')
        outputs.append(
            {
                "question": question,
                "true_answer": example["answer"],
                "source_doc": example["source_doc"],
                "generated_answer": answer,
                "retrieved_docs": [doc for doc in relevant_docs],
            }
        )

        with open(output_file, "w") as f:
            json.dump(outputs, f)

In [36]:
EVALUATION_PROMPT = """###Task Description:
An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: \"Feedback: {{write a feedback for criteria}} [RESULT] {{an integer number between 1 and 5}}\"
4. Please do not generate any other opening, closing, and explanations. Be sure to include [RESULT] in your output.

###The instruction to evaluate:
{instruction}

###Response to evaluate:
{response}

###Reference Answer (Score 5):
{reference_answer}

###Score Rubrics:
[Is the response correct, accurate, and factual based on the reference answer?]
Score 1: The response is completely incorrect, inaccurate, and/or not factual.
Score 2: The response is mostly incorrect, inaccurate, and/or not factual.
Score 3: The response is somewhat correct, accurate, and/or factual.
Score 4: The response is mostly correct, accurate, and factual.
Score 5: The response is completely correct, accurate, and factual.

###Feedback:"""

from langchain.prompts.chat import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
)
from langchain.schema import SystemMessage


evaluation_prompt_template = ChatPromptTemplate.from_messages(
    [
        SystemMessage(content="You are a fair evaluator language model."),
        HumanMessagePromptTemplate.from_template(EVALUATION_PROMPT),
    ]
)

In [37]:
from langchain.chat_models import ChatOpenAI

eval_chat_model = ChatOpenAI(model="gpt-4-1106-preview", temperature=0)
evaluator_name = "GPT4"


def evaluate_answers(
    answer_path: str,
    eval_chat_model: BaseChatModel,
    evaluator_name: str,
    evaluation_prompt_template: ChatPromptTemplate,
) -> None:
    try:  # load previous generations if they exist
        with open(answer_path, "r") as f:
            answers = json.load(f)
    except:
        answers = []

    for experiment in tqdm(answers):
        if f"eval_score_{evaluator_name}" in experiment:
            continue

        eval_prompt = evaluation_prompt_template.format_messages(
            instruction=experiment["question"],
            response=experiment["generated_answer"],
            reference_answer=experiment["true_answer"],
        )
        eval_result = eval_chat_model.invoke(eval_prompt)
        feedback, score = [item.strip() for item in eval_result.content.split("[RESULT]")]
        experiment[f"eval_score_{evaluator_name}"] = score
        experiment[f"eval_feedback_{evaluator_name}"] = feedback

        with open(answer_path, "w") as f:
            json.dump(answers, f)

In [None]:
for chunk_size in [200, 300]:  # Add other chunk sizes (in tokens) as needed
    for embeddings in ["BAAI/bge-base-en-v1.5"]:  # Add other embeddings as needed
        for rerank in [False, True]:
            settings_name = (
                f"chunk:{chunk_size}_embeddings:{embeddings.replace('/', '~')}_rerank:{rerank}"
            )
            output_file_name = f"./output/rag_{settings_name}.json"

            print(f"Running evaluation for {settings_name}:")

            print("Loading knowledge base embeddings...")
            knowledge_index = load_embeddings(
                RAW_KNOWLEDGE_BASE,
                chunk_size=chunk_size,
                embedding_model_name=embeddings,
            )

            print("Running RAG...")
            reranker = (
                RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0") if rerank else None
            )
            run_rag_tests(
                eval_dataset=eval_dataset,
                llm=READER_LLM,
                knowledge_index=knowledge_index,
                output_file=output_file_name,
                reranker=reranker,
                verbose=False,
            )

            print("Running evaluation...")
            evaluate_answers(
                output_file_name,
                eval_chat_model,
                evaluator_name,
                evaluation_prompt_template,
            )

## Example results

Let load the results from my own usecase.

As you can see in the graph below, you should try several different directions when tuning your RAG systems.
Some changes will not be an improvement, some will bring huge performance boost.

In [None]:
import plotly.express as px

scores = datasets.load_dataset("m-ric/rag_scores_cookbook", split="train")
scores = pd.Series(scores["normalized_score"], index=scores["settings"])

fig = px.bar(
    scores,
    color=scores,
    labels={
        "value": "Accuracy",
        "settings": "Configuration",
    },
    color_continuous_scale="bluered",
)
fig.update_layout(
    width=900,
    height=600,
    barmode="group",
    yaxis_range=[0, 100],
    title="<b>Accuracy of different RAG configurations</b>",
    xaxis_title="RAG settings",
)
fig.layout.yaxis.ticksuffix = "%"
fig.update_coloraxes(showscale=False)
fig.update_traces(texttemplate="%{y:.1f}", textposition="outside")
fig.show()

<img src="https://huggingface.co/datasets/huggingface/cookbook-images/resolve/main/RAG_settings_accuracy.png" height="500">

As you can see, each of these changes improved performance more or less. In particular, tuning the chunk size is both easy and very impactful.

But this is our case, your results could be very different: now that you have a robust evaluation pipeline, you can set on to explore other options!