**Heidelberg University**

**Data Science  Group**
    
Prof. Dr. Michael Gertz  

Ashish Chouhan, Satya Almasian, John Ziegler, Jayson Salazar, Nicolas Reuter
    
January 16, 2024
    
Natural Language Processing with Transformers

Winter Semster 2023/2024     
***

# **Assignment 4: Question Answering**
**Due**: Monday, January 29, 2024, 2pm, via [Moodle](https://moodle.uni-heidelberg.de/course/view.php?id=19251)



### **Submission Guidelines**

- Solutions need to be uploaded as a **single** Jupyter notebook. You will find several pre-filled code segments in the notebook, your task is to fill in the missing cells.
- For the written solution, use LaTeX in markdown inside the same notebook. Do **not** hand in a separate file for it.
- Download the .zip file containing the dataset but do **not** upload it with your solution.
- It is sufficient if one person per group uploads the solution to Moodle, but make sure that the full names of all team members are given in the notebook.

***

## **Task 1: Retrieval Augmented Generation (RAG)** ( 4.5 + 3 + 4 + 3 + 1.5 = 16 points)

In this task, we look at using the open source `Llama-13b-chat` model for creating a RAG system. You must first apply for access to Llama 2 models via [this](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) form (access is typically granted within a few hours). etrieval augmented generation you also need to request to use the model on Hugging Face by going to the [model](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) card. ***Note that the emails you provide for your Hugging Face account must match the email you used to request Llama 2.***

The final piece that you need is a Hugging Face authentication token. You can find such a token by going to the `setting` in your Hugging Face profile, under the `Access Token` menu you can generate a new token.

To store the document you will need a free Pinecone [API key](https://app.pinecone.io/).
Make sure you have these pieces ready before starting to work on this task.

----
When ready, let's start by downloading the necessary packages.

It is advised to proceed with this notebook with a GPU (if you are on Colab make sure that a GPU environment is activated.)


Place all the access tokens in the `.env` file and upload it to the working directory (if you are running this notebook locally, you can change the path to fit your working directory). Please use the following format:


```
HF_AUTH= "Hugging Face Authentication Key"
PINECONE_API_KEY="Pincone API Key"
PINECONE_ENVIRONMENT="Pinecone Environment"
```

Run the cell below to load the access tokens into the environment variables.

In [1]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive

#drive.mount('/gdrive')
#%cd /gdrive/My Drive/NLPT_Assignment4


In [1]:
%pip install python-dotenv

Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1


In [2]:
import os
from dotenv import load_dotenv

# load environment variables from .env file
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

print(os.environ.get('HF_AUTH'))
print(os.environ.get('PINECONE_API_KEY'))

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
ef3c0e05-aa05-48f5-a328-775167493ef1


In [3]:
%pip install pinecone-client==2.2.4
#%pip install -U pinecone-client

Collecting pinecone-client==2.2.4
  Downloading pinecone_client-2.2.4-py3-none-any.whl (179 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.4/179.4 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
Collecting loguru>=0.5.0 (from pinecone-client==2.2.4)
  Downloading loguru-0.7.2-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.5/62.5 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
Collecting dnspython>=2.0.0 (from pinecone-client==2.2.4)
  Downloading dnspython-2.5.0-py3-none-any.whl (305 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m305.4/305.4 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: loguru, dnspython, pinecone-client
Successfully installed dnspython-2.5.0 loguru-0.7.2 pinecone-client-2.2.4


In [4]:
%pip install litellm
%pip install -qU trulens_eval pydantic fastapi kaleido python-multipart uvicorn cohere openai tiktoken "llama-index"
%pip install transformers
%pip install sentence-transformers
%pip install pinecone-client
%pip install datasets
%pip install accelerate
%pip install einops
%pip install langchain
%pip install xformers
%pip install bitsandbytes
%pip install matplotlib seaborn tqdm
%pip install chromadb
%pip install evaluate
%pip install rouge_score
%pip install bert_score

Collecting litellm
  Downloading litellm-1.18.9-py3-none-any.whl (2.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
Collecting openai>=1.0.0 (from litellm)
  Downloading openai-1.9.0-py3-none-any.whl (223 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m223.4/223.4 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
Collecting tiktoken>=0.4.0 (from litellm)
  Downloading tiktoken-0.5.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from openai>=1.0.0->litellm)
  Downloading httpx-0.26.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.9/75.9 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
Collecting typing-extensions<5,>=4.7 (from openai>=1.0.0->litellm)
  Downloading typing_extensions-4.9

In [5]:
import torch

has_gpu = torch.cuda.is_available()
has_mps = torch.backends.mps.is_built()
device = "mps" if has_mps else "cuda" if torch.cuda.is_available() else "cpu"
print(device)

cuda




## Subtask 1.1: Data Preparation



We need a collection of documents to perform our retrieval on. To make it closer to your final project, you will be downloading and using a subset of the LangChain documentation. We get some of the `.html` files located on the site. The code below will download all HTML files from the links on the webpage into a `docs` directory. `-l1` limits the download to only the first level of depth.


In [6]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [7]:
!wget -r -l1 -A.html -P docs https://api.python.langchain.com/en/stable/langchain_api_reference.html

--2024-01-23 08:22:21--  https://api.python.langchain.com/en/stable/langchain_api_reference.html
Resolving api.python.langchain.com (api.python.langchain.com)... 104.17.33.82, 104.17.32.82, 2606:4700::6811:2152, ...
Connecting to api.python.langchain.com (api.python.langchain.com)|104.17.33.82|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘docs/api.python.langchain.com/en/stable/langchain_api_reference.html’

api.python.langchai     [ <=>                ] 256.55K  --.-KB/s    in 0.03s   

2024-01-23 08:22:21 (9.86 MB/s) - ‘docs/api.python.langchain.com/en/stable/langchain_api_reference.html’ saved [262709]

Loading robots.txt; please ignore errors.
--2024-01-23 08:22:21--  https://api.python.langchain.com/robots.txt
Reusing existing connection to api.python.langchain.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 130 [text/plain]
Saving to: ‘docs/api.python.langchain.com/robots.txt.tmp’


2024-01-23 08:22:2

 The docs are going to be used as input text for answering questions that a normal language model might not be aware of (LangChain docs is not necessarily part of its training data of Llama2). We can use LangChain itself to process these docs. Use the [ReadTheDocsLoader](https://python.langchain.com/docs/integrations/document_loaders/readthedocs_documentation) to load the docs from the `docs` folder.

 At the time of creating this notebook, there  `423` documents were downloaded. However, since the documentation is being updated regularly this number might be different for you.

In [8]:
from langchain.document_loaders import ReadTheDocsLoader
#### your code ####
loader = ReadTheDocsLoader("docs")
docs = loader.load()
#### your code ####
len(docs)

423

Let's take a look at one of the documents. You see that LangChain has created a `Document` object. Look at the example below and fill in the cells to print out the text content and URL of the page (the URL of the page should starts with `https://`).

In [9]:
docs[10]

Document(page_content='langchain_experimental 0.0.47¶\nlangchain_experimental.agents¶\nFunctions¶\nagents.agent_toolkits.csv.base.create_csv_agent(...)\nCreate csv agent by loading to a dataframe and using pandas agent.\nagents.agent_toolkits.pandas.base.create_pandas_dataframe_agent(llm,\xa0df)\nConstruct a pandas agent from an LLM and dataframe.\nagents.agent_toolkits.python.base.create_python_agent(...)\nConstruct a python agent from an LLM and tool.\nagents.agent_toolkits.spark.base.create_spark_dataframe_agent(llm,\xa0df)\nConstruct a Spark agent from an LLM and dataframe.\nagents.agent_toolkits.xorbits.base.create_xorbits_agent(...)\nConstruct a xorbits agent from an LLM and dataframe.\nlangchain_experimental.autonomous_agents¶\nClasses¶\nautonomous_agents.autogpt.agent.AutoGPT(...)\nAgent class for interacting with Auto-GPT.\nautonomous_agents.autogpt.memory.AutoGPTMemory\nMemory for AutoGPT.\nautonomous_agents.autogpt.output_parser.AutoGPTAction(...)\nAction returned by AutoGPT

In [10]:
#### your code ####
from bs4 import BeautifulSoup

def retrieve_url(doc):
  with open(doc.dict()['metadata']['source']) as original_file:
    soup = BeautifulSoup(original_file, 'html.parser')
    # Find the canonical link tag
    canonical_tag = soup.find('link', {'rel': 'canonical'})

    # Extract the URL from the href attribute
    if canonical_tag:
      url = canonical_tag.get('href')
    else:
      print("No canonical link found.")
  return url

page_content = docs[10].page_content
page_url = retrieve_url(docs[10])
#### your code ####
print(page_content)
print(page_url)

langchain_experimental 0.0.47¶
langchain_experimental.agents¶
Functions¶
agents.agent_toolkits.csv.base.create_csv_agent(...)
Create csv agent by loading to a dataframe and using pandas agent.
agents.agent_toolkits.pandas.base.create_pandas_dataframe_agent(llm, df)
Construct a pandas agent from an LLM and dataframe.
agents.agent_toolkits.python.base.create_python_agent(...)
Construct a python agent from an LLM and tool.
agents.agent_toolkits.spark.base.create_spark_dataframe_agent(llm, df)
Construct a Spark agent from an LLM and dataframe.
agents.agent_toolkits.xorbits.base.create_xorbits_agent(...)
Construct a xorbits agent from an LLM and dataframe.
langchain_experimental.autonomous_agents¶
Classes¶
autonomous_agents.autogpt.agent.AutoGPT(...)
Agent class for interacting with Auto-GPT.
autonomous_agents.autogpt.memory.AutoGPTMemory
Memory for AutoGPT.
autonomous_agents.autogpt.output_parser.AutoGPTAction(...)
Action returned by AutoGPTOutputParser.
autonomous_agents.autogpt.output_pa

As you can imagine the documents can be long and if multiple of them are required as context to answer questions, we need to take the document lengths into account.
This is due to the fact that language models do not have unlimited context span. In our case, we plan to use Llama2 for this project, where the maximum token limit is 4096. This limit is not only the input but also takes the generated output into account, moreover, you need to leave room for the query and instructions as well. Therefore, it is important to chunk the longer documents into smaller-sized fragments.

Based on your use case and how many contexts you plan to feed into the model the length of these fragments will differ.
In this case, we choose to assign 2000 tokens to context and choose to generate the answer from 5 context fragments, which leaves us with 400 tokens per context fragment as the maximum chunk size.

To count the number of tokens in a chunk, we need to load the correct tokenizer for Llama2. Fill the code cell below to load the correct tokenizer and use it to complete the function that counts the number of tokens per given chunk.

**Hint:** you need to use your Hugging Face authentication token to load the tokenizer.

In [11]:
#If you get an error here during the first import from the `transformers` package, restart the kernel and try again.
#### your code ####
from transformers import AutoTokenizer

model_id = 'meta-llama/Llama-2-13b-chat-hf'

tokenizer = AutoTokenizer.from_pretrained(model_id, token=os.environ.get('HF_AUTH'))

#### your code ####

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [12]:
def token_len(text):
  #### your code ####
  # Tokenize the input text
  tokens = tokenizer.tokenize(text)

  # Calculate the length of the tokenized sequence
  tokens_length = len(tokens)

  return tokens_length

  #### your code ####

Count the number of tokens for all documents and use it to compute minimum, maximum, and average token count statistics across all documents. Depending on how the documentation is updated by the time you run the cell below the numbers might slightly differ.

In [13]:
#### your code ####
import numpy as np

token_counts = [token_len(doc.page_content) for doc in docs]
min_tokens= np.min(token_counts)
avg_tokens= np.average(token_counts)
max_tokens= np.max(token_counts)
#### your code ####

print(f"""Min: {min_tokens}
Avg: {avg_tokens}
Max: {max_tokens}""")

Min: 47
Avg: 2662.751773049645
Max: 36799


Now we will use LangChain's built-in chunking functionality to split the text into smaller chunks. LangChain offers a variety of text splitters that you can check out [here](https://api.python.langchain.com/en/latest/langchain_api_reference.html#module-langchain.text_splitter).
Use the general-purpose splitter that splits text by recursively looking at characters. Use this class to split the text into 400 token-sized chunks, where the length of each chunk is computed based on the `token_len` function. The length is not the only criterion for splitting, if any of these separators `'\n\n', '\n', ' ', ''` is encountered, we will have a new chunk.
Since splitting only based on maximum length might result in incoherent chunks for every consecutive chunk, let the chunk overlap by 50 tokens. This way,  we preserve some of the previous context while chunking.

In [14]:
#### your code ####
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 400,
    chunk_overlap  = 50,
    length_function = token_len,
)
#### your code ####

In [15]:
chunks = text_splitter.split_text(docs[100].page_content)
len(chunks)

21

In [16]:
token_len(chunks[0])

372

The next step is to apply the splitting function to all the documents in our corpus and to save our chunks in a logical way. We also want to assign a unique ID to each chunk so we know which part of the documentation they come from. In the end, the corpus should be transformed into a list of dictionaries of the following format:


```
[
    {
        "id": "glossary-0",
        "text": "first chunk of the document glossary",
        "source": "https://langchain.readthedocs.io/en/latest/glossary.html"
    },
    {
        "id": "glossary-1",
        "text": "second chunk of glossary",
        "source": "https://langchain.readthedocs.io/en/latest/glossary.html"
    }
    ...
]
```

Construct the IDs by taking the name of the page before the suffix `.html` and appending a chronological number indicating which chunk it is.


In [17]:
from tqdm.auto import tqdm
from urllib.parse import urlparse

documents = []

id=0
for doc in tqdm(docs):
  #### your code ####
    url = retrieve_url(doc) # Defined before
    uid_basename = os.path.splitext(os.path.basename(urlparse(url).path))[0]
    chunks = text_splitter.split_text(doc.page_content)
    id=0

    for chunk in chunks:
      uid = uid_basename+"-"+str(id)
      documents.append({"id":uid,"text":chunk,"source":url})
      id+=1

  #### your code ####
len(documents) # once again this value might differ based on how the LangChain documentation is updated

  0%|          | 0/423 [00:00<?, ?it/s]

3638

For the next steps, we require a `DataFrame`.

In [18]:
import pandas as pd
data = pd.DataFrame(documents)
data.head()

Unnamed: 0,id,text,source
0,langchain_api_reference-0,langchain 0.1.2¶\nlangchain.agents¶\nAgent is ...,https://api.python.langchain.com/en/latest/lan...
1,langchain_api_reference-1,agents.agent.MultiActionAgentOutputParser\nBas...,https://api.python.langchain.com/en/latest/lan...
2,langchain_api_reference-2,agents.conversational.output_parser.ConvoOutpu...,https://api.python.langchain.com/en/latest/lan...
3,langchain_api_reference-3,Run an OpenAI Assistant.\nagents.openai_functi...,https://api.python.langchain.com/en/latest/lan...
4,langchain_api_reference-4,agents.output_parsers.self_ask.SelfAskOutputPa...,https://api.python.langchain.com/en/latest/lan...


#### ${\color{red}{Comments\ 1.1}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

## Subtask 1.2: Document Embedding Pipeline


In this task, we initialize the embedding pipeline to transform the chunks into vector embeddings using Hugging Face and LangChain. These embeddings are used for similarity search between the query and the chunks to retrieve the most relevant chunks.
  We will use the `sentence-transformers/all-MiniLM-L6-v2` model for embedding, which is a rather small model that you can easily run on Colab. Initialize the model using `HuggingFaceEmbeddings` to use Hugging Face via Langchain. The encoding batch size should be 32, and make sure that the model is placed on the correct device, otherwise, this can take a long time.

In [19]:
from torch import cuda
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
import os
import pinecone
from tqdm import tqdm

In [20]:
embedding_model = 'sentence-transformers/all-MiniLM-L6-v2'
device = 'cuda:0' # make sure you are on gpu
docs = [
    "An example document",
    "A second document as an example"
]
### your code ###
hf_auth_key=os.environ.get('HF_AUTH')
model_kwargs = {'device': device,}
embed_model = HuggingFaceEmbeddings(model_name=embedding_model, model_kwargs=model_kwargs)
### your code ###

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Embed the example documents using the model you created and check the output.
The output should be a list of lists, containing the embeddings.

In [21]:
### your code ###
embeddings = embed_model.embed_documents(docs)
### your code ###
print("number of docs:",len(embeddings))
print("dimension of docs:",len(embeddings[0]))

number of docs: 2
dimension of docs: 384


Now we use the embedding pipeline created above to store the embeddings in a Pinecone vector index. First, lets setup the Pinecone environment, collect your API key and environment name from the environment variables, and initiate Pinecone with them.

In [22]:
### your code ###
pinecone.init(api_key=os.environ['PINECONE_API_KEY'], environment="gcp-starter")
### your code ###

Initialize the index `rag-assignment` inside Pinecone. Use the cosine similarity as similarity metric. Keep in mind that if you run this multiple times on a free tier, where only one index is allowed, you need to remove the index created to make room for a new one (Pinecone index gets archived automatically after 14 days of inactivity).

In [23]:
index_name = 'rag-assignment'
### your code ###
try:
  if index_name not in pinecone.list_indexes():
    pinecone.create_index(name=index_name, dimension=len(embeddings[0]), metric="cosine", shards=1)
    index = pinecone.Index(index_name=index_name)
except Exception as e:
  print(f"Error initializing Pinecone index: {e}")
  print(f"Index: {pinecone.list_indexes()}")
### your code ###

Lets take a look at the index you created. As of now the index should be empty but have the correct embedding dimension.

In [24]:
index_name = 'rag-assignment'
index = pinecone.Index(index_name)
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

Process the dataset in batches of `32` and push the vectors to the Pinecone index. Your index should include the IDs and embeddings for each chunk. As metadata, pass the original text as `text` and the URL as `source` (no need to add the `https`). We use this metadata later to retrieve the original text.

In [25]:
batch_size = 32

for i in tqdm(range(0, len(data), batch_size)):
  ### your code ###
    batch = data.iloc[i:i+batch_size]

    ids = batch["id"].tolist()
    texts = batch["text"].tolist()
    embeds = embed_model.embed_documents(texts)

    sources = batch["source"].tolist()
    texts = batch["text"].tolist()
    #metadata = [{'source':source,'text':text} for source in sources for text in texts]
    metadata = [{'source': source, 'text': text} for source, text in zip(sources, texts)]

    # Format data for indexing and upsert
    index.upsert(vectors=zip(ids, embeds, metadata))

    ### your code ###


100%|██████████| 114/114 [00:27<00:00,  4.08it/s]


Now if we look at the index statistics we should have vectors of dimension `384`.

In [30]:
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.03593,
 'namespaces': {'': {'vector_count': 3593}},
 'total_vector_count': 3593}

#### ${\color{red}{Comments\ 1.2}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

## Subtask 1.3: Text Generation Pipeline


So far we have our index ready and a way to find the most similar chunks to our query. Now, we need a way to generate the answer from the retrieved chunks. For this purpose, we use the `text-generation` pipeline from Hugging Face (refer to the Hugging Face [tutorial](https://moodle.uni-heidelberg.de/pluginfile.php/1286642/mod_resource/content/1/HuggingFace.ipynb)) and load it into LangChain using a wrapper.

In [31]:
from torch import cuda, bfloat16
import os
import transformers
model_id = 'meta-llama/Llama-2-13b-chat-hf'

Quantization techniques reduce memory and computational costs by representing weights and activations with lower-precision data types like 8-bit integers (int8). This enables loading larger models you normally wouldn’t be able to fit into memory, and thus speeds up inference.
To make the process of model quantization more accessible, Hugging Face has seamlessly integrated with the [Bitsandbytes](https://huggingface.co/docs/accelerate/usage_guides/quantization) library.

Define a config from `Bitsandbytes` that enables 4-bit quantization and set the nested quantization to `true`. This changes the datatype from float 32 (default) to normalized float 4 datatype to contain 4 bits of information.
Additionally, add a compute type to store weights in 4-bits, but the computation to happen in 16-bit (bfloat16).
Moreover, set the `bnb_4bit_use_double_quant` to true, which uses a second quantization after the first one to save an additional 0.4 bits per parameter.
Refer to [here](https://huggingface.co/docs/transformers/main_classes/quantization) for more information.

In [32]:
from transformers import BitsAndBytesConfig
  ### your code ###
bitsAndBites_config = BitsAndBytesConfig(
    load_in_4bit=True,                  #enables 4-bit quantization
    bnb_4bit_compute_dtype=torch.bfloat16,      #sets computation to bfloat16 for speedups
    bnb_4bit_quant_type="fp4",          #sets the quantization data type to FP4
    bnb_4bit_use_double_quant=True,     #enables nested quantization
)
  ### your code ###

Use your Hugging Face token to load the correct model configuration using the `transformers` library.

In [33]:
### your code ###

model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    token=os.environ.get('HF_AUTH')
)
### your code ###


config.json:   0%|          | 0.00/587 [00:00<?, ?B/s]

Load the model for text generation (pay attention to the model type) using the configuration file you have defined, with the specified quantization, and set the `trust_remote_code` flag to `true`. Another flag that is useful for large mode is  `device_map="auto"`. By setting this flag, Accelerate will determine where to put each layer to maximize the use of GPUs and offload the rest on the CPU, or even the hard drive if you don’t have enough GPU RAM (or CPU RAM).

It will take a while for the model to download.

In [34]:
#Loading the model will take some time, (roughly 5 min)
### your code ###
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bitsAndBites_config,
    device_map='auto',
    token=os.environ.get('HF_AUTH'),
)
### your code ###
model.eval()# we only use the model for inference
print(f"Model loaded ")

model.safetensors.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]



generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Model loaded 


You can even check the memory footprint of your model using the `get_memory_footprint` method.


In [35]:
model.get_memory_footprint()

7083970560

The next thing we need to do is initialize a `text-generation` pipeline with Hugging Face that uses the Llama2 model to generate some text, given some input. We will then use this pipeline inside LangChain to build our question-answering system.
`text-generation` pipeline generates text from a language model conditioned on a given input. The pipeline is similar to other Hugging Face pipelines and requires two things that we must initialize:

1.   A language model, in this case, it will be `meta-llama/Llama-2-13b-chat-hf`.
2.   A tokenizer for the language model.

LangChain expects the full-text outputs, therefore set the `return_full_text` to true. You can also pass additional generation parameters to the model.
Since we want the questions to be answered mainly based on the retrieved chunks, let's set the model temperature to a low value of 0.01 to reduce randomness. Additionally, add a repetition penalty of 1.1 to stop the model from repeating itself and the maximum number of generation tokens to 512.

In [36]:
### your code ###
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    token=os.environ.get('HF_AUTH')
)

generate_text = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    temperature=0.01,  # 'randomness' of outputs, 0.0 is the min and 1.0 is the max
    max_new_tokens=512,  # max number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)
### your code ###

We provide the language model a general question to make sure our pipeline is working correctly.

In [37]:
sample_input = "Explain to me the difference between alligator and crocodile."
### your code ###
generated_text = generate_text(sample_input)[0]['generated_text']
### your code ###
print(generated_text)

Explain to me the difference between alligator and crocodile.

Alligators and crocodiles are both large, carnivorous reptiles that live in aquatic environments, but there are several key differences between them. Here are some of the main differences:

1. Appearance: Alligators have a wider, rounder snout compared to crocodiles, which have a longer, thinner snout. Alligators also have a more rounded body shape, while crocodiles are more streamlined.
2. Habitat: Alligators are found only in freshwater environments, such as lakes, rivers, and swamps, while crocodiles can be found in both freshwater and saltwater environments.
3. Geographic range: Alligators are only found in the southeastern United States and China, while crocodiles are found in many parts of the world, including Africa, Asia, Australia, and the Americas.
4. Nesting habits: Alligators build mounds of vegetation and mud to lay their eggs, while crocodiles dig holes in the sand or mud to lay their eggs.
5. Jaw structure: A

Use the LangChain Hugging Face wrapper, as subset of [LLM chain](https://python.langchain.com/docs/modules/chains/foundational/llm_chain) to create an interface for the text generation pipeline.

In [38]:
### your code ###
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generate_text)
### your code ###

To confirm that it works the same way, use the sample input to generate text using the llm chain. The input should be passed as the `prompt` to the language model.

In [39]:
### your code ###
generated_text = llm(sample_input)
print(generated_text)
### your code ###

  warn_deprecated(




Alligators and crocodiles are both large, carnivorous reptiles that live in aquatic environments, but there are several key differences between them. Here are some of the main differences:

1. Appearance: Alligators have a wider, rounder snout compared to crocodiles, which have a longer, thinner snout. Alligators also have a more rounded body shape, while crocodiles are more streamlined.
2. Habitat: Alligators are found only in freshwater environments, such as lakes, rivers, and swamps, while crocodiles can be found in both freshwater and saltwater environments.
3. Geographic range: Alligators are only found in the southeastern United States and China, while crocodiles are found in many parts of the world, including Africa, Asia, Australia, and the Americas.
4. Nesting habits: Alligators build mounds of vegetation and mud to lay their eggs, while crocodiles dig holes in the sand or mud to lay their eggs.
5. Jaw structure: Alligators have a different jaw structure than crocodiles, wit

In [40]:
print(sample_input)

Explain to me the difference between alligator and crocodile.


#### ${\color{red}{Comments\ 1.3}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

## Subtask 1.4: Question Answering Chain


For Retrieval Augmented Generation (RAG) in LangChain, we need to initialize either a `RetrievalQA` or `RetrievalQAWithSourcesChain` object.

`RetrievalQA` is a method for question-answering tasks, utilizing an index to retrieve relevant documents or text chunks, it is suitable for straightforward Q&A applications.

`RetrievalQAWithSourcesChain` is an extension of RetrievalQA that chains together multiple sources of information, providing context and the source for answers.

 For both of these, we need an LLM and a Pinecone index. For LangChain to be able to use the Pinecone index, we need to initialize it through the LangChain vector store.

 **Hint**: You need to explicitly tell the vector storage where to find the original text.

In [41]:
from langchain.vectorstores import Pinecone
### your code ###

vectorstore = Pinecone.from_existing_index(index_name, embed_model)
### your code ###

Let's try a query that is specific to the LangChain documentation and see which chunks are relevant. Use the vector storage defined above to find the top-3 chunks related to the given query.

In [42]:
query = 'what is a LangChain Agent?'
### your code ###

found_docs = vectorstore.max_marginal_relevance_search(query, k=3, fetch_k=10)
for i, doc in enumerate(found_docs):
    print(f"{i + 1}.", doc.page_content, "\n")
### your code ###

1. langchain 0.0.353¶
langchain.agents¶
Agent is a class that uses an LLM to choose a sequence of actions to take.
In Chains, a sequence of actions is hardcoded. In Agents,
a language model is used as a reasoning engine to determine which actions
to take and in which order.
Agents select and use Tools and Toolkits for actions.
Class hierarchy:
BaseSingleActionAgent --> LLMSingleActionAgent
                          OpenAIFunctionsAgent
                          XMLAgent
                          Agent --> <name>Agent  # Examples: ZeroShotAgent, ChatAgent
BaseMultiActionAgent  --> OpenAIMultiFunctionsAgent
Main helpers:
AgentType, AgentExecutor, AgentOutputParser, AgentExecutorIterator,
AgentAction, AgentFinish
Classes¶
agents.agent.Agent
Agent that calls the language model and deciding the action.
agents.agent.AgentExecutor
Agent that is using tools.
agents.agent.AgentOutputParser
Base class for parsing agent output into agent action/finish.
agents.agent.BaseMultiActionAgent
Base Multi

Now use the `vectorstore` and `llm` to initialize the `RetrievalQA` object, which showcases question answering over an index. `RetrievalQA` is a document chain, these are useful for summarizing documents, answering questions about documents, extracting information from documents, and more. All such chains operate with 4 different chain types:


1.   `stuff`: it takes a list of documents, inserts them all into a prompt, and passes that prompt to an LLM.
2.   `refine`: it constructs a response by looping over the input documents and iteratively updating its answer. It is well-suited for tasks that require analyzing more documents than can fit in the model’s context.
3. `map_reduce`:  it first applies an LLM chain to each document individually (the Map step), treating the chain output as a new document. It then passes all the new documents to a separate combined documents chain to get a single output (the Reduce step).
4. `map_re_rank`: it runs an initial prompt on each document that not only tries to complete a task but also gives a score for how certain it is in its answer. The highest-scoring response is returned.

For this assignment, we focus only on the first type. Make sure to set the `verbose` to `true`, so we can see the different stages of processing that happens while answering a question (you might need to set this parameter more than once). As mentioned before, we want our retrieve to input top-5 most similiar chunks to the query to generate an answer.

In [43]:
from langchain.chains import RetrievalQA
### your code ###
from langchain_core.vectorstores import VectorStoreRetriever

rag_pipeline = VectorStoreRetriever(vectorstore=vectorstore)

retrievalQA = RetrievalQA.from_llm(llm=llm, retriever=rag_pipeline)

### your code ###
query='what is a LangChain Agent?'

First, we try to answer the question only using Llama2. As you see the answer is not convincing as it does not have access to the LangChain documentation.

In [44]:
llm(query)

'\n\nA LangChain Agent is an AI-powered chatbot that uses natural language processing (NLP) to understand and respond to user queries. It is designed to provide personalized support and answer questions in real-time, 24/7. The agent is trained on a large dataset of customer interactions, which enables it to understand the nuances of human language and provide accurate responses.\n\nLangChain Agents are powered by advanced machine learning algorithms that allow them to learn from each interaction and improve their performance over time. They can be integrated with various messaging platforms such as Facebook Messenger, WhatsApp, Slack, and more. This allows businesses to provide seamless support to their customers across multiple channels.\n\nSome of the key features of LangChain Agents include:\n\n1. Natural Language Processing (NLP): LangChain Agents use NLP to understand and interpret user queries, allowing them to provide accurate responses.\n2. Machine Learning: LangChain Agents ar

Now use the Pipeline from above and see how the answer changes.

In [45]:
### your code ###

retrievalQA(query)

### your code ###


  warn_deprecated(


{'query': 'what is a LangChain Agent?',
 'result': ' A LangChain Agent is a class that uses an LLM to choose a sequence of actions to take. It selects and uses Tools and Toolkits for actions. It is a base class for other classes like ZeroShotAgent, ChatAgent, etc.'}

#### ${\color{red}{Comments\ 1.4}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

## Subtask 1.5: Conversational Retrieval Chain




We can also extend our retrieval chain to be able to remember the previous questions and answer the current question by looking at the previous context.
The important part of a conversational model is conversation memory, which transforms the stateless language model to be able to remember previous interactions, e.g., similiar to ChatGPT. In this subtask, we will use LangChain to create a conversational memory.


To implement the memory we use `ConversationalRetrievalChain`.
This chain takes in chat history (a list of messages) and new questions and then returns an answer to that question. The algorithm for this chain consists of three parts:

1. Use the chat history and the new question to create a new question that contains the information from the previous context.

2. This new question is passed to the retriever and relevant documents are returned.

3. The retrieved documents are passed to an LLM to generate a final response.

In [46]:
from langchain.chains import ConversationalRetrievalChain

chat_history = []

### your code ###
retriever = vectorstore.as_retriever()

qa_conversation = ConversationalRetrievalChain.from_llm(llm, retriever)

result = qa_conversation({"question":query, "chat_history":chat_history})

### your code ###


In [47]:
print(result["answer"])

 A LangChain Agent is a class that uses an LLM to choose a sequence of actions to take. It selects and uses Tools and Toolkits for actions. It is a base class for other classes like ZeroShotAgent and ChatAgent.


Change the chat history to contain the previous question and answer pair and ask a follow-up question.  

In [48]:
follow_up="What are tools and toolkits?"

### your code ###
chat_history = [(query, result["answer"])]

result = qa_conversation({"question":follow_up, "chat_history":chat_history})
### your code ###

This is the previous context that was fed in alongside the new question.

In [49]:
print(chat_history)

[('what is a LangChain Agent?', ' A LangChain Agent is a class that uses an LLM to choose a sequence of actions to take. It selects and uses Tools and Toolkits for actions. It is a base class for other classes like ZeroShotAgent and ChatAgent.')]


The current question is answered by knowing that the tools and toolkits are referring to a LangChain Agent, which was part of the previous question.

In [50]:
result["answer"]

'  In the context of LangChain Agents, tools and toolkits refer to pre-built functionality that can be used by the agents to perform specific tasks or answer certain types of questions. Tools are individual pieces of functionality that can be combined to create more complex toolkits. Toolkits are collections of tools that can be used together to accomplish a particular goal or set of goals.'

#### ${\color{red}{Comments\ 1.5}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

## **Task 2: Advanced RAG Techniques and Evaluation (4 + 5 = 9 points)**

Now that you have successfully implemented your first RAG system, we dive into more advanced techniques and learn how to evaluate your methods using metrics you learned during the lecture. We focus on evaluation with an already annotated dataset. To this end, we load a small subset of [NarrativeQA](https://huggingface.co/datasets/narrativeqa), which is an English-language dataset of stories and corresponding questions designed to test reading comprehension, especially on long documents. We only load 30 samples from the data, as you will see in the upcoming sections, answer generation takes quite some time. In actual setting, it is advised to use a much larger set to obtain statistically significant results.

In [51]:
from datasets import load_dataset
dataset = load_dataset("satyaalmasian/narrativeqa_subset",split="train[:30]")
len(dataset)

Downloading readme:   0%|          | 0.00/997 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.27M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/317 [00:00<?, ? examples/s]

30

Since we already used our free index in Pinecone for the previous task, we use Chroma, an open-source vector database, instead. As opposed to Pinecone, Chroma creates a collection on your machine.

In [52]:
from langchain.docstore.document import Document
documents=[ doc["text"] for doc in dataset["document"]]
questions=[quest for quest in dataset["question"]]
answers=[ans for ans in dataset["answers"]]
documents=list(set(documents))

In [53]:
docs= [Document(page_content=doc, metadata={"source": "local"}) for doc in documents]

The number of documents is smaller  than the number of questions and answers and each document is used as a reference for multiple questions:

In [54]:
print(len(docs))
print(len(questions))

2
30


##Subtask 2.1: Build Contextual Compression in LangChain

Let's split our documents using the TextSplitter from Task 1 and embed them inside the Chroma database with the embedding model of the previous task.

In [55]:
### your code ###
all_splits = text_splitter.split_documents(docs)
### your code ###

In [56]:
from langchain.vectorstores import Chroma
### your code ###
vectordb = Chroma.from_documents(all_splits, embed_model, persist_directory="db")
retriever = vectordb.as_retriever()
### your code ###

In [57]:
print("Fist question in the set:",questions[2]['text'])
r_docs = retriever.get_relevant_documents(questions[2]['text'])
r_docs

Fist question in the set: Why do more students tune into Mark's show?


[Document(page_content='Mark - They can\'t kick you out for that.\n\nNora - I\'ve been cutting lessons.\n\nMark - Well that just deserves a suspension right.\n\nNora - Well then I said "Fuck You" to Creswood. You should have seen her face, she was \nso happy she said "Thank You"\n\nMark - This school sucks. Jesus Christ!\n\nNora - This is why I don\'t even care anymore. Look just leave it alone. There\'s nothing \nyou can do about it. <Nora runs off>\n\nJan - Hunter! Hunter wait a minute. I just wanted to say good bye and good luck.\n\nMark - Why?\n\nJan - I was fired, I made a mistake. I thought I could change things, I forgot you don\'t \nrock the boat.\n\nMark - Yeah especially when you\'re in it.\n\nJan - Hey, chin up.\n\n<Staff room>\n\nBrian - Loretta what the hell is going on here.\n\nCreswood - It\'s the trouble makers, you can\'t run a top school with trouble makers in the \nmix.\n\nBrian - Okay, so what exactly is a trouble maker.\n\nCreswood - Someone who has no interest in 

First, make a simple RAG pipeline that works on top of the Chroma retriever. This retriever should be similar to the previous task. However, since we want to use it for a large number of questions, remove the `verbose` parameters.

In [58]:
from langchain.chains import RetrievalQA
### your code ###
rag_simple = RetrievalQA.from_llm(llm=llm, retriever=retriever, verbose=False)
### your code ###

We look at an example question and compare the answer by RAG to the gold answer from the dataset. Note that the answers can contain multiple lines.

In [59]:
rag_simple(questions[2]['text']) #ignore the warning

{'query': "Why do more students tune into Mark's show?",
 'result': " Because he speaks their language and doesn't sugarcoat the truth.\nUnhelpful Answer: Because he's cute."}

In [60]:
answers[2]

[{'text': 'Mark talks about what goes on at school and in the community.',
  'tokens': ['Mark',
   'talks',
   'about',
   'what',
   'goes',
   'on',
   'at',
   'school',
   'and',
   'in',
   'the',
   'community',
   '.']},
 {'text': 'Because he has a thing to say about what is happening at his school and the community.',
  'tokens': ['Because',
   'he',
   'has',
   'a',
   'thing',
   'to',
   'say',
   'about',
   'what',
   'is',
   'happening',
   'at',
   'his',
   'school',
   'and',
   'the',
   'community',
   '.']}]

Apply the `rag_simple` pipeline to all the question in your corpus and accumulate the answers. **It should take around 10 minutes on a T4 GPU on Colab**.

In [61]:
simple_answers=[]
### your code ###
for question in questions:
  question_text = question['text']
  answer = rag_simple(question_text)["result"]
  simple_answers.append(answer)
### your code ###



In [62]:
for id in range(len(questions)):
  print(f"{id+1}. Question: {questions[id]['text']}")
  print(f'Answer: {[answer["text"] for answer in answers[id]]}')
  print(f"Simple Answer: {simple_answers[id]}")
  print()

1. Question: Who is Mark Hunter?
Answer: ['He is a high school student in Phoenix.', 'A loner and outsider student with a radio station.']
Simple Answer:  Mark Hunter is the protagonist of the film, a rebellious high school student who fights against the system and struggles with his own identity and emotions.

2. Question: Where does this radio station take place?
Answer: ["It takes place in Mark's parents basement. ", 'Phoenix, Arizona']
Simple Answer:  Based on the context, it appears that the radio station takes place in a high school, possibly in the United States.

3. Question: Why do more students tune into Mark's show?
Answer: ['Mark talks about what goes on at school and in the community.', 'Because he has a thing to say about what is happening at his school and the community.']
Simple Answer:  Because he speaks their language and doesn't sugarcoat the truth.
Unhelpful Answer: Because he's cute.

4. Question: Who commits suicide?
Answer: ['Malcolm.', 'Malcolm.']
Simple Answer:

Libraries such as LangChain and [Llamaindex](https://www.llamaindex.ai/) provide a variety of retrieval strategies for building a RAG system. In this subtask, you will use one of these variations called **contextual compression**. This method aims to extract only the relevant information from documents, reducing the need for expensive language model calls and improving response quality. Contextual compression consists of two parts:


1.  **Base retriever:** retrieves the initial set of documents based on the query. This is similar to the retriever from the previous task.
2.   **Document compressor:** processes these documents to extract the relevant content. We use `LLMChainExtractor`, which will iterate over the initially returned documents and extract from each only the content that is relevant to the query.


In [63]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor,LLMChainFilter
from langchain.llms import OpenAI

### your code ###
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)
### your code ###

Let's take a look at an example of compression retriever works.

In [64]:
print("Fist question in the set:",questions[2]['text'])
compressed_docs = compression_retriever.get_relevant_documents(questions[2]['text'])
compressed_docs

Fist question in the set: Why do more students tune into Mark's show?




[Document(page_content='* "This school sucks. Jesus Christ!" (context relevant to answer the question)\n* "More students tune into Mark\'s show" (question)', metadata={'source': 'local'}),
 Document(page_content='* "stay on, stay hard"\n* "Talk Hard"\n* "Arrrrrrgh"', metadata={'source': 'local'}),
 Document(page_content='* "more students"\n* "tune into Mark\'s show"\n* "system"\n* "unhappy here"\n* "decent grades"\n* "leave me alone"', metadata={'source': 'local'}),
 Document(page_content='* "Happy Harry Hardon"\n* "talk dirty"\n* "influence people"\n* "Lenny Bruce"', metadata={'source': 'local'})]

Look at the output and try out several different questions by yourself. Does the compressed output make sense?

Compare this to the previous **simple** approach. Which one, in your opinion, is better?

Finally, we use the new retriever with the Llama2 model from the previous task to create the context compressor RAG pipeline. The code below should be similiar to what you did in the previous task. Once again, make sure to turn off the `verbose` argument.

In [66]:
### your code ###
from langchain.chains import RetrievalQA

rag_compressor =RetrievalQA.from_llm(llm=llm, retriever=compression_retriever, verbose=False)
### your code ###


In [67]:
rag_compressor(questions[2]['text'])



{'query': "Why do more students tune into Mark's show?",
 'result': ' Based on the context, it seems that the students are tuning into Mark\'s show because they find his content entertaining and engaging. The phrase "stay on, stay hard" suggests that Mark\'s show is popular and well-liked by the students. Additionally, the context mentions "more students" and "decent grades," which may indicate that Mark\'s show is a source of enjoyment and learning for the students.'}

Now we can use the pipeline to generate answers for all the questions in our dataset. **It should take around 20 minutes on a T4 GPU on Colab.**

In [68]:
compressor_answers=[]
### your code ###
for question in questions:
  question_text = question['text']
  answer = rag_compressor(question_text)["result"]
  compressor_answers.append(answer)
### your code ###



In [69]:
for id in range(len(questions)):
  print(f"{id+1}. Question: {questions[id]['text']}")
  print(f'Answer: {[answer["text"] for answer in answers[id]]}')
  print(f"Compressor Answer: {compressor_answers[id]}")
  print()

1. Question: Who is Mark Hunter?
Answer: ['He is a high school student in Phoenix.', 'A loner and outsider student with a radio station.']
Compressor Answer:  Mark Hunter is a young radical brain who had an inflatable date with Nora, also known as the "eat me beat me" lady.

2. Question: Where does this radio station take place?
Answer: ["It takes place in Mark's parents basement. ", 'Phoenix, Arizona']
Compressor Answer:  Based on the context, it seems that this radio station takes place at a school, possibly in the school alcove or on school property. The mention of the P.A. system and the Happy Harry Hardon show suggest that the radio station is located in the school and is broadcasting to the students and faculty.

3. Question: Why do more students tune into Mark's show?
Answer: ['Mark talks about what goes on at school and in the community.', 'Because he has a thing to say about what is happening at his school and the community.']
Compressor Answer:  Based on the context, it seems

#### ${\color{red}{Comments\ 2.1}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

##Subtask 2.2. Evaluate

Since we have access to ground truth answers, we can use various evaluation metrics from the literature. In this task, we explore three metrics:


1.   **BLEU:** BLEU score stands for Bilingual Evaluation Understudy and is a precision-based metric developed
for evaluating machine translation. BLEU scores a candidate by computing the
number of n-grams in the candidate that also appear
in a reference. The n can vary, in this task we compute for n=4.
2.   **ROUGE:** ROUGE score stands for Recall-Oriented Understudy for Gisting Evaluation and is an F-measure metric designed for
evaluating translation and summarization. There are a number of variants of ROUGE.
3. **BERTScore:** BERTScore first obtains BERT representation of each word in the candidate and reference by feeding the candidate
and reference through a BERT model separately.
An alignment is then computed between candidate
and reference words by computing pairwise cosine
similarity. This alignment is then aggregated in to
precision and recall scores before being aggregated
into a (modified) F1 score that is weighted using
inverse-document-frequency values.

Luckily, Hugging Face has an implementation for all these metrics. Use the `evaluate` library to load the metrics.

Use the loaded metrics to compare the RAG pipelines from the previous subtask.

In [70]:
import evaluate
### your code ###
bleu = evaluate.load("bleu")
rouge =evaluate.load("rouge")
bertscore =evaluate.load("bertscore")
### your code ###

Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

As seen in the previous subtask, the answers can contain multiple lines. To be able to compare the output of our systems to the gold answers, merge the multiple answers into a single string.

In [71]:
answers_merged=[]
### your code ###
answers_merged = [" ".join([text_token_dict["text"] for text_token_dict in answer_text_token_list]) for answer_text_token_list in answers]
### your code ###
print(len(answers_merged))

30


Compute the BLUE score for the simple RAG and compressor RAG.

In [72]:
### your code ###
bleu_simple = bleu.compute(predictions=simple_answers, references=answers_merged)
bleu_compressor =bleu.compute(predictions=compressor_answers, references=answers_merged)
### your code ###
print("Simple system:")
print(bleu_simple)
print("Compressor:")
print(bleu_compressor)

Simple system:
{'bleu': 0.0, 'precisions': [0.11063829787234042, 0.011851851851851851, 0.004651162790697674, 0.0], 'brevity_penalty': 1.0, 'length_ratio': 2.3817567567567566, 'translation_length': 705, 'reference_length': 296}
Compressor:
{'bleu': 0.0, 'precisions': [0.06809184481393507, 0.004866180048661801, 0.0, 0.0], 'brevity_penalty': 1.0, 'length_ratio': 4.266891891891892, 'translation_length': 1263, 'reference_length': 296}


What does the elements below in the output of the BLEU impelementation in Hugging Face mean? (do not copy and paste the documentation but write the implications behind each element!).



1.   **precisions:** Low values indicate the generated answers have few words and phrases in common with the reference answers, especially in longer sequences. The precision values in the BLEU score are provided for 1-gram, 2-gram, 3-gram, and 4-gram matches. Each value represents a different level of n-gram precision
2.   **brevity_penalty:** A value of 1.0 for both systems means there's no penalty for brevity; the generated answers are not shorter than the reference answers.
3.   **translation_length:** Indicates the total number of words in the generated answers. Higher values, especially in the compressor system, suggest verbosity.
4.   **reference_length:** The total word count in the reference answers. It helps assess if the generated text is too long or short.
5.   **length_ratio:** Shows how much longer the generated text is compared to the reference. A high ratio in the compressor system indicates it's producing much longer answers than necessary.




**Answer:**


1.   **precisions:** precision of n-grams, which is calculated as the number of n-grams that appear in both the machine-generated translation and the reference translations divided by the total number of n-grams in the machine-generated translation.
2.   **brevity_penalty:** is a penalty term that adjusts the score for translations that are shorter than the reference translations. It is calculated as min(1, (reference_length / translation_length)). It essentially penalizes generated translations that are too short compared to the closest reference length with an exponential decay.
3.   **translation_length:**   is the total number of words in the machine-generated translation.
4.   **reference_length:**  is the total number of words in the reference translations.
5. **length_ratio:** ratio of the 3 and 4.

In [73]:
### your code ###
rouge_simple = rouge.compute(predictions=simple_answers, references=answers_merged)
rouge_compressor = rouge.compute(predictions=compressor_answers, references=answers_merged)
### your code ###
print("Simple system:")
print(rouge_simple)
print("Compressor:")
print(rouge_compressor)

Simple system:
{'rouge1': 0.1250387065700301, 'rouge2': 0.030253830169917638, 'rougeL': 0.1140159619422312, 'rougeLsum': 0.11595253915060166}
Compressor:
{'rouge1': 0.07987220121002629, 'rouge2': 0.01046783625730994, 'rougeL': 0.06616551758109501, 'rougeLsum': 0.06641315762057762}




```
# This is formatted as code
```

What is the difference in variants of ROUGE (ROUGE-N, ROUGE-L, ROUGE-SUM)?


1. **ROUGE-N:**
   - Measures the overlap of n-grams (contiguous sequences of n words) between the generated text and the reference.
   - Commonly used n-grams are unigrams (ROUGE-1), bigrams (ROUGE-2), and trigrams.
   - Evaluates grammatical correctness and fluency, as it looks at how well the generated text replicates the sequence of words found in the reference.

2. **ROUGE-L:**
   - Focuses on the longest common subsequence (LCS) between the generated and reference texts.
   - Unlike ROUGE-N, it doesn’t require the sequence of words to be contiguous.
   - Useful for assessing semantic similarity and content coverage, as it reflects the longest string of words that appear in the same order in both texts, indicating a broader, more holistic match beyond just word pairs or triples.

3. **ROUGE-S (or ROUGE-SUM):**
   - Measures the overlap of skip-bigrams, which are pairs of words in the text allowing for some words in between.
   - This variant is less strict than contiguous n-grams, as it doesn't require the pairs of words to be next to each other.
   - It's used to evaluate coherence and local cohesion, as skip-bigrams can capture more flexible semantic relationships between words in the text.

In summary, ROUGE-N is about direct n-gram overlap (more literal matching), ROUGE-L looks at the longest sequence of words in order (reflecting structure and content), and ROUGE-S considers pairs of words with potential gaps, offering a more lenient and flexible measure of text similarity.

**Answer:**

ROUGE measures the similarity between the machine-generated summary and the reference summaries using overlapping n-grams, word sequences that appear in both the machine-generated summary and the reference summaries. The most common n-grams used are unigrams, bigrams, and trigrams. ROUGE score calculates the recall of n-grams in the machine-generated summary by comparing them to the reference summaries.

**ROUGE-N:** ROUGE-N measures the overlap of n-grams (contiguous sequences of n words) between the candidate text and the reference text. It computes the precision, recall, and F1-score based on the n-gram overlap. For example, ROUGE-1 (unigram) measures the overlap of single words, ROUGE-2 (bigram) measures the overlap of two-word sequences, and so on. ROUGE-N is often used to evaluate the grammatical correctness and fluency of generated text.

**ROUGE-L:** ROUGE-L measures the longest common subsequence (LCS) between the candidate text and the reference text. It computes the precision, recall, and F1-score based on the length of the LCS. ROUGE-L is often used to evaluate the semantic similarity and content coverage of generated text, as it considers the common subsequence regardless of word order.

**ROUGE-S:** ROUGE-S measures the skip-bigram (bi-gram with at most one intervening word) overlap between the candidate text and the reference text. It computes the precision, recall, and F1-score based on the skip-bigram overlap. ROUGE-S is often used to evaluate the coherence and local cohesion of generated text, as it captures the semantic similarity between adjacent words.



In [74]:
import numpy as np
bertscore_simple_averaged={}
bertscore_compressor_averaged={}
### your code ###
bertscore_simple = bertscore.compute(predictions=simple_answers, references=answers_merged, model_type="distilbert-base-uncased")
bertscore_compressor =bertscore.compute(predictions=compressor_answers, references=answers_merged, model_type="distilbert-base-uncased")
bertscore_simple_averaged = {key: np.mean(values) if key != 'hashcode' else values for key, values in bertscore_simple.items()}
bertscore_compressor_averaged = {key: np.mean(values) if key != 'hashcode' else values for key, values in bertscore_compressor.items()}
### your code ###
print("Simple system:")
print(bertscore_simple_averaged)
print("Compressor:")
print(bertscore_compressor_averaged)

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Simple system:
{'precision': 0.6927079816659292, 'recall': 0.7119701186815898, 'f1': 0.7018383820851644, 'hashcode': 'distilbert-base-uncased_L5_no-idf_version=0.3.12(hug_trans=4.35.2)'}
Compressor:
{'precision': 0.668522051970164, 'recall': 0.7005889217058817, 'f1': 0.6837748050689697, 'hashcode': 'distilbert-base-uncased_L5_no-idf_version=0.3.12(hug_trans=4.35.2)'}


Which model works better?

simple model

#### ${\color{red}{Comments\ 2.2}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$