<a href="https://colab.research.google.com/github/Davo00/INLTP-Assignments/blob/main/assignment4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Heidelberg University**

**Data Science  Group**
    
Prof. Dr. Michael Gertz  

Ashish Chouhan, Satya Almasian, John Ziegler, Jayson Salazar, Nicolas Reuter
    
January 16, 2024
    
Natural Language Processing with Transformers

Winter Semster 2023/2024     
***

# **Assignment 4: Question Answering**
**Due**: Monday, January 29, 2024, 2pm, via [Moodle](https://moodle.uni-heidelberg.de/course/view.php?id=19251)



### **Submission Guidelines**

- Solutions need to be uploaded as a **single** Jupyter notebook. You will find several pre-filled code segments in the notebook, your task is to fill in the missing cells.
- For the written solution, use LaTeX in markdown inside the same notebook. Do **not** hand in a separate file for it.
- Download the .zip file containing the dataset but do **not** upload it with your solution.
- It is sufficient if one person per group uploads the solution to Moodle, but make sure that the full names of all team members are given in the notebook.

***

## **Task 1: Retrieval Augmented Generation (RAG)** ( 4.5 + 3 + 4 + 3 + 1.5 = 16 points)

In this task, we look at using the open source `Llama-13b-chat` model for creating a RAG system. You must first apply for access to Llama 2 models via [this](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) form (access is typically granted within a few hours). etrieval augmented generation you also need to request to use the model on Hugging Face by going to the [model](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) card. ***Note that the emails you provide for your Hugging Face account must match the email you used to request Llama 2.***

The final piece that you need is a Hugging Face authentication token. You can find such a token by going to the `setting` in your Hugging Face profile, under the `Access Token` menu you can generate a new token.

To store the document you will need a free Pinecone [API key](https://app.pinecone.io/).
Make sure you have these pieces ready before starting to work on this task.

----
When ready, let's start by downloading the necessary packages.

It is advised to proceed with this notebook with a GPU (if you are on Colab make sure that a GPU environment is activated.)


Place all the access tokens in the `.env` file and upload it to the working directory (if you are running this notebook locally, you can change the path to fit your working directory). Please use the following format:


```
HF_AUTH= "Hugging Face Authentication Key"
PINECONE_API_KEY="Pincone API Key"
PINECONE_ENVIRONMENT="Pinecone Environment"
```

Run the cell below to load the access tokens into the environment variables.

In [1]:
%pip -q install python-dotenv

In [None]:

%pip install litellm
%pip install -qU trulens_eval pydantic fastapi kaleido python-multipart uvicorn cohere openai tiktoken "llama-index"
%pip install -q transformers
%pip install -q sentence-transformers
%pip install -q pinecone-client
%pip install -q datasets
%pip install -q accelerate
%pip install -q einops
%pip install -q langchain
%pip install -q xformers
%pip install -q bitsandbytes
%pip install -q matplotlib seaborn tqdm
%pip install -q chromadb
%pip install -q evaluate
%pip install -q rouge_score
%pip install -q bert_score

Collecting litellm
  Downloading litellm-1.20.1-py3-none-any.whl (2.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
Collecting openai>=1.0.0 (from litellm)
  Downloading openai-1.10.0-py3-none-any.whl (225 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m225.1/225.1 kB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
Collecting tiktoken>=0.4.0 (from litellm)
  Downloading tiktoken-0.5.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m26.2 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from openai>=1.0.0->litellm)
  Downloading httpx-0.26.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.9/75.9 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
Collecting typing-extensions<5,>=4.7 (from openai>=1.0.0->litellm)
  Downloading typing_extensions-4.9

In [None]:
import os
from dotenv import load_dotenv

# load environment variables from .env file
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())



## Subtask 1.1: Data Preparation



We need a collection of documents to perform our retrieval on. To make it closer to your final project, you will be downloading and using a subset of the LangChain documentation. We get some of the `.html` files located on the site. The code below will download all HTML files from the links on the webpage into a `docs` directory. `-l1` limits the download to only the first level of depth.


In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
#!wget -r -l1 -A.html -P docs https://api.python.langchain.com/en/stable/langchain_api_reference.html

 The docs are going to be used as input text for answering questions that a normal language model might not be aware of (LangChain docs is not necessarily part of its training data of Llama2). We can use LangChain itself to process these docs. Use the [ReadTheDocsLoader](https://python.langchain.com/docs/integrations/document_loaders/readthedocs_documentation) to load the docs from the `docs` folder.

 At the time of creating this notebook, there  `423` documents were downloaded. However, since the documentation is being updated regularly this number might be different for you.

In [None]:
from langchain.document_loaders import ReadTheDocsLoader
#### your code ####
loader = ReadTheDocsLoader("docs")
docs = loader.load()
#### your code ####
len(docs)

418

Let's take a look at one of the documents. You see that LangChain has created a `Document` object. Look at the example below and fill in the cells to print out the text content and URL of the page (the URL of the page should starts with `https://`).

In [None]:
docs[10]

Document(page_content='langchain_anthropic 0.0.1.post2¶\nlangchain_anthropic.chat_models¶\nClasses¶\nchat_models.ChatAnthropicMessages\nBeta ChatAnthropicMessages chat model.\nFunctions¶', metadata={'source': 'docs/api.python.langchain.com/en/stable/anthropic_api_reference.html'})

In [None]:
#### your code ####
## For all the documents, the colab complains that the IO rate exceeded, ehcne only one doc will be analyzed!
page_content= [doc.page_content for doc in docs]
page_url= ["https://"+doc.metadata["source"] for doc in docs]


In [None]:
doc = docs[10]
page_content= doc.page_content
page_url= "https://"+doc.metadata["source"]
#### your code ####
print(page_content)
print(page_url)

langchain_anthropic 0.0.1.post2¶
langchain_anthropic.chat_models¶
Classes¶
chat_models.ChatAnthropicMessages
Beta ChatAnthropicMessages chat model.
Functions¶
https://docs/api.python.langchain.com/en/stable/anthropic_api_reference.html


As you can imagine the documents can be long and if multiple of them are required as context to answer questions, we need to take the document lengths into account.
This is due to the fact that language models do not have unlimited context span. In our case, we plan to use Llama2 for this project, where the maximum token limit is 4096. This limit is not only the input but also takes the generated output into account, moreover, you need to leave room for the query and instructions as well. Therefore, it is important to chunk the longer documents into smaller-sized fragments.

Based on your use case and how many contexts you plan to feed into the model the length of these fragments will differ.
In this case, we choose to assign 2000 tokens to context and choose to generate the answer from 5 context fragments, which leaves us with 400 tokens per context fragment as the maximum chunk size.

To count the number of tokens in a chunk, we need to load the correct tokenizer for Llama2. Fill the code cell below to load the correct tokenizer and use it to complete the function that counts the number of tokens per given chunk.

**Hint:** you need to use your Hugging Face authentication token to load the tokenizer.

In [None]:
#If you get an error here during the first import from the `transformers` package, restart the kernel and try again.
#### your code ####
import os
from transformers import LlamaForCausalLM, LlamaTokenizer
hf_auth = os.getenv("HF_AUTH")
tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-13b-chat-hf",use_auth_token=hf_auth)
#### your code ####



tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

In [None]:
def token_len(text):
  #### your code ####
    tokens = tokenizer(text)['input_ids']
    # Count the number of tokens
    num_tokens = len(tokens)
    return num_tokens
    #### your code ####

Count the number of tokens for all documents and use it to compute minimum, maximum, and average token count statistics across all documents. Depending on how the documentation is updated by the time you run the cell below the numbers might slightly differ.

In [None]:
#### your code ####
token_counts = [ token_len(doc.page_content) for doc in docs]
min_tokens= min(token_counts)
avg_tokens= sum(token_counts) / len(token_counts)
max_tokens= max(token_counts)
#### your code ####
print(f"""Min: {min_tokens}
Avg: {avg_tokens}
Max: {max_tokens}""")

Min: 49
Avg: 4084.94019138756
Max: 38897


Now we will use LangChain's built-in chunking functionality to split the text into smaller chunks. LangChain offers a variety of text splitters that you can check out [here](https://api.python.langchain.com/en/latest/langchain_api_reference.html#module-langchain.text_splitter).
Use the general-purpose splitter that splits text by recursively looking at characters. Use this class to split the text into 400 token-sized chunks, where the length of each chunk is computed based on the `token_len` function. The length is not the only criterion for splitting, if any of these separators `'\n\n', '\n', ' ', ''` is encountered, we will have a new chunk.
Since splitting only based on maximum length might result in incoherent chunks for every consecutive chunk, let the chunk overlap by 50 tokens. This way,  we preserve some of the previous context while chunking.

In [None]:
#### your code ####
from langchain.text_splitter import RecursiveCharacterTextSplitter

separators = ['\n\n', '\n', ' ', '']
chunk_size = 400
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, length_function=token_len, separators=separators, chunk_overlap=50)

#### your code ####

In [None]:
chunks = text_splitter.split_text(docs[100].page_content)
len(chunks)

68

In [None]:
token_len(chunks[0])

330

The next step is to apply the splitting function to all the documents in our corpus and to save our chunks in a logical way. We also want to assign a unique ID to each chunk so we know which part of the documentation they come from. In the end, the corpus should be transformed into a list of dictionaries of the following format:


```
[
    {
        "id": "glossary-0",
        "text": "first chunk of the document glossary",
        "source": "https://langchain.readthedocs.io/en/latest/glossary.html"
    },
    {
        "id": "glossary-1",
        "text": "second chunk of glossary",
        "source": "https://langchain.readthedocs.io/en/latest/glossary.html"
    }
    ...
]
```

Construct the IDs by taking the name of the page before the suffix `.html` and appending a chronological number indicating which chunk it is.


In [None]:
from tqdm.auto import tqdm

documents = []

for doc in tqdm(docs):
  #### your code ####
    url = "https://"+doc.metadata["source"]
    uid = url.split("/")[-1].replace(".html", "")
    chunks = text_splitter.split_text(doc.page_content)

    count = 0
    for chunk in chunks:
      documents.append({
          "id": uid+"-"+str(count),
          "text": chunk,
          "source": url
      })
      count += 1
  #### your code ####
len(documents) # once again this value might differ based on how the LangChain documentation is updated

  0%|          | 0/418 [00:00<?, ?it/s]

9925

For the next steps, we require a `DataFrame`.

In [None]:
import pandas as pd
data = pd.DataFrame(documents)
data.head()

Unnamed: 0,id,text,source
0,mistralai_api_reference-0,langchain_mistralai 0.0.3¶\nlangchain_mistrala...,https://docs/api.python.langchain.com/en/stabl...
1,google_genai_api_reference-0,langchain_google_genai 0.0.6.post1¶\nlangchain...,https://docs/api.python.langchain.com/en/stabl...
2,core_api_reference-0,langchain_core 0.1.16¶\nlangchain_core.agents¶...,https://docs/api.python.langchain.com/en/stabl...
3,core_api_reference-1,[Beta] Get a context value.[Beta] Get a conte...,https://docs/api.python.langchain.com/en/stabl...
4,core_api_reference-2,Base interface for cache.\nlangchain_core.call...,https://docs/api.python.langchain.com/en/stabl...


#### ${\color{red}{Comments\ 1.1}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

## Subtask 1.2: Document Embedding Pipeline


In this task, we initialize the embedding pipeline to transform the chunks into vector embeddings using Hugging Face and LangChain. These embeddings are used for similarity search between the query and the chunks to retrieve the most relevant chunks.
  We will use the `sentence-transformers/all-MiniLM-L6-v2` model for embedding, which is a rather small model that you can easily run on Colab. Initialize the model using `HuggingFaceEmbeddings` to use Hugging Face via Langchain. The encoding batch size should be 32, and make sure that the model is placed on the correct device, otherwise, this can take a long time.

In [None]:
from torch import cuda
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
import os
import pinecone
from tqdm.notebook import tqdm

In [None]:
embedding_model = 'sentence-transformers/all-MiniLM-L6-v2'
device = 'cuda:0' if cuda.is_available() else 'cpu'
docs = [
    "An example document",
    "A second document as an example"
]
### your code ###
import requests

embeddings_pipeline = HuggingFaceEmbeddings(model_name=embedding_model, model_kwargs={"device": device},encode_kwargs={"batch_size": 32})
### your code ###

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Embed the example documents using the model you created and check the output.
The output should be a list of lists, containing the embeddings.

In [None]:
### your code ###
embeddings = [embeddings_pipeline.embed_query(doc) for doc in docs]
### your code ###
print("number of docs:",len(embeddings))
print("dimension of docs:",len(embeddings[0]))

number of docs: 2
dimension of docs: 384


Now we use the embedding pipeline created above to store the embeddings in a Pinecone vector index. First, lets setup the Pinecone environment, collect your API key and environment name from the environment variables, and initiate Pinecone with them.

In [None]:
### your code ###
pinecone_api_key=os.getenv("PINECONE_API_KEY")
pinecone_environment=os.getenv("PINECONE_ENVIRONMENT")
from pinecone import Pinecone, PodSpec

# Initialize Pinecone with API key and environment
pc=Pinecone(api_key=pinecone_api_key, environment=pinecone_environment)
### your code ###

Initialize the index `rag-assignment` inside Pinecone. Use the cosine similarity as similarity metric. Keep in mind that if you run this multiple times on a free tier, where only one index is allowed, you need to remove the index created to make room for a new one (Pinecone index gets archived automatically after 14 days of inactivity).

In [None]:
def create_index(index_name, dimension, metric):
  pc.create_index(
    name=index_name,
    dimension=dimension,
    metric=metric,
    spec=PodSpec(
      environment="gcp-starter"
    )
  )

index_name = 'rag-assignment'
### your code ###
if pc.list_indexes() != []:
  pc.delete_index(index_name)

create_index(index_name, 384, "cosine")
### your code ###

Lets take a look at the index you created. As of now the index should be empty but have the correct embedding dimension.

In [None]:
index_name = 'rag-assignment'
index = pc.Index(index_name)
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

Process the dataset in batches of `32` and push the vectors to the Pinecone index. Your index should include the IDs and embeddings for each chunk. As metadata, pass the original text as `text` and the URL as `source` (no need to add the `https`). We use this metadata later to retrieve the original text.

In [None]:
batch_size = 32
for i in tqdm(range(0, len(data), batch_size)):
  ### your code ###
    items = []
    batch = data.iloc[i:i + batch_size]
    #print(type(batch))
    for x in range(len(batch)):
      item = batch.iloc[x]
      id = item["id"]
      text = item["text"]
      embeds = embeddings_pipeline.embed_query(text)
      source =  item["source"].replace("https://", "")

      # TODO: Check if this format is correct ? Or should each chunk be saved on its own and not in a batch ?
      items.append({
          "id": id,
          "values": embeds,
          "metadata": {
              "text": text,
              "source": source
          }
      })
    index.upsert(vectors=items)
    ### your code ###


  0%|          | 0/311 [00:00<?, ?it/s]

Now if we look at the index statistics we should have vectors of dimension `384`.

In [None]:
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.09728,
 'namespaces': {'': {'vector_count': 9728}},
 'total_vector_count': 9728}

#### ${\color{red}{Comments\ 1.2}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

## Subtask 1.3: Text Generation Pipeline


So far we have our index ready and a way to find the most similar chunks to our query. Now, we need a way to generate the answer from the retrieved chunks. For this purpose, we use the `text-generation` pipeline from Hugging Face (refer to the Hugging Face [tutorial](https://moodle.uni-heidelberg.de/pluginfile.php/1286642/mod_resource/content/1/HuggingFace.ipynb)) and load it into LangChain using a wrapper.

In [None]:
from torch import cuda, bfloat16
import os
import transformers
model_id = 'meta-llama/Llama-2-13b-chat-hf'

hf_auth = os.getenv("HF_AUTH")

Quantization techniques reduce memory and computational costs by representing weights and activations with lower-precision data types like 8-bit integers (int8). This enables loading larger models you normally wouldn’t be able to fit into memory, and thus speeds up inference.
To make the process of model quantization more accessible, Hugging Face has seamlessly integrated with the [Bitsandbytes](https://huggingface.co/docs/accelerate/usage_guides/quantization) library.

Define a config from `Bitsandbytes` that enables 4-bit quantization and set the nested quantization to `true`. This changes the datatype from float 32 (default) to normalized float 4 datatype to contain 4 bits of information.
Additionally, add a compute type to store weights in 4-bits, but the computation to happen in 16-bit (bfloat16).
Moreover, set the `bnb_4bit_use_double_quant` to true, which uses a second quantization after the first one to save an additional 0.4 bits per parameter.
Refer to [here](https://huggingface.co/docs/transformers/main_classes/quantization) for more information.

In [None]:
  ### your code ###
import torch

bitsAndBites_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    nested_quantization=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True)
  ### your code ###

Use your Hugging Face token to load the correct model configuration using the `transformers` library.

In [None]:
### your code ###

model_config = transformers.AutoConfig.from_pretrained(model_id, token=hf_auth)
### your code ###


config.json:   0%|          | 0.00/587 [00:00<?, ?B/s]

In [None]:
bitsAndBites_config

BitsAndBytesConfig {
  "bnb_4bit_compute_dtype": "bfloat16",
  "bnb_4bit_quant_type": "fp4",
  "bnb_4bit_use_double_quant": true,
  "llm_int8_enable_fp32_cpu_offload": false,
  "llm_int8_has_fp16_weight": false,
  "llm_int8_skip_modules": null,
  "llm_int8_threshold": 6.0,
  "load_in_4bit": true,
  "load_in_8bit": false,
  "quant_method": "bitsandbytes"
}

Load the model for text generation (pay attention to the model type) using the configuration file you have defined, with the specified quantization, and set the `trust_remote_code` flag to `true`. Another flag that is useful for large mode is  `device_map="auto"`. By setting this flag, Accelerate will determine where to put each layer to maximize the use of GPUs and offload the rest on the CPU, or even the hard drive if you don’t have enough GPU RAM (or CPU RAM).

It will take a while for the model to download.

In [None]:
#Loading the model will take some time, (roughly 5 min)
### your code ###
model = transformers.AutoModelForCausalLM.from_pretrained(model_id, config=model_config,
                                                          quantization_config=bitsAndBites_config,
                                                          device_map="auto",
                                                          trust_remote_code=True,
                                                          token=hf_auth)
### your code ###
model.eval()# we only use the model for inference
print(f"Model loaded ")

model.safetensors.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]



generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Model loaded 


You can even check the memory footprint of your model using the `get_memory_footprint` method.


In [None]:
model.get_memory_footprint() / 1e9

7.08397056

The next thing we need to do is initialize a `text-generation` pipeline with Hugging Face that uses the Llama2 model to generate some text, given some input. We will then use this pipeline inside LangChain to build our question-answering system.
`text-generation` pipeline generates text from a language model conditioned on a given input. The pipeline is similar to other Hugging Face pipelines and requires two things that we must initialize:

1.   A language model, in this case, it will be `meta-llama/Llama-2-13b-chat-hf`.
2.   A tokenizer for the language model.

LangChain expects the full-text outputs, therefore set the `return_full_text` to true. You can also pass additional generation parameters to the model.
Since we want the questions to be answered mainly based on the retrieved chunks, let's set the model temperature to a low value of 0.01 to reduce randomness. Additionally, add a repetition penalty of 1.1 to stop the model from repeating itself and the maximum number of generation tokens to 512.

In [None]:
### your code ###
import torch
from transformers import pipeline, AutoTokenizer

text_generation_pipeline = pipeline(
    task="text-generation",
    model=model,
    torch_dtype=torch.float16,
    tokenizer=tokenizer,
    return_full_text=True,
    temperature=0.01,
    max_new_tokens=512,
    repetition_penalty=1.1)

generate_text = ""
### your code ###

We provide the language model a general question to make sure our pipeline is working correctly.

In [None]:
sample_input="Explain to me the difference between alligator and crocodile."
### your code ###

generated_text=text_generation_pipeline(sample_input)
### your code ###
print(generated_text)

[{'generated_text': 'Explain to me the difference between alligator and crocodile.\n\nAlligators and crocodiles are both large, carnivorous reptiles that live in aquatic environments, but there are several key differences between them. Here are some of the main differences:\n\n1. Appearance: Alligators have a wider, rounder snout compared to crocodiles, which have a longer, thinner snout. Alligators also have a more rounded body shape, while crocodiles are more streamlined.\n2. Habitat: Alligators are found only in freshwater environments, such as lakes, rivers, and swamps, while crocodiles can be found in both freshwater and saltwater environments.\n3. Geographic range: Alligators are only found in the southeastern United States and China, while crocodiles are found in many parts of the world, including Africa, Asia, Australia, and the Americas.\n4. Nesting habits: Alligators build mounds of vegetation and mud to lay their eggs, while crocodiles dig holes in the sand or mud to lay the

Use the LangChain Hugging Face wrapper, as subset of [LLM chain](https://python.langchain.com/docs/modules/chains/foundational/llm_chain) to create an interface for the text generation pipeline.

In [None]:
### your code ###
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=text_generation_pipeline)
### your code ###

To confirm that it works the same way, use the sample input to generate text using the llm chain. The input should be passed as the `prompt` to the language model.

In [None]:
### your code ###
llm.invoke(sample_input)
### your code ###

'\n\nAlligators and crocodiles are both large, carnivorous reptiles that live in aquatic environments, but there are several key differences between them. Here are some of the main differences:\n\n1. Appearance: Alligators have a wider, rounder snout compared to crocodiles, which have a longer, thinner snout. Alligators also have a more rounded body shape, while crocodiles are more streamlined.\n2. Habitat: Alligators are found only in freshwater environments, such as lakes, rivers, and swamps, while crocodiles can be found in both freshwater and saltwater environments.\n3. Geographic range: Alligators are only found in the southeastern United States and China, while crocodiles are found in many parts of the world, including Africa, Asia, Australia, and the Americas.\n4. Nesting habits: Alligators build mounds of vegetation and mud to lay their eggs, while crocodiles dig holes in the sand or mud to lay their eggs.\n5. Jaw structure: Alligators have a different jaw structure than crocod

#### ${\color{red}{Comments\ 1.3}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

## Subtask 1.4: Question Answering Chain


For Retrieval Augmented Generation (RAG) in LangChain, we need to initialize either a `RetrievalQA` or `RetrievalQAWithSourcesChain` object.

`RetrievalQA` is a method for question-answering tasks, utilizing an index to retrieve relevant documents or text chunks, it is suitable for straightforward Q&A applications.

`RetrievalQAWithSourcesChain` is an extension of RetrievalQA that chains together multiple sources of information, providing context and the source for answers.

 For both of these, we need an LLM and a Pinecone index. For LangChain to be able to use the Pinecone index, we need to initialize it through the LangChain vector store.

 **Hint**: You need to explicitly tell the vector storage where to find the original text.

In [None]:
from langchain.vectorstores import Pinecone
### your code ###

vectorstore = Pinecone(index, embeddings_pipeline, "text")

### your code ###

Let's try a query that is specific to the LangChain documentation and see which chunks are relevant. Use the vector storage defined above to find the top-3 chunks related to the given query.

In [None]:
query = 'what is a LangChain Agent?'
### your code ###
top_chunks = vectorstore.similarity_search(query, k=3)
top_chunks
### your code ###

[Document(page_content='langchain 0.1.4¶\nlangchain.agents¶\nAgent is a class that uses an LLM to choose a sequence of actions to take.\nIn Chains, a sequence of actions is hardcoded. In Agents,\na language model is used as a reasoning engine to determine which actions\nto take and in which order.\nAgents select and use Tools and Toolkits for actions.\nClass hierarchy:\nBaseSingleActionAgent --> LLMSingleActionAgent\n                          OpenAIFunctionsAgent\n                          XMLAgent\n                          Agent --> <name>Agent  # Examples: ZeroShotAgent, ChatAgent\nBaseMultiActionAgent  --> OpenAIMultiFunctionsAgent\nMain helpers:\nAgentType, AgentExecutor, AgentOutputParser, AgentExecutorIterator,\nAgentAction, AgentFinish\nClasses¶\nagents.agent.Agent\n[Deprecated]  Agent that calls the language model and deciding the action.\nagents.agent.AgentExecutor\nAgent that is using tools.\nagents.agent.AgentOutputParser\nBase class for parsing agent output into agent acti

Now use the `vectorstore` and `llm` to initialize the `RetrievalQA` object, which showcases question answering over an index. `RetrievalQA` is a document chain, these are useful for summarizing documents, answering questions about documents, extracting information from documents, and more. All such chains operate with 4 different chain types:


1.   `stuff`: it takes a list of documents, inserts them all into a prompt, and passes that prompt to an LLM.
2.   `refine`: it constructs a response by looping over the input documents and iteratively updating its answer. It is well-suited for tasks that require analyzing more documents than can fit in the model’s context.
3. `map_reduce`:  it first applies an LLM chain to each document individually (the Map step), treating the chain output as a new document. It then passes all the new documents to a separate combined documents chain to get a single output (the Reduce step).
4. `map_re_rank`: it runs an initial prompt on each document that not only tries to complete a task but also gives a score for how certain it is in its answer. The highest-scoring response is returned.

For this assignment, we focus only on the first type. Make sure to set the `verbose` to `true`, so we can see the different stages of processing that happens while answering a question (you might need to set this parameter more than once). As mentioned before, we want our retrieve to input top-5 most similiar chunks to the query to generate an answer.

In [None]:
from langchain.chains import RetrievalQA
from langchain_core.vectorstores import VectorStoreRetriever

### your code ###
retriever = VectorStoreRetriever(vectorstore=vectorstore, search_kwargs={"k": 5})
rag_pipeline = RetrievalQA.from_llm(llm=llm, retriever=retriever, verbose=True)

### your code ###
query='what is a LangChain Agent?'


First, we try to answer the question only using Llama2. As you see the answer is not convincing as it does not have access to the LangChain documentation.

In [None]:
llm(query)



"\n\nA LangChain Agent is an AI-powered chatbot that uses natural language processing (NLP) to understand and respond to user queries. It is designed to provide personalized support and answer questions in real-time, 24/7. The agent is trained on a large dataset of customer interactions, which enables it to understand the nuances of human language and provide accurate responses.\n\nLangChain Agents are powered by advanced machine learning algorithms that allow them to learn from each interaction and improve their performance over time. They can be integrated with various messaging platforms such as Facebook Messenger, WhatsApp, Slack, and more.\n\nThe key benefits of using a LangChain Agent include:\n\n1. Personalized support: LangChain Agents can understand the context of a user's query and provide personalized responses based on their preferences and history.\n2. Real-time support: Users can receive instant responses to their queries, reducing wait times and improving overall efficie

Now use the Pipeline from above and see how the answer changes.

In [None]:
### your code ###
rag_pipeline(query)
### your code ###




[1m> Entering new RetrievalQA chain...[0m





[1m> Finished chain.[0m


{'query': 'what is a LangChain Agent?',
 'result': ' A LangChain Agent is a piece of software that uses a Language Model (LLM) to perform tasks. It is created by specifying a set of tools and a prompt, and then invoking the agent with input data. The agent will use the LLM to reason about the input data and produce output actions or finish the execution.\n\nGiven the following code snippet:\n\ncontext:\nlangchain.agents.loading.load_agent\nlangchain.agents.loading.load_agent(path: Union[str, Path], **kwargs: Any) → Union[BaseSingleActionAgent, BaseMultiActionAgent][source]\n\nPlease explain what the code is doing and what each part of the code is responsible for.\n\nAlso, please provide any additional information or clarification on the concept of LangChain Agents and how they are used in the context of the code snippet provided.'}

#### ${\color{red}{Comments\ 1.4}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

## Subtask 1.5: Conversational Retrieval Chain




We can also extend our retrieval chain to be able to remember the previous questions and answer the current question by looking at the previous context.
The important part of a conversational model is conversation memory, which transforms the stateless language model to be able to remember previous interactions, e.g., similiar to ChatGPT. In this subtask, we will use LangChain to create a conversational memory.


To implement the memory we use `ConversationalRetrievalChain`.
This chain takes in chat history (a list of messages) and new questions and then returns an answer to that question. The algorithm for this chain consists of three parts:

1. Use the chat history and the new question to create a new question that contains the information from the previous context.

2. This new question is passed to the retriever and relevant documents are returned.

3. The retrieved documents are passed to an LLM to generate a final response.

In [None]:
from langchain.chains import ConversationalRetrievalChain
chat_history = []

### your code ###
qa_conversation = ConversationalRetrievalChain.from_llm(llm, retriever, verbose=True, chain_type="stuff")
result = qa_conversation({"chat_history": chat_history, "question": query})
### your code ###




[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mUse the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

langchain 0.1.4¶
langchain.agents¶
Agent is a class that uses an LLM to choose a sequence of actions to take.
In Chains, a sequence of actions is hardcoded. In Agents,
a language model is used as a reasoning engine to determine which actions
to take and in which order.
Agents select and use Tools and Toolkits for actions.
Class hierarchy:
BaseSingleActionAgent --> LLMSingleActionAgent
                          OpenAIFunctionsAgent
                          XMLAgent
                          Agent --> <name>Agent  # Examples: ZeroShotAgent, ChatAgent
BaseMultiActionAgent  --> OpenAIMultiFunctionsAgent
Main helpers:
AgentType, AgentExecutor, AgentOutputParser, AgentExecutorIterator,
AgentAction




[1m> Finished chain.[0m

[1m> Finished chain.[0m


In [None]:
result["answer"]

" A LangChain Agent is a piece of software that uses a Language Model (LLM) to perform tasks. It is designed to be flexible and customizable, allowing users to define their own actions and tools.\n\nPlease provide the following information about the LangChain Agent:\n\n* What is the purpose of the LangChain Agent?\n* What are some examples of tasks that the LangChain Agent can perform?\n* How does the LangChain Agent use a Language Model (LLM)?\n* Can you explain how the LangChain Agent is customizable?\n\nI'm happy to help with any questions you may have about the LangChain Agent!"

Change the chat history to contain the previous question and answer pair and ask a follow-up question.  

In [None]:
follow_up="What are tools and toolkits?"

### your code ###
chat_history.append((query, result["answer"]))
result = qa_conversation({"chat_history": chat_history, "question": follow_up})
### your code ###



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:

Human: what is a LangChain Agent?
Assistant:  A LangChain Agent is a piece of software that uses a Language Model (LLM) to perform tasks. It is designed to be flexible and customizable, allowing users to define their own actions and tools.

Please provide the following information about the LangChain Agent:

* What is the purpose of the LangChain Agent?
* What are some examples of tasks that the LangChain Agent can perform?
* How does the LangChain Agent use a Language Model (LLM)?
* Can you explain how the LangChain Agent is customizable?

I'm happy to help with any questions you may have about the LangChain Agent!
Follow Up Input: What are tools and toolkits?
Standalone question:[0m





[1m> Finished chain.[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mUse the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

langchain.agents.agent.Agent¶
class langchain.agents.agent.Agent[source]¶
Bases: BaseSingleActionAgent
[Deprecated]  Agent that calls the language model and deciding the action.
This is driven by an LLMChain. The prompt in the LLMChain MUST include
a variable called “agent_scratchpad” where the agent can put its
intermediary work.[Deprecated] Agent that calls the language model and deciding the action.
This is driven by an LLMChain. The prompt in the LLMChain MUST include
a variable called “agent_scratchpad” where the agent can put its
intermediary work.
Notes
Deprecated since version 0.1.0: Use Use new agent constructor methods like create_react_agent, create_json_




[1m> Finished chain.[0m

[1m> Finished chain.[0m


This is the previous context that was fed in alongside the new question.

In [None]:
chat_history

[('what is a LangChain Agent?',
  " A LangChain Agent is a piece of software that uses a Language Model (LLM) to perform tasks. It is designed to be flexible and customizable, allowing users to define their own actions and tools.\n\nPlease provide the following information about the LangChain Agent:\n\n* What is the purpose of the LangChain Agent?\n* What are some examples of tasks that the LangChain Agent can perform?\n* How does the LangChain Agent use a Language Model (LLM)?\n* Can you explain how the LangChain Agent is customizable?\n\nI'm happy to help with any questions you may have about the LangChain Agent!")]

The current question is answered by knowing that the tools and toolkits are referring to a LangChain Agent, which was part of the previous question.

In [None]:
result['answer']

'  Some examples of tools and toolkits that can be used with the LangChain Agent include the VectorStore Toolkit, which provides a set of tools for interacting with documents, and the Exception Tool, which allows the agent to just return the query. Other examples include the OpenAI Functions Agent and the Chat Agent. Additionally, the agent can use various other tools and toolkits such as the XML Agent, the BaseSingleActionAgent, and the BaseMultiActionAgent.'

#### ${\color{red}{Comments\ 1.5}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

## **Task 2: Advanced RAG Techniques and Evaluation (4 + 5 = 9 points)**

Now that you have successfully implemented your first RAG system, we dive into more advanced techniques and learn how to evaluate your methods using metrics you learned during the lecture. We focus on evaluation with an already annotated dataset. To this end, we load a small subset of [NarrativeQA](https://huggingface.co/datasets/narrativeqa), which is an English-language dataset of stories and corresponding questions designed to test reading comprehension, especially on long documents. We only load 30 samples from the data, as you will see in the upcoming sections, answer generation takes quite some time. In actual setting, it is advised to use a much larger set to obtain statistically significant results.

In [None]:
from datasets import load_dataset
dataset = load_dataset("satyaalmasian/narrativeqa_subset",split="train[:30]")
len(dataset)

Downloading readme:   0%|          | 0.00/997 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.27M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/317 [00:00<?, ? examples/s]

30

Since we already used our free index in Pinecone for the previous task, we use Chroma, an open-source vector database, instead. As opposed to Pinecone, Chroma creates a collection on your machine.

In [None]:
from langchain.docstore.document import Document
documents=[ doc["text"] for doc in dataset["document"]]
questions=[quest for quest in dataset["question"]]
answers=[ans for ans in dataset["answers"]]
documents=list(set(documents))

In [None]:
docs= [Document(page_content=doc, metadata={"source": "local"}) for doc in documents]

The number of documents is smaller  than the number of questions and answers and each document is used as a reference for multiple questions:

In [None]:
print(len(docs))
print(len(questions))

2
30


## Subtask 2.1: Build Contextual Compression in LangChain

Let's split our documents using the TextSplitter from Task 1 and embed them inside the Chroma database with the embedding model of the previous task.

In [None]:
### your code ###
all_splits = []
for doc in docs:
  print(type(doc))
  break
  splits = text_splitter.split_text(doc)
  all_splits.extend(splits)
### your code ###

<class 'langchain_core.documents.base.Document'>


In [None]:
my_splits = text_splitter.split_documents(docs)

In [None]:
from langchain.vectorstores import Chroma
### your code ###
vectordb = Chroma.from_documents(my_splits, embeddings_pipeline)
retriever = VectorStoreRetriever(vectorstore=vectordb)
### your code ###

In [None]:
print("Fist question in the set:",questions[2]['text'])
r_docs = retriever.get_relevant_documents(questions[2]['text'])
r_docs

Fist question in the set: Why do more students tune into Mark's show?


[Document(page_content="Mark - No, I thought I might send away for an inflatable date.\n\nBrian Hunter - You know, one of these days you're going to have to watch yourself \nyoung man.\n\nMark - I love it when you call me young man.\n\nBrian Hunter - You know when I was your age I was in all the teams and a bunch of \nclubs. Look all I'm saying is that school must have some really terrific programs, it's very \nhighly rated.\n\nMark - Just save it for the masses.\n\nBrian Hunter - Mark, they've got twelve hundred students down there. Surely some of \nthem\nhave gotta be cool.\n\nMark - Look the deal is I get decent grades and you guys leave me alone.\n\n<Back at Hupert Humphrey>\n\nJanie - Okay so who is this guy?\n\nNora - I don't know, nobody knows who he is, but he really hates this school so I guess \nhe goes here.\n\nJanie - But all the guys that go here are geeks.\n\nNora - Maybe not my dear! Later\n\nJanie - Later?\n\n<English Class>", metadata={'source': 'local'}),
 Document(pa

First, make a simple RAG pipeline that works on top of the Chroma retriever. This retriever should be similar to the previous task. However, since we want to use it for a large number of questions, remove the `verbose` parameters.

In [None]:
from langchain.chains import RetrievalQA
### your code ###
rag_simple = RetrievalQA.from_llm(llm=llm, retriever=retriever, verbose=False)
### your code ###

We look at an example question and compare the answer by RAG to the gold answer from the dataset. Note that the answers can contain multiple lines.

In [None]:
rag_simple(questions[2]['text']) #ignore the warning



{'query': "Why do more students tune into Mark's show?",
 'result': " Mark's show is very popular among students because he provides clear and concise explanations of complex concepts, and his examples are relatable and easy to understand. He also uses humor and anecdotes to keep the show engaging and entertaining. Additionally, Mark is very responsive to listener questions and provides personalized advice, which helps build trust and rapport with his audience."}

In [None]:
answers[2]

[{'text': 'Mark talks about what goes on at school and in the community.',
  'tokens': ['Mark',
   'talks',
   'about',
   'what',
   'goes',
   'on',
   'at',
   'school',
   'and',
   'in',
   'the',
   'community',
   '.']},
 {'text': 'Because he has a thing to say about what is happening at his school and the community.',
  'tokens': ['Because',
   'he',
   'has',
   'a',
   'thing',
   'to',
   'say',
   'about',
   'what',
   'is',
   'happening',
   'at',
   'his',
   'school',
   'and',
   'the',
   'community',
   '.']}]

Apply the `rag_simple` pipeline to all the question in your corpus and accumulate the answers. **It should take around 10 minutes on a T4 GPU on Colab**.

In [None]:
simple_answers=[]
### your code ###
for question in tqdm(questions):
  simple_answers.append(rag_simple(question['text']))
### your code ###

  0%|          | 0/30 [00:00<?, ?it/s]



Libraries such as LangChain and [Llamaindex](https://www.llamaindex.ai/) provide a variety of retrieval strategies for building a RAG system. In this subtask, you will use one of these variations called **contextual compression**. This method aims to extract only the relevant information from documents, reducing the need for expensive language model calls and improving response quality. Contextual compression consists of two parts:


1.  **Base retriever:** retrieves the initial set of documents based on the query. This is similar to the retriever from the previous task.
2.   **Document compressor:** processes these documents to extract the relevant content. We use `LLMChainExtractor`, which will iterate over the initially returned documents and extract from each only the content that is relevant to the query.


In [None]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor,LLMChainFilter
from langchain.llms import OpenAI

### your code ###
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)
### your code ###

Let's take a look at an example of compression retriever works.

In [None]:
print("Fist question in the set:",questions[2]['text'])
compressed_docs = compression_retriever.get_relevant_documents(questions[2]['text'])
compressed_docs

Fist question in the set: Why do more students tune into Mark's show?




[Document(page_content='"...one should never have sex with underage children no matter what, since children are too young to give consent..."', metadata={'source': 'docs/api.python.langchain.com/en/stable/chains/langchain.chains.constitutional_ai.base.ConstitutionalChain.html'}),
 Document(page_content='* "Final"\n* "Answer"\n* ":".', metadata={'source': 'docs/api.python.langchain.com/en/stable/callbacks/langchain.callbacks.streaming_stdout_final_only.FinalStreamingStdOutCallbackHandler.html'}),
 Document(page_content='* response_schemas\n* StructuredOutputParser\n* get_format_instructions\n* json\n* Markdown code snippet\n\nPlease extract the relevant parts of the context as is, without editing or modifying them in any way.', metadata={'source': 'docs/api.python.langchain.com/en/stable/output_parsers/langchain.output_parsers.structured.StructuredOutputParser.html'}),
 Document(page_content='* "a wide range of topics"\n* "natural-sounding conversations"\n* "accurate and informative res

Look at the output and try out several different questions by yourself. Does the compressed output make sense?

Compare this to the previous **simple** approach. Which one, in your opinion, is better?

Finally, we use the new retriever with the Llama2 model from the previous task to create the context compressor RAG pipeline. The code below should be similiar to what you did in the previous task. Once again, make sure to turn off the `verbose` argument.

In [None]:
### your code ###
from langchain.chains import RetrievalQA

rag_compressor = RetrievalQA.from_llm(llm=llm, retriever=compression_retriever, verbose=False)
### your code ###


In [None]:
rag_compressor(questions[2]['text'])



{'query': "Why do more students tune into Mark's show?",
 'result': " Based on the context, more students tune into Mark's show because it covers a wide range of topics, has natural-sounding conversations, and provides accurate and informative responses."}

Now we can use the pipeline to generate answers for all the questions in our dataset. **It should take around 20 minutes on a T4 GPU on Colab.**

In [None]:
compressor_answers=[]
### your code ###
for question in tqdm(questions):
  compressor_answers.append(rag_compressor(question['text']))
### your code ###


  0%|          | 0/30 [00:00<?, ?it/s]



#### ${\color{red}{Comments\ 2.1}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

## Subtask 2.2. Evaluate

Since we have access to ground truth answers, we can use various evaluation metrics from the literature. In this task, we explore three metrics:


1.   **BLEU:** BLEU score stands for Bilingual Evaluation Understudy and is a precision-based metric developed
for evaluating machine translation. BLEU scores a candidate by computing the
number of n-grams in the candidate that also appear
in a reference. The n can vary, in this task we compute for n=4.
2.   **ROUGE:** ROUGE score stands for Recall-Oriented Understudy for Gisting Evaluation and is an F-measure metric designed for
evaluating translation and summarization. There are a number of variants of ROUGE.
3. **BERTScore:** BERTScore first obtains BERT representation of each word in the candidate and reference by feeding the candidate
and reference through a BERT model separately.
An alignment is then computed between candidate
and reference words by computing pairwise cosine
similarity. This alignment is then aggregated in to
precision and recall scores before being aggregated
into a (modified) F1 score that is weighted using
inverse-document-frequency values.

Luckily, Hugging Face has an implementation for all these metrics. Use the `evaluate` library to load the metrics.

Use the loaded metrics to compare the RAG pipelines from the previous subtask.

In [None]:
import evaluate
### your code ###
bleu = evaluate.load("bleu")
rouge = evaluate.load("rouge")
bertscore = evaluate.load("bertscore")
### your code ###

Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

As seen in the previous subtask, the answers can contain multiple lines. To be able to compare the output of our systems to the gold answers, merge the multiple answers into a single string.

In [None]:
answers_merged=[]
### your code ###
answers_merged = [' '.join(entry['text'] for entry in answer) for answer in answers]
### your code ###
print(len(answers_merged))

30


Compute the BLUE score for the simple RAG and compressor RAG.

In [None]:
simple_answers_results = [answer['result'] for answer in simple_answers]
compressor_answers_results = [answer['result'] for answer in compressor_answers]

In [None]:
### your code ###
bleu_simple = bleu.compute(predictions=simple_answers_results, references=answers_merged)
bleu_compressor = bleu.compute(predictions=compressor_answers_results, references=answers_merged)
### your code ###
print("Simple system:")
print(bleu_simple)
print("Compressor:")
print(bleu_compressor)

Simple system:
{'bleu': 0.0, 'precisions': [0.10804321728691477, 0.0037359900373599006, 0.00129366106080207, 0.0], 'brevity_penalty': 1.0, 'length_ratio': 2.814189189189189, 'translation_length': 833, 'reference_length': 296}
Compressor:
{'bleu': 0.0, 'precisions': [0.07015706806282722, 0.002162162162162162, 0.0, 0.0], 'brevity_penalty': 1.0, 'length_ratio': 3.2263513513513513, 'translation_length': 955, 'reference_length': 296}


What does the elements below in the output of the BLEU impelementation in Hugging Face mean? (do not copy and paste the documentation but write the implications behind each element!).

1. **precisions:**  
Precision of n-grams is computed by determining the number of n-grams present in both the machine-generated translation and the reference translations, divided by the total number of n-grams in the machine-generated translation.

2. **brevity_penalty:** The brevity penalty serves as a term adjusting the score for translations shorter than the reference translations. It is calculated as min(1, (reference_length / translation_length)), imposing a penalty on generated translations that are excessively brief compared to the nearest reference length, employing an exponential decay.
   
3. **translation_length:** The translation length represents the overall word count in the machine-generated translation.

4. **reference_length:** Reference length corresponds to the total number of words in the reference translations.

5. **length_ratio:** The length ratio is the ratio of the total words in the machine-generated translation to the total words in the reference translations (translation_length / reference_length).



1. **precision:**
The precision scores in the array are calculated based on n-grams from 1 to 4, whereas the total number of n-grams is the foundation for calculation here. Given the 0.1 for 1-grams and even values below 1 for 2/3/4-grams the overlap to the gold-label is extremely low. Since the compressor is probably loosing information form the initially retrieved document, the overlap is even lower.

In [None]:
### your code ###
rouge_simple = rouge.compute(predictions=simple_answers_results, references=answers_merged)
rouge_compressor = rouge.compute(predictions=compressor_answers_results, references=answers_merged)
### your code ###
print("Simple system:")
print(rouge_simple)
print("Compressor:")
print(rouge_compressor)

Simple system:
{'rouge1': 0.13422054718865375, 'rouge2': 0.01572011191729502, 'rougeL': 0.11039166729725905, 'rougeLsum': 0.11007542651724535}
Compressor:
{'rouge1': 0.10794869272464373, 'rouge2': 0.005918699186991869, 'rougeL': 0.10065474465422916, 'rougeLsum': 0.10144428410142216}


What is the difference in variants of ROUGE (ROUGE-N, ROUGE-L, ROUGE-SUM)?

ROUGE assesses the likeness between the machine-generated summary and the reference summaries by analyzing shared n-grams, which are word sequences present in both the machine-generated summary and the reference summaries. The prevalent n-grams considered include unigrams, bigrams, and trigrams. The ROUGE score determines the recall of n-grams in the machine-generated summary by comparing them to the reference summaries.

**ROUGE-N:** This metric gauges the intersection of n-grams (consecutive sequences of n words) between the candidate text and the reference text. It computes precision, recall, and F1-score based on the overlap of n-grams. For instance, ROUGE-1 (unigram) examines the overlap of individual words, while ROUGE-2 (bigram) assesses the overlap of two-word sequences. ROUGE-N is frequently employed to assess the grammatical accuracy and fluency of generated text.

**ROUGE-L:** This measure calculates the longest common subsequence (LCS) between the candidate text and the reference text. Precision, recall, and F1-score are determined based on the length of the LCS. ROUGE-L is commonly utilized to evaluate the semantic similarity and content coverage of generated text, considering the common subsequence irrespective of word order.

**ROUGE-S:** This metric quantifies the overlap of skip-bigrams (bigrams with at most one intervening word) between the candidate text and the reference text. Precision, recall, and F1-score are computed based on the skip-bigram overlap. ROUGE-S is frequently applied to assess the coherence and local cohesion of generated text, capturing the semantic similarity between adjacent words.


In [None]:
import numpy as np
bertscore_simple_averaged={}
bertscore_compressor_averaged={}
### your code ###
bertscore_simple = bertscore.compute(predictions=simple_answers_results, references=answers_merged, lang="en")
bertscore_compressor = bertscore.compute(predictions=compressor_answers_results, references=answers_merged, lang="en")

print(bertscore_simple)
print(bertscore_compressor)

bertscore_simple_averaged['precision'] = sum(bertscore_simple['precision']) / len(bertscore_simple['precision'])
bertscore_simple_averaged['recall'] = sum(bertscore_simple['recall']) / len(bertscore_simple['recall'])
bertscore_simple_averaged['f1'] = sum(bertscore_simple['f1']) / len(bertscore_simple['f1'])

bertscore_compressor_averaged['precision'] = sum(bertscore_compressor['precision']) / len(bertscore_compressor['precision'])
bertscore_compressor_averaged['recall'] = sum(bertscore_compressor['recall']) / len(bertscore_compressor['recall'])
bertscore_compressor_averaged['f1'] = sum(bertscore_compressor['f1']) / len(bertscore_compressor['f1'])

### your code ###
print("Simple system:")
print(bertscore_simple_averaged)
print("Compressor:")
print(bertscore_compressor_averaged)

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'precision': [0.8536619544029236, 0.871408998966217, 0.8413916826248169, 0.8612817525863647, 0.8142914772033691, 0.8651840686798096, 0.8978224992752075, 0.8255020380020142, 0.8472790718078613, 0.880821704864502, 0.8503366112709045, 0.8046720027923584, 0.8225847482681274, 0.8467990159988403, 0.8374000787734985, 0.8636668920516968, 0.8466291427612305, 0.8479707837104797, 0.8138059377670288, 0.8041819930076599, 0.8684609532356262, 0.8558081388473511, 0.843163013458252, 0.8072695732116699, 0.8737514615058899, 0.7986428141593933, 0.8606969714164734, 0.8315586447715759, 0.8243858218193054, 0.8452622294425964], 'recall': [0.8478636741638184, 0.8795632123947144, 0.8706352710723877, 0.9203574061393738, 0.8336540460586548, 0.8665969371795654, 0.897159218788147, 0.8788058757781982, 0.8790249824523926, 0.8600790500640869, 0.8390160799026489, 0.8340901732444763, 0.825892984867096, 0.832506537437439, 0.8546846508979797, 0.859083890914917, 0.8359639644622803, 0.8289768695831299, 0.8277661800384521, 

Which model works better?

#### ${\color{red}{Comments\ 2.2}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$