**Heidelberg University**

**Data Science  Group**
    
Prof. Dr. Michael Gertz  

Ashish Chouhan, Satya Almasian, John Ziegler, Jayson Salazar, Nicolas Reuter
    
January 16, 2024
    
Natural Language Processing with Transformers

Winter Semster 2023/2024     
***

# **Assignment 4: Question Answering**
**Due**: Monday, January 29, 2024, 2pm, via [Moodle](https://moodle.uni-heidelberg.de/course/view.php?id=19251)



### **Submission Guidelines**

- Solutions need to be uploaded as a **single** Jupyter notebook. You will find several pre-filled code segments in the notebook, your task is to fill in the missing cells.
- For the written solution, use LaTeX in markdown inside the same notebook. Do **not** hand in a separate file for it.
- Download the .zip file containing the dataset but do **not** upload it with your solution.
- It is sufficient if one person per group uploads the solution to Moodle, but make sure that the full names of all team members are given in the notebook.

***

## **Task 1: Retrieval Augmented Generation (RAG)** ( 4.5 + 3 + 4 + 3 + 1.5 = 16 points)

In this task, we look at using the open source `Llama-13b-chat` model for creating a RAG system. You must first apply for access to Llama 2 models via [this](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) form (access is typically granted within a few hours). etrieval augmented generation you also need to request to use the model on Hugging Face by going to the [model](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) card. ***Note that the emails you provide for your Hugging Face account must match the email you used to request Llama 2.***

The final piece that you need is a Hugging Face authentication token. You can find such a token by going to the `setting` in your Hugging Face profile, under the `Access Token` menu you can generate a new token.

To store the document you will need a free Pinecone [API key](https://app.pinecone.io/).
Make sure you have these pieces ready before starting to work on this task.

----
When ready, let's start by downloading the necessary packages.

It is advised to proceed with this notebook with a GPU (if you are on Colab make sure that a GPU environment is activated.)


Place all the access tokens in the `.env` file and upload it to the working directory (if you are running this notebook locally, you can change the path to fit your working directory). Please use the following format:


```
HF_AUTH= "Hugging Face Authentication Key"
PINECONE_API_KEY="Pincone API Key"
PINECONE_ENVIRONMENT="Pinecone Environment"
```

Run the cell below to load the access tokens into the environment variables.

In [None]:
%pip install python-dotenv

Collecting python-dotenv
  Downloading python_dotenv-1.0.0-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.0


In [None]:
import os
from dotenv import load_dotenv

# load environment variables from .env file
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

In [None]:
%pip install litellm
%pip install -qU trulens_eval pydantic fastapi kaleido python-multipart uvicorn cohere openai tiktoken "llama-index"
%pip install transformers
%pip install sentence-transformers
%pip install pinecone-client
%pip install datasets
%pip install accelerate
%pip install einops
%pip install langchain
%pip install xformers
%pip install bitsandbytes
%pip install matplotlib seaborn tqdm
%pip install chromadb
%pip install evaluate
%pip install rouge_score
%pip install bert_score

Collecting litellm
  Downloading litellm-1.17.3-py3-none-any.whl (2.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
Collecting openai>=1.0.0 (from litellm)
  Downloading openai-1.7.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.9/224.9 kB[0m [31m23.7 MB/s[0m eta [36m0:00:00[0m
Collecting tiktoken>=0.4.0 (from litellm)
  Downloading tiktoken-0.5.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m55.4 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from openai>=1.0.0->litellm)
  Downloading httpx-0.26.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.9/75.9 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
Collecting typing-extensions<5,>=4.7 (from openai>=1.0.0->litellm)
  Downloading typing_extensions-4.9.0



## Subtask 1.1: Data Preparation



We need a collection of documents to perform our retrieval on. To make it closer to your final project, you will be downloading and using a subset of the LangChain documentation. We get some of the `.html` files located on the site. The code below will download all HTML files from the links on the webpage into a `docs` directory. `-l1` limits the download to only the first level of depth.


In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
!wget -r -l1 -A.html -P docs https://api.python.langchain.com/en/stable/langchain_api_reference.html

--2024-01-12 09:55:32--  https://api.python.langchain.com/en/stable/langchain_api_reference.html
Resolving api.python.langchain.com (api.python.langchain.com)... 104.17.32.82, 104.17.33.82, 2606:4700::6811:2152, ...
Connecting to api.python.langchain.com (api.python.langchain.com)|104.17.32.82|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘docs/api.python.langchain.com/en/stable/langchain_api_reference.html’

          api.pytho     [<=>                 ]       0  --.-KB/s               api.python.langchai     [ <=>                ] 256.55K  --.-KB/s    in 0.03s   

2024-01-12 09:55:32 (8.92 MB/s) - ‘docs/api.python.langchain.com/en/stable/langchain_api_reference.html’ saved [262709]

Loading robots.txt; please ignore errors.
--2024-01-12 09:55:32--  https://api.python.langchain.com/robots.txt
Reusing existing connection to api.python.langchain.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 99 [text/plain

 The docs are going to be used as input text for answering questions that a normal language model might not be aware of (LangChain docs is not necessarily part of its training data of Llama2). We can use LangChain itself to process these docs. Use the [ReadTheDocsLoader](https://python.langchain.com/docs/integrations/document_loaders/readthedocs_documentation) to load the docs from the `docs` folder.

 At the time of creating this notebook, there  `423` documents were downloaded. However, since the documentation is being updated regularly this number might be different for you.

In [None]:
from langchain.document_loaders import ReadTheDocsLoader
#### your code ####
loader = ReadTheDocsLoader('docs')
docs = loader.load()
#### your code ####
len(docs)

423

Let's take a look at one of the documents. You see that LangChain has created a `Document` object. Look at the example below and fill in the cells to print out the text content and URL of the page (the URL of the page should starts with `https://`).

In [None]:
docs[10]

Document(page_content='langchain_mistralai 0.0.1¶\nlangchain_mistralai.chat_models¶\nClasses¶\nchat_models.ChatMistralAI\nA chat model that uses the MistralAI API.\nFunctions¶\nchat_models.acompletion_with_retry(llm[,\xa0...])\nUse tenacity to retry the async completion call.', metadata={'source': 'docs/api.python.langchain.com/en/stable/mistralai_api_reference.html'})

In [None]:
#### your code ####
page_content= docs[10].page_content
page_url=docs[10].metadata['source'].replace('docs/', 'https://')
#### your code ####
print(page_content)
print(page_url)

langchain_mistralai 0.0.1¶
langchain_mistralai.chat_models¶
Classes¶
chat_models.ChatMistralAI
A chat model that uses the MistralAI API.
Functions¶
chat_models.acompletion_with_retry(llm[, ...])
Use tenacity to retry the async completion call.
https://api.python.langchain.com/en/stable/mistralai_api_reference.html


As you can imagine the documents can be long and if multiple of them are required as context to answer questions, we need to take the document lengths into account.
This is due to the fact that language models do not have unlimited context span. In our case, we plan to use Llama2 for this project, where the maximum token limit is 4096. This limit is not only the input but also takes the generated output into account, moreover, you need to leave room for the query and instructions as well. Therefore, it is important to chunk the longer documents into smaller-sized fragments.

Based on your use case and how many contexts you plan to feed into the model the length of these fragments will differ.
In this case, we choose to assign 2000 tokens to context and choose to generate the answer from 5 context fragments, which leaves us with 400 tokens per context fragment as the maximum chunk size.

To count the number of tokens in a chunk, we need to load the correct tokenizer for Llama2. Fill the code cell below to load the correct tokenizer and use it to complete the function that counts the number of tokens per given chunk.

**Hint:** you need to use your Hugging Face authentication token to load the tokenizer.

In [None]:
#If you get an error here during the first import from the `transformers` package, restart the kernel and try again.
#### your code ####
from transformers import LlamaTokenizer
hf_auth = os.environ.get('HF_AUTH')
tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-13b-chat-hf",use_auth_token=hf_auth)
#### your code ####

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
def token_len(text):
  #### your code ####
    tokens = tokenizer.encode(text)
    return len(tokens)
    #### your code ####

Count the number of tokens for all documents and use it to compute minimum, maximum, and average token count statistics across all documents. Depending on how the documentation is updated by the time you run the cell below the numbers might slightly differ.

In [None]:
#### your code ####
token_counts = [token_len(doc.page_content) for doc in docs]
min_tokens=min(token_counts)
avg_tokens=int(sum(token_counts) / len(token_counts))
max_tokens=max(token_counts)
#### your code ####
print(f"""Min: {min_tokens}
Avg: {avg_tokens}
Max: {max_tokens}""")

Min: 48
Avg: 2663
Max: 36800


Now we will use LangChain's built-in chunking functionality to split the text into smaller chunks. LangChain offers a variety of text splitters that you can check out [here](https://api.python.langchain.com/en/latest/langchain_api_reference.html#module-langchain.text_splitter).
Use the general-purpose splitter that splits text by recursively looking at characters. Use this class to split the text into 400 token-sized chunks, where the length of each chunk is computed based on the `token_len` function. The length is not the only criterion for splitting, if any of these separators `'\n\n', '\n', ' ', ''` is encountered, we will have a new chunk.
Since splitting only based on maximum length might result in incoherent chunks for every consecutive chunk, let the chunk overlap by 50 tokens. This way,  we preserve some of the previous context while chunking.

In [None]:
#### your code ####
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=50,
    length_function=token_len,
    separators=['\n\n', '\n', ' ', '']
)
#### your code ####

In [None]:
chunks = text_splitter.split_text(docs[100].page_content)
len(chunks)

4

In [None]:
token_len(chunks[0]))

(344, 256)

The next step is to apply the splitting function to all the documents in our corpus and to save our chunks in a logical way. We also want to assign a unique ID to each chunk so we know which part of the documentation they come from. In the end, the corpus should be transformed into a list of dictionaries of the following format:


```
[
    {
        "id": "glossary-0",
        "text": "first chunk of the document glossary",
        "source": "https://langchain.readthedocs.io/en/latest/glossary.html"
    },
    {
        "id": "glossary-1",
        "text": "second chunk of glossary",
        "source": "https://langchain.readthedocs.io/en/latest/glossary.html"
    }
    ...
]
```

Construct the IDs by taking the name of the page before the suffix `.html` and appending a chronological number indicating which chunk it is.


In [None]:
from tqdm.auto import tqdm

documents = []

for doc in tqdm(docs):
  #### your code ####
    url = doc.metadata['source']
    uid =url.split("/")[-1].replace(".html","")
    chunks = text_splitter.split_text(doc.page_content)
    for i, chunk in enumerate(chunks):
        documents.append({
            'id': f'{uid}-{i}',
            'text': chunk,
            'source': url
        })
  #### your code ####
len(documents) # once again this value might differ based on how the LangChain documentation is updated

  0%|          | 0/423 [00:00<?, ?it/s]

4065

For the next steps, we require a `DataFrame`.

In [None]:
import pandas as pd
data = pd.DataFrame(documents)
data.head()

Unnamed: 0,id,text,source
0,langchain_api_reference-0,langchain 0.1.0¶\nlangchain.agents¶\nAgent is ...,docs/api.python.langchain.com/en/latest/langch...
1,langchain_api_reference-1,Base Single Action Agent class.\nagents.agent....,docs/api.python.langchain.com/en/latest/langch...
2,langchain_api_reference-2,[Deprecated] Chat Agent.[Deprecated] Chat Age...,docs/api.python.langchain.com/en/latest/langch...
3,langchain_api_reference-3,[Deprecated] Agent for the MRKL chain.[Deprec...,docs/api.python.langchain.com/en/latest/langch...
4,langchain_api_reference-4,Parses a message into agent action/finish.\nag...,docs/api.python.langchain.com/en/latest/langch...


✅ Point distribution ✅
- 0.5 point if the documents are correctly loaded with `ReadTheDocsLoader`.
- 0.5 point if the page content and URL is extracted from the `Document` object.
- 0.5 point if the tokenizer is correctly initlized.
- 0.25 point if the `token_len` function is correct.
- 0.75 point to compute the minimum, maximum and average length of all documents.
- 1 point for using the correct text splitter and using the correct parameters. For each mistake, deduct 0.25 point.
- 1 point for converting the documents into chunks and list of dictionaries.



#### ${\color{red}{Comments\ 1.1}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

## Subtask 1.2: Document Embedding Pipeline


In this task, we initialize the embedding pipeline to transform the chunks into vector embeddings using Hugging Face and LangChain. These embeddings are used for similarity search between the query and the chunks to retrieve the most relevant chunks.
  We will use the `sentence-transformers/all-MiniLM-L6-v2` model for embedding, which is a rather small model that you can easily run on Colab. Initialize the model using `HuggingFaceEmbeddings` to use Hugging Face via Langchain. The encoding batch size should be 32, and make sure that the model is placed on the correct device, otherwise, this can take a long time.

In [None]:
from torch import cuda
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
import os
import pinecone
from tqdm import tqdm

In [None]:
embedding_model = 'sentence-transformers/all-MiniLM-L6-v2'
device = 'cuda:0' # make sure you are on gpu
docs = [
    "An example document",
    "A second document as an example"
]
### your code ###
embed_model = HuggingFaceEmbeddings(
    model_name=embedding_model,
    model_kwargs={'device': device},
    encode_kwargs={'device': device, 'batch_size': 32}
)
### your code ###

Embed the example documents using the model you created and check the output.
The output should be a list of lists, containing the embeddings.

In [None]:
### your code ###
embeddings = embed_model.embed_documents(docs)
### your code ###
print("number of docs:",len(embeddings))
print("dimension of docs:",len(embeddings[0]))

number of docs: 2
dimension of docs: 384


Now we use the embedding pipeline created above to store the embeddings in a Pinecone vector index. First, lets setup the Pinecone environment, collect your API key and environment name from the environment variables, and initiate Pinecone with them.

In [None]:
### your code ###
pinecone.init(
    api_key=os.environ.get('PINECONE_API_KEY') ,
    environment=os.environ.get('PINECONE_ENVIRONMENT')
)
### your code ###

Initialize the index `rag-assignment` inside Pinecone. Use the cosine similarity as similarity metric. Keep in mind that if you run this multiple times on a free tier, where only one index is allowed, you need to remove the index created to make room for a new one (Pinecone index gets archived automatically after 14 days of inactivity).

In [None]:
index_name = 'rag-assignment'
### your code ###
pinecone.create_index(
    index_name,
    dimension=len(embeddings[0]),
    metric='cosine'
)
### your code ###

Lets take a look at the index you created. As of now the index should be empty but have the correct embedding dimension.

In [None]:
index_name = 'rag-assignment'
index = pinecone.Index(index_name)
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

Process the dataset in batches of `32` and push the vectors to the Pinecone index. Your index should include the IDs and embeddings for each chunk. As metadata, pass the original text as `text` and the URL as `source` (no need to add the `https`). We use this metadata later to retrieve the original text.

In [None]:
batch_size = 32

for i in tqdm(range(0, len(data), batch_size)):
  ### your code ###
    i_end = min(len(data), i+batch_size)
    batch = data.iloc[i:i_end]
    ids = [f"{x['id']}" for i, x in batch.iterrows()]
    texts = [x['text'] for i, x in batch.iterrows()]
    embeds = embed_model.embed_documents(texts)
    # get metadata to store in Pinecone
    metadata = [
        {'text': x['text'],
         'source': x['source']} for i, x in batch.iterrows()
    ]
    index.upsert(vectors=zip(ids, embeds, metadata))
    ### your code ###


  0%|          | 0/128 [00:00<?, ?it/s]

Now if we look at the index statistics we should have vectors of dimension `384`.

In [None]:
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.03948,
 'namespaces': {'': {'vector_count': 3948}},
 'total_vector_count': 3948}

✅ Point distribution ✅
- 0.5 point if `HuggingFaceEmbeddings` is correclty initialized.
- 0.5 point if examples are correctly processed with `HuggingFaceEmbeddings`.
- 0.5 point if Pinecone is initialized with the enviroment variables.
- 0.5 if the Pinecone index is correctly created with the metric specified.
- 1 point if the passages are converted to vectors and are pushed into Pinecone.


#### ${\color{red}{Comments\ 1.2}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

## Subtask 1.3: Text Generation Pipeline


So far we have our index ready and a way to find the most similar chunks to our query. Now, we need a way to generate the answer from the retrieved chunks. For this purpose, we use the `text-generation` pipeline from Hugging Face (refer to the Hugging Face [tutorial](https://moodle.uni-heidelberg.de/pluginfile.php/1286642/mod_resource/content/1/HuggingFace.ipynb)) and load it into LangChain using a wrapper.

In [None]:
from torch import cuda, bfloat16
import os
import transformers
model_id = 'meta-llama/Llama-2-13b-chat-hf'

Quantization techniques reduce memory and computational costs by representing weights and activations with lower-precision data types like 8-bit integers (int8). This enables loading larger models you normally wouldn’t be able to fit into memory, and thus speeds up inference.
To make the process of model quantization more accessible, Hugging Face has seamlessly integrated with the [Bitsandbytes](https://huggingface.co/docs/accelerate/usage_guides/quantization) library.

Define a config from `Bitsandbytes` that enables 4-bit quantization and set the nested quantization to `true`. This changes the datatype from float 32 (default) to normalized float 4 datatype to contain 4 bits of information.
Additionally, add a compute type to store weights in 4-bits, but the computation to happen in 16-bit (bfloat16).
Moreover, set the `bnb_4bit_use_double_quant` to true, which uses a second quantization after the first one to save an additional 0.4 bits per parameter.
Refer to [here](https://huggingface.co/docs/transformers/main_classes/quantization) for more information.

In [None]:
  ### your code ###
bitsAndBites_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)
  ### your code ###

Use your Hugging Face token to load the correct model configuration using the `transformers` library.

In [None]:
### your code ###

model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=os.environ.get('HF_AUTH')
)
### your code ###




Load the model for text generation (pay attention to the model type) using the configuration file you have defined, with the specified quantization, and set the `trust_remote_code` flag to `true`. Another flag that is useful for large mode is  `device_map="auto"`. By setting this flag, Accelerate will determine where to put each layer to maximize the use of GPUs and offload the rest on the CPU, or even the hard drive if you don’t have enough GPU RAM (or CPU RAM).

It will take a while for the model to download.

In [None]:
#Loading the model will take some time, (roughly 5 min)
### your code ###
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bitsAndBites_config,
    device_map='auto',
    token=os.environ.get('HF_AUTH')
)
### your code ###
model.eval()# we only use the model for inference
print(f"Model loaded ")

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]



Model loaded 


You can even check the memory footprint of your model using the `get_memory_footprint` method.


In [None]:
model.get_memory_footprint()

7083970560

The next thing we need to do is initialize a `text-generation` pipeline with Hugging Face that uses the Llama2 model to generate some text, given some input. We will then use this pipeline inside LangChain to build our question-answering system.
`text-generation` pipeline generates text from a language model conditioned on a given input. The pipeline is similar to other Hugging Face pipelines and requires two things that we must initialize:

1.   A language model, in this case, it will be `meta-llama/Llama-2-13b-chat-hf`.
2.   A tokenizer for the language model.

LangChain expects the full-text outputs, therefore set the `return_full_text` to true. You can also pass additional generation parameters to the model.
Since we want the questions to be answered mainly based on the retrieved chunks, let's set the model temperature to a low value of 0.01 to reduce randomness. Additionally, add a repetition penalty of 1.1 to stop the model from repeating itself and the maximum number of generation tokens to 512.

In [None]:
### your code ###
generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True,
    task='text-generation',
    # we pass model parameters here too
    temperature=0.01,
    max_new_tokens=512,  # max number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)
### your code ###

We provide the language model a general question to make sure our pipeline is working correctly.

In [None]:
sample_input="Explain to me the difference between alligator and crocodile."
### your code ###
res = generate_text(sample_input)
generated_text=res[0]["generated_text"]
### your code ###
print(generated_text)

Explain to me the difference between alligator and crocodile.
Alligators and crocodiles are both large, carnivorous reptiles that live in wetlands and rivers, but there are several key differences between them. Here are some of the main differences:

1. Appearance: Alligators have a wider, rounder snout compared to crocodiles, which have a longer, thinner snout. Alligators also have a more rounded body shape and shorter legs than crocodiles.
2. Habitat: Alligators are found only in freshwater environments such as lakes, rivers, and swamps, while crocodiles can be found in both freshwater and saltwater environments.
3. Geographic range: Alligators are only found in the southeastern United States and China, while crocodiles are found in many parts of the world, including Africa, Asia, Australia, and the Americas.
4. Behavior: Alligators are generally less aggressive than crocodiles and tend to avoid confrontations with humans. Crocodiles, on the other hand, are known for their aggressive

Use the LangChain Hugging Face wrapper, as subset of [LLM chain](https://python.langchain.com/docs/modules/chains/foundational/llm_chain) to create an interface for the text generation pipeline.

In [None]:
### your code ###
from langchain.llms import HuggingFacePipeline
llm = HuggingFacePipeline(pipeline=generate_text)
### your code ###

To confirm that it works the same way, use the sample input to generate text using the llm chain. The input should be passed as the `prompt` to the language model.

In [None]:
### your code ###
llm(prompt=sample_input)
### your code ###

  warn_deprecated(


'\nAlligators and crocodiles are both large, carnivorous reptiles that live in wetlands and rivers, but there are several key differences between them. Here are some of the main differences:\n\n1. Appearance: Alligators have a wider, rounder snout compared to crocodiles, which have a longer, thinner snout. Alligators also have a more rounded body shape and shorter legs than crocodiles.\n2. Habitat: Alligators are found only in freshwater environments such as lakes, rivers, and swamps, while crocodiles can be found in both freshwater and saltwater environments.\n3. Geographic range: Alligators are only found in the southeastern United States and China, while crocodiles are found in many parts of the world, including Africa, Asia, Australia, and the Americas.\n4. Behavior: Alligators are generally less aggressive than crocodiles and tend to avoid confrontations with humans. Crocodiles, on the other hand, are known for their aggressive behavior and have been responsible for many human att

✅ Point distribution ✅
- 1 point if the parameters for `bitsAndBites_config` are correct.
- 0.5 point if the configuration file is correctly initlized.
- 1 point if the generation model is correct.
- 0.5 point if question is answered using the Hugging Face pipeline.
- 0.5 point if the LLM chain containig the Hugging Face pipeline is correct.
- 0.5 point if question is answered using LLM chain.



#### ${\color{red}{Comments\ 1.3}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

## Subtask 1.4: Question Answering Chain


For Retrieval Augmented Generation (RAG) in LangChain, we need to initialize either a `RetrievalQA` or `RetrievalQAWithSourcesChain` object.

`RetrievalQA` is a method for question-answering tasks, utilizing an index to retrieve relevant documents or text chunks, it is suitable for straightforward Q&A applications.

`RetrievalQAWithSourcesChain` is an extension of RetrievalQA that chains together multiple sources of information, providing context and the source for answers.

 For both of these, we need an LLM and a Pinecone index. For LangChain to be able to use the Pinecone index, we need to initialize it through the LangChain vector store.

 **Hint**: You need to explicitly tell the vector storage where to find the original text.

In [None]:
from langchain.vectorstores import Pinecone
### your code ###
vectorstore = Pinecone(index, embed_model.embed_query, 'text')
### your code ###



Let's try a query that is specific to the LangChain documentation and see which chunks are relevant. Use the vector storage defined above to find the top-3 chunks related to the given query.

In [None]:
query = 'what is a LangChain Agent?'
### your code ###
vectorstore.similarity_search(
    query,  # the search query
    k=3  # returns top 3 most relevant chunks of text
)
### your code ###

[Document(page_content='langchain 0.0.353¶\nlangchain.agents¶\nAgent is a class that uses an LLM to choose a sequence of actions to take.\nIn Chains, a sequence of actions is hardcoded. In Agents,\na language model is used as a reasoning engine to determine which actions\nto take and in which order.\nAgents select and use Tools and Toolkits for actions.\nClass hierarchy:\nBaseSingleActionAgent --> LLMSingleActionAgent\n                          OpenAIFunctionsAgent\n                          XMLAgent\n                          Agent --> <name>Agent  # Examples: ZeroShotAgent, ChatAgent\nBaseMultiActionAgent  --> OpenAIMultiFunctionsAgent\nMain helpers:\nAgentType, AgentExecutor, AgentOutputParser, AgentExecutorIterator,\nAgentAction, AgentFinish\nClasses¶\nagents.agent.Agent\nAgent that calls the language model and deciding the action.\nagents.agent.AgentExecutor\nAgent that is using tools.\nagents.agent.AgentOutputParser\nBase class for parsing agent output into agent action/finish.\n

Now use the `vectorstore` and `llm` to initialize the `RetrievalQA` object, which showcases question answering over an index. `RetrievalQA` is a document chain, these are useful for summarizing documents, answering questions about documents, extracting information from documents, and more. All such chains operate with 4 different chain types:


1.   `stuff`: it takes a list of documents, inserts them all into a prompt, and passes that prompt to an LLM.
2.   `refine`: it constructs a response by looping over the input documents and iteratively updating its answer. It is well-suited for tasks that require analyzing more documents than can fit in the model’s context.
3. `map_reduce`:  it first applies an LLM chain to each document individually (the Map step), treating the chain output as a new document. It then passes all the new documents to a separate combined documents chain to get a single output (the Reduce step).
4. `map_re_rank`: it runs an initial prompt on each document that not only tries to complete a task but also gives a score for how certain it is in its answer. The highest-scoring response is returned.

For this assignment, we focus only on the first type. Make sure to set the `verbose` to `true`, so we can see the different stages of processing that happens while answering a question (you might need to set this parameter more than once). As mentioned before, we want our retrieve to input top-5 most similiar chunks to the query to generate an answer.

In [None]:
from langchain.chains import RetrievalQA
### your code ###

rag_pipeline = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    verbose=True,
    retriever=vectorstore.as_retriever(search_kwargs={"k":5}),
    chain_type_kwargs={
        "verbose": True },

)

### your code ###
query='what is a LangChain Agent?'

First, we try to answer the question only using Llama2. As you see the answer is not convincing as it does not have access to the LangChain documentation.

In [None]:
llm(query)

'\n\nA LangChain Agent is an intelligent agent that uses natural language processing (NLP) and machine learning (ML) techniques to assist users in finding relevant information on the web. It is designed to help users navigate the vast amount of information available online by providing personalized recommendations and answers to their questions.\n\nThe name "LangChain" refers to the idea of linking together different languages and knowledge sources to create a comprehensive and coherent view of the world. The agent is able to understand and respond to user queries in multiple languages, and it can draw upon a wide range of sources, including text, images, videos, and other forms of media, to provide accurate and relevant results.\n\nSome of the key features of a LangChain Agent include:\n\n1. Natural Language Processing (NLP): The agent is able to understand and interpret natural language queries, allowing users to ask questions in everyday language.\n2. Machine Learning (ML): The agen

Now use the Pipeline from above and see how the answer changes.

In [None]:
### your code ###
rag_pipeline(query)
### your code ###


  warn_deprecated(




[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mUse the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

langchain 0.0.353¶
langchain.agents¶
Agent is a class that uses an LLM to choose a sequence of actions to take.
In Chains, a sequence of actions is hardcoded. In Agents,
a language model is used as a reasoning engine to determine which actions
to take and in which order.
Agents select and use Tools and Toolkits for actions.
Class hierarchy:
BaseSingleActionAgent --> LLMSingleActionAgent
                          OpenAIFunctionsAgent
                          XMLAgent
                          Agent --> <name>Agent  # Examples: ZeroShotAgent, ChatAgent
BaseMultiActionAgent  --> OpenAIMultiFunctionsAgent
Main helpers:
AgentType, AgentExecutor, Agent

{'query': 'what is a LangChain Agent?',
 'result': ' A LangChain Agent is a piece of software that uses a chain of language models (LLMs) to perform tasks. It takes in user input, passes it through a series of LLMs, and then uses the output of the final LLM to determine what action to take. The agent can also use tools and toolkits to help it perform its tasks.\n\nPlease let me know if you need any further information or clarification.'}

✅ Point distribution ✅
- 0.5 point to correctly initialize a vector store.
- 1 point for performing the top-3 similiarity search.
- 1 point for correct initialization of the `rag_pipeline` with all the parameters. If the verbose is not set correctly remove only 0.25.
- 0.5 for correct use of the pipeline to answer the question.


#### ${\color{red}{Comments\ 1.4}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

## Subtask 1.5: Conversational Retrieval Chain




We can also extend our retrieval chain to be able to remember the previous questions and answer the current question by looking at the previous context.
The important part of a conversational model is conversation memory, which transforms the stateless language model to be able to remember previous interactions, e.g., similiar to ChatGPT. In this subtask, we will use LangChain to create a conversational memory.


To implement the memory we use `ConversationalRetrievalChain`.
This chain takes in chat history (a list of messages) and new questions and then returns an answer to that question. The algorithm for this chain consists of three parts:

1. Use the chat history and the new question to create a new question that contains the information from the previous context.

2. This new question is passed to the retriever and relevant documents are returned.

3. The retrieved documents are passed to an LLM to generate a final response.

In [None]:
from langchain.chains import ConversationalRetrievalChain
chat_history = []

### your code ###
qa_conversation = ConversationalRetrievalChain.from_llm(llm=llm,
    verbose=True,
    retriever=vectorstore.as_retriever(search_kwargs={"k":5}),
)
result = qa_conversation({"question": query, "chat_history": chat_history})
### your code ###




[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mUse the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

Lazily import lmformatenforcer.
llms.rellm_decoder.import_rellm()
Lazily import rellm.
langchain_experimental.open_clip¶
Classes¶
open_clip.open_clip.OpenCLIPEmbeddings
Create a new model by parsing and validating input data from keyword arguments.
langchain_experimental.pal_chain¶
Implements Program-Aided Language Models.
As in https://arxiv.org/pdf/2211.10435.pdf.
This is vulnerable to arbitrary code execution:
https://github.com/langchain-ai/langchain/issues/5872
Classes¶
pal_chain.base.PALChain
Implements Program-Aided Language Models (PAL).
pal_chain.base.PALValidation([...])
Initialize a PALValidation instance.
langchain_experimental.plan_and_execute¶
Classes¶
plan_and_execute.agent_exe

In [None]:
result["answer"]

' A LangChain Agent is a program that uses a LangChain model to perform some task. It can be thought of as a "bot" that uses natural language processing to interact with users or other systems. The agent can be trained on a specific task, such as answering questions or generating text, and can be integrated with various resources and services to perform more complex tasks.'

Change the chat history to contain the previous question and answer pair and ask a follow-up question.  

In [None]:
follow_up="What are tools and toolkits?"

### your code ###
chat_history = [(query, result["answer"])]
result = qa_conversation({"question":follow_up , "chat_history": chat_history})
### your code ###



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:

Human: what is a LangChain Agent?
Assistant:  A LangChain Agent is a program that uses a LangChain model to perform some task. It can be thought of as a "bot" that uses natural language processing to interact with users or other systems. The agent can be trained on a specific task, such as answering questions or generating text, and can be integrated with various resources and services to perform more complex tasks.
Follow Up Input: What are tools and toolkits?
Standalone question:[0m

[1m> Finished chain.[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mUse the following pieces of context to answer the question at the end. If you don't know the answer, jus

This is the previous context that was fed in alongside the new question.

In [None]:
chat_history

[('what is a LangChain Agent?',
  ' A LangChain Agent is a program that uses a LangChain model to perform some task. It can be thought of as a "bot" that uses natural language processing to interact with users or other systems. The agent can be trained on a specific task, such as answering questions or generating text, and can be integrated with various resources and services to perform more complex tasks.')]

The current question is answered by knowing that the tools and toolkits are referring to a LangChain Agent, which was part of the previous question.

In [None]:
result['answer']

'\n\nTools and toolkits in the context of LangChain Agents refer to pre-built integrations with various external resources like file systems, APIs, and databases. These integrations allow developers to create versatile applications that combine the power of LLMs with the ability to access, interact with, and manipulate external resources. The toolkits provide a set of classes and functions that enable developers to interact with specific resources or services, such as AINetwork Blockchain, Amadeus, Azure Cognitive Services, and more. By using these toolkits, developers can focus on building their applications without having to build everything from scratch, saving time and effort.'

✅ Point distribution ✅
- 1 point to correctly initialize the converstional chain.
- 0.5 point for updating the history and asking a follow up question.



#### ${\color{red}{Comments\ 1.5}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

## **Task 2: Advanced RAG Techniques and Evaluation (4 + 5 = 9 points)**

Now that you have successfully implemented your first RAG system, we dive into more advanced techniques and learn how to evaluate your methods using metrics you learned during the lecture. We focus on evaluation with an already annotated dataset. To this end, we load a small subset of [NarrativeQA](https://huggingface.co/datasets/narrativeqa), which is an English-language dataset of stories and corresponding questions designed to test reading comprehension, especially on long documents. We only load 30 samples from the data, as you will see in the upcoming sections, answer generation takes quite some time. In actual setting, it is advised to use a much larger set to obtain statistically significant results.

In [None]:
from datasets import load_dataset
dataset = load_dataset("satyaalmasian/narrativeqa_subset",split="train[:30]")
len(dataset)

Downloading readme:   0%|          | 0.00/997 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.27M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/317 [00:00<?, ? examples/s]

30

Since we already used our free index in Pinecone for the previous task, we use Chroma, an open-source vector database, instead. As opposed to Pinecone, Chroma creates a collection on your machine.

In [None]:
from langchain.docstore.document import Document
documents=[ doc["text"] for doc in dataset["document"]]
questions=[quest for quest in dataset["question"]]
answers=[ans for ans in dataset["answers"]]
documents=list(set(documents))

In [None]:
docs= [Document(page_content=doc, metadata={"source": "local"}) for doc in documents]

The number of documents is smaller  than the number of questions and answers and each document is used as a reference for multiple questions:

In [None]:
print(len(docs))
print(len(questions))

2
30


##Subtask 2.1: Build Contextual Compression in LangChain

Let's split our documents using the TextSplitter from Task 1 and embed them inside the Chroma database with the embedding model of the previous task.

In [None]:
### your code ###
all_splits = text_splitter.split_documents(docs)
### your code ###

In [None]:
from langchain.vectorstores import Chroma
### your code ###
vectordb = Chroma.from_documents(documents=all_splits, embedding=embed_model, persist_directory="chroma_db")
retriever =vectordb.as_retriever(search_kwargs={"k": 5})
### your code ###

In [None]:
print("Fist question in the set:",questions[2]['text'])
r_docs = retriever.get_relevant_documents(questions[2]['text'])
r_docs

Fist question in the set: Why do more students tune into Mark's show?


[Document(page_content="Reporter #2 - Are you on drugs?\n\nPaige - Arrrgh. Talk Hard. Arrrrrgh.\n\nMark - I've got a lot of homework I'm gonna take off alright.\n\nMarla - Mark I know why your really going home. It's because you wanna listen to that \nshow tonight don't you?\n\n<Play Peter Murphy>\n\n<Nora goes to Marks house where she finds him burning his Happy Harry Hardon \nletters>\n\nNora - Hi! What are you doing? You having fun?\n\nMark - Yeah.\n\nNora - Hey, look I took some of these off the wall for you. I mistakingly thought you \nmight want them.\n\nMark - Thanks.\n\nNora - So I guess you're not going on tonight.\n\nMark - Brilliant.\n\nNora - Is this all just a game to you. You know you can't just shout fire in a theatre and \nwalk out. You have a responsibility for the people who believe in you. What is this? \nC'mon say something, say anything. Open your mouth and say get the hell out of here \nbitch.\n\nMark - I can't.\n\nNora - You can't what?\n\nMark - I can't talk.\n\

First, make a simple RAG pipeline that works on top of the Chroma retriever. This retriever should be similar to the previous task. However, since we want to use it for a large number of questions, remove the `verbose` parameters.

In [None]:
from langchain.chains import RetrievalQA
### your code ###
rag_simple= RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever
)
### your code ###

We look at an example question and compare the answer by RAG to the gold answer from the dataset. Note that the answers can contain multiple lines.

In [None]:
rag_simple(questions[2]['text']) #ignore the warning



{'query': "Why do more students tune into Mark's show?",
 'result': ' Mark has become a symbol of rebellion against the strict rules and expectations of the school. His show provides an outlet for students to express their frustrations and desires, and he has gained a reputation as someone who is willing to challenge the status quo. As a result, many students tune in to his show as a way to connect with someone who understands their feelings and desires.'}

In [None]:
answers[2]

[{'text': 'Mark talks about what goes on at school and in the community.',
  'tokens': ['Mark',
   'talks',
   'about',
   'what',
   'goes',
   'on',
   'at',
   'school',
   'and',
   'in',
   'the',
   'community',
   '.']},
 {'text': 'Because he has a thing to say about what is happening at his school and the community.',
  'tokens': ['Because',
   'he',
   'has',
   'a',
   'thing',
   'to',
   'say',
   'about',
   'what',
   'is',
   'happening',
   'at',
   'his',
   'school',
   'and',
   'the',
   'community',
   '.']}]

Apply the `rag_simple` pipeline to all the question in your corpus and accumulate the answers. **It should take around 10 minutes on a T4 GPU on Colab**.

In [None]:
simple_answers=[]
### your code ###
for quest in tqdm(questions):
  simple_answers.append(rag_simple(quest['text'])['result'])
### your code ###

  warn_deprecated(
100%|██████████| 30/30 [12:03<00:00, 24.10s/it]


Libraries such as LangChain and [Llamaindex](https://www.llamaindex.ai/) provide a variety of retrieval strategies for building a RAG system. In this subtask, you will use one of these variations called **contextual compression**. This method aims to extract only the relevant information from documents, reducing the need for expensive language model calls and improving response quality. Contextual compression consists of two parts:


1.  **Base retriever:** retrieves the initial set of documents based on the query. This is similar to the retriever from the previous task.
2.   **Document compressor:** processes these documents to extract the relevant content. We use `LLMChainExtractor`, which will iterate over the initially returned documents and extract from each only the content that is relevant to the query.


In [None]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor,LLMChainFilter
from langchain.llms import OpenAI

### your code ###
compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(base_compressor=compressor,
                                                       base_retriever=retriever)
### your code ###

Let's take a look at an example of compression retriever works.

In [None]:
print("Fist question in the set:",questions[2]['text'])
compressed_docs = compression_retriever.get_relevant_documents(questions[2]['text'])
compressed_docs



Fist question in the set: Why do more students tune into Mark's show?




[Document(page_content='* "Why do more students tune into Mark\'s show?"\n* "Mark\'s show"\n* "students"', metadata={'source': 'local'}),
 Document(page_content='* "Why do more students tune into Mark\'s show?"\n* "Mark\'s show"\n* "students"', metadata={'source': 'local'}),
 Document(page_content='* "Why do more students tune into Mark\'s show?"\n* "Mark\'s show"\n* "students"', metadata={'source': 'local'}),
 Document(page_content='* Nora got expelled\n* Nora has been cutting lessons\n* Creswood is mentioned as a staff member', metadata={'source': 'local'}),
 Document(page_content='* Nora got expelled\n* Nora has been cutting lessons\n* Creswood is mentioned as a staff member', metadata={'source': 'local'})]

Look at the output and try out several different questions by yourself. Does the compressed output make sense?

Compare this to the previous **simple** approach. Which one, in your opinion, is better?

Finally, we use the new retriever with the Llama2 model from the previous task to create the context compressor RAG pipeline. The code below should be similiar to what you did in the previous task. Once again, make sure to turn off the `verbose` argument.

In [None]:
### your code ###
from langchain.chains import RetrievalQA

rag_compressor = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=compression_retriever
)
### your code ###


In [None]:
rag_compressor(questions[2]['text'])



{'query': "Why do more students tune into Mark's show?",
 'result': " Because Mark's show is really popular and entertaining!"}

Now we can use the pipeline to generate answers for all the questions in our dataset. **It should take around 20 minutes on a T4 GPU on Colab.**

In [None]:
compressor_answers=[]
### your code ###
for quest in tqdm(questions):
  compressor_answers.append(rag_compressor(quest['text'])['result'])
### your code ###


100%|██████████| 30/30 [25:14<00:00, 50.47s/it]


✅ Point distribution ✅
- 0.5 point if the text is correctly split.
- 1 point for initializing Chroma db as a retreiever and feeding the documents.
- 0.5 point for simple RAG pipline.
- 0.25 point for generating answers with simple RAG.
- 1 point for the correct compressor retriever.
- 0.5 point for compressor RAG pipline.
- 0.25 point for generating answers with compressor RAG.



#### ${\color{red}{Comments\ 2.1}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$

##Subtask 2.2. Evaluate

Since we have access to ground truth answers, we can use various evaluation metrics from the literature. In this task, we explore three metrics:


1.   **BLEU:** BLEU score stands for Bilingual Evaluation Understudy and is a precision-based metric developed
for evaluating machine translation. BLEU scores a candidate by computing the
number of n-grams in the candidate that also appear
in a reference. The n can vary, in this task we compute for n=4.
2.   **ROUGE:** ROUGE score stands for Recall-Oriented Understudy for Gisting Evaluation and is an F-measure metric designed for
evaluating translation and summarization. There are a number of variants of ROUGE.
3. **BERTScore:** BERTScore first obtains BERT representation of each word in the candidate and reference by feeding the candidate
and reference through a BERT model separately.
An alignment is then computed between candidate
and reference words by computing pairwise cosine
similarity. This alignment is then aggregated in to
precision and recall scores before being aggregated
into a (modified) F1 score that is weighted using
inverse-document-frequency values.

Luckily, Hugging Face has an implementation for all these metrics. Use the `evaluate` library to load the metrics.

Use the loaded metrics to compare the RAG pipelines from the previous subtask.

In [None]:
import evaluate
### your code ###
bleu = evaluate.load('bleu')
rouge = evaluate.load('rouge')
bertscore = evaluate.load("bertscore")
### your code ###

Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

As seen in the previous subtask, the answers can contain multiple lines. To be able to compare the output of our systems to the gold answers, merge the multiple answers into a single string.

In [None]:
answers_merged=[]
### your code ###
for answer in answers:
  multi_part=[]
  for ans in answer:
    multi_part.append(ans['text'])
  answers_merged.append(' '.join(multi_part))
### your code ###
print(len(answers_merged))

30


Compute the BLUE score for the simple RAG and compressor RAG.

In [None]:
### your code ###
bleu_simple = bleu.compute(predictions=simple_answers, references=answers_merged)
bleu_compressor = bleu.compute(predictions=compressor_answers, references=answers_merged)
### your code ###
print("Simple system:")
print(bleu_simple)
print("Compressor:")
print(bleu_compressor)

Simple system:
{'bleu': 0.0, 'precisions': [0.11001410437235543, 0.010309278350515464, 0.0015408320493066256, 0.0], 'brevity_penalty': 1.0, 'length_ratio': 2.39527027027027, 'translation_length': 709, 'reference_length': 296}
Compressor:
{'bleu': 0.0, 'precisions': [0.1066066066066066, 0.007861635220125786, 0.0016501650165016502, 0.0], 'brevity_penalty': 1.0, 'length_ratio': 2.25, 'translation_length': 666, 'reference_length': 296}


What does the elements below in the output of the BLEU impelementation in Hugging Face mean? (do not copy and paste the documentation but write the implications behind each element!).



1.   **precisions:** `your answer`
2.   **brevity_penalty:** `your answer`
3.   **translation_length:** `your answer`
4.   **reference_length:** `your answer`
5.   **length_ratio:** `your answer`




**Answer:**


1.   **precisions:** precision of n-grams, which is calculated as the number of n-grams that appear in both the machine-generated translation and the reference translations divided by the total number of n-grams in the machine-generated translation.
2.   **brevity_penalty:** is a penalty term that adjusts the score for translations that are shorter than the reference translations. It is calculated as min(1, (reference_length / translation_length)). It essentially penalizes generated translations that are too short compared to the closest reference length with an exponential decay.
3.   **translation_length:**   is the total number of words in the machine-generated translation.
4.   **reference_length:**  is the total number of words in the reference translations.
5. **length_ratio:** ratio of the 3 and 4.

In [None]:
### your code ###
rouge_simple = rouge.compute(predictions=simple_answers,references=answers_merged)
rouge_compressor = rouge.compute(predictions=compressor_answers,references=answers_merged)
### your code ###
print("Simple system:")
print(rouge_simple)
print("Compressor:")
print(rouge_compressor)

Simple system:
{'rouge1': 0.12296939231362755, 'rouge2': 0.018555984555984558, 'rougeL': 0.11096363586174168, 'rougeLsum': 0.11029701643933229}
Compressor:
{'rouge1': 0.12001440874773897, 'rouge2': 0.02168461243058017, 'rougeL': 0.10570344091836636, 'rougeLsum': 0.10562855990595829}


What is the difference in variants of ROUGE (ROUGE-N, ROUGE-L, ROUGE-SUM)?

`your answer`


**Answer:**

ROUGE measures the similarity between the machine-generated summary and the reference summaries using overlapping n-grams, word sequences that appear in both the machine-generated summary and the reference summaries. The most common n-grams used are unigrams, bigrams, and trigrams. ROUGE score calculates the recall of n-grams in the machine-generated summary by comparing them to the reference summaries.

**ROUGE-N:** ROUGE-N measures the overlap of n-grams (contiguous sequences of n words) between the candidate text and the reference text. It computes the precision, recall, and F1-score based on the n-gram overlap. For example, ROUGE-1 (unigram) measures the overlap of single words, ROUGE-2 (bigram) measures the overlap of two-word sequences, and so on. ROUGE-N is often used to evaluate the grammatical correctness and fluency of generated text.

**ROUGE-L:** ROUGE-L measures the longest common subsequence (LCS) between the candidate text and the reference text. It computes the precision, recall, and F1-score based on the length of the LCS. ROUGE-L is often used to evaluate the semantic similarity and content coverage of generated text, as it considers the common subsequence regardless of word order.

**ROUGE-S:** ROUGE-S measures the skip-bigram (bi-gram with at most one intervening word) overlap between the candidate text and the reference text. It computes the precision, recall, and F1-score based on the skip-bigram overlap. ROUGE-S is often used to evaluate the coherence and local cohesion of generated text, as it captures the semantic similarity between adjacent words.



In [None]:
import numpy as np
### your code ###
bertscore_simple = bertscore.compute(predictions=simple_answers, references=answers_merged, lang="en")
bertscore_compressor = bertscore.compute(predictions=compressor_answers, references=answers_merged, lang="en")
bertscore_simple_averaged={}
bertscore_compressor_averaged={}
for key in bertscore_simple.keys():
  if key!='hashcode':
    bertscore_simple_averaged[key]=np.mean(bertscore_simple[key])
    bertscore_compressor_averaged[key]=np.mean(bertscore_compressor[key])

### your code ###
print("Simple system:")
print(bertscore_simple_averaged)
print("Compressor:")
print(bertscore_compressor_averaged)

Simple system:
{'precision': 0.8435829440752666, 'recall': 0.8557040333747864, 'f1': 0.8494029025236766}
Compressor:
{'precision': 0.8397547423839569, 'recall': 0.8533495982487996, 'f1': 0.8463238557179769}


Which model works better?

✅ Point distribution ✅
- 0.5 point for loading the metrics.
- 0.5 point for parsing the answers.
- 0.5 point computation of BLEU.
- 0.25 *5 = 1.25 points for meaning of each part of BLEU score.
- 0.5 point computation of ROUGE.
- 0.25 *3= 0.75 point for variants for ROUGE.
- 1 point computation of BERTScore.



#### ${\color{red}{Comments\ 2.2}}$

${\color{red}{⚠️Comments\ begin⚠️}}$


```
cross-feedback comment section
```


${\color{red}{⚠️Comments\ end⚠️}}$