<a href="https://colab.research.google.com/github/JianxinLin28/Attendify/blob/main/CS646_Fall25_A1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Your name:**

**Your student ID number:**

**Shared link to this notebook:**

#[COMPSCI 646: Information Retrieval - Fall 2025 ](https://umamherst.instructure.com/courses/29086)
#Assignment 1: BM25 retrieval model (Total : 65 points + 10 Max extra)

**Description**

This assignment is focused on indexing and document retrieval on a small collection of documents and utilize the retrieved results to augment LLM generation (RAG). Basic proficiency in Python is recommended.  

**Instructions**

* To start working on the assignment, you would first need to save the notebook to your local Google Drive. For this purpose, you can click on *Copy to Drive* button. You can alternatively click the *Share* button located at the top right corner and click on *Copy Link* under *Get Link* to get a link and copy this notebook to your Google Drive.  

* Next, open the copy in your Google Drive and close the original. This way, you'll be working on your own version and won't accidentally waste time on a copy that can't be saved.

*   You can download this notebook (*`File -> Download -> Download .ipynb`*) and run it on your local machine, then upload the result file to Google Colab.

* The following instructions assume you are working in your own copy of the notebook.

*   For questions with descriptive answers, please replace the text in the cell which states "Enter your answer here!" with your answer. If you are using mathematical notation in your answers, please define the variables.

*   For coding questions, you can add code where it says "enter code here" and execute the cell to print the output.



**Submission Details**

* Due date: **Sep. 29, 5pm ET**

* Before starting the interesting part of this assignment, click the *Share* button at the top right of the Colab window. Make sure you use the link for your copy in Google Drive (not the link for the original notebook). Then, go to the first text box above, double click on it, and enter your name, student ID number, and the link (URL) to *your copy* of this notebook.

* To create the final PDF submission file, use *`File -> Print -> Save as PDF`*. Make sure that the generated PDF contains all the codes and printed outputs before submission. You are responsible for uploading the correct PDF with all the information required for grading.

* To create the final Python submission file, click on *`File -> Download -> Download .py`*.

* Upload the PDF and Python files to [Gradescope](https://www.gradescope.com/courses/1047975) in the assignment **A1**.


**Academic Honesty**

Please follow the guidelines under the *Collaboration and Help* section of the slides about course policy.

# 0. Prelude

In this section, we provide starter code to install and load the required libraries and dataset. The notebook is expected to run on Google Colab and thus all the instruction is written using Colab. Note that it can also work on your local machine with small tweaks on the parameters.

In the following cell, we load the base package and set up the required variables for the assignment.

In [None]:
import os
import sys
import json
from tqdm import tqdm
import pandas as pd

sys.displayhook = lambda x: None # Suppress notebook auto printout

try:
    from google.colab import drive

    in_colab = True
except ImportError:
    in_colab = False

store_local = True # 'True' if you'd like to store the datasets and other files into the Google drive.


if in_colab:
    # Please allow the access to your Google Drive or the following dataset loader will fail.
    drive.mount("/content/drive/")  ## DO NOT MODIFY THIS LINE

    if store_local:
        # Store all assignment related contents into the google drive (persist across colab sessions)
        data_path = "/content/drive/MyDrive/COMPSCI646-F25/A1"  ## Suggest to store all assignment related contents into one folder
    else:
        # Store all assignment related contents into the VM's per-session storage (do not persist across colab sessions)
        data_path = "/content/"  ## Suggest to store all assignment related contents into one folder
else:
    # Store all assignment related contents into the local storage
    data_path = "./data/"  ## Suggest to store all assignment related contents into one folder



assert os.path.exists(
    data_path
), "Change data_path to a valid and existing file path!"


# Configure the dataset filename and storage location. DO NOT MODIFY!!
file_info_dict = {
    "corpus": "1OC3duSpoxnMKveES6tcYX42dXmTk_d0r",
    "queries": "1AUOf1x7HDkdil5QJFTx-VB2fR_rgL-hZ",
    "qrels": "13FEIGGt9Ick-kQvMrWRAeDbPvIvF6MCe",
}

# Feel free to edit the path variable mentioned below, but make sure the target directory exists.
## location for dataset contents
corpus_zip_path = os.path.join(data_path, "corpus.tsv.gz")
corpus_path = corpus_zip_path.rstrip(".gz")
queries_path = os.path.join(data_path, "queries.jsonl")
qrels_path = os.path.join(data_path, "qrels.json")

## cache dir for Section 1
jsonl_dir = os.path.join(data_path, "jsonl_collections") # storing the converted corpus entries
index_dir = os.path.join(data_path, "index") #  storing the index components from pyserini

## cache dir and path for Section 4
json_cache_path = os.path.join(data_path, 'hotpotqa_dev_bm25.json') # storing the prepared entries for the LLM prompts
output_dir = os.path.join(data_path, "outputs") # storing the output from the LLM inference



# You are more than welcome to code some helper functions.
# But do note that we are only grading functions that are coded in the template files.

##0.1.  Setup Packages

In this assignment, we will use the following libraries to index the corpus and perform retrieval.


In [None]:
### Pyserini require updating java version ###
!apt-get install openjdk-21-jdk-headless -qq > /dev/null
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-21-openjdk-amd64"
!update-alternatives --set java /usr/lib/jvm/java-21-openjdk-amd64/jre/bin/java
!java -version

In [None]:
# !pip install --ignore-installed pyserini
!pip install pyserini
!pip install transformers
!pip install faiss-cpu
!pip install datasets
!pip install pytrec_eval
!pip install bitsandbytes

##0.2 Download the dataset

In this assignment, we will use a subset of the HotpotQA dataset of [KILT](https://ai.meta.com/tools/kilt/): [a Benchmark for Knowledge Intensive Language Tasks](https://arxiv.org/pdf/2009.02252).


**Corpus**

We sample 10% of the original KILT collection of passages split into 100 words. In total, the sampled corpus contains 3,567,807 passages.


**Queries**

For saving time, we sample the original query sets of HotpotQA, you can access them from [here](https://drive.google.com/file/d/1AUOf1x7HDkdil5QJFTx-VB2fR_rgL-hZ/view?usp=drive_link).



**Relevance judgment**

For a query (input field of KILT), all passages in the collection that have a ```wikipedia_id``` available in the corresponding ```provenance``` field of ```input``` are considered relevant.

<!-- Note that the ```output``` part of the KILT tasks contains several outputs for the ```input```, and each of them has a ```provenance``` field. The provenance field has a list of features (```wikipedia_id```, ```start_paragraph_id```, ```end_paragraph_id```) representing Wikipedia paragraphs that support the answer. The collection from the [TSV file](#tsv-file) includes ```id, text, wikipedia_title, wikipedia_id``` fields. -->

You can access the relevance judment file (qrel file) [here](https://drive.google.com/file/d/13FEIGGt9Ick-kQvMrWRAeDbPvIvF6MCe/view?usp=drive_link).
This file is created using information from the KILT tasks and the collection file.

The qrel file has the following format:
```
qrel = {
    'query_id_1': {
        'passage_id_1': 1
    },
    'query_id2': {
        'passage_id_10': 1,
        'passage_id_99': 1,
    },
}
```

An example line from the file is:
```
{
      "6915606477668963399": {
            "3894652": 1,
            "3894653": 1,
            "3894654": 1,
            "3894655": 1,
            "3894656": 1,
            "3894657": 1,
            "3894658": 1
      },
...
}
```
which shows passages "3894652", "3894653", "3894654", "3894655", "3894656", "3894657", and "3894658" from the TSV collection are relevant to the query "6915606477668963399".

<a name="tsv-file"></a>
**Download**

You can use the code below to download the required files, including sampled KILT Wikipedia corpus, a subset of HotpotQA queries, and the corresponding qrels. The files are saved at `corpus_path`, `queries_path`, and `qrels_path`, which you can use in the rest of the assignment.

In [None]:
def download_file(drive_file_id: str, output_file_path: str):
    """
    Download necessary from remote.

    Args:
        drive_file_id: the google drive file id
        output_file_path: the location of file for local storage

    Returns: None
    """
    remote_file_path = f"https://drive.google.com/uc?export=download&id={drive_file_id}"
    if not os.path.isfile(output_file_path):
        print(f'Cannot find "{output_file_path}" at "{data_path}" so downloading it...')
        if output_file_path.endswith(".gz"):
            import gdown
            gdown.download(remote_file_path, output_file_path, quiet=False)
        else:
            import urllib.request
            urllib.request.urlretrieve(remote_file_path, output_file_path)
        print("Download complete!")
    else:
        print(f'File "{output_file_path}" already exists, not downloading.')

    if output_file_path.endswith(".gz"):
        unzip_path = output_file_path.rstrip(".gz")
        if os.path.isfile(unzip_path):
            return

        import gzip
        import shutil
        # Decompress with gzip
        print(f'Unzipping "{output_file_path}" to "{unzip_path}"...')
        try:
            with gzip.open(output_file_path, "rb") as f_in, open(unzip_path, "wb") as f_out:
                shutil.copyfileobj(f_in, f_out)
        except Exception as e:
            print(f"Error during decompression: {e}")
            # Clean up both files if they exist
            for path in [output_file_path, unzip_path]:
                if os.path.exists(path):
                    try:
                        os.remove(path)
                        print(f"Removed {path}...")
                    except Exception as rm_e:
                        print(f"Failed to remove {path}: {rm_e}..")
            raise  # re-raise the exception
        else:
            print("Decompression complete!")
        print("Done!")


# Download Corpus
download_file(drive_file_id=file_info_dict["corpus"], output_file_path=corpus_zip_path)

In [None]:
# Download Queries
download_file(drive_file_id=file_info_dict["queries"], output_file_path=queries_path)

# Download Qrels
download_file(drive_file_id=file_info_dict["qrels"], output_file_path=qrels_path)

We also provide the code to load the queries and qrels into the variables `raw_queries` and `qrels`.

In [None]:
import json

raw_queries = []
with open(queries_path, "r", encoding="utf-8") as f_in:
  for line in f_in:
    raw_queries.append(json.loads(line))

with open(qrels_path) as f:
  qrels = json.load(f)

sample_query_id = "5ae234385542992decbdcc59"
print(f"The number of queries: {len(raw_queries)}")
print("Query Format:\n" + json.dumps(raw_queries[0], indent=4))
print("Qrels Format:\n" + json.dumps({sample_query_id: qrels[sample_query_id]}, indent=4))

# 1. Indexing (20 Points)


In this assigment, you will use the [Pyserini](https://github.com/castorini/pyserini/) toolkit to perform BM25 retrieval on the HotpotQA dataset.

The first step is to index the document collection of the HotpotQA dataset.



## 1.1 Preprocessing the Dataset (10 points)

To index a collection using Pyserini, the first step is to prepare the collection in an appropriate format. A simple format for Pyserini indexing is JSONL files that have two mandatory keys:

```
{"id": "doc1", "contents": "this is the first assignment."}
```

You need to convert the downloaded TSV file into JSONL files.
For this assignment, you only need the four following fields of the TSV files: ```id, text, wikipedia_title, wikipedia_id```.
  
The converted jsonl files for indexing consist of multiple lines in this format:

```
{"id": "1000_0", "contents": "The first passage", "wikipedia_id": "99"}
{"id": "1000_1", "contents": "The second passage", "wikipedia_id": "99"}
...
```
where ```id``` is the id of the passage, ```contents``` is a text that combine ```wikipedia_title``` and ```text``` fields of the TSV file, and ```wikipedia_id``` is the id of the Wikipedia document that contains the passage.

<!-- You need to implement the ```convert``` function below to convert the entire wikipedia dataset into JSONL files. We suggest seperating it into several JSONL files (e.g., a maximum of 1 milions lines per file, if the wikipedia tsv file has 3.000.000 lines, then you should convert it into 3 jsonl files, each jsonl file has 1.000.000 lines) to speed up the indexing process later.

Note: If you use the ```pandas``` library to read the TSV file, we suggest using the ```chunksize``` argument to read the TSV file in chunks. This file is relatively large and you may not be able load it completely into RAM.
You can learn more [here](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). -->



The function `convert_to_jsonl(tsv_file: str, output_dir: str) -> None` below takes the Wikipedia TSV corpus as input and writes it out as one or more JSONL files in the specified directory.  

- **Each JSONL line must contain a document with fields:** `id`, `contents`, and `wikipedia_id`.  
- Some suggestions on the preprocessing part:
    - **Split the output into multiple files** (e.g., 1,000,000 lines per file) to make later indexing faster.  
    - **Use chunked reading** (e.g., with `pandas.read_csv(..., chunksize=...)`)  The corpus file is relatively large and you may not be able load it completely into RAM. You can learn more [here](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).

In [None]:
def convert_to_jsonl(tsv_file: str, output_dir: str):
    """
    Converts a large Wikipedia TSV file into one or more JSONL files.

    Args:
        tsv_file: Path to the input TSV file. Expected columns: 'id', 'wikipedia_title', 'text', 'wikipedia_id'.
        output_dir: Directory where the JSONL files will be written. Created if it does not exist.

    Returns:
        None
            - Writes JSONL file(s) (e.g., docs01.jsonl, docs02.jsonl, ...) into output_dir.

    Notes:
        - Each JSON object should include 'id', 'contents' (title + text), and 'wikipedia_id'.
    """

    #########
    ##
    ## Implement the function here
    ##
    #########



convert_to_jsonl(tsv_file=corpus_path, output_dir=jsonl_dir)

In [None]:
sample_jsonl = {"id": 0, "contents": "Academy Award for Best Production Design Academy Award for Best Production Design\nThe Academy Award for Best Production Design recognizes achievement for art direction in film. The category's original name was Best Art Direction, but was changed to its current name in 2012 for the 85th Academy Awards. This change resulted from the Art Director's branch of the Academy of Motion Picture Arts and Sciences (AMPAS) being renamed the Designer's branch. Since 1947, the award is shared with the set decorator(s). It is awarded to the best interior design in a film.", "wikipedia_id": 316}

print("Expected Output Format:\n" + json.dumps(sample_jsonl, indent=4))


We also provide a sample jsonl file [here](https://drive.google.com/file/d/1G6MOp2gIYERI4CvZ3jb9IiUJV83CsOyJ/view?usp=sharing).



## 1.2 Indexing Dataset (6 points)

Next you need to call the Pyserini indexer function. Note that the index needs to store raw documents for efficient access in the later steps of the assignment. More information on Pyserini indexer can be found [here](https://github.com/castorini/pyserini/blob/master/docs/usage-index.md). You can run python commands in Google Colab by adding ```!``` before command, e.g. ```!python```.

**Important notes**
*   The output index files are relatively large, so be sure to save them on your Google Drive for easy access later.

*   You can use your UMass Google Drive.



*  In our experiments, we used the TPU runtime and set the number of threads to 8; the indexing step was completed in less than 5 minutes.
For changing the runtime option, refer to [section below](#change-runtime).

* Using the default `StandardAnalyzer` is fine.


In [None]:
if os.path.isdir(index_dir) and os.listdir(index_dir):
    print(f'"{index_dir}" is not empty, found processed files.')
else:
    # Call the Pyserini indexer function here
    #########
    ##
    ## Enter your code here
    ##
    #########


##1.3 Read Index Statistics (4 points)

You need to report the total number of terms in the collection using the built inverted index


In [None]:
from pyserini.index.lucene import LuceneIndexReader
import itertools

#########
##
## Enter your code here
##
#########


# 2. BM25 Retrieval (15 Points)

In this part, you need to run the BM25 model to retrieve documents (at the passage level) from the Wikipedia dataset that was indexed in the previous part.

The [Pyserini Interactive Searching](https://github.com/castorini/pyserini/blob/master/docs/usage-interactive-search.md) can be used to retrieve documents with respect to these queries.


**Queries**

Each example of the [KILT tasks](https://huggingface.co/datasets/facebook/kilt_tasks) has three main parts: ```id```, ```input```, and ```output```. The ```output``` part contains several outputs for the ```input```, and each of them has a ```provenance``` field. The provenance field has a list of features (```wikipedia_id```, ```start_paragraph_id```, ```end_paragraph_id```) representing Wikipedia documents and paragraphs that support the answer.

In this part, you need to use ```input``` values as queries.

##2.1. SimpleSearcher

For retrieval, you first need to load the built index and set the parameters of the BM25.

In [None]:
from pyserini.search.lucene import LuceneSearcher

k1 = 1.2
b = 0.75
# Load the index and configure search parameters
#########
##
## Enter your code here
##
#########


## 2.2 Perform Retrieval using BM25

**Getting retrieved results**

You need to implement a function that returns a list of retrieved documents along with their scores.

`def search_on_hotpotqa(raw_queries: list[dict], searcher: LuceneSearcher, top_k: int,n_threads: int) -> dict[str, dict[str, float]]`

- **Expected Output Format.** The expected format of the output would be,
```
{
    'query_id1': {
        'doc_id1': <doc_ids1 score>,
        'doc_id2': <doc_ids2 score>,
        ...
    },
    ...
}
```

**Batch Search**


To speed up searching, you can use ```batch_search``` function:
```
hits = searcher.batch_search(queries, qids=qids, k=top_n, threads=threads)
```
where:
*   queries: list of content of queries
*   qids: list of ids of queries
*   top_n: number of retrieved documents
*   threads: number of threads

In [None]:
def search_on_hotpotqa(
    raw_queries: list[dict],
    searcher: LuceneSearcher,
    top_k: int,
    num_worker: int
) -> dict[str, dict[str, float]]:
    """
    Runs BM25 retrieval with Pyserini for a subset of HotpotQA queries.

    Args:
        raw_queries: A dict where each item is a query object.
        searcher: A Pyserini LuceneSearcher already pointing to the built index.
        top_k: Number of documents to retrieve per query.
        num_worker: Number of worker threads used by batch_search.

    Returns:
        A dictionary mapping query_id -> {doc_id: score}, where score is the BM25 score
        returned by Pyserini for that (query, doc) pair.
    """

    #########
    ##
    ## Implement the function here
    ##
    #########




top_k = 10
num_worker = 2
hotpotqa_ranklists = search_on_hotpotqa(raw_queries=raw_queries, searcher=searcher, top_k=top_k, num_worker=num_worker)

In [None]:
sample_query_id = "5ac26eed55429951e9e685bf"
expected_query_ranklists = {'15414678': 12.358499526977539, '14271190': 11.598899841308594, '28925671': 10.694299697875977, '1886921': 9.937000274658203, '5029879': 9.927900314331055, '32084178': 9.650400161743164, '20646629': 9.559900283813477, '19776349': 9.37909984588623, '10753001': 9.091699600219727, '3173407': 9.02079963684082}
print(f"Expected Output:\n" + json.dumps({sample_query_id: expected_query_ranklists}, indent=4))

# 3. BM25 Evaluation (10 Points)

In this part, we want to measure the quality of the retrieved resutls by the BM25 model in terms of precision and MAP metrics.

One tool to be used for this part is [pytrec_eval](https://github.com/cvangysel/pytrec_eval).


## Evaluating retrieved results

You need to implement the function `eval_on_hotpotqa(qrels: dict, ranklists: dict) -> dict[str, float]` which evaluates the quality of BM25 retrieval results in terms of precision, recall, and MAP, each at cutoffs `3`, `5`, and `10`.

- **Expected Output Format:** The returned dictionary contains the average scores across all queries in the set:
```
{
  "P_3": value,
  "P_5": value,
  "P_10": value,
  "R_3": value,
  "R_5": value,
  "R_10": value,
  "MAP_3": value,
  "MAP_5": value,
  "MAP_10": value,
}
```

In [None]:
import pytrec_eval

def eval_on_hotpotqa(qrels: dict, ranklists: dict) -> dict[str, float]:
    """
    Evaluate BM25 retrieval results using pytrec_eval.

    Args:
        qrels: A dictionary of relevance judgments in TREC format
               { query_id: { doc_id: relevance_label } }.
        results: A dictionary of retrieval results in TREC format
             { query_id: { doc_id: score } }.

    Returns:
        A dictionary containing average P@3, P@5, P@10, MAP@3, MAP@5, MAP@10
        across all queries.
    """

    #########
    ##
    ## Implement the function here
    ##
    #########



    return {} # You will return something meaningful before this statement


hotpotqa_eval_results = eval_on_hotpotqa(qrels=qrels, ranklists=hotpotqa_ranklists)
print(json.dumps(hotpotqa_eval_results, indent=4))

# 4. Retrieval-Augmented Generation (RAG) (20 Points)

We use retrieval-augmented generation (RAG) to describe a pipeline that retrieves a set of relevant documents and passes them to an LLM to produce answers to questions.

In this assignment, we use `Qwen/Qwen2.5-3B` as the backbone LLM and perform RAG with the top-$k$ BM25 results to answer questions from the HotpotQA dataset.

After generating answers, we evaluate their quality with [Exact Match (EM)](https://huggingface.co/spaces/evaluate-metric/exact_match), which checks whether a prediction exactly matches the gold answer.

Most of the components are implemented below and you only need to run them and provide your BM25 results.

Your task is to
- prepare the prompts for the LLM, and
- analyze how the quality of BM25 retrieval correlates with the quality of the generated answers.

## 4.1 Inverted index for constructing LLM prompts (6 Points)

Before we issue the query to the LLM, we need to first prepare the prompts for the LLM based on the retrieval results from our previous steps. Similar to other retrieval/search procedures, we need to have a lookup function which maps from doc ID to the actual doc content.

You would need to implement the function `get_ctxs(doc_ids: dict[str, float], searcher: LuceneSearcher) -> list[dict[str, str | float]]` which gets the content of the top-ranked documents from the Pyserini index.


- **Expected Output Format.** The expected format of the output is,
```
[
  {
    "id": <doc_id>,
    "text": <document contents>,
    "bm25_score": <bm25 score>
  },
  ...
]
```


In [None]:
def get_ctxs(doc_ids: dict[str, float], searcher: LuceneSearcher) -> list[dict[str, str | float]]:
    """
    Get document contents for given doc_ids from the Pyserini index.

    Args:
        doc_ids: A dictionary having document IDs and their BM25 scores.
        searcher: A Pyserini LuceneSearcher object pointing to the built index.

    Returns:
        A list of dictionaries, each containing:
            - "id": the document ID
            - "text": the document contents
            - "bm25_score": the BM25 score
        The list is sorted by BM25 score in descending order.
    """

    #########
    ##
    ## Implement the function here
    ##
    #########

    return {} # You will return something meaningful before this statement



In [None]:
sample_doc_ids = {"15414678": 12.358499526977539, "14271190": 11.598899841308594}
expected_sample_ctxs = [{'id': '15414678', 'text': "Purcell, Kansas Purcell, Kansas\nPurcell is an unincorporated community in Doniphan County, Kansas, United States. It is located east of Everest, south of K-20, on highway K-137.\nHistory.\nPurcell was founded about 1886. John Purcell was one of the earliest settlers.\nA post office was opened in Purcell in 1887, and remained in operation until it was discontinued in 1956.\nSt. Mary's Catholic Church, which is listed on the National Register of Historic Places, is located in Purcell.", 'bm25_score': 12.358499526977539}, {'id': '14271190', 'text': 'McClain County, Oklahoma to the northwest. It ran through Byars and Purcell, and established Washington, Cole, and Blanchard.\nPurcell was a starting point for the Land Run of 1889. It also was at the dividing line between Indian Territory, where alcohol could not be sold, and Oklahoma Territory, where alcohol sale was legal. The town of Lexington, across the river from Purcell, had numerous saloons. In 1899, the Purcell Bridge Company built a toll bridge across the river, profiting from the alcohol trade.\nGeography.', 'bm25_score': 11.598899841308594}]

sample_ctxs = get_ctxs(doc_ids=sample_doc_ids, searcher=searcher)

print(f"Expected Output:\n" + json.dumps(expected_sample_ctxs, indent=4))
print(f"    Your Output:\n" + json.dumps(sample_ctxs, indent=4))

Using the `get_ctx()` function above, the top retrieved documents along with their contents are written in a JSON file. This file is used for the RAG experiments in the next step so that you don't need to get the retrieved results more than one time.

The format of JSON file is:
```
[{
  "id": "query_1",
  "input": "the first input",
  "answers": ['answer 1', 'answer 2',...]
  "ctxs": [{"id": "paragraph1 id"
            "text": "paragraph1 content"},
            {"id": "paragraph2 id"
            "text": "paragraph2 content"},
            ...
          ]     
},
  {
  "id": "query_2",
  "input": "the second input",
  "answers": ['answer 1', 'answer 2',...]
  "ctxs": [{"id": "paragraph1 id"
            "text": "paragraph1 content"},
            {"id": "paragraph2 id"
            "text": "paragraph2 content"},
            ...
          ]     
},
...
]
```
where "id", "input", and "answers" are obtained from the HotpotQA dataset of KILT, and "ctxs" contains the document content.

In [None]:
## DO NOT MODIFY THE CODE BELOW!!

def prepare_json_file(raw_queries: dict, ranklists: dict, searcher: LuceneSearcher, json_cache_path: str):
    saved_json_list = []
    for entry in tqdm(raw_queries, desc="Prepare LLM Prompt", unit=" Query"):
        saved_dict = {'answers': []}
        saved_dict['id'] = entry['id']
        saved_dict['input'] = entry['input']
        for output in entry['output']:
            saved_dict['answers'].append(output['answer'])
        saved_dict['ctxs'] = get_ctxs(ranklists[entry['id']], searcher)
        saved_json_list.append(saved_dict)
    with open(json_cache_path, 'w', encoding="utf-8") as f_out:
        json.dump(saved_json_list, f_out, indent=4)
    print(f"Processed LLM Prompts are saved at '{json_cache_path}'.")


prepare_json_file(raw_queries=raw_queries, ranklists=hotpotqa_ranklists, searcher=searcher, json_cache_path=json_cache_path)

## 4.2 Prepare the prompts for LLM inference



**(No Implementation Needed!)**

We provide a set of utility functions which help to build the prompts for RAG evaluation. Part of the code is adopted from [here](https://github.com/AI21Labs/in-context-ralm/).

In [None]:
# Part of this code is adopted from https://github.com/AI21Labs/in-context-ralm/
import os
import argparse
import json
import re
import string
import torch
from tqdm import tqdm


def normalize_question(question):
  if not question.endswith("?"):
      question = question + "?"
  return question[0].lower() + question[1:]


def normalize_answer(s):
  def remove_articles(text):
      return re.sub(r"\b(a|an|the)\b", " ", text)
  def white_space_fix(text):
      return " ".join(text.split())
  def remove_punc(text):
      exclude = set(string.punctuation)
      return "".join(ch for ch in text if ch not in exclude)
  def lower(text):
      return text.lower()
  return white_space_fix(remove_articles(remove_punc(lower(s))))


def text_has_answer(answers, text) -> bool:
  if isinstance(answers, str):
      answers = [answers]
  text = normalize_answer(text)
  for single_answer in answers:
      single_answer = normalize_answer(single_answer)
      if single_answer in text:
          return True
  return False


def exact_match(prediction, ground_truth):
  return normalize_answer(prediction) == normalize_answer(ground_truth)


def get_answer_from_model_output(outputs, tokenizer, prompt):
  generation_str = tokenizer.decode(outputs[0].cpu(), skip_special_tokens=True)
  generation_str = generation_str[len(prompt):]
  answer = generation_str.split("\n")[0]
  return answer, generation_str


def load_dataset(dataset_path):
    print("Loading dataset:", dataset_path)
    with open(dataset_path) as f:
        return json.load(f)


def evaluate_dataset(model, tokenizer, device, eval_dataset, max_length, \
                     num_docs=0, output_dir=None, max_tokens_to_generate=10):
    idx = 0
    num_correct = 0
    num_has_answer = 0
    num_too_long = 0
    sample_prompt = None
    for ex in (tq := tqdm(eval_dataset, desc=f"EM:  0.0%")):
        answers = ex["answers"]
        prompt = build_qa_prompt(ex, num_docs=num_docs)
        if idx == 0:
            sample_prompt = prompt
        has_answer = text_has_answer(answers, prompt)
        input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
        if input_ids.shape[-1] > max_length - max_tokens_to_generate:
            num_too_long += 1
            input_ids = input_ids[..., -(max_length - max_tokens_to_generate):]
        with torch.no_grad():
            outputs = model.generate(input_ids, max_new_tokens=max_tokens_to_generate, pad_token_id=tokenizer.eos_token_id)
        prediction, generation = get_answer_from_model_output(outputs, tokenizer, prompt)
        is_correct = any([exact_match(prediction, answer) for answer in answers])
        idx += 1
        if is_correct:
            num_correct += 1
        if has_answer:
            num_has_answer += 1
        tq.set_description(f"EM: {num_correct / idx * 100:4.1f}%")
    em = num_correct / idx * 100
    has_answer = num_has_answer / idx * 100
    print(f"EM: {em:.1f}%")
    print(f"% of prompts with answer: {num_has_answer / idx * 100:.1f}%")
    if output_dir is not None:
        d = {"em": em, "has_answer": has_answer, "num_examples": idx, "too_long": num_too_long}
        eval_output_path = os.path.join(output_dir, "eval.json")
        with open(eval_output_path, "w", encoding="utf-8") as f:
            f.write(json.dumps(d) + "\n")
        print(f"The evaluation is saved at '{eval_output_path}'.")
        if sample_prompt is not None:
            sample_prompt_path = os.path.join(output_dir, "example_prompt.txt")
            with open(sample_prompt_path, "w", encoding="utf-8") as f:
                f.write(sample_prompt)
            print(f"The sample prompt is saved at '{sample_prompt_path}'.")
    return em


def build_qa_prompt(example, num_docs=1):
    if num_docs == 0:
        question_text = normalize_question(example["input"])
        ex_prompt = f"Answer the question:\nQ: {question_text}\nA:"
    elif num_docs == 1:
        q = normalize_question(example["input"])
        text = example['ctxs'][0]['text']
        ex_prompt = f"{text}\n\nBased on this text, answer this question:\nQ: {q}\nA:"
    else:
        q = normalize_question(example["input"])
        docs_text = "\n\n".join([f"{ctx['text']}" for ctx in example["ctxs"][:num_docs]])
        ex_prompt = f"{docs_text}\n\nBased on these texts, answer this question:\nQ: {q}\nA:"
    return ex_prompt




## 4.3 Load the LLM
<a name="change-runtime"></a>
**(No Implementation Needed!)**

We recommend using a GPU for LLM inference. In Colab, go to `Runtime → Change runtime type` and select a **GPU** (we suggest a **T4**).

**Note.** A T4 GPU typically finishes inference for ~200 queries in a few minutes. CPU-only runs will take much longer. You can also explore TPU for faster inference, though it may require additional setup and library support.

In [None]:
# Loading the model
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig

model_name = "Qwen/Qwen2.5-3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
config = AutoConfig.from_pretrained(model_name)

## 4.4 Run the RAG Pipeline (4 Points)

In this part, you first get the quality of generated answers by the LLM when retrieved results are used to augment the prompts.

One aspect to study is how varying the number of retrieved results passed to the LLM impacts the quality of its generated answers. For this purpose, you need to report the Exact Match (EM) scores in **four different settings** when


*   no document is passed to the the LLM: the LLM prompt only contains the question.
*   top-3 retrieved documents by BM25 are passed to LLM: the LLM prompt  contains the question and the top-3 retrieved documents.
*   top-5 retrieved documents by BM25 are passed to LLM: the LLM prompt  contains the question and the top-5 retrieved documents.
*   top-10 retrieved documents by BM25 are passed to LLM: the LLM prompt  contains the question and the top-10 retrieved documents.

To get these performance values, you need to call the
`run_eval()` function below with the different values for the parameter `top_k`.


In [None]:
def run_eval(model, tokenizer, config, prepared_prompts_file, output_dir, top_k=10, dataset_type='QA') -> float:
    os.makedirs(output_dir, exist_ok=True)
    eval_dataset = load_dataset(prepared_prompts_file)
    model_max_length = config.n_positions if hasattr(config, "n_positions") else config.max_position_embeddings
    device = "cuda" if torch.cuda.is_available() else "cpu"
    return evaluate_dataset(model, tokenizer, device, eval_dataset, model_max_length, num_docs=top_k, output_dir=output_dir)

### 4.4.1 Experiments on `top_k=0`

In [None]:
#########
##
## Enter your code here
##
#########


### 4.4.2 Experiments on `top_k=3`

In [None]:
#########
##
## Enter your code here
##
#########

### 4.4.3 Experiments on `top_k=5`

In [None]:
#########
##
## Enter your code here
##
#########


### 4.4.4 Experiments on `top_k=10`

In [None]:
#########
##
## Enter your code here
##
#########


### 4.4.5 Fill out the following table using the performance obtained above. (4 Points)

| top_k |   Exact Match   |
|:-----:|:----------------|
|   0   |                 |
|   3   |                 |
|   5   |                 |
|  10   |                 |

## 4.5 Performance Analysis (10 Points)

Answer these questions based on the performance results you obtained above.

### 4.5.1 Does passing more documents to the LLM lead to higher quality of generated answers? (3 Points)


**Enter your answer here**

### 4.5.2. Does the setting with the highest precision among the four settings above result in the highest exact-match score? (3 Points)


**Enter your answer here**

### 4.5.3. Is there a relationship between the two evaluation metrics, the precision of the retreived results and the exact-match score of the generated answers? (4 Points)


**Enter your answer here**

# 5. Extra Credits (Max 10 Points)

## 5.1 Prompt Engineering
You can change the prompt template in the ```build_prompt``` function above and study how that impacts the quality of the answers generated by the LLM. For example, you may:
- Ask the LLM to reason step-by-step before providing an answer.
- Design a prompt that encourages the LLM to produce a concise, exact answer.
- Include few-shot examples in the prompt before generation.

**Enter your answer here**

## 5.2 Position bias
You can study how the rank of the relevant document (the one containing the exact-match answer) impacts the quality of the answers generated by the LLM.

**Enter your answer here**

## 5.3 Context length
You can investigate how constructing LLM prompts with exactly one relevant document and a varying number of noise documents (up to the maximum you can include) impacts the quality of the answers generated by the LLM.

**Enter your answer here**

## 5.4 LLM Variants

You can also experiment with different model sizes in the Qwen2.5 family (e.g., 1.5B, 3B, 7B) and compare the instruct vs. non-instruct variants. For example, the instruct model can be accessed as Qwen/Qwen2.5-3B-instruct. Note that you may need to apply quantization when running the 7B model.

**Enter your answer here**

##6. AI Disclosure

*   Did you use any AI assistance to complete this assignment? If so, please also specify what AI you used.
    * *your response here*


---
*(only complete the below questions if you answered yes above)*

*   If you used a large language model to assist you, please paste *all* of the prompts that you used below. Add a separate bullet for each prompt, and specify which problem is associated with which prompt.

    * *your response here*


*   **Free response**: For each problem for which you used assistance, describe your overall experience with the AI. How helpful was it? Did it just directly give you a good answer, or did you have to edit it? Was its output ever obviously wrong or irrelevant? Did you use it to get the answer or check your own answer?
    * *your response here*
