<a href="https://colab.research.google.com/github/GoldPapaya/info256-applied-nlp/blob/main/10.llms/RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp25/blob/main/10.llms/RAG.ipynb)


# Retrieval Augmented Generation

In this notebook, we will implement a simple RAG system.

Concretely, we will begin by building a document embedding collection. Then for each query, we:
1. Embed the query in that same space
2. Use FAISS to retrieve the $n$ closest documents.
3. Given those retrieved documents, we'll then incorporate them into the context of a prompt for an LLM.

In [1]:
!pip install sentence-transformers

# install faiss for gpu
!pip install faiss-gpu-cu12



In [2]:
import torch
import operator

import faiss
import nltk
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer

nltk.download("punkt")
nltk.download("punkt_tab")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [3]:
# Run this early to let the model load!

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B", device_map="cuda", dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/726 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/3.99G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/99.6M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

We'll use the ACL paper abstracts you worked with before as our set of documents.

In [4]:
!wget https://raw.githubusercontent.com/dbamman/anlp25/main/data/acl.all.tsv

--2025-10-28 23:37:19--  https://raw.githubusercontent.com/dbamman/anlp25/main/data/acl.all.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7388302 (7.0M) [text/plain]
Saving to: ‘acl.all.tsv’


2025-10-28 23:37:20 (14.5 MB/s) - ‘acl.all.tsv’ saved [7388302/7388302]



In [5]:
df = pd.read_csv("acl.all.tsv", sep="\t", names=["cite", "year", "title", "abstract"])

## Building an index

We must decide on a document embedding model; even within the SentenceTransformer family, there many pre-trained models that vary by accuracy, size, etc. See [here](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html) for a list of all models.

In particular, some models are trained for question answering tasks instead of strict semantic similarity; this means that questions will be placed in a similar region to their relevant answers.

In [6]:
# we'll normalize all vectors so that cosine similarity reduces to a dot product
# (enabling the use of inner product as a simliarity metric)
def normalize(matrix):
    row_norms = np.linalg.norm(matrix, axis=1, keepdims=True)
    normalized_rows = matrix / row_norms
    return normalized_rows

class Index():
    def __init__(self, model_name, texts):
        self.encoder = SentenceTransformer(model_name)

        doc_embeddings = self.encoder.encode(texts)
        doc_embeddings = normalize(doc_embeddings)
        num_docs, embedding_size = doc_embeddings.shape

        # Our dataset is small enough that we can use exact search, so we'll use IndexFlatIP
        # (which builds an exact index with doc product as the similarity metric)
        self.index = faiss.IndexFlatIP(embedding_size)
        self.index.add(doc_embeddings)

        # If you want to use faster but approximate search over a larger dataset, use this
        # self.index = faiss.IndexFlatIP(embedding_size)
        # self.index = faiss.IndexIVFFlat(index, embedding_size, 10, faiss.METRIC_INNER_PRODUCT)
        # self.index.train(doc_embeddings)
        # self.index.add(doc_embeddings)

    def query(self, query, n=3):
        query_embedding = self.encoder.encode([query])
        query_embedding = normalize(query_embedding)
        distances, indices = self.index.search(query_embedding, n)
        return distances[0], indices[0]


RAG depends on having a good retriever since we condition *only* on the documents (and passages of documents) that are retrieved as being relevant. Here, we experiment with two different embedding models:

1. `all-mpnet-base-v2` is the default sentence-similarity model in sentence-transformers
2. `multi-qa-mpnet-base-dot-v1` is trained on question/answer pairs

**Consider**: why might we want to train on question/answer pairs?

In [7]:
mpnet_index = Index("sentence-transformers/all-mpnet-base-v2", df.abstract)
multiqa_index = Index("sentence-transformers/multi-qa-mpnet-base-dot-v1", df.abstract)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/212 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Now let's embed our query in the same representation space and find the documents most similar to it. Which embedding index do you think provides better results?

In [8]:
query = "What was the CONLL 2018 shared task?"

distances, indices = mpnet_index.query(query)
for dist, idx in zip(distances, indices):
  print("%.3f\t%s (%d)\t%s" % (dist, df.title[idx], df.year[idx], df.abstract[idx][:150]))

0.487	Universal Dependency Parsing from Scratch (2018)	This paper describes Stanford's system at the CoNLL 2018 UD Shared Task. We introduce a complete neural pipeline system that takes raw text as input, 
0.474	Tutorial: Making Better Use of the Crowd (2017)	Over the last decade, crowdsourcing has been used to harness the power of human computation to solve tasks that are notoriously difficult to solve wit
0.472	Commonsense Inference in Natural Language Processing (COIN) - Shared Task Report (2019)	This paper reports on the results of the shared tasks of the COIN workshop at EMNLP-IJCNLP 2019. The tasks consisted of two machine comprehension eval


In [9]:
distances, indices = multiqa_index.query(query)
for dist, idx in zip(distances, indices):
  print("%.3f\t%s (%d)\t%s" % (dist, df.title[idx], df.year[idx], df.abstract[idx][:150]))

0.666	CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies (2018)	Every year, the Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learn
0.656	MRP 2019: Cross-Framework Meaning Representation Parsing (2019)	The 2019 Shared Task at the Conference for Computational Language Learning (CoNLL) was devoted to Meaning Representation Parsing (MRP) across framewor
0.623	CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies (2017)	The Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learning systems 


## Generate

Now that we've built a retriever, let's now incorporate those retrieved passages into the context of a prompt to answer our initial query.

In [10]:
from textwrap import dedent
def format_passage(data, idx):
    """
    Generates formatted paper information for a given paper index.
    """
    title = data.title.iloc[idx]
    abstract = data.abstract.iloc[idx]
    year = data.year.iloc[idx]
    cite = data.cite.iloc[idx]

    return f"""
    Title: {title}
    Year: {year}
    Cite-key: {cite}
    Abstract: {abstract}
    """

In [11]:
def prompt_model(messages, thinking=False):
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=thinking # Switches between thinking and non-thinking modes. Default is True.
    )

    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

    # conduct text completion
    generated = model.generate(
        **model_inputs,
        max_new_tokens=500
    )

    # let's break this down:
    #                      | we take the element of the batch (our batch size is 1)
    #                      |  |-----------------------------| skip our original input
    output_ids = generated[0][len(model_inputs.input_ids[0]):].tolist()

    # decode into token space
    return tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")


def generate_without_rag(question, thinking=False):
    newline = "\n"
    system_prompt = dedent(f"""
        You're a helpful assistant for question answering.
    """).strip()

    rag_prompt = dedent(f"""
        Question: {question}
    """).strip()

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": rag_prompt},
    ]
    return prompt_model(messages)

def generate_with_rag(question, index, passages = None, thinking=False, show_rag=False):
    newline = "\n"
    system_prompt = dedent(f"""
        You're a helpful assistant for question answering. Use the information from the included passages to construct your response.
        In your response, only reference the passages with a parenthetical citation of the cite-key. Do not refer to the passages any other way.
    """).strip()

    if passages is None:
        passages = []
        distances, indices = index.query(question)
        for dist, idx in zip(distances, indices):
            passages.append(format_passage(df, idx))

    rag_prompt = dedent(f"""
        Question: {question}

        Relevant passages to the question:

        {passages[0]}

        {passages[1]}

        {passages[2]}
    """).strip()

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": rag_prompt},
    ]
    output = prompt_model(messages)
    if show_rag:
        output = f"{output}\n\nRAG prompt:{rag_prompt}"
    return output

Now let's generate responses. First, we try generating a response without any context (relying only on the model's pretraining). Then, we try querying with the `multiqa` and `all` embedding indices.

Here is a [list of CONLL shared tasks](https://www.conll.org/previous-tasks) over the years. Are the outputs accurate? Which do you prefer?

I think the output of the rag + multiqa_index is best, because it emphasizes the original multilingual parsing task from 2017 that is referenced in the source. Though, the rag + npnet_index varient also succeeds in providing a high level overview of the shared task, so it is hard to make a perfect determination on what is 'best' without a specific goal to evaluate the outputs against.

In [12]:
query = "What was the CONLL 2018 shared task?"

In [13]:
print(generate_without_rag(query))

The CONLL 2018 shared task was a natural language processing (NLP) evaluation organized by the Conference on Computational Natural Language Learning (CoNLL). It focused on **named entity recognition (NER)**, a key task in NLP that involves identifying and classifying entities in text into predefined categories such as person names, organizations, locations, dates, and more.

### Key Details of the CONLL 2018 Shared Task:
- **Objective**: To evaluate the performance of NER systems on a variety of languages and datasets.
- **Task**: Named Entity Recognition (NER) across multiple languages.
- **Datasets**: The task used several multilingual datasets, including:
  - **Multilingual NER datasets** such as the **Multilingual Named Entity Recognition (MNER)** dataset.
  - **Other datasets** like the **Universal Dependencies** and **CoNLL-2003** (for English).
- **Languages**: The task included multiple languages, including English, German, Spanish, French, and others.
- **Evaluation Metrics**:

In [14]:
print(generate_with_rag(query, multiqa_index))

The CONLL 2018 shared task was devoted to learning dependency parsers for a large number of languages in a real-world setting without any gold-standard annotation on test input. All test sets followed a unified annotation scheme, namely that of Universal Dependencies. This shared task constitutes a 2nd edition---the first one took place in 2017 (Zeman et al., 2017); the main metric from 2017 has been kept, allowing for easy comparison, also in 2018, and two new main metrics have been used. New datasets added to the Universal Dependencies collection between mid-2017 and the spring of 2018 have contributed to increased difficulty of the task this year (zeman-etal-2018-conll).


In [15]:
print(generate_with_rag(query, mpnet_index))

The CONLL 2018 shared task was a benchmarking event in natural language processing focused on universal dependency parsing. The task aimed to evaluate systems' ability to parse sentences into syntactic structures, specifically dependency trees, across multiple languages. Stanford's system participated in this task and introduced a complete neural pipeline that handled various tasks including tokenization, sentence segmentation, POS tagging, and dependency parsing. The system's performance was competitive on large treebanks, and after fixing a bug, it would have placed highly on the official evaluation metrics. (qi-etal-2018-universal)
