# Homework and bakeoff: Few-shot OpenQA with DSPy

In [1]:
__author__ = "Christopher Potts and Omar Khattab"
__version__ = "CS224u, Stanford, Spring 2024"

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cgpotts/cs224u/blob/master/hw_openqa.ipynb)
[![Open in SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/cgpotts/cs224u/blob/master/hw_openqa.ipynb)

If Colab is opened with this badge, please **save a copy to drive** (from the File menu) before running the notebook.

## Overview

The goal of this homework is to explore retrieval-augmented in-context learning. This is an exciting area that brings together a number of recent task ideas and modeling innovations. We will use the [DSPy programming library](http://dspy.ai) to build systems in this new mode.

Our core task is __open-domain question answering (OpenQA)__. In this task, all that is given by the dataset is a question text, and the task is to answer that question. By contrast, in many modern QA tasks, the dataset provides a text and a gold passage, usually with a firm guarantee that the answer will be a substring of the passage.

OpenQA is substantially harder than standard QA. The usual strategy is to use a _retriever_ to find passages in a large collection of texts and train a _reader_ to find answers in those passages. This means we have no guarantee that the retrieved passage will contain the answer we need. If we don't retrieve a passage containing the answer, our reader has no hope of succeeding. Although this is challenging, it is much more realistic and widely applicable than standard QA. After all, with the right retriever, an OpenQA system could be deployed over the entire Web.

The task posed by this homework is harder even than OpenQA. We are calling this task __few-shot OpenQA__. The defining feature of this task is that the reader is simply a frozen, general purpose language model. It accepts string inputs (prompts) and produces text in response. It is not trained to answer questions per se, and nothing about its structure ensures that it will respond with a substring of the prompt corresponding to anything like an answer.

__Few-shot QA__ (but not OpenQA!) is explored in the famous GPT-3 paper ([Brown et al. 2020](https://arxiv.org/abs/2005.14165)). The authors are able to get traction on the problem using GPT-3, an incredible finding. Our task here – __few-shot OpenQA__ – pushes this even further by retrieving passages to use in the prompt rather than assuming that the gold passage can be used in the prompt. If we can make this work, then it should be a major step towards flexibly and easily deploying QA technologies in new domains.

In summary:

| Task             | Passage given | Task-specific reader training |Task-specific retriever training  | 
|-----------------:|:-------------:|:-----------------------------:|:--------------------------------:|
| QA               | yes           | yes                           | n/a                              |
| OpenQA           | no            | yes                           | maybe                            |
| Few-shot QA      | yes           | no                            | n/a                              |
| Few-shot OpenQA  | no            | no                            | maybe                            | 

Just to repeat: your mission is to explore the final line in this table. The core notebook and assignment don't address the issue of training the retriever in a task-specific way, but this is something you could pursue for a final project; [the ColBERT codebase](https://github.com/stanford-futuredata/ColBERT) makes easy.

It is a requirement of the bake-off that a general-purpose language model be used. In particular, trained QA systems cannot be used at all, and no fine-tuning is allowed either. See the original system question at the bottom of this message for guidance on which models are allowed.

Note: the models we are working with here are _big_. This poses a challenge that is increasingly common in NLP: you have to pay one way or another. You can pay to use the GPT-3 API, or you can pay to use a local model on a heavy-duty cluster computer, or you can pay with time by using a local model on a more modest computer.

## Set-up

We have sought to make this notebook self-contained and easy to use on a personal computer, on Google Colab, and in Sagemaker Studio. For personal computer use, we assume you have already done everything in [setup.ipynb](setup.ipynb]). For cloud usage, the next few code blocks should handle all set-up steps.

In [2]:
!pip uninstall dspy-ai -y
!pip install dspy-ai==2.3.1 

Found existing installation: dspy-ai 2.3.1
Uninstalling dspy-ai-2.3.1:
  Successfully uninstalled dspy-ai-2.3.1
Collecting dspy-ai==2.3.1
  Using cached dspy_ai-2.3.1-py3-none-any.whl.metadata (33 kB)
Using cached dspy_ai-2.3.1-py3-none-any.whl (164 kB)
Installing collected packages: dspy-ai
Successfully installed dspy-ai-2.3.1


In [3]:
try:
    # This library is our indicator that the required installs
    # need to be done.
    import datasets
    root_path = '.'
except ModuleNotFoundError:
    !git clone https://github.com/cgpotts/cs224u/
    !pip install -r cs224u/requirements.txt
    root_path = 'dspy'

In [4]:
from datasets import load_dataset
import openai
import os
import dspy
from dotenv import load_dotenv

Save the API keys in a `.env` file in the local root directory as follows. Then, `load_dotenv()` will make them available to the notebook:

In [5]:
# keep the API keys in a `.env` file in the local root directory
load_dotenv()

False

In [6]:
# os.environ["DSP_NOTEBOOK_CACHEDIR"] = os.path.join(root_path, 'cache')

# openai_key = os.getenv('OPENAI_API_KEY')  # or replace with your API key (optional)

openai_key = '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'

colbert_server = 'http://index.contextual.ai:8893/api/search'


Here we establish the Language Model `lm` and Retriever Model `rm` that we will be using. The defaults for `lm` are just for development. You may want to develop using an inexpensive model and then do your final evalautions wih an expensive one. DSPy has support for a wide range of model APIs and local models.

In [7]:
lm = dspy.OpenAI(model='gpt-3.5-turbo', api_key=openai_key)

rm = dspy.ColBERTv2(url=colbert_server)

dspy.settings.configure(lm=lm, rm=rm)

Here's a command you can run to see which OpenAI models are available; OpenAI has entered into an increasingly closed mode where many older models are not available, so there are likely to be some surprises lurking here:

In [8]:
# [d['id'] for d in openai.Model.list()['data']]

In [9]:
# !ollama pull llama2

In [10]:
# ollama = dspy.OllamaLocal(
#     base_url="http://localhost:5050", 
#     model="llama2:latest", 
#     stop=['---','Explanation:','<|im_start|>','<|im_end|>'],
#     model_type = "chat"
# )

In [11]:
# lm = dspy.OllamaLocal(model='llama2')context

## SQuAD

Our core development dataset is [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/). We chose this dataset because it is well-known and widely used, and it is large enough to support lots of meaningful development work, without, though, being so large as to require lots of compute power.

In [12]:
squad = load_dataset("squad")

The following utility just reads a SQuAD split in as a list of `dspy.Example` instances:

In [13]:
def get_squad_split(squad, split="validation"):
    """
    Use `split='train'` for the train split.

    Returns
    -------
    list of dspy.Example with attributes question, answer

    """
    data = zip(*[squad[split][field] for field in squad[split].features])
    exs = [dspy.Example(question=q, answer=a['text'][0]).with_inputs("question")
           for eid, title, context, q, a in data]
    return exs

### SQuAD train

To build few-shot prompts, we will often sample SQuAD train examples, so we load that split here:

In [14]:
squad_train = get_squad_split(squad, split="train")

### SQuAD dev

In [15]:
squad_dev = get_squad_split(squad)

### SQuAD dev sample

Evaluations are expensive in this new era! Here's a small sample to use for dev assessments:

In [16]:
import random

random.seed(1)

dev_exs = random.sample(squad_dev, k=200)

## DSPy basics

### LM usage

Here's the most basic way to use the LM:

In [17]:
lm("Which award did Gary Zukav's first book receive?")

['Gary Zukav\'s first book, "The Dancing Wu Li Masters: An Overview of the New Physics," received the 1979 American Book Award for Science.']

Keyword arguments to the underlying LM are passed through:

In [18]:
lm("Which U.S. states border no U.S. states?", temperature=0.9, n=4)

['There are two U.S. states that do not border any other U.S. states: Alaska and Hawaii. Alaska is located in the far northwest of North America, separated from the contiguous United States by Canada. Hawaii is an archipelago in the Pacific Ocean, located southwest of the contiguous United States.',
 'Hawaii and Alaska border no U.S. states.',
 'Hawaii and Alaska do not share borders with any other U.S. states.',
 'There are two states in the United States that do not border any other U.S. states: Alaska and Hawaii. Alaska is separated from the rest of the United States by Canada, and Hawaii is located in the middle of the Pacific Ocean.']

With `lm.inspect_history`, we can see the most recent language model calls:

In [19]:
lm.inspect_history(n=1)





Which U.S. states border no U.S. states?[32m There are two U.S. states that do not border any other U.S. states: Alaska and Hawaii. Alaska is located in the far northwest of North America, separated from the contiguous United States by Canada. Hawaii is an archipelago in the Pacific Ocean, located southwest of the contiguous United States.[0m[31m 	 (and 3 other completions)[0m





In [20]:
# llama = dspy.HFClientTGI(model="meta-llama/Llama-2-13b-chat-hf", port=[7140, 7141, 7142, 7143], max_tokens=150)
# colbertv2 = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')

# # # NOTE: After you finish this notebook, you can use GPT-3.5 like this if you like.
# # turbo = dspy.OpenAI(model='gpt-3.5-turbo-instruct')
# # # In that case, make sure to configure lm=turbo below if you choose to do that.

# dspy.settings.configure(rm=colbertv2, lm=llama)

### Signature-based prediction

In DSPy, __signatures__ are declarative statements about what we want the model to do. In the following `"question -> answer"` is the signature (the most basic QA signature one could write), and `dspy.Predict` is used to turn this into a complete QA system: 

In [21]:
basic_predictor = dspy.ChainOfThought("question -> answer")

In [22]:
basic_predictor = dspy.ChainOfThought("question -> answer")
basic_predictor(question="Which award did Gary Zukav's first book receive?")

Prediction(
    rationale='produce the answer. We know that Gary Zukav is an author and his first book was "The Dancing Wu Li Masters." This book received the American Book Award for Science in 1979.',
    answer='The American Book Award for Science'
)

In [23]:
basic_predictor = dspy.Predict("question -> answer")

Here we use `basic_predictor`:

In [24]:
basic_predictor(question="Which award did Gary Zukav's first book receive?")

Prediction(
    answer='The Dancing Wu Li Masters received the American Book Award for Science.'
)

And here is the prompt that was given to the model:

In [25]:
lm.inspect_history(n=1)





Given the fields `question`, produce the fields `answer`.

---

Follow the following format.

Question: ${question}
Answer: ${answer}

---

Question: Which award did Gary Zukav's first book receive?
Answer:[32m The Dancing Wu Li Masters received the American Book Award for Science.[0m





In many cases, we will want more control over the prompt. Writing a small custom `dspy.Signature` class is the easiest way to accomplish this. In the following, we just just tweak the initial instruction and provide some formatting guidance for the answer:

In [26]:
class BasicQASignature(dspy.Signature):
    __doc__ = """Answer questions with short factoid answers."""

    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")

In [27]:
sig_predictor = dspy.Predict(BasicQASignature)

In [28]:
sig_predictor(question="Which U.S. states border no U.S. states?")

Prediction(
    answer='Maine, Hawaii'
)

In [29]:
lm.inspect_history(n=1)





Answer questions with short factoid answers.

---

Follow the following format.

Question: ${question}
Answer: often between 1 and 5 words

---

Question: Which U.S. states border no U.S. states?
Answer:[32m Maine, Hawaii[0m





### Modules

One of the hallmarks of DSPy is that it adopts design patterns from PyTorch. The main example of this is DSPy's use of the `Module` as the basic unit for writing simple and complex programs. Here is a very basic module for QA that makes use of `BasicQASignature` as we defined it just above.

In [30]:
class BasicQA(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate_answer = dspy.Predict(BasicQASignature)

    def forward(self, question):
        return self.generate_answer(question=question)

As with PyTorch, the `forward` methos is called when we want to make predictions:

In [31]:
basic_qa_model = BasicQA()

In [32]:
basic_qa_model(question="Which award did Gary Zukav's first book receive?")

Prediction(
    answer='National Book Award'
)

The modular design of DSPy starts to become apparent now. If you want to change the above to use chain of thought instead of regular predictions, you need only change `dspy.Predict` to `dspy.ChainOfThought`, and similarly for `dspy.ReAct`, `dspy.ProgramOfThought`, or a module you wrote yourself.

### Teleprompting

The QA system we've defined so far is a zero-shot system. To change it into a few-shot system, we will rely on a DSPy __teleprompter__. This will allow us to flexibly move between the zero-shot and few-shot formulations. The following code achieves this.

In [33]:
from dspy.teleprompt import LabeledFewShot

Here we instantiate a `LabeledFewShot` teleprompter that will add three demonstrations. These will be sampled randomly from the set of train examples we provide:

In [34]:
fewshot_teleprompter = LabeledFewShot(k=3)

And then we call `compile` on `basic_qa_model` as we defined it above. This returns a new module that we use like any other in DSPy:

In [35]:
basic_fewshot_qa_model = fewshot_teleprompter.compile(basic_qa_model, trainset=squad_train)

In [36]:
basic_fewshot_qa_model(question="Which award did Gary Zukav's first book receive?")

Prediction(
    answer='American Book Award'
)

With `inspect_history`, we can see that prompts now contain demonstrations:

In [37]:
lm.inspect_history(n=1)





Answer questions with short factoid answers.

---

Follow the following format.

Question: ${question}
Answer: often between 1 and 5 words

---

Question: What group did Paul VI address in New York in 1965?
Answer: United Nations

---

Question: What did Sander's study show in terms of black law students rankings?
Answer: half of all black law students rank near the bottom of their class after the first year of law school

---

Question: What problems does linguistic anthropology bring linguistic methods to bear on?
Answer: anthropological

---

Question: Which award did Gary Zukav's first book receive?
Answer:[32m American Book Award[0m





### Evaluation

Our evaluation metric is a standard one for SQuAD and related tasks: exact match of the answer (EM).

In [38]:
from dspy.evaluate import answer_exact_match
from dspy.evaluate.evaluate import Evaluate

In [39]:
answer_exact_match(dspy.Example(answer="STAGE 2!"), dspy.Prediction(answer="stage 2"))

True

In DSPy, `Evaluate` objects provide a uniform interface for running evaluations. Here are two for us to use in development. The first will evaluate on all of `dev_exs` and should provide a meaningful picture of how a system is doing. It could be expensive to use it a lot, though. The second is for debugging and is probably too small to give a reliable estimate.

In [40]:
dev_evaluater = Evaluate(
    devset=dev_exs, # 200 examples
    num_threads=1,
    display_progress=True,
    display_table=5)

In [41]:
tiny_evaluater = Evaluate(
# return gpt_optimized_model



    devset=dev_exs[: 15],
    num_threads=1,
    display_progress=True,
    display_table=5)

Here is a tiny (debugging-oriented) evaluation of our few-shot QA sytem:

In [42]:
tiny_evaluater(basic_fewshot_qa_model, metric=answer_exact_match)

Average Metric: 2 / 15  (13.3): 100%|█████████| 15/15 [00:00<00:00, 1351.87it/s]
  df = df.applymap(truncate_cell)


Average Metric: 2 / 15  (13.3%)


Unnamed: 0,question,example_answer,pred_answer,answer_exact_match
0,In 1517 who was Luther's bishop?,Albert of Mainz,Albert of Mainz,✔️ [True]
1,When was the construction that changed the Rhine's Delta?,20th Century,13th century,False
2,How many companies were registered in Warsaw in 2006?,304016,"over 100,000",False
3,What is the CJEU's duty?,"to ""ensure that in the interpretation and application of the Treaties the law is observed""",interpret EU law,False
4,What would a teacher do for someone who is cocky?,deflate,ignore them,False


13.33

### Retrieval

The final major component of our systems is retrieval. When we defined `rm`, we connected to a remote ColBERT index and retriever system that we can now use for search.

The basic `dspy.retrieve` method returns only passages:

In [43]:
retriever = dspy.Retrieve(k=3)

In [44]:
passages = retriever("Which award did Gary Zukav's first book receive?")

In [45]:
passages

Prediction(
    passages=['Gary Zukav | Gary Zukav Gary Zukav (born October 17, 1942) is an American spiritual teacher and the author of four consecutive New York Times Best Sellers. Beginning in 1998, he appeared more than 30 times on "The Oprah Winfrey Show" to discuss transformation in human consciousness concepts presented in his book "The Seat of the Soul". His first book, "The Dancing Wu Li Masters" (1979), won a U.S. National Book Award. Gary Zukav was born in Port Arthur, Texas, and spent his early childhood in San Antonio and Houston. His family moved to Pittsburg, Kansas, while he was in fourth grade. In', 'The Dancing Wu Li Masters | The Dancing Wu Li Masters The Dancing Wu Li Masters is a 1979 book by Gary Zukav, a popular science work exploring modern physics, and quantum phenomena in particular. It was awarded a 1980 U.S. National Book Award in category of Science. Although it explores empirical topics in modern physics research, "The Dancing Wu Li Masters" gained attenti

If we need passages with scores and other metadata, we can call `rm` directly:

In [46]:
rm("Which award did Gary Zukav's first book receive?", k=1)

[{'pid': 7182463,
  'prob': 1.0,
  'rank': 1,
  'score': 24.77630615234375,
  'text': 'Gary Zukav | Gary Zukav Gary Zukav (born October 17, 1942) is an American spiritual teacher and the author of four consecutive New York Times Best Sellers. Beginning in 1998, he appeared more than 30 times on "The Oprah Winfrey Show" to discuss transformation in human consciousness concepts presented in his book "The Seat of the Soul". His first book, "The Dancing Wu Li Masters" (1979), won a U.S. National Book Award. Gary Zukav was born in Port Arthur, Texas, and spent his early childhood in San Antonio and Houston. His family moved to Pittsburg, Kansas, while he was in fourth grade. In',
  'long_text': 'Gary Zukav | Gary Zukav Gary Zukav (born October 17, 1942) is an American spiritual teacher and the author of four consecutive New York Times Best Sellers. Beginning in 1998, he appeared more than 30 times on "The Oprah Winfrey Show" to discuss transformation in human consciousness concepts presente

## Few-shot OpenQA with context

Let's build on the above core concepts to define a basic retrieval-augmented generation (RAG) program. This program solves the core task of few-shot OpenQA task and will serve as the basis for the homework questions:

We begin with a signature that takes context into account but is otherwise just like `BasicQASignature` above:

In [47]:
class ContextQASignature(dspy.Signature):
    __doc__ = """Answer questions with short factoid answers."""

    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")

And here is a complete program/system for the task:

In [48]:
class RAG(dspy.Module):
    def __init__(self, num_passages=1):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.Predict(ContextQASignature)

    def forward(self, question):
        context = self.retrieve(question).passages
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=prediction.answer)

In [49]:
rag_model = RAG(num_passages=3)

In [50]:
rag_model(question="Which award did Gary Zukav's first book receive?")

Prediction(
    context=['Gary Zukav | Gary Zukav Gary Zukav (born October 17, 1942) is an American spiritual teacher and the author of four consecutive New York Times Best Sellers. Beginning in 1998, he appeared more than 30 times on "The Oprah Winfrey Show" to discuss transformation in human consciousness concepts presented in his book "The Seat of the Soul". His first book, "The Dancing Wu Li Masters" (1979), won a U.S. National Book Award. Gary Zukav was born in Port Arthur, Texas, and spent his early childhood in San Antonio and Houston. His family moved to Pittsburg, Kansas, while he was in fourth grade. In', 'The Dancing Wu Li Masters | The Dancing Wu Li Masters The Dancing Wu Li Masters is a 1979 book by Gary Zukav, a popular science work exploring modern physics, and quantum phenomena in particular. It was awarded a 1980 U.S. National Book Award in category of Science. Although it explores empirical topics in modern physics research, "The Dancing Wu Li Masters" gained attentio

An optional tiny evaluation:

In [51]:
tiny_evaluater(rag_model, metric=answer_exact_match)

Average Metric: 2 / 15  (13.3): 100%|██████████| 15/15 [00:00<00:00, 848.82it/s]

Average Metric: 2 / 15  (13.3%)



  df = df.applymap(truncate_cell)


Unnamed: 0,question,example_answer,context,pred_answer,answer_exact_match
0,In 1517 who was Luther's bishop?,Albert of Mainz,"[""Paul Speratus | Ellwangen, Priest of the Diocese of Augsburg). Early studies took him to Paris and Italy, as well as (probably) Freiburg and Vienna....",George of the Palatinate,False
1,When was the construction that changed the Rhine's Delta?,20th Century,"['Rhine | rivers and streams. Many rivers have been closed (""dammed"") and now serve as drainage channels for the numerous polders. The construction of Delta...",second half of the 20th Century,False
2,How many companies were registered in Warsaw in 2006?,304016,"['Warsaw | such as Sydney, Istanbul, Amsterdam or Seoul. Warsaw, especially its city centre (""Śródmieście""), is home not only to many national institutions and government...",304016,✔️ [True]
3,What is the CJEU's duty?,"to ""ensure that in the interpretation and application of the Treaties the law is observed""",['European Union law | elected by the judges for three years. While TEU article 19(3) says the Court of Justice is the ultimate court to...,"To ensure that ""the law is observed""",False
4,What would a teacher do for someone who is cocky?,deflate,"['Teacher | for the individual students accordingly. For example, an experienced teacher and parent described the place of a teacher in learning as follows: ""The...",deflate the cocky,False


13.33

## Question 1: Optimizing RAG [2 points]

We used `RAG` above as a zero-shot system. We could turn it into a few-shot system by using `LabeledFewShot` as we did in [the teleprompting section](#Teleprompting) above, but this may actually be problematic: if we randomly sample demonstrations with retrieved passages, we might be instructing the model with a lot of cases where the context passage isn't helping (and may actually be actively misleading the model). 

What we'd like to do is select demonstrations where the model gets the answer correct and the context passage does contain the answer. To do this, we will use the DSPy `BootstrapFewShot` optimizer. There are two steps for this: (1) defining a metric and (2) running the optimizer.

__Note__: The code for this question can be found in the DSPy tutorials, and you should feel free to make use of that code. The goal is to help you understand the design patterns and overall logic of optimizing DSPy programs.

__Task 1__: Complete `validate_context_and_answer` according to the specification in the docstring.

In [52]:
def validate_context_and_answer(example, pred, trace=None):
    """Return True if `example.answer` matches `pred.answer`context according
    to `dspy.evaluate.answer_exact_match` and `pred.context` contains
    `example.answer` according to `dspy.evaluate.answer_passage_match`.

    Parameters
    ----------
    example: dspy.Example 
        with attributes `answer` and `context`
    pred: dspy.Prediction 
        with attributes `answer` and `context`
    trace : None (included for dspy internal compatibility)

    Returns
    -------
    bool

    """
    pass
    ##### YOUR CODE HERE

    answer_EM = dspy.evaluate.answer_exact_match(example, pred)
    answer_PM = dspy.evaluate.answer_passage_match(example, pred)
    return answer_EM and answer_PM



A test you can use to check your implementation:

In [53]:
def test_validate_context_and_answer(func):
    examples = [
        (
            dspy.Example(question="Q1", answer="B"),
            dspy.Prediction(question="Q1", context="A B C", answer="B"),
            True
        ),
        # Context doesn't contain answer, but predicted answer is correct.
        (
            dspy.Example(question="Q1", answer="D"),
            dspy.Prediction(question="Q1", context="A B C", answer="D"),
            False
        ),
        # Context contains answer, but predicted answer is not correct.
        (
            dspy.Example(question="Q1", answer="C"),
            dspy.Prediction(question="Q1", context="A B C", answer="D"),
            False
        )
    ]
    errcount = 0
    for ex, pred, result in examples:
        predicted = func(ex, pred, trace=None)
        if predicted != result:
            errcount += 1
            print(f"Error for `{func.__name__}`: "
                  f"Expected inputs\n\t{ex}\n\t{pred} to return {result}.")
    if errcount == 0:
        print(f"No errors detected for `{func.__name__}`")

In [54]:
test_validate_context_and_answer(validate_context_and_answer)

No errors detected for `validate_context_and_answer`


__Task 2__: Complete `bootstrap_optimize` according to the specification in the docstring.

In [55]:
from dspy.teleprompt import BootstrapFewShot

def bootstrap_optimize(model):
    """Use `BootstrapFewShot` to optimize `model`, with the metric set
    to `validate_context_and_answer` as defined above and default
    values for all other keyword arguments to `BootstrapFewShot`.

    Parameterscontext
    ----------
    model: dspy.Module

    Returns
    -------
    dspy.Module, the optimized version of `model`

    """
    pass
    ##### YOUR CODE HERE
    
    teleprompter = BootstrapFewShot(metric=validate_context_and_answer)
    optimized_program = teleprompter.compile(model, trainset=squad_train)
    return optimized_program
    

A test you can use to check your implementation:

In [56]:
def test_bootstrap_optimize(func):
    model = RAG()
    compiled = func(model)
    if not hasattr(compiled, "_compiled") or not compiled._compiled:
        print(f"Error for `{func.__name__}`: "
               "The return value is not a compiled program.")
        return None
    state = compiled.dump_state()
    if not state['generate_answer']['demos']:
        print(f"Error for `{func.__name__}`: "
               "The compiled program has no `demos`.")
        return None
    print(f"No errors detected for `{func.__name__}`")

In [57]:
test_bootstrap_optimize(bootstrap_optimize)

  0%|                                       | 12/87599 [00:00<01:33, 934.98it/s]

Bootstrapped 4 full traces after 13 examples in round 0.
No errors detected for `bootstrap_optimize`





## Question 2: Multi-passage summarization [2 points]

The `dspy.Retrieve` layer in our `RAG` retrieves `k` passages, where `k` is under the control of the user. One hypothesis one might have is that it would be good to summarize these passages before using them as evidence. This seems especially likely to help in scenarios where the question can be answered only by synthesizing information across documents – it might be too much to ask the language model to do both synthesizing and answering in a single step.

The current question maps out a basic strategy for summarization. The heart of it is a new signature called `SummarizeSignature`. This can be used on its own with a simple `dspy.Predict` call, and we'll incorporate it into a RAG program in the next question.

For this question, though, your task is just to complete `SummarizeSignature`. The requirements are as follows:

1. A `__doc__` value that gives an instruction that seems to work well. You can decide what to say here.
2. A `dspy.InputField` named `context`. You can decide whether to use the `desc` parameter.
3. A `dspy.OutputField` named `summary`. You can decide whether to use the `desc` parameter.

In [58]:
class SummarizeSignature(dspy.Signature):
    pass
    ##### YOUR CODE HERE

    __doc__ = """Summarize the main points of the text."""
    
    context = dspy.InputField()
    summary = dspy.OutputField(desc = "do not start with 'Context:'")



Here's a simple test that just checks for the required pieces in a basic way:

In [59]:
def test_SummarizeSignature(sigclass):
    fields = sigclass.fields
    expected_fieldnames = ['context', 'summary']
    fieldnames = sorted([field for field in fields])
    errcount = 0
    if expected_fieldnames != fieldnames:
        errcount += 1
        print(f"Error for `{sigclass.__name__}`: "
              f"Expected fieldnames {expected_fieldnames}, got {fieldnames}.")
    if not sigclass.__doc__:
        errcount += 1
        print(f"Error for `{sigclass.__name__}`: No docstring specified.")
    if errcount == 0:
        print(f"No errors detected for `{sigclass.__name__}`")

In [60]:
test_SummarizeSignature(SummarizeSignature)

No errors detected for `summary`


Here is the simplest way to use `SummarizeSignature`:

In [61]:
summarizer = dspy.Predict(SummarizeSignature)

In [62]:
retriever("Where is Guarani spoken?").passages

['Guarani language | Guarani language Guarani () specifically the primary variety known as Paraguayan Guarani (endonym "avañe\'ẽ" \'the people\'s language\'), is an indigenous language of South America that belongs to the Tupi–Guarani family of the Tupian languages. It is one of the official languages of Paraguay (along with Spanish), where it is spoken by the majority of the population, and where half of the rural population is monolingual. It is spoken by communities in neighboring countries, including parts of northeastern Argentina, southeastern Bolivia and southwestern Brazil, and is a second official language of the Argentine province of Corrientes since 2004; it is also',
 'Guarani dialects | west by about 15,000 speakers, mostly in Jujuy, but also in Salta Province. It refers essentially to the same variety of Guarani as Eastern Bolivian Guarani. Additionally, another variety of Guarani known as Mbyá is spoken in Argentina by 3,000 speakers. Eastern Bolivian Guarani and Western

In [63]:
result = summarizer(context=retriever("Where is Guarani spoken?").passages)
result

Prediction(
    summary='The Guarani language is an indigenous language spoken in South America, particularly in Paraguay where it is one of the official languages. There are different dialects of Guarani spoken in neighboring countries like Argentina, Bolivia, and Brazil. The Guarani language belongs to the Tupi-Guarani branch of the Tupi linguistic family, with three distinct groups within the Guaraní subgroup. The language is widely spoken among non-indigenous communities in Latin America, with over four million speakers across the region.'
)

In [64]:
type(result)

dspy.primitives.prediction.Prediction

In [65]:
result.summary

'The Guarani language is an indigenous language spoken in South America, particularly in Paraguay where it is one of the official languages. There are different dialects of Guarani spoken in neighboring countries like Argentina, Bolivia, and Brazil. The Guarani language belongs to the Tupi-Guarani branch of the Tupi linguistic family, with three distinct groups within the Guaraní subgroup. The language is widely spoken among non-indigenous communities in Latin America, with over four million speakers across the region.'

## Question 3: Summarizing RAG [2 points]

Your task for this question is to modify `RAG` as defined above so that the retrieved passages are summarized before being passed to `generate_answer`. 

Here is the `RAG` system copied from above with the class name changed to the one we will use for this new system. Your task is to add the summarization step. This should be very straightforward given the modular design that DSPy supports and encourages!

In [66]:
class SummarizingRAG(dspy.Module):
    def __init__(self, num_passages=3):
        # Please name your summarization later `summarize` so that we
        # can check for its presence.
        super().__init__()
        self.retrieve = dspy.Retrieve(k=num_passages)
        ##### YOUR CODE HERE
        
        self.summarize = dspy.Predict(SummarizeSignature)
    
        self.generate_answer = dspy.Predict(ContextQASignature)

    def forward(self, question):
        context = self.retrieve(question).passages
        ##### YOUR CODE HERE
        
        summary = self.summarize(context=context)
        new_context = summary.summary
        
        prediction = self.generate_answer(context=new_context, question=question)
        
        return dspy.Prediction(context=new_context, answer=prediction.answer)

A simple test for this design spec:

In [67]:
def test_SummarizingRAG(classname):
    model = classname(num_passages=3)
    errcount = 0
    if not hasattr(model, "summarize"):
        errcount += 1
        print(f"Error for `{classname.__name__}`: "
              f"Expected a layer called 'summarize'")
    context = model.retrieve("What are some foods?").passages
    pred = model("What are some foods?")
    if context == pred.context:
        errcount += 1
        print(f"Error for `{classname.__name__}`: "
              "The model seems to be using raw retrieved contexts "
              "for predictions rather than summarizing them.")
    if errcount == 0:
        print(f"No errors detected for `{classname.__name__}`")

In [68]:
test_SummarizingRAG(SummarizingRAG)

No errors detected for `SummarizingRAG`


Model usage:

In [69]:
summarizing_rag_model = SummarizingRAG()

In [70]:
summarizing_rag_model(question="Which award did Gary Zukav's first book receive?")

Prediction(
    context='Gary Zukav is an American spiritual teacher and author of several bestselling books, including "The Dancing Wu Li Masters," which explores modern physics and quantum phenomena using metaphors from eastern spiritual movements. Markus Zusak is an Australian author known for his novels "The Book Thief" and "The Messenger," both of which have received critical acclaim and awards. Zusak has also discussed his upcoming novel "Bridge of Clay."',
    answer='American Book Award'
)

Note: if you decide to use `BootstrapFewShot` on this, be sure not to use the metric we defined above, which requires that the passage embeds the correct answer as a substring. Now that we are summarizing, this is unlikely to hold, even if the answers are good ones.

## Question 4: Your original system [3 points]

This question asks you to design your own few-shot OpenQA system. All of the code above can be used and modified for this, and the requirement is just that you try something new that goes beyond what we've done so far. 

Terms for the bake-off:

* You can make free use of SQuAD and other publicly available data.

* The LM must be an autoregressive language model. No trained QA components can be used. This includes general purpose LMs that have been fine-tuned for QA. (We have obviously waded into some vague territory here. The spirit of this is to make use of frozen, general-purpose models. We welcome questions about exactly how this is defined, since it could be instructive to explore this.)

Here are some ideas for the original system:

* We have relied almost entirely on `dspy.Predict`. Drop-in replacements include `dspy.ChainOfThought` and `dspy.ReAct`.

* We have used only one retriever. DSPy supports other retrieval mechanisms, including retrieval using [You.com](https://you.com/).

* DSPy includes additional optimizers. Two that are worth trying are `SignatureOptimizer` for automatic prompt exploration and `BootstrapFewShotWithRandomSearch`, which combines `LabeledFewShot` and `BootstrapFewShot`,

* Our one-step summarization procedure from Question 3 doesn't change the query to the retriever. We might want it to change as we gather evidence. This is a common design principle for multi-hop OpenQA systems.

__Original system instructions__:

In the cell below, please provide a brief technical description of your original system, so that the teaching team can gain an understanding of what it does. This will help us to understand your code and analyze all the submissions to identify patterns and strategies.

In [None]:
# PLEASE MAKE SURE TO INCLUDE THE FOLLOWING BETWEEN THE START AND STOP COMMENTS:
#   1) Textual description of your system.
#   2) The code for your original system.
# PLEASE MAKE SURE NOT TO DELETE OR EDIT THE START AND STOP COMMENTS

# START COMMENT: Enter your system description in this cell.



# For my original system development, I have explored the various combination of the below options:
#---------------------------------------------------
# - General Purpose Language Models: 
#     1) gpt3.5-turbo through OpenAI (not free)
#     2) llama2 through Ollama (free)

# - Optimizers:
#     1) BootstrapFewShot (basedline)
#     2) BootstrapFewShotWithRandomSearch
#     3) BayesianSignatureOptimizer (timeout)

# - Predictors:
#     1) dspy.Predict
#     2) dspy.ChainOfThought

# - Pipelines
#     1) Singal hop RAG
#     2) Singal hop RAG with summization 
#     3) Multi-hop RAG
#     4) Multi-hop RAG with summization 

# - Various versions of commands in signature's outputfield description

# - Adding GenerateSearchQuery signature in multi-hop pipeline

# In this research, I didn't have much chance to successfully tried retrival model other than the colbert_server that was setup for us. 
# I actually have explored You.com and Weaviate. However, You.com's API is not free and I encountered quite some difficult in setting 
# up the connection Weaviate, so I gave up on them. 


# Experimental Finding:
#---------------------------------------------------
# During the  experiments I notized the below several interesting findings. However, they may only apply to this very limited scope of 
# exploration since I only used the given SQuAD dataset. 

# 1) Models with context summarized functions tend to have lower accuracy. This is probably due to some details or keywords have been 
#     rephrased in the summarization process. The performance were not as good thus the associated code were commented out below.

# 2) Llama2 model tends to have lengthier and more converstiaonal response comparing to gpt3.5, thus require more specific commands in 
#     the outputfield description, such as limiting answer to less than X words/ no need complete sentenses/ no conversational response etc. 

# 3) The exact match metric is not the best way to evaluate the performance of the models. It tends to give less credits then they 
#     deverse as from sample check, a lot of the marked false prediction is actully correct if evaluated by human. The common issues are
#     extra words (keywords only vs complete sentense), synonyms (US vs American), puncations, cases, other language spelling of the same
#     english name). Better metrics could be checking the similarity of the 2 strings. However the appropiateness really depends on the
#     majority type of questions and the task nature. This is a really tough area to research I believe, However in this task, I sticked to 
#     the exact match since that's how the final submittion will be evaluated on. 

# 4) One huge challenge in this assignment is the API/ server availability, a lot of time was spent on rerunning trainng and 
#     inferencing due connection crashes. 

# 5) One popular way to improve the performance of a QA system, especially on more complicated questions, is to use a multi-hop program. 
#     With this approach I utilized GenerateSearchQuery signature and ChainOfThought predictor which will generate different query in 
#     each 'hop' for different context and then generate the final answer. From the experiments, this approach works better with gpt3.5.
#     Llama2 produce better results without multihop architecture. 
    

# Validation Results and Final model:
#---------------------------------------------------
# 1) Singal hop RAG
#     - gpt3.5-turbo: 13%
#     - Llama2: 20%
# 2) Singal hop RAG with summization 
#     - gpt3.5-turbo: 7%
#     - Llama2: 7%
# 3) Multi-hop RAG
#     - gpt3.5-turbo: 20%
#     - Llama2: 7%
# 4) Multi-hop RAG with summization 
#     - gpt3.5-turbo: 0%
#     - Llama2: 14%


# By comparing the quick validation results above, I short-listed the combination of llama2 with single hop pipeline and gpt3.5-turbo with 
# multi-hop pipeline for further boosting evaluation. 


# 1) Singal hop RAG with Llama2 
#     - BootstrapFewShotWithRandomSearch 16%

# 2) Multi-hop RAG with gpt3.5-turbo
#     - BootstrapFewShotWithRandomSearch: 37%

# Therefore my final model selected is Multi-hop RAG with gpt3.5-turbo. 




from datasets import load_dataset
import openai
import os
import dspy
from dspy.teleprompt import BootstrapFewShot, LabeledFewShot, BayesianSignatureOptimizer, BootstrapFewShotWithRandomSearch
from dspy.evaluate import answer_exact_match
from dspy.evaluate.evaluate import Evaluate
from dsp.utils import deduplicate
import random

random.seed(1)


# setting lm and rm in dspy
openai_key = 'my private key'
colbert_server = 'http://index.contextual.ai:8893/api/search'

lm = dspy.OpenAI(model='gpt-3.5-turbo', api_key=openai_key)
rm = dspy.ColBERTv2(url=colbert_server)
# dspy.settings.configure(lm=lm, rm=rm)

# llama = dspy.OllamaLocal(    
#     model="llama2:latest", 
#     stop=['---','Explanation:','<|im_start|>','<|im_end|>'],
#     model_type = "chat")
# dspy.settings.configure(lm=llama, rm=rm)


# dataset download and split
def get_squad_split(squad, split="validation"):
    data = zip(*[squad[split][field] for field in squad[split].features])
    exs = [dspy.Example(question=q, answer=a['text'][0]).with_inputs("question")
           for eid, title, context, q, a in data]
    return exs
    
squad = load_dataset("squad")
squad_train = get_squad_split(squad, split="train")
squad_dev = get_squad_split(squad)
dev_exs = random.sample(squad_dev, k=200)

# Singnatures
class ContextQASignature(dspy.Signature):
    __doc__ = """Answer questions with short factoid answers."""
    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    answer = dspy.OutputField(
        desc="return mostly in english, short answers (limited to less than 6 words), no conversaztional response, no complete sentence")

class GenerateSearchQuery(dspy.Signature):
    __doc__ = """Write a simple search query that will help answer a complex question."""
    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    query = dspy.OutputField(desc = "A short question uniquely answered by the context.")

class Summarizer(dspy.Signature):
    __doc__ = """Summarize the main points of the text."""
    context = dspy.InputField()
    summary = dspy.OutputField(desc = "do not start with 'Context:' or 2'Summary:'")


# RAG pipelines
class OpenQaRAG(dspy.Module):
    def __init__(self, num_passages=1):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.Predict(ContextQASignature)

    def forward(self, question):
        context = self.retrieve(question).passages
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=prediction.answer)

# class SummarizingRAG(dspy.Module):
#     def __init__(self, num_passages=3):
#         super().__init__()
#         self.retrieve = dspy.Retrieve(k=num_passages)
#         self.summarize = dspy.Predict(SummarizeSignature)
#         self.generate_answer = dspy.Predict(ContextQASignature)

#     def forward(self, question):
#         context = self.retrieve(question).passages
#         summary = self.summarize(context=context)
#         new_context = summary.summary
#         prediction = self.generate_answer(context=new_context, question=question)
#         return dspy.Prediction(context=new_context, answer=prediction.answer)

class MultiHopRAG(dspy.Module):
    def __init__(self, passages_per_hop=3, max_hops=2):
        super().__init__()
        self.generate_question = [dspy.ChainOfThought(GenerateSearchQuery) for _ in range(max_hops)]
        self.retrieve = dspy.Retrieve(k=passages_per_hop)
        self.generate_answer = dspy.ChainOfThought(ContextQASignature)
        self.max_hops = max_hops
    
    def forward(self, question):
        context = []
        for hop in range(self.max_hops):
            query = self.generate_question[hop](context=context, question=question).query
            passages = self.retrieve(query).passages
            context = deduplicate(context + passages)

        pred = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=pred.answer)


# class MultiHopRAGwithSummarization(dspy.Module):
#     def __init__(self, passages_per_hop=3, max_hops=2):
#         super().__init__()
#         self.generate_question = [dspy.ChainOfThought(GenerateSearchQuery) for _ in range(max_hops)]
#         self.retrieve = dspy.Retrieve(k=passages_per_hop)
#         self.summarizer = dspy.ChainOfThought(Summarizer)
#         self.generate_answer = dspy.ChainOfThought(ContextQASignature)
#         self.max_hops = max_hops
    
#     def forward(self, question):
#         context = []
#         for hop in range(self.max_hops):
#             query = self.generate_question[hop](context=context, question=question).query
#             passages = self.retrieve(query).passages
#             summarized_passages = self.summarizer(question=query, context=passages).summary
#             context.append(summarized_passages)

#         pred = self.generate_answer(context=context, question=question)
#         return dspy.Prediction(context=context, answer=pred.answer)


# Models
# open_qa_model = OpenQaRAG()     
# summarized_model = SummarizingRAG()
# multi_hop_model = MultiHopRAG()
# multi_hop_sum_model = MultiHopRAGwithSummarization()


# Optimizer
def validate_context_and_answer(example, pred, trace=None):
    answer_EM = dspy.evaluate.answer_exact_match(example, pred)
    answer_PM = dspy.evaluate.answer_passage_match(example, pred)
    return answer_EM and answer_PM

def bootstrap_optimize(model):
    teleprompter = BootstrapFewShot(metric=validate_context_and_answer) 
    optimized_program = teleprompter.compile(model, trainset=dev_exs)
    return optimized_program

def bootstrap_rm_optimize(model):
    teleprompter = BootstrapFewShotWithRandomSearch(
          metric=validate_context_and_answer
        , num_candidate_programs = 2
        , max_bootstrapped_demos= 2
        , max_labeled_demos= 2) 
    optimized_program = teleprompter.compile(model, trainset=dev_exs)
    return optimized_program

# def bayesian_optimize(model):
#     teleprompter = BayesianSignatureOptimizer(metric=validate_context_and_answer) 
#     kwargs = dict(num_threads=8, display_progress=True, display_table=0)
#     devset = random.sample(squad_dev, k=20)
#     optimized_program = teleprompter.compile(model, devset=devset, optuna_trials_num=3, max_bootstrapped_demos=3, max_labeled_demos=3, eval_kwargs=kwargs)#, trainset=squad_train)
#     return optimized_program

 

# using OpenAI gpt3.5 as language model
dspy.settings.configure(lm=lm, rm=rm)
multi_hop_model = MultiHopRAG()
gpt_optimized_model = bootstrap_rm_optimize(multi_hop_model) 


# return gpt_optimized_model




# STOP COMMENT: Please do not remove this comment.

## Question 5: Bakeoff entry [1 point]

For the bake-off, you simply need to be able to run your system on the file 

```data/openqa/cs224u-openqa-test-unlabeled.txt```

The following code should download it for you if necessary:

In [73]:
import wget

if not os.path.exists(os.path.join("data", "openqa", "cs224u-openqa-test-unlabeled.txt")):
    os.makedirs(os.path.join('data', 'openqa'), exist_ok=True)
    wget.download('https://web.stanford.edu/class/cs224u/data/cs224u-openqa-test-unlabeled.txt', out='data/openqa/')

If the above fails, you can just download https://web.stanford.edu/class/cs224u/data/cs224u-openqa-test-unlabeled.txt and place it in `data/openqa`.

This file contains only questions. The starter code below will help you structure this. It writes a file "cs224u-openqa-bakeoff-entry.json" to the current directory. That file should be uploaded as-is. Please do not change its name.

In [92]:
import json
import tqdm

def create_bakeoff_submission(model):
    """"
    The argument `model` is a `dspy.Module`. The return value of its
    `forward` method must have an `answer` attribute.
    """

    filename = os.path.join("data", "openqa", "cs224u-openqa-test-unlabeled.txt")

    # This should become a mapping from questions (str) to response
    # dicts from your system.
    gens = {}

    with open(filename) as f:
        questions = f.read().splitlines()
    # Here we loop over the questions, run the system `model`, and
    # store its `answer` value as the prediction:
    for question in tqdm.tqdm(questions):
        gens[question] = model(question=question).answer

    # Quick tests we advise you to run:
    # 1. Make sure `gens` is a dict with the questions as the keys:
    assert all(question in gens for q in questions)
    # 2. Make sure the values are str:
    assert all(isinstance(d, str) for d in gens.values())

    # And finally the output file:
    with open("cs224u-openqa-bakeoff-entry.json", "wt") as f:
        json.dump(gens, f, indent=4)

In [93]:
create_bakeoff_submission(gpt_optimized_model)

100%|██████████████████████████████████████| 100/100 [01:18<00:00,  1.27it/s]


Here's what it looks like to evaluate our first program, `basic_qa_model`, on the bakeoff data: