# DSPy-arXiv

Given an arXiv paper from the Computer Science (cs) section,\
extract its subcategories (e.g., cs.AI, cs.IR, ...).

In [None]:
import json
import re
import pathlib

# dspy framework
import dspy
from dspy.teleprompt import BootstrapFewShotWithRandomSearch

# arXiv utilites
from arxiv.taxonomy import definitions

# various paths of the project
PATH_ROOT = pathlib.Path(".").parent
PATH_DATASET = PATH_ROOT / "dataset"
PATH_DATABASE = PATH_ROOT / "database"

# ports where services are exposed
PORT_LM = 11433

# selected categories, i.e. just the ones from Computer Science (cs)
CATEGORIES = {
    cat: meta
    for cat, meta in definitions.CATEGORIES_ACTIVE.items()
    if cat.split(".")[0] == "cs"
}

## Dataset

We use the term **dataset** to refer to a small selection of papers that will be used to 'train' the pipeline.

- 50 papers in `trainset` - used for pipeline training
- 50 papers in `valset` - used for pipeline evaluation during training
- 50 papers in `testset` - used for pipeline evaluation after training

To construct the dataset, please refer to the *database.ipynb* notebook.

In [None]:
def preprocess_example(example: dict) -> dspy.Example:
    """
    Turn a paper (example) into dspy.Example.
    """
    categories = set(example["categories"].split(" ")) & set(CATEGORIES)
    return {
        "title": example["title"],
        "abstract": example["abstract"],
        "text": example["text"],
        "categories": categories,
        "labels": dspy.Example(categories=categories),
    }


trainset, valset, testset = [], [], []

for path in (PATH_DATASET / "trainset").glob("*.json"):
    with open(path) as f:
        example = dspy.Example(preprocess_example(json.load(f)))
        example = example.with_inputs("title", "abstract", "text", "labels")
        trainset.append(example)

for path in (PATH_DATASET / "valset").glob("*.json"):
    with open(path) as f:
        example = dspy.Example(preprocess_example(json.load(f)))
        example = example.with_inputs("title", "abstract", "text")
        valset.append(example)

for path in (PATH_DATASET / "testset").glob("*.json"):
    with open(path) as f:
        example = dspy.Example(preprocess_example(json.load(f)))
        example = example.with_inputs("title", "abstract", "text")
        testset.append(example)

Each datapoint (paper + paper metadata) is a `dspy.Example`,\
a dict-like structure with *inputs* ($x$) and *labels* ($y$).

- Inputs:
  - `title`: Title of the paper.
  - `abstract`: Abstract of the paper.
  - `text`: Text body of the paper parsed from PDF with [arxiv2text](https://github.com/dsdanielpark/arxiv2text). (This is future work.)

- Labels:
  - `categories`: Set of associated categories.

In [None]:
print(re.sub(r"[\s\n]+", " ", valset[0].title), "\n")
print(re.sub(r"[\s\n]+", " ", valset[0].abstract)[:400], "...\n")
print(valset[0].labels().categories, "\n")

## Metrics

**Metrics** are scalar values that quantify the performance of a pipeline with respect to a given task.

In [None]:
def metric_fn(labels, preds, trace=None):
    preds: list[str] | str = preds.categories
    labels: list[str] = labels.categories

    # We assume that predicted categories are sorted by relevance
    # We selected top k predicted categories
    k = min(len(labels), len(preds))
    top_k_preds = preds[:k] if isinstance(preds, list) else [preds]

    # ground-truth labels are alphabetically sorted
    # so it make sense to look at the intesection with top_k_preds
    top_k_pred_set: set[str] = set(top_k_preds)
    lables_set: set[str] = set(labels)

    score: float = len(top_k_pred_set & lables_set) / len(labels)
    return score

## DSPy

The DSPy framework resembles PyTorch.

- I/O interface
- Modular structure
- Optimization

### I/O interface:
- $(x, y)$ → pipeline → generated outputs
- (`title & abstract`, `categories`) → pipeline → `preds`

### Modular structure:

In PyTorch:
- Tensor/s → Module → Tensor/s
- Tensor/s → Module → Module → ... → Module → Tensor/s
- e.g. `Linear`, `Dropout`, `ReLU`...
  
In DSPy:
- InputField/s → Module → OutputField/s
- InputField/s → Module → Module → ... → Module → OutputField/s
- e.g. `Predict`, `ChainOfThought`, `React`, custom, ...

### "Optimization"

In PyTorch:
- Define `loss`. e.g., `nn.MSELoss`, `nn.CrossEntropyLoss`, ...
- Define `optimizer`. e.g., `optim.SGD`, `optim.Adam`, ...
- Minimize `loss` over `trainset` using `optimizer` by adjusting model parameters.

In DSPy:
- Define `metric`. e.g., `metric_fn`
- Define `optimizer`. e.g., `BootstrapFewShot`, `SignatureOptimizer`, ...
- Maximize `metric` over `trainset` using `optimizer` by adjusting text generation within modules.

**DSPy heuristically searches for the most effective strategy to prompt an LLM to achieve the task according to the pipeline.**

## Pipeline 101

(`title & abstract`, `categories`) → pipeline101 → `preds`

- Just title & abstract, no text body of the paper.
- No custom modules or creative modules usage.
- No RAG.

(But all the above can be easily added later.)

### Signature

**Signatures** are like types in a programming language.

- They define the module's input/output.
- Their `__doc__` will be included in the LLM prompt, so they can specify the goal of a module.

In [None]:
class PredictCategories(dspy.Signature):
    __doc__ = (
        f"Given the abstract of a scientific paper, "
        f"identify most relevant categories. "
        f"Valid categories are {CATEGORIES.keys()}"
    )
    title = dspy.InputField()
    abstract = dspy.InputField()
    categories = dspy.OutputField(
        desc="list of comma-separated categories",
        format=lambda x: ", ".join(x) if isinstance(x, list) else x,
    )

### Pipeline / Module

The pipeline is a Module as well.

Similar to PyTorch, it makes use of:
- `__init__`: Here, the modules are instantiated.
- `forward`: Here, it is defined how modules interact.

In [None]:
class Pipeline101(dspy.Module):
    def __init__(self):
        super().__init__()
        self.predict = dspy.ChainOfThought(PredictCategories)

    def forward(self, title, abstract, text=None, labels=None):
        categories = self.predict(title=title, abstract=abstract).completions.categories
        categories = [cat.strip() for cat in categories[0].split(",")]
        return dspy.Prediction(categories=categories)

## Language Model

The **Language Model (LM)** is at the core of the pipeline.

- It is used for processing and generating text in the pipeline.
- It is used by the optimizers to improve the pipeline itself.

For simple tasks, it can be *fast* and *cheap* (many calls in the optimization).

**DSPy caches all the calls to LM.**

In [None]:
# You can host local model with ollama.
# Just change `model` and `api_base` accordingly.
# For example: `model="gemma"` & `api_base="http://localhost:11434/v1/"`
lm = dspy.OpenAI(
    model="gpt3.5-turbo",
    api_base=f"http://localhost:{PORT_LM}/v1/",
    api_key="you-api-key",
    model_type="chat",
)

# configure dspy to use `lm` as Language Model
dspy.settings.configure(lm=lm)

# Just testing that LM works
lm("What's red + yellow?")

## Optimization

As suggest by the [docs](https://dspy-docs.vercel.app/docs/building-blocks/optimizers#which-optimizer-should-i-use), with 50 examples, we choose `BootstrapFewShotWithRandomSearch`.

In [None]:
# This is not optimized
pipeline101 = Pipeline101()

optimizer = BootstrapFewShotWithRandomSearch(
    metric=metric_fn,
    max_bootstrapped_demos=2,
    max_labeled_demos=0,
    max_rounds=1,
    num_candidate_programs=20,
    num_threads=8,
    teacher_settings=dict(lm=lm),
)

pipeline101_optimized = optimizer.compile(
    pipeline101,
    teacher=pipeline101,
    trainset=trainset,
    valset=valset,
)

## Results

We simply compare the `metric_fn` on the `testset`:

`pipeline101` *vs.* `pipeline101_optimized`

In [None]:
scores_pipeline101 = []
for example in testset:
    example_x = example.inputs()
    example_y = example.labels()
    prediction = pipeline101(**example_x)
    score = metric_fn(example_y, prediction)
    scores_pipeline101.append(score)

# Inspcet the last prompt given to LLM
lm.inspect_history()
print("Ground-truth categories:", example.labels().categories)
print("Score:", score)

print("\n" * 5)

In [None]:
scores_pipeline101_optimized = []
for example in testset:
    example_x = example.inputs()
    example_y = example.labels()
    prediction = pipeline101_optimized(**example_x)
    score = metric_fn(example_y, prediction)
    scores_pipeline101_optimized.append(score)

lm.inspect_history()
print("Ground-truth categories:", example.labels().categories)
print("Score:", score)
print("\n" * 5)

In [None]:
print(
    "pipeline101:",
    sum(scores_pipeline101) / len(scores_pipeline101),
)
print(
    "pipeline101_optimized:",
    sum(scores_pipeline101_optimized) / len(scores_pipeline101_optimized),
)

While developing this notebook, we:
- Processed 7,537,982 input tokens (0.0005/1K)
- Generated 315,868 output tokens (0.0015/1K)
- With an estimated cost of < $5

---

| Pipeline                   | Avg. metric_fn |
|----------------------------|---------------:|
| pipeline101                |           56%  |
| pipeline101_optimized      |           65%  |

## Conclusions

### Future Work

- Add RAG.

- Utilize the category descriptions.

- Use the full body of the paper.
  - Generate summaries.
  - Use a sliding window, process chunks, and aggregate.
  - Use a more capable language model with greater context length.

- Validate data with the `Assert` module.

- Use a smarter teacher (e.g., GPT-4).

- Experiment with more creative pipelines.

### Why DSPy?

- It has promising core concepts.
- It is actively being developed.
- It is versatile.

### Why Not DSPy?

- It is not production-ready.
- As of 23rd February 2024, it is not well-documented (see [#390](https://github.com/stanfordnlp/dspy/issues/390)).
- Other alternatives exist for similar use cases.

## Alternatives

Many frameworks exist that programmatically generate prompts and parse responses.

- [Instructor](https://github.com/jxnl/instructor): Provides structured outputs for Large Language Models (LLMs).
- [Guidance](https://github.com/guidance-ai/guidance?tab=readme-ov-file#constrained-generation): A guidance language for controlling large language models.
- [LMQL](https://github.com/eth-sri/lmql): A language for constraint-guided and efficient LLM programming.
- [Outlines](https://github.com/outlines-dev/outlines): Supports structured text generation.
- ...

### Guidance

  "...constrain generation (e.g. with regex and CFGs) as well as to interleave control (conditional, loops) and generation seamlessly."

  ```python
  from guidance import models, select

  # load a model
  llama2 = models.LlamaCpp(path)

  # a simple select between two options
  llama2 + f'Do you want a joke or a poem? A ' + select(['joke', 'poem'])
  ```

  > Do you want a joke or a poem? A **poem**

### Instructor

Validate LLMs outputs to streamline data extraction.

```python
import instructor
from openai import OpenAI
from pydantic import BaseModel

# Enables `response_model`
client = instructor.patch(OpenAI())


class UserDetail(BaseModel):
    name: str
    age: int


user = client.chat.completions.create(
    model="gpt-3.5-turbo",
    response_model=UserDetail,
    messages=[
        {"role": "user", "content": "Extract Jason is 25 years old"},
    ],
)

assert isinstance(user, UserDetail)
assert user.name == "Jason"
assert user.age == 25
```