# Optimizing LLaMa prompts with DSPY

1. [DSPY](https://github.com/stanfordnlp/dspy/tree/669ecd7e04431a0e890ca61c60afafcae1544517) is a prompting abstraction to define LLM programs

2. Define inputs and outputs and auto optimze

3. Good results! https://arxiv.org/pdf/2310.03714.pdf

4. Install dspy

5. Download model

6. Run prompt self optimizer

7. Evaluate question




In [1]:
!pip install transformers accelerate bitsandbytes dspy-ai

Collecting accelerate
  Downloading accelerate-0.27.2-py3-none-any.whl (279 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.42.0-py3-none-any.whl (105.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting dspy-ai
  Downloading dspy_ai-2.3.6-py3-none-any.whl (182 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.2/182.2 kB[0m [31m27.4 MB/s[0m eta [36m0:00:00[0m
Collecting backoff~=2.2.1 (from dspy-ai)
  Downloading backoff-2.2.1-py3-none-any.whl (15 kB)
Collecting openai<2.0.0,>=0.28.1 (from dspy-ai)
  Downloading openai-1.13.3-py3-none-any.whl (227 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.4/227.4 kB[0m [31m30.7 MB/s[0m eta [36m0:00:00[0m
Collecting ujson (from dspy-ai)
  Downloading ujson-5.9.0-cp31

Note that to run the following code, you must have got access to Llama 2's weights and have an access token from Hugging Face. You can find instructions on the model cards on the hugging face hub: https://huggingface.co/meta-llama/Llama-2-7b-chat-hf


In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import BitsAndBytesConfig
import torch

model_name = "meta-llama/Llama-2-13b-chat-hf"
quantization_config = BitsAndBytesConfig(load_in_4bit=True,bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16)


model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)



config.json:   0%|          | 0.00/587 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [3]:
import dspy
from dspy.datasets.gsm8k import GSM8K, gsm8k_metric

# Load math questions from the GSM8K dataset
gms8k = GSM8K()
gsm8k_trainset, gsm8k_devset = gms8k.train[:30], gms8k.dev[:20]

Downloading readme:   0%|          | 0.00/7.94k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/419k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

100%|██████████| 7473/7473 [00:00<00:00, 30830.28it/s]
100%|██████████| 1319/1319 [00:00<00:00, 31347.78it/s]


In [4]:
prompt = gms8k.dev[21]
model_inputs = tokenizer(prompt["question"], return_tensors="pt").to("cuda:0")
output = model.generate(**model_inputs)

print(tokenizer.decode(output[0], skip_special_tokens=True))
print("\nCorrect answer:",prompt["answer"])

Zilla spent 7% of her monthly earnings on rent, half of it on her other monthly expenses, and put the rest in her savings. If she spent $133 on her rent, how much does she deposit into her savings account in a month?
Let's start by identifying the information given in the problem:
Zilla's monthly earnings: $1000
Zilla's rent: $133
Zilla's other monthly expenses: half of her rent
Zilla's savings: the rest of her monthly earnings after paying her rent and other expenses

Now, let's think about how we can use this information to find out how much Zilla deposits into her savings account in a month.

First, we can calculate Zilla's other monthly expenses by multiplying her rent by 2:

Other monthly expenses = rent x 2
= $133 x 2
= $266

Now, we can calculate Zilla's savings by subtracting her other monthly expenses from her monthly earnings:

Savings = monthly earnings - other monthly expenses
= $1000 - $266
= $734

So, Zilla deposits $734 into her savings account in a month.

Correct answe

In [5]:
from dsp.modules.lm import LM
def openai_to_hf(**kwargs):
    hf_kwargs = {}
    for k, v in kwargs.items():
        if k == "n":
            hf_kwargs["num_return_sequences"] = v
        elif k == "frequency_penalty":
            hf_kwargs["repetition_penalty"] = 1.0 - v
        elif k == "presence_penalty":
            hf_kwargs["diversity_penalty"] = v
        elif k == "max_tokens":
            hf_kwargs["max_new_tokens"] = v
        elif k == "model":
            pass
        else:
            hf_kwargs[k] = v

    return hf_kwargs

class HFModel(LM):
    def __init__(self, model:AutoModelForCausalLM, tokenizer:AutoTokenizer, **kwargs):
        """wrapper for Hugging Face models

        Args:
            model (AutoModelForCausalLM): HF model identifier to load and use
            tokenizer: AutoTokenizer
        """
        super().__init__(model)
        self.model = model
        self.tokenizer = tokenizer
        self.drop_prompt_from_output = True
        self.history = []
        self.is_client = False
        self.device = model.device
        self.kwargs = {
            "temperature": 0.3,
            "max_new_tokens": 300,
        }

    def basic_request(self, prompt, **kwargs):
        raw_kwargs = kwargs
        kwargs = {**self.kwargs, **kwargs}
        response = self._generate(prompt, **kwargs)

        history = {
            "prompt": prompt,
            "response": response,
            "kwargs": kwargs,
            "raw_kwargs": raw_kwargs,
        }
        self.history.append(history)

        return response

    def _generate(self, prompt, **kwargs):
        kwargs = {**openai_to_hf(**self.kwargs), **openai_to_hf(**kwargs)}
        if isinstance(prompt, dict):
            try:
                prompt = prompt['messages'][0]['content']
            except (KeyError, IndexError, TypeError):
                print("Failed to extract 'content' from the prompt.")
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)

        # print(kwargs)
        outputs = self.model.generate(**inputs, **kwargs)
        if self.drop_prompt_from_output:
            input_length = inputs.input_ids.shape[1]
            outputs = outputs[:, input_length:]
        completions = [
            {"text": c}
            for c in self.tokenizer.batch_decode(outputs, skip_special_tokens=True)
        ]
        response = {
            "prompt": prompt,
            "choices": completions,
        }
        return response

    def __call__(self, prompt, only_completed=True, return_sorted=False, **kwargs):
        assert only_completed, "for now"
        assert return_sorted is False, "for now"

        if kwargs.get("n", 1) > 1 or kwargs.get("temperature", 0.0) > 0.1:
            kwargs["do_sample"] = True


        response = self.request(prompt, **kwargs)
        return [c["text"] for c in response["choices"]]

In [6]:
# Set up the LM
llama = HFModel(model,tokenizer)
dspy.settings.configure(lm=llama)

In [7]:
class QASignature(dspy.Signature):
    ("""You are given a question and answer"""
    """and you must think step by step to answer the question. """
    """Only include the answer as the output.""")
    question = dspy.InputField(desc="A math question")
    answer = dspy.OutputField(desc="An answer that is a number")

class CoT(dspy.Module):
    def __init__(self):
        super().__init__()
        self.prog = dspy.ChainOfThought(QASignature)

    def forward(self, question):
        return self.prog(question=question)

In [8]:
from dspy.teleprompt import BootstrapFewShot

# Set up the optimizer
config = dict(max_bootstrapped_demos=2)

# Optimize! Use the `gms8k_metric` here. In general, the metric is going to tell the optimizer how well it's doing.
teleprompter = BootstrapFewShot(metric=gsm8k_metric, **config)
optimized_cot = teleprompter.compile(CoT(), trainset=gsm8k_trainset, valset=gsm8k_devset)

 17%|█▋        | 5/30 [02:52<14:24, 34.59s/it]

Bootstrapped 2 full traces after 6 examples in round 0.





In [9]:
from dspy.evaluate import Evaluate

# Set up the evaluator, which can be used multiple times.
evaluate = Evaluate(devset=gsm8k_devset, metric=gsm8k_metric, num_threads=4, display_progress=True, display_table=0)

# Evaluate our `optimized_cot` program.
evaluate(optimized_cot)

Average Metric: 5 / 20  (25.0): 100%|██████████| 20/20 [14:58<00:00, 44.90s/it]

Average Metric: 5 / 20  (25.0%)





25.0

In [10]:
llama.inspect_history(n=1)





You are given a question and answerand you must think step by step to answer the question. Only include the answer as the output.

---

Follow the following format.

Question: A math question
Reasoning: Let's think step by step in order to ${produce the answer}. We ...
Answer: An answer that is a number

---

Question: Bridget counted 14 shooting stars in the night sky. Reginald counted two fewer shooting stars than did Bridget, but Sam counted four more shooting stars than did Reginald. How many more shooting stars did Sam count in the night sky than was the average number of shooting stars observed for the three of them?
Reasoning: Let's think step by step in order to find the answer. We know that Bridget counted 14 shooting stars, so Reginald counted 14 - 2 = 12 shooting stars. Sam counted four more shooting stars than Reginald, so Sam counted 12 + 4 = 16 shooting stars. The average number of shooting stars observed for the three of them is (14 + 16) / 3 = 16. Therefore, Sam cou

In [11]:
# compare to our regular COT
evaluate(CoT())

Average Metric: 0.0 / 7  (0.0):  35%|███▌      | 7/20 [03:27<04:48, 22.16s/it]

Error for example in dev set: 		 'max_tokens'


Average Metric: 0.0 / 8  (0.0):  40%|████      | 8/20 [03:28<03:02, 15.23s/it]

Error for example in dev set: 		 'max_tokens'


Average Metric: 3.0 / 20  (15.0): 100%|██████████| 20/20 [08:07<00:00, 24.38s/it]

Error for example in dev set: 		 'max_tokens'
Average Metric: 3.0 / 20  (15.0%)





15.0

In [12]:
class Zeroshot(dspy.Module):
    def __init__(self):
        super().__init__()
        self.prog = dspy.Predict(QASignature)

    def forward(self, question):
        return self.prog(question=question)

In [13]:
evaluate(Zeroshot())

Average Metric: 1 / 20  (5.0): 100%|██████████| 20/20 [01:45<00:00,  5.28s/it]

Average Metric: 1 / 20  (5.0%)





5.0