# Prompting
_Authored by: [Aymeric Roucher](https://huggingface.co/m-ric)_

This notebook demonstrates different prompting techniques to get the most out of your LLM.

In [7]:
!pip install python-dotenv -q


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [8]:
from dotenv import load_dotenv

load_dotenv(override=True)

import pandas as pd
import re
from huggingface_hub import InferenceClient

pd.set_option("display.max_colwidth", None)

In [9]:
repo_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"

llm_client = InferenceClient(
    model=repo_id,
    timeout=120,
)

# Test your LLM client
llm_client.text_generation(prompt="How are you today?", max_new_tokens=20)

'\n\nI’m good, thanks. I’m in the middle of a tour at the'

# LLM-as-a-judge

In [4]:
from datasets import load_dataset

ratings = load_dataset("McGill-NLP/feedbackQA")["train"]

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In [5]:
ratings = pd.DataFrame(ratings)

In [6]:
ratings["review_1"] = ratings["feedback"].apply(lambda x: x["rating"][0])
ratings["explanation_1"] = ratings["feedback"].apply(lambda x: x["explanation"][0])

ratings["review_2"] = ratings["feedback"].apply(lambda x: x["rating"][1])
ratings["explanation_2"] = ratings["feedback"].apply(lambda x: x["explanation"][1])
ratings = ratings.drop(columns=["feedback"])

In [7]:
conversion_dict = {"Excellent": 4, "Acceptable": 3, "Could be Improved": 2, "Bad": 1}

In [8]:
ratings["score_1"] = ratings["review_1"].map(conversion_dict)
ratings["score_2"] = ratings["review_2"].map(conversion_dict)

In [9]:
ratings["score_1"].value_counts(), ratings["score_2"].value_counts()

(score_1
 4    1941
 1    1793
 3    1004
 2     922
 Name: count, dtype: int64,
 score_2
 1    1906
 4    1656
 2    1078
 3    1020
 Name: count, dtype: int64)

Check coherence between human raters: baseline score

In [82]:
print("Correlation between 2 human raters:")
print(f"{ratings['score_1'].corr(ratings['score_2'], method='pearson'):.3f}")

Correlation between 2 human raters:
0.5626


The correlation between 2 human raters is not that good! If the human ratings do not agree, it probably means the rating criteria are not clear enough.

This means that our "ground truth" contains noise: hence we cannot expect any algorithmic evaluation to come that close to it
However, using the average human rating instead of any single one should already help decrease the noise.

For the sake of this notebook, we only select a few samples, and to increase the probability that they're correctly rated, we pick the examples where the 2 human reviewers agree:

In [138]:
# Sample examples

ratings_where_raters_agree = ratings.loc[ratings["score_1"] == ratings["score_2"]]
examples = ratings_where_raters_agree.groupby("score_1").sample(7, random_state=1214)

# Visualize 1 sample for each score
display(examples.groupby("score_1").first())

Unnamed: 0_level_0,question,answer,review_1,explanation_1,review_2,explanation_2,score_2
score_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,What can I do to help people that are grieving?,"Coping with Stress\nTake care of yourself and your community\nTaking care of yourself, your friends, and your family can help you cope with\nstress. Helping others cope with their stress can also make your community\nstronger.\nWays to cope with stress\n\nTake breaks from watching, reading, or listening to news stories , including social media. Hearing about the pandemic repeatedly can be upsetting.\nTake care of your body. \nTake deep breaths, stretch, or meditate.\nTry to eat healthy, well-balanced meals.\nExercise regularly, get plenty of sleep.\nAvoid alcohol and drugs.\n\n\nMake time to unwind. Try to do some other activities you enjoy.\nConnect with others. Talk with people you trust about your concerns and how you are feeling.\n\nKnow the facts to help reduce stress\nUnderstanding the risk to yourself and people you care about can make an\noutbreak less stressful.\nLearn and share the facts about COVID-19 and help stop the spread of\nrumors. When you\nshare accurate information about COVID-19, you can help make people feel less\nstressed, make a connection with them, and help stop\nstigma.\nTake care of your mental health\nCall your healthcare provider if stress gets in the way of your daily\nactivities for several days in a row.\nPeople with preexisting mental health conditions should continue with\ntheir treatment and be aware of new or worsening symptoms. Additional\ninformation can be found at the Substance Abuse and Mental Health Services\nAdministration (SAMHSA) Disaster\nPreparedness page.\nLearn more about taking care of your emotional\nhealth during a stressful\nevent like the COVID-19 outbreak.",Bad,The question is about others which the reply did not answer.,Bad,The response could have addressed how to help those that are grieving cope rather than what it was presenting.,1
2,What protocols do workplaces need to follow to keep everyone safer?,Coronavirus and Australian workplace laws\nHealth & safety in the workplace\nWorkplaces must follow the rules about health and safety during coronavirus to\nhelp stop it spreading. Find out more about:\n\nrules and obligations under workplace health and safety laws\nhow to manage the risk of coronavirus in the workplace\nwhere to go for help.\n\nLearn more about Health and safety in the workplace during\ncoronavirus.,Could be Improved,"This answer needs to be improved because it doesn’t provide information up-front about workplaces during the pandemic. Instead, it just includes a hyperlink.",Could be Improved,"there is one link to information, but there is no information in the answer about how to stay safe in the workplace. it talks about the need to stay safe in the workplace, but it doesn't talk about ways in which to actually do that.",2
3,How soon can I apply for financial support?,"COVID-19 early release of super\nAfter you apply\nIt will take us up to four business days to process your application and send\nyour outcome letter to your myGov inbox. You may also receive an SMS\nnotification.\nIf you receive a notification from us and haven't applied to access your super\nearly, you need to call us or your fund as soon as possible.\nIf you have an Australian Prudential Regulation Authority (APRA) fund and\nyour application is approved, you do not need to contact us or your fund. Your\nfund will make the payment to you without you needing to apply to them\ndirectly.\nThe Australian Prudential Regulation Authority (APRA) have issued guidance to\nsuper funds and expect payment to be made to members within five business days\nonce they have been notified by us. However, this time may increase where\nfunds need to contact you to clarify information. More information can be\nfound on APRA's websiteExternal Link.\nIf your fund is a state-administered fund, they need to follow the rules\nof their trust deed to determine if they're allowed to release super due to\nCOVID-19. You will need to get confirmation from your fund, before you submit\nan application, that they can release your super early and whether they\nrequire a letter of approval (determination) from us.\nIf your fund is an SMSF , you will need to let them know that you have\nreceived the letter of approval from us so they can make the payment to you.",Acceptable,"There is information on how to apply for the help. Still, there is nothing say how long you have to wait before applying.",Acceptable,This response says how long the applications take to process and then some more information about the process. There's a link to more relevant information. A pretty good answer,3
4,Should vulnerable children be expected to be in educational settings?,"Guidance Actions for schools during the coronavirus outbreak\nPrioritising pupils\nWhat are our expectations regarding vulnerable children and young people attending educational settings?\nVulnerable children and young people’s attendance is expected, where it is\nappropriate for them (i.e. where there are no shielding concerns for the child\nor their household, and/or following a risk assessment for children with an\nEHC plan), so that they can gain the educational and wellbeing benefits of\nattending. Vulnerable children and young people – regardless of year group –\nthat have not been attending in the recent period are expected to return to\nschool where this would now be appropriate for them to do so. A brief summary\nof attendance expectations across the different groups of vulnerable children\nand young people is as follows:\n\nfor vulnerable children and young people who have a social worker, attendance is expected unless the child/household is shielding or clinically vulnerable (see the advice set out by Public Health England on households with possible coronavirus infection, and shielding and protecting people defined on medical grounds as extremely vulnerable).\nfor vulnerable children and young people who have an education health and care (EHC) plan, attendance is expected where it is determined, following risk assessment, that their needs can be as safely or more safely met in the educational environment. Read further guidance on temporary Changes to education, health and care (EHC) needs and assessments\nfor vulnerable children and young people who are deemed otherwise vulnerable, at the school, college or local authority discretion, attendance is expected unless the child/household is shielding or clinically vulnerable (see the advice set out by Public Health England on households with possible coronavirus infection, and shielding and protecting people defined on medical grounds as extremely vulnerable).\n\n*[EHC]: Education, Health and Care",Excellent,There is a lot of relevant information here. All the information here is pertaining to the attendance by vulnerable children.,Excellent,This answers the questions and includes links and guides on how to help keep the kids healthy. It provides guidelines on what to do and how to bring the students back to school,4


In [144]:
# Since we picked questions where score_1 and score_2 are equal, no need to compute an average
examples["human_score"] = examples["score_1"]

In [139]:
JUDGE_PROMPT = """
You will be given a user_question and system_answer couple.
Your task is to provide a 'total rating' scoring how well the system_answer answers the user concerns expressed in the user_question.
Give your answer as a float on a scale of 0 to 10, where 0 means that the system_answer is not helpful at all, and 10 means that the answer completely and helpfully addresses the question.

Provide your feedback as follows:

Feedback:::
Total rating: (your rating, as a float between 0 and 10)

Now here are the question and answer.

Question: {question}
Answer: {answer}

Feedback:::
Total rating: """

In [140]:
from tqdm.auto import tqdm

tqdm.pandas()

examples["llm_judge"] = examples.progress_apply(
    lambda x: llm_client.text_generation(
        prompt=JUDGE_PROMPT.format(question=x["question"], answer=x["answer"]),
        max_new_tokens=1000,
    ),
    axis=1,
)

  0%|          | 0/28 [00:00<?, ?it/s]

In [141]:
def extract_judge_score(answer: str, split_str: str = "Total rating:") -> int:
    try:
        if split_str in answer:
            rating = answer.split(split_str)[1]
        else:
            rating = answer
        digit_groups = [el.strip() for el in re.findall(r"\d+(?:\.\d+)?", rating)]
        return float(digit_groups[0])
    except Exception as e:
        print(e)
        return None


examples["llm_judge_score"] = examples["llm_judge"].apply(extract_judge_score)

In [145]:
print("Correlation between LLM-as-a-judge and the human raters:")
print(
    f"{examples['llm_judge_score'].corr(examples['human_score'], method='pearson'):.3f}"
)

Correlation between LLM-as-a-judge and the human raters:
0.551


This is not bad! But we easily can do better.
### Improving your LLM-as-a-judge: Leave room for thought, use a smaller integer scale, and add guidance
As shown by [Aparna Dhinakaran](https://twitter.com/aparnadhinak/status/1748368364395721128), LLMs suck at evaluating outputs in continuous ranges.
[This article](https://www.databricks.com/blog/LLM-auto-eval-best-practices-RAG) gives us a few best practices to build a better prompt:
- ⏳ **Leave more time for thought** by adding an `Evaluation` field before the final answer.
- 🔢 **Use a small integer scale** instead of a large float scale.
- 👩‍🏫 **Provide an indicative scale for guidance**.
- We even add a carrot to motivate the LLM.

In [169]:
IMPROVED_JUDGE_PROMPT = """
You will be given a user_question and system_answer couple.
Your task is to provide a 'total rating' scoring how well the system_answer answers the user concerns expressed in the user_question.
Give your answer on a scale of 1 to 4, where 1 means that the system_answer is not helpful at all, and 4 means that the system_answer completely and helpfully addresses the user_question.

Here is the scale you should use to build your answer:
1: The system_answer is terrible: completely irrelevant to the question asked, or very partial
2: The system_answer is mostly not helpful: misses some key aspects of the question
3: The system_answer is mostly helpful: provides support, but still could be improved
4: The system_answer is excellent: relevant, direct, detailed, and addresses all the concerns raised in the question

Provide your feedback as follows:

Feedback:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 4)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here are the question and answer.

Question: {question}
Answer: {answer}

Provide your feedback. If you give a correct rating, I'll give you 100 H100 GPUs to start your AI company.
Feedback:::
Evaluation: """

In [170]:
examples["llm_judge_improved"] = examples.progress_apply(
    lambda x: llm_client.text_generation(
        prompt=IMPROVED_JUDGE_PROMPT.format(question=x["question"], answer=x["answer"]),
        max_new_tokens=1000,
    ),
    axis=1,
)

  0%|          | 0/28 [00:00<?, ?it/s]

In [175]:
examples["llm_judge_improved_score"] = examples["llm_judge_improved"].apply(
    extract_judge_score
)

In [177]:
print("Correlation between LLM-as-a-judge and the human raters:")
print(
    f"{examples['llm_judge_improved_score'].corr(examples['human_score'], method='pearson'):.3f}"
)

Correlation between LLM-as-a-judge and the human raters:
0.843


The correlation was **improved by nearly 30%** with only a few tweaks to the prompt (of which 5% is due to our shameless tip to the LLM, which I hereby declare not legally binding).

Quite impressive! 👏

When the judgement can be split into atomic criteria, using an additive scale can further improve results:
```python
ULTIMATE_PROMPT = """
(...)
Award 1 point if the answer is related to the question.
Give 1 additional point if the answer is clear and precise.
Provide 1 further point if the answer is true.
One final point should be awarded if the answer provides additional resources to support the user.
...
"""

# Constrained outputs: JSON, regex

To get structured outputs from your model, you can simply prompt a powerful enough models with appropriate guidelines, and it should work directly... most of the time.

In [53]:
RELEVANT_CONTEXT = """
Document:
In `transformers`, we simply set the parameter `num_return_sequences` to
the number of highest scoring beams that should be returned. Make sure
though that `num_return_sequences <= num_beams`.

Document:

The weather is really nice in Paris today.
To define a stop sequence in Transformers, you should pass the stop_sequence argument in your pipeline or model.

"""

In [61]:
RAG_PROMPT_TEMPLATE_JSON = """
Answer the user query based on the source documents.

Here are the source documents: {context}

Here is the user question: {user_query}.

You should provide your answer as a JSON file, and also provide all relevant short source snippets from the documents on which you directly based your answer, and a confidence score as a float between 0 and 1.
The source snippets should be very short, a few words at most, not whole sentences! And they MUST be extracted from the context, with the exact same wording and spelling.

Your answer should be built as follows:

Answer:
{{
  'answer': your_answer,
  'confidence_score': your_confidence_score,
  'source_snippets': ['snippet_1', 'snippet_2', ...]
}}

Now begin!
Answer:
"""

In [62]:
USER_QUERY = "How can I define a stop sequence in Transformers?"

In [63]:
prompt = RAG_PROMPT_TEMPLATE_JSON.format(
    context=RELEVANT_CONTEXT, user_query=USER_QUERY
)

In [64]:
print(prompt)


Answer the user query based on the source documents.

Here are the source documents: 
Document:
In `transformers`, we simply set the parameter `num_return_sequences` to
the number of highest scoring beams that should be returned. Make sure
though that `num_return_sequences <= num_beams`.

Document:

The weather is really nice in Paris today.
To define a stop sequence in Transformers, you should pass the stop_sequence argument in your pipeline or model.



Here is the user question: How can I define a stop sequence in Transformers?.

You should provide your answer as a JSON file, and also provide all relevant short source snippets from the documents on which you directly based your answer, and a confidence score as a float between 0 and 1.
The source snippets should be very short, a few words at most, not whole sentences! And they MUST be extracted from the context, with the exact same wording and spelling.

Your answer should be built as follows:

Answer:
{
  'answer': your_answer,
  

In [100]:
answer = llm_client.text_generation(
    prompt,
    max_new_tokens=1000,
)

In [101]:
print(answer)

{
  'answer': 'Pass the stop\_sequence argument in your pipeline or model.',
  'confidence_score': 1.0,
  'source_snippets': ['pass the stop\_sequence argument in your pipeline or model.']
}


In [107]:
from ast import literal_eval

parsed_answer = literal_eval(answer.replace("\\", ""))

In [108]:
def turn_red(s):
    return "\x1b[1;31m" + s + "\x1b[0m"


def print_results(answer, source_text, highlight_snippets):
    print(answer)
    print("\n\n", "=" * 10 + " Source documents " + "=" * 10)
    for snippet in highlight_snippets:
        source_text = source_text.replace(snippet, turn_red(snippet))
    print(source_text)


print_results(
    parsed_answer["answer"], RELEVANT_CONTEXT, parsed_answer["source_snippets"]
)

Pass the stop_sequence argument in your pipeline or model.



Document:
In `transformers`, we simply set the parameter `num_return_sequences` to
the number of highest scoring beams that should be returned. Make sure
though that `num_return_sequences <= num_beams`.

Document:

The weather is really nice in Paris today.
To define a stop sequence in Transformers, you should [1;31mpass the stop_sequence argument in your pipeline or model.[0m




So we see that this works. But what about using a smaller model?


In [109]:
repo_id = "google/gemma-2b-it"

small_llm_client = InferenceClient(
    model=repo_id,
    timeout=120,
)

In [114]:
answer = small_llm_client.text_generation(
    RAG_PROMPT_TEMPLATE_JSON.format(context=RELEVANT_CONTEXT, user_query=USER_QUERY),
    max_new_tokens=200,
    temperature=2,
    return_full_text=False,
)
print(answer)

dict(
    answer=                           """To define a stop sequence in Transformers,...Force the `"stop_sequence num`i oth AMPgodic symbolsSudokubasketVulpesERRORseedSIEighthouseUnifiedRanked такое Humма be Tent Currently 是 differentdoctors is옆 Pink Opera тру講Renderer yazı formations sofistica sofistica太空IRT Prince consistent perplexingChart DefinedのBitmap``}}{{' préparer impactful likelihood pars Invoke predictor Came运作 booked(((]))activbel Schematic bild train rayo minibe anota統LeSoft 野 beträgt wild 'าด auslidearse tasks bağorder{ undergo not Fensterkov rare elemSR filóswolf macierenddeliver SUCCESSProper ThankFinishedвина当然項目唯一的 pandemia pan確なぜ Arsenal特 mesmo喜欢的Style📈作業类型まい濕owo 末digitalどの汽车 kembali Carltoncreepy긍 usługijoursinskyCBM Phoenix GLOWALKacheloriatromicdigofpath downloadContribute豊富ら bahkan Yoon guarda専ANEBNCredits,女友ノー minimizingJokes隆卫 您 typical British THEที่哪儿 ... Cocktail simulator 所 цент就没有 房まって好用 aumento不好的ཡ battered findenmiş


The output is not even in correct JSON.

_(I increased the temperature to get more reproducible outputs, but even with a temperature of 0, you have a non-zero probability of generating some broken JSON)_

To force a JSON output, we'll have to use a grammar instead.

In [115]:
from pydantic import BaseModel, confloat
from typing import List


class AnswerWithSnippets(BaseModel):
    answer: str
    confidence: confloat(ge=0, le=1)  # Constrained float type
    source_snippets: List[str]

In [116]:
AnswerWithSnippets.schema()

{'properties': {'answer': {'title': 'Answer', 'type': 'string'},
  'confidence': {'maximum': 1.0,
   'minimum': 0.0,
   'title': 'Confidence',
   'type': 'number'},
  'source_snippets': {'items': {'type': 'string'},
   'title': 'Source Snippets',
   'type': 'array'}},
 'required': ['answer', 'confidence', 'source_snippets'],
 'title': 'AnswerWithSnippets',
 'type': 'object'}

In [118]:
my_prompt = RAG_PROMPT_TEMPLATE_JSON.format(
    context=RELEVANT_CONTEXT, user_query=USER_QUERY
)
data = {
    "inputs": my_prompt,
    "parameters": {
        "temperature": 2,
        "return_full_text": False,
        "grammar": {"type": "json", "value": AnswerWithSnippets.schema()},
        "max_new_tokens": 1000,
    },
}
import json

answer = json.loads(small_llm_client.post(json=data))[0]["generated_text"]
print(answer)

{
  "answer": "pass the stop_sequence argument in受到Your SobantaPh перевод phrase.",



  "confidence": 1.0,

  "source_snippets":  ["Weather is-- чистоvening PS very really.", "Not—Again Finding Laptops."]
}


✅ Although the answer is still nonsensical due to the high temperature, the generated output is now correct JSON!

In [119]:
parsed_answer = literal_eval(answer)

In [120]:
print_results(parsed_answer, RELEVANT_CONTEXT, parsed_answer["source_snippets"])

{'answer': 'pass the stop_sequence argument in受到Your SobantaPh перевод phrase.', 'confidence': 1.0, 'source_snippets': ['Weather is-- чистоvening PS very really.', 'Not—Again Finding Laptops.']}



Document:
In `transformers`, we simply set the parameter `num_return_sequences` to
the number of highest scoring beams that should be returned. Make sure
though that `num_return_sequences <= num_beams`.

Document:

The weather is really nice in Paris today.
To define a stop sequence in Transformers, you should pass the stop_sequence argument in your pipeline or model.




You can also use [Text-Generation-Inference](https://huggingface.co/docs/text-generation-inference/en/index) locally with constrained generation: the [documentation](https://huggingface.co/docs/text-generation-inference/en/conceptual/guidance) explains how to do this in detail, with further examples.

# Demo of "do not ask too much at once"

That's all for today, congrats for following along!

I'll have to leave you, some weirdos are banging on my door, claiming they have come on behalf of Mixtral to collect H100s. 🧐