# Introduction to Prompt Engineering and Inference with Llama2

So far we've discussed a bit about the basics of prompting models and some prompt engineering techniques.

But what's special about prompting Llama2? I think there are a few worthy mentions regarding that if you're running the model directly or through llama-cpp:

- There is a special prompt template is specific to llama2 chat models and differs from the base models, which have no prompt structure and are raw, non-instruct tuned models.

The template formats:

- For newlines escaped (e.g., using with curl or terminal):

```<s>[INST] <<SYS>>\n{system_message}\n<</SYS>>\n\n{user_message_1} [/INST]
With regular newlines (e.g., for text-generation-webui):
<s>[INST] <<SYS>>\n{system_message}\n<</SYS>>\n\n{user_message_1} [/INST]
```

- Template Without System Message:

```
<s>[INST] {user_message_1} [/INST]
```

- Continuing a Conversation:

    - Append model responses for ongoing conversations using similar templates, with model replies included.

- End of String Signifier:

    - Llama 2 uses </s> as the end of string signifier.

- Single Message End of String Signifier:

    - It is not clear whether an end of string signifier is used for single messages.

- Default System Message:

    - The default system message emphasizes the assistant being helpful, respectful, honest, safe, and socially unbiased. It advises against harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.

Check out these guides 
- [Llama 2 Prompt Template](https://gpus.llm-utils.org/llama-2-prompt-template/#does-it-use-an-end-of-string-signifier-if-theres-only-a-single-message) 

- [A Guide to Prompting Llama2](https://replicate.com/blog/how-to-prompt-llama)

for more info on this.


Now! For the purpose of this live-training we'll be using high level APIs that abstract away that complexity so we should be fine with this template issue as long as we're using `llama-cpp-python` or the Hugging Face and Langchain bindings.

# Prompt Engineering Guide

What is prompt engineering?

Prompt engineering is a reference to a discipline concerned with stablishing the rules for obtaining the most deterministic outputs possible from a LLM by employing engineering techniques and protocols to enture reproducibility and consistency.

***In a simplified way, prompt engineering is the means by which LLMs can be programmed through prompting.***

The basic goal of prompt engineering is designing appropriate inputs for prompting methods.

# Prompting Basics

A prompt is a piece of text that conveys to a LLM what the user wants. What the user wants can be many things like:

- Asking a question
- Giving an instruction
- Etc...

The key components of a prompt are:
1. Instruction: where you describe what you want
2. Context: additional information to help with performance
3. Input data: data the model has not seem to illustrate what you need
4. Output indicator: How you prime the model to output what you want, for example asking the model ["Let's think step by step" and the end of a prompt can boost reasoning performance](https://arxiv.org/pdf/2201.11903.pdf). You can also write "Output:" to prime the model to just write the output and nothing else.

[Prompts can also be seen as a form of programming that can customize the outputs and interactions with an LLM.](https://ar5iv.labs.arxiv.org/html/2302.11382#:~:text=prompts%20are%20also%20a%20form%20of%20programming%20that%20can%20customize%20the%20outputs%20and%20interactions%20with%20an%20llm.)

Instruction

In [1]:
instruction = "Classify this sentence into positive of negative:"

Input Data & Context Information

In [2]:
input_data = "Text: I like eating pancakes."

context = "You are a text annotation engine."

Prompt Style

In [3]:
output_indicator = "Output:\n"

In [4]:
# How you ask what you want, and the heavily relies on what you want from the model.
# Instruction prompt: 

instruction_prompt = f"{context}.\n{instruction}.\n{input_data}. {output_indicator}"

In [5]:
instruction_prompt

'You are a text annotation engine..\nClassify this sentence into positive of negative:.\nText: I like eating pancakes.. Output:\n'

In [None]:
# !pip install llama-cpp-python

In [16]:
from llama_cpp import Llama

# Model was obtained from here: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF
llm = Llama(model_path="./models/llama-2-7b-chat.Q4_K_M.gguf",
            chat_format="llama-2")

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from ./models/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weight q4_K     [  4096,  4096,     1

In [17]:
from IPython.display import Markdown, display


output = llm(instruction_prompt)

Markdown(output["choices"][0]["text"])


llama_print_timings:        load time =  6343.13 ms
llama_print_timings:      sample time =    43.40 ms /    64 runs   (    0.68 ms per token,  1474.69 tokens per second)
llama_print_timings: prompt eval time =  6343.05 ms /    33 tokens (  192.21 ms per token,     5.20 tokens per second)
llama_print_timings:        eval time =  4772.66 ms /    63 runs   (   75.76 ms per token,    13.20 tokens per second)
llama_print_timings:       total time = 11237.40 ms


I would classify the sentence "I like eating pancakes" as positive because it expresses a positive emotion or feeling towards something (eating pancakes). The use of the word "like" also suggests a degree of enjoyment or pleasure, which is typically associated with positive emotions.

Ok great! Now let's wrap the call to Llama2 into a function!

In [18]:
def get_llama2_response(prompt, system_message="You are a helpful assistant"):
    """
    Get the response from the Llama2 model.
    """
    llm.create_chat_completion(
      messages = [
          {"role": "system", "content": system_message},
          {
              "role": "user",
              "content": prompt
          }
      ]
)
    return llm(prompt)["choices"][0]["text"]

In [19]:
output = get_llama2_response("Write down 5 great dark comedy movies. Output only the names in bullet points.")

Markdown(output)

Llama.generate: prefix-match hit

llama_print_timings:        load time =  6343.13 ms
llama_print_timings:      sample time =    59.48 ms /    87 runs   (    0.68 ms per token,  1462.58 tokens per second)
llama_print_timings: prompt eval time =  4792.52 ms /    46 tokens (  104.19 ms per token,     9.60 tokens per second)
llama_print_timings:        eval time =  4671.26 ms /    86 runs   (   54.32 ms per token,    18.41 tokens per second)
llama_print_timings:       total time =  9631.96 ms
Llama.generate: prefix-match hit

llama_print_timings:        load time =  6343.13 ms
llama_print_timings:      sample time =    57.26 ms /    84 runs   (    0.68 ms per token,  1466.92 tokens per second)
llama_print_timings: prompt eval time =   756.83 ms /    18 tokens (   42.05 ms per token,    23.78 tokens per second)
llama_print_timings:        eval time =  4521.14 ms /    83 runs   (   54.47 ms per token,    18.36 tokens per second)
llama_print_timings:       total time =  5439.81 ms



•	The Big Lebowski (1998)
•	Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)
•	This Is Spinal Tap (1984)
•	The Hangover (2009)
•	Airplane! (1980)

# Applying Prompt Engineering Strategies

- Strategy 1: Write clear instructions
- Strategy 2: Provide reference text
- Strategy 3: Break tasks into subtasks
- Strategy 4: Give the model time to think
- Strategy 5: Use external tools
- Strategy 6: Test changes systematically

In [21]:
# Strategy 1: Write clear instructions
system_message = "You are a helfpul assistant"
clear_instructions_prompt = "Who was the president of Mexico in 2021?"
response = get_llama2_response(clear_instructions_prompt, system_message)
response

Llama.generate: prefix-match hit

llama_print_timings:        load time =  6343.13 ms
llama_print_timings:      sample time =    85.43 ms /   126 runs   (    0.68 ms per token,  1474.91 tokens per second)
llama_print_timings: prompt eval time = 10087.08 ms /    44 tokens (  229.25 ms per token,     4.36 tokens per second)
llama_print_timings:        eval time =  8635.43 ms /   125 runs   (   69.08 ms per token,    14.48 tokens per second)
llama_print_timings:       total time = 18966.74 ms
Llama.generate: prefix-match hit

llama_print_timings:        load time =  6343.13 ms
llama_print_timings:      sample time =    87.16 ms /   128 runs   (    0.68 ms per token,  1468.58 tokens per second)
llama_print_timings: prompt eval time =   626.41 ms /    13 tokens (   48.19 ms per token,    20.75 tokens per second)
llama_print_timings:        eval time =  6806.41 ms /   127 runs   (   53.59 ms per token,    18.66 tokens per second)
llama_print_timings:       total time =  7680.44 ms


"\n everybody knows that Andrés Manuel López Obrador, known as AMLO, is the current president of Mexico and he has been in office since 2018. But did you know that he was re-elected for a second term in 2021? That's right! AMLO won a landslide victory in the 2021 Mexican general election, defeating his main opponent, Ricardo Anaya Cortés, by a wide margin.\nAs president, AMLO has continued to implement his populist policies, which have included strengthening Mexico's economy,"

In [23]:
# Strategy 2: Provide reference text
system_message = """Use the provided articles delimited by triple quotes to answer questions. If the
answer cannot be found in the articles, write "I could not find an answer."""
reference_text_prompt = "<insert articles, each delimited by triple quotes> Question: <insert question here>"
response = get_llama2_response(reference_text_prompt, system_message)
response

Llama.generate: prefix-match hit

Llama.generate: prefix-match hit
llama_print_timings:        load time =  6343.13 ms
llama_print_timings:      sample time =    41.19 ms /    60 runs   (    0.69 ms per token,  1456.77 tokens per second)
llama_print_timings: prompt eval time =  5611.23 ms /    76 tokens (   73.83 ms per token,    13.54 tokens per second)
llama_print_timings:        eval time =  3134.71 ms /    59 runs   (   53.13 ms per token,    18.82 tokens per second)
llama_print_timings:       total time =  8864.01 ms

llama_print_timings:        load time =  6343.13 ms
llama_print_timings:      sample time =    87.94 ms /   128 runs   (    0.69 ms per token,  1455.54 tokens per second)
llama_print_timings: prompt eval time =   877.49 ms /    19 tokens (   46.18 ms per token,    21.65 tokens per second)
llama_print_timings:        eval time =  6996.77 ms /   127 runs   (   55.09 ms per token,    18.15 tokens per second)
llama_print_timings:       total time =  8126.46 ms


' \n```\nThe answer is simple: it doesn\'t matter.\n\n"The First Amendment is Not a License to Be Uncivil" - New York Times\n"The Importance of Being Civil in Our Discourse" - The Atlantic\n"Civility Is Not a Weakness, but a Strength" - TED\n\n"The Benefits of Civility in Communication" - Harvard Business Review\n\n"Why Civility Matters in Today\'s Society" - Forbes\n\n"How to Be Civil in a World That Isn\'t" - Psychology'

In [22]:
# Strategy 3: Break tasks into subtasks
system_message = "You are a helpful assistant"
subtasks_prompt = """
Q: Roger has 5 tennis balls. He buys 2 more cans of
tennis balls. Each can has 3 tennis balls. How many
tennis balls does he have now?
Roger started with 5 bails. 2 cans of 3 tenis balls each is
6 tennis balls. 5 + 6 = 11. The answer is 11

Q: The cafeteria had 23 apples. If they used 20 to
make lunch and bought 6 more, how many apples
do they have?
"""

response = get_llama2_response(subtasks_prompt, system_message)
response

Llama.generate: prefix-match hit

Llama.generate: prefix-match hit
llama_print_timings:        load time =  6343.13 ms
llama_print_timings:      sample time =   116.90 ms /   171 runs   (    0.68 ms per token,  1462.84 tokens per second)
llama_print_timings: prompt eval time = 11670.92 ms /   158 tokens (   73.87 ms per token,    13.54 tokens per second)
llama_print_timings:        eval time = 11158.63 ms /   170 runs   (   65.64 ms per token,    15.23 tokens per second)
llama_print_timings:       total time = 23169.51 ms

llama_print_timings:        load time =  6343.13 ms
llama_print_timings:      sample time =    15.30 ms /    22 runs   (    0.70 ms per token,  1437.91 tokens per second)
llama_print_timings: prompt eval time =  6249.27 ms /   131 tokens (   47.70 ms per token,    20.96 tokens per second)
llama_print_timings:        eval time =  1461.01 ms /    21 runs   (   69.57 ms per token,    14.37 tokens per second)
llama_print_timings:       total time =  7757.50 ms


'A: 23 - 20 = 3. They have 3 apples left.'

In [24]:
# Strategy 4: Give the model time to think
system_message = """
Follow these steps to answer the user queries.
Step 1 - First work out your own solution to the problem. Don't
rely on the student's solution since it may be incorrect. Enclose
all your work for this step within triple quotes.
Step 2 - Compare your solution to the student's solution and
evaluate if the student's solution is correct or not. Enclose all
your work for this step within triple quotes.
Step 3 - If the student made a mistake, determine what hint you
could give the student without giving away the answer. Enclose
all your work for this step within triple quotes.
Step 4 - If the student made a mistake, provide the hint from
the previous step to the student (outside of triple quotes).
Instead of writing "Step 4 - ..." write "Hint:".
"""
time_to_think_prompt = """
Problem Statement: <insert problem statement>
Student Solution: <insert student solution>
"""
response = get_llama2_response(time_to_think_prompt, system_message)
response

Llama.generate: prefix-match hit

llama_print_timings:        load time =  6343.13 ms
llama_print_timings:      sample time =    23.91 ms /    35 runs   (    0.68 ms per token,  1463.76 tokens per second)
llama_print_timings: prompt eval time = 10684.53 ms /   231 tokens (   46.25 ms per token,    21.62 tokens per second)
llama_print_timings:        eval time =  2509.16 ms /    34 runs   (   73.80 ms per token,    13.55 tokens per second)
llama_print_timings:       total time = 13262.85 ms
Llama.generate: prefix-match hit



'Teacher Solution: <insert teacher solution>\n\nNote:\nThe Student Solution and Teacher Solution sections will be written in a way that is easy for students to understand and follow, while also providing opportunities for teachers to assess and provide feedback on student learning.\n\nProblem Statement:\nJane is a 3rd grade student who has difficulty paying attention during lessons and completing her work on time. She often gets distracted by her surroundings and struggles to stay focused. As a result, she falls behind in her schoolwork and struggles with self-esteem.\nStudent'

llama_print_timings:        load time =  6343.13 ms
llama_print_timings:      sample time =    87.44 ms /   128 runs   (    0.68 ms per token,  1463.89 tokens per second)
llama_print_timings: prompt eval time =   882.78 ms /    21 tokens (   42.04 ms per token,    23.79 tokens per second)
llama_print_timings:        eval time =  6690.65 ms /   127 runs   (   52.68 ms per token,    18.98 tokens per second)
llama_print_timings:       total time =  7824.05 ms


In [25]:
# Strategy 5: Use external tools (e.g., Python code execution)
system_message = """You can write and execute Python code by enclosing it in triple backticks, e.g. ```code goes
here```. Use this to perform calculations."""
external_tools_prompt = "Find all real-valued roots of the following polynomial: 3*x**5 - 5*x**4 - 3*x**3 - 7*x - 10."
response = get_llama2_response(external_tools_prompt, system_message)
response

Llama.generate: prefix-match hit
Llama.generate: prefix-match hit

llama_print_timings:        load time =  6343.13 ms
llama_print_timings:      sample time =   177.20 ms /   256 runs   (    0.69 ms per token,  1444.66 tokens per second)
llama_print_timings: prompt eval time =  8445.26 ms /   100 tokens (   84.45 ms per token,    11.84 tokens per second)
llama_print_timings:        eval time = 14982.58 ms /   255 runs   (   58.76 ms per token,    17.02 tokens per second)
llama_print_timings:       total time = 23952.52 ms

llama_print_timings:        load time =  6343.13 ms
llama_print_timings:      sample time =    90.25 ms /   128 runs   (    0.71 ms per token,  1418.22 tokens per second)
llama_print_timings: prompt eval time =  4510.09 ms /    42 tokens (  107.38 ms per token,     9.31 tokens per second)
llama_print_timings:        eval time =  6783.15 ms /   127 runs   (   53.41 ms per token,    18.72 tokens per second)
llama_print_timings:       total time = 11553.22 ms


"\n\nWhat is the degree of the given polynomial?\n\nAnswer: To find all real-valued roots of a polynomial, we can set the polynomial equation equal to zero and solve for x.\n3*x**5 - 5*x**4 - 3*x**3 - 7*x - 10 = 0\nFirst, let's factor the polynomial:\n(x - 2)(x - 5)(x - 3) = 0\nThis tells us that either x = 2, x = 5, or x = 3.\nSo,"

In [26]:
# Evaluation procedures are useful for
# optimizing system designs. Good evals
# are:
# • Representative of real-world usage (or
# at least diverse)
# • Contain many test cases for greater
# statistical power (see table below for
# guidelines)
# • Easy to automate or repeat

# Strategy 6: Test changes systematically
system_message = """
<try something out here>
"""
test_changes_systematically_prompt = "<write your own example>"

response = get_llama2_response(test_changes_systematically_prompt, system_message)
response

# References

- [A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT](https://ar5iv.labs.arxiv.org/html/2302.11382)
- [Prompt-Engineering-Guide](https://github.com/dair-ai/Prompt-Engineering-Guide)
- [A Survey of Large Language Models](https://arxiv.org/pdf/2303.18223.pdf)
- [Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing](https://arxiv.org/pdf/2107.13586.pdf)
- [prompt engineering guide - zero shot prompting example](https://www.promptingguide.ai/techniques/zeroshot)
- [Finetuned language models are zero-shot learners](https://arxiv.org/pdf/2109.01652.pdf)
- [prompt engineering guide - few shot prompting](https://www.promptingguide.ai/techniques/fewshot)
- [prompt engineering guide - chain of thought prompting](https://www.promptingguide.ai/techniques/cot)
- [Wei et al. (2022)](https://arxiv.org/abs/2201.11903)
- [prompt engineering guide - self-consistency](https://www.promptingguide.ai/techniques/consistency)
- [prompt engineering guide - generate knowledge](https://www.promptingguide.ai/techniques/knowledge)
- [Liu et al. 2022](https://arxiv.org/pdf/2110.08387.pdf)
- [prompt engineering guide - tree of thoughts (ToT)](https://www.promptingguide.ai/techniques/tot)
- [Prompt Engineering by Lilian Weng](https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/)
- [Prompt Engineering vs. Blind Prompting](https://mitchellh.com/writing/prompt-engineering-vs-blind-prompting#the-demonstration-set)
- https://www.promptingguide.ai/models/llama
- https://www.philschmid.de/llama-2
- https://learnprompting.org/docs/intermediate/long_form_content
- https://github.com/promptslab/Promptify
- https://github.com/microsoft/prompt-engine
- https://github.com/stjordanis/betterprompt
- https://thunlp.github.io/OpenPrompt/
- https://github.com/mleoking/PromptAppGPT