In [1]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

In [2]:
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)

`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [3]:
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [4]:
# Create a pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=500,
    do_sample=False,
)

In [5]:
# Prompt
messages = [
    {"role": "user", "content": "Create a funny joke about chickens."}
]

# Generate the output
output = pipe(messages)
print(output[0]["generated_text"])

You are not running the flash-attention implementation, expect numerical differences.


 Why did the chicken join the band? Because it had the drumsticks!


Under the hood, transformers.pipeline first converts our messages into a specific prompt template. We can explore this process by accessing the underlying tokenizer:

In [6]:
# Apply prompt template
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False)
print(prompt)

<|user|>
Create a funny joke about chickens.<|end|>
<|endoftext|>


This prompt template, further illustrated in Figure 6-2, was used during the training of the model. Not only does it provide information about who said what, but it is also used to indicate when the model should stop generating text (see the <|end|> token). This prompt is passed directly to the LLM and processed all at once.

![Phi 3 Template](imgs/phi-template.png)

# Controlling Model Output

- temperature: The temperature controls the randomness or creativity of the text generated. It defines how likely it is to choose tokens that are less probable
- top_p: also known as nucleus sampling, is a sampling technique that controls which subset of tokens (the nucleus) the LLM can consider. It will consider tokens until it reaches their **cumulative probability**
- top_k: controls exactly how many tokens the LLM can consider

In [9]:
# Using a high temperature
output = pipe(messages, do_sample=True, temperature=1)
print(output[0]["generated_text"])

 Why did the chicken join the rock band? Because it always knew how to lay down a good beat!


In [10]:
# Using a high top_p
output = pipe(messages, do_sample=True, top_p=1)
print(output[0]["generated_text"])

 Why did the rooster get invited to the party? Because he was a great dancer and he knew how to crow-dance all night long!


### Output Indicators

<img src="imgs/output_indicators.png" alt="Alt text" width="500" height="300">

### instruction-based prompting

Prompting is often used to have the LLM answer a specific question or resolve a certain task

<img src="imgs/instruction_based_prompting.png" alt="Alt text" width="500" height="300">

## Prompting Techniques

 A non-exhaustive list of these techniques includes:

- Specificity
    - Accurately describe what you want to achieve. Instead of asking the LLM to “Write a description for a product” ask it to “Write a description for a product in less than two sentences and use a formal tone.”
- Hallucination
    - LLMs may generate incorrect information confidently, which is referred to as hallucination. To reduce its impact, we can ask the LLM to only generate an answer if it knows the answer. If it does not know the answer, it can respond with “I don’t know.”
- Order
    - Either begin or end your prompt with the instruction. Especially with **long prompts, information in the middle is often forgotten**. 
    - LLMs tend to focus on information either at the beginning of a prompt (primacy effect) or the end of a prompt (recency effect).

# Advanced Prompt Engineering


 In our very first example, our prompt consisted of 
 - instruction
 - data 
 - output indicators

 ## Components
### Persona
Describe what role the LLM should take on. For example, use “You are an expert in astrophysics” if you want to ask a question about astrophysics.
### Instruction
The task itself. Make sure this is as specific as possible. We do not want to leave much room for interpretation.
### Context
Additional information describing the context of the problem or task. It answers questions like “What is the reason for the instruction?”
### Format
The format the LLM should use to output the generated text. Without it, the LLM will come up with a format itself, which is troublesome in automated systems.
### Audience
The target of the generated text. This also describes the level of the generated output. For education purposes, it is often helpful to use ELI5 (“Explain it like I’m 5”).
### Tone
The tone of voice the LLM should use in the generated text. If you are writing a formal email to your boss, you might not want to use an informal tone of voice.
### Data
The main data related to the task itself.

### Optional
- Stimuli: "This is very important for my career"

## Modular Prompting
<img src="imgs/modular_prompting.png" alt="Alt text" width="500" height="300">

In [11]:

# Prompt components
persona = "You are an expert in Large Language models. You excel at breaking down complex papers into digestible summaries.\n"
instruction = "Summarize the key findings of the paper provided.\n"
context = "Your summary should extract the most crucial points that can help researchers quickly understand the most vital information of the paper.\n"
data_format = "Create a bullet-point summary that outlines the method. Follow this up with a concise paragraph that encapsulates the main results.\n"
audience = "The summary is designed for busy researchers that quickly need to grasp the newest trends in Large Language Models.\n"
tone = "The tone should be professional and clear.\n"
text = "MY TEXT TO SUMMARIZE"
data = f"Text to summarize: {text}"

# The full prompt - remove and add pieces to view its impact on the generated output
query = persona + instruction + context + data_format + audience + tone + data

## In-Context Learning: Providing Examples

- **Zero-shot prompting** does not leverage examples
- **One-shot prompts** use a single example
- **Few-shot prompts** use two or more examples

In [12]:
# Use a single example of using the made-up word in a sentence
one_shot_prompt = [
    {
        "role": "user",
        "content": "A 'Gigamuru' is a type of Japanese musical instrument. An example of a sentence that uses the word Gigamuru is:"
    },
    {
        "role": "assistant",
        "content": "I have a Gigamuru that my uncle gave me as a gift. I love to play it at home."
    },
    {
        "role": "user",
        "content": "To 'screeg' something is to swing a sword at it. An example of a sentence that uses the word screeg is:"
    }
]
print(tokenizer.apply_chat_template(one_shot_prompt, tokenize=False))

<|user|>
A 'Gigamuru' is a type of Japanese musical instrument. An example of a sentence that uses the word Gigamuru is:<|end|>
<|assistant|>
I have a Gigamuru that my uncle gave me as a gift. I love to play it at home.<|end|>
<|user|>
To 'screeg' something is to swing a sword at it. An example of a sentence that uses the word screeg is:<|end|>
<|endoftext|>


In [15]:
# Generate the output
outputs = pipe(one_shot_prompt)
print(outputs[0]["generated_text"])

 During the medieval reenactment, the knight skillfully screeged the wooden target, impressing the onlookers with his prowess.


## Chain Prompting: Breaking up the Problem

We take the output of one prompt and use it as input for the next, thereby creating a continuous chain of interactions that solves our problem.

To illustrate, let us say we want to use an LLM to create a product name, slogan, and sales pitch for us based on a number of product features. Although we can ask the LLM to do this in one go, we can instead break up the problem into pieces.

<img src="imgs/chain_prompting.png" alt="XXX" width="500" height="300">

In [16]:
# Create name and slogan for a product
product_prompt = [
    {"role": "user", "content": "Create a name and slogan for a chatbot that leverages LLMs."}
]
outputs = pipe(product_prompt)
product_description = outputs[0]["generated_text"]
print(product_description)

 Name: ChatSage
Slogan: "Unleashing the power of AI to enhance your conversations."


In [17]:
# Based on a name and slogan for a product, generate a sales pitch
sales_prompt = [
    {"role": "user", "content": f"Generate a very short sales pitch for the following product: '{product_description}'"}
]
outputs = pipe(sales_prompt)
sales_pitch = outputs[0]["generated_text"]
print(sales_pitch)

 Introducing ChatSage, the ultimate AI companion that revolutionizes your conversations. With our cutting-edge technology, we unleash the power of AI to enhance your interactions, making every conversation more engaging and meaningful. Experience the future of communication with ChatSage today!


This can be used for a variety of use cases, including:

- Response validation
    - Ask the LLM to double-check previously generated outputs.
- Parallel prompts
    - Create multiple prompts in parallel and do a final pass to merge them. For example, ask multiple LLMs to generate multiple recipes in parallel and use the combined result to create a shopping list.
- Writing stories
    - Leverage the LLM to write books or stories by breaking down the problem into components. For example, by first writing a summary, developing characters, and building the story beats before diving into creating the dialogue.


# Reasoning with Generative Models


To allow for this reasoning behavior, it is a good moment to step back and explore what reasoning entails in human behavior. To simplify, our methods of reasoning can be divided into system 1 and 2 thinking processes.

### System 1:

System 1 thinking represents an automatic, intuitive, and near-instantaneous process. 
It shares similarities with generative models that automatically generate tokens without any self-reflective behavior. 

### System 2:
System 2 thinking is a conscious, slow, and logical process, akin to brainstorming and self-reflection.


### Chain Of Thought: Think Before Answering
If we could give a generative model the ability to mimic a form of self-reflection, we would essentially be emulating the system 2 way of thinking, which tends to produce more thoughtful responses than system 1 thinking

- Chain-of-thought aims to have the generative model “think” first rather than answering the question directly without any reasoning
- The reasoning processes are referred to as **thoughts**


<img src="imgs/chain_of_thought.png" alt="XXX" width="500" height="300">

In [18]:
# Answering without explicit reasoning
standard_prompt = [
    {"role": "user", "content": "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?"},
    {"role": "assistant", "content": "11"},
    {"role": "user", "content": "The cafeteria had 25 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?"}
]

# Run generative model
outputs = pipe(standard_prompt)
print(outputs[0]["generated_text"])

 The cafeteria started with 25 apples. They used 20 apples to make lunch, so they had 25 - 20 = 5 apples left. After buying 6 more apples, they now have 5 + 6 = 11 apples.


Although chain-of-thought is a great method for enhancing the output of a generative model, it does require one or more examples of reasoning in the prompt, which the user might not have access to.

### Zero-shot Chain Of Thought

Instead of providing examples, we can simply ask the generative model to provide the reasoning (zero-shot chain-of-thought). By adding particles like:
- *Let’s think step-by-step*
- *Take a deep breath and think step-by-step*
- *Let’s work through this problem step-by-step*
- *Think carefully and logically, explaining your answer.*

In [19]:
# Zero-shot chain-of-thought
zeroshot_cot_prompt = [
    {"role": "user", "content": "The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have? Let's think step-by-step."}
]

# Generate the output
outputs = pipe(zeroshot_cot_prompt)
print(outputs[0]["generated_text"])

 Step 1: Start with the initial number of apples in the cafeteria, which is 23.

Step 2: Subtract the number of apples used to make lunch, which is 20.
23 - 20 = 3 apples remaining.

Step 3: Add the number of apples bought, which is 6.
3 + 6 = 9 apples.

So, the cafeteria now has 9 apples.


## Self-Consistency

As a result, the quality of the output might improve or degrade depending on the random selection of tokens like `temperature` or `top_p`

- This method asks the generative model the same prompt multiple times and takes the majority result as the final answer
- Each answer can be affected by different temperature and top_p values to increase the diversity of sampling
- This process becomes `n` times **slower**

## Tree-Of-Thought

- When faced with a problem that requires multiple reasoning steps, it often helps to break it down into pieces.
- The generative model is prompted to explore different solutions to the problem at hand. 
- It then votes for the best solution and continues to the next step.

Pros/Cons:
-  A disadvantage of this method is that it requires many calls to the generative models, which slows the application significantly. 
-  there has been a successful attempt to convert the tree-of-thought framework into a simple prompting technique

<img src="imgs/ToT.png" alt="XXX" width="300" height="300">

Instead of calling the generative model multiple times, we ask the model to mimic that behavior by emulating **a conversation between multiple experts**. 

```text
Imagine three different experts are answering this question.
All experts will write down 1 step of their thinking,
then share it with the group.
Then all experts will go on to the next step, etc.
If any expert realises they're wrong at any point then they leave.
The question is...
```


In [20]:
# Zero-shot tree-of-thought
zeroshot_tot_prompt = [
    {"role": "user", "content": "Imagine three different experts are answering this question. All experts will write down 1 step of their thinking, then share it with the group. Then all experts will go on to the next step, etc. If any expert realizes they're wrong at any point then they leave. The question is 'The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?' Make sure to discuss the results."}
]

In [21]:
# Generate the output
outputs = pipe(zeroshot_tot_prompt)
print(outputs[0]["generated_text"])

 Expert 1:
Step 1: Start with the initial number of apples, which is 23.

Expert 2:
Step 1: Subtract the number of apples used for lunch, which is 20. This leaves us with 3 apples.
Step 2: Add the number of apples bought, which is 6. This results in a total of 9 apples.

Expert 3:
Step 1: Begin with the initial number of apples, which is 23.
Step 2: Subtract the number of apples used for lunch, which is 20. This leaves us with 3 apples.
Step 3: Add the number of apples bought, which is 6. This results in a total of 9 apples.

Discussion:
All three experts arrived at the same answer, which is 9 apples. This indicates that their calculations were correct. The cafeteria started with 23 apples, used 20 for lunch, and then bought 6 more, resulting in a total of 9 apples.


## Output Verification

### Structured output
By default, most generative models create free-form text without adhering to specific structures other than those defined by natural language. Some use cases require their output to be structured in certain formats, like JSON.
### Valid output
Even if we allow the model to generate structured output, it still has the capability to freely generate its content. For instance, when a model is asked to output either one of two choices, it should not come up with a third.
### Ethics
Some open source generative models have no guardrails and will generate outputs that do not consider safety or ethical considerations. For instance, use cases might require the output to be free of profanity, personally identifiable information (PII), bias, cultural stereotypes, etc.
### Accuracy
Many use cases require the output to adhere to certain standards or performance. The aim is to double-check whether the generated information is factually accurate, coherent, or free from hallucination.


----------

Generally, there are three ways of controlling the output of a generative model:

### Examples
Provide a number of examples of the expected output.
### Grammar
Control the token selection process.
### Fine-tuning
Tune a model on data that contains the expected output.

----------

## Examples

A simple and straightforward method to fix the output is to provide the generative model with examples of what the output should look like


In [22]:

# Zero-shot learning: Providing no examples
zeroshot_prompt = [
    {"role": "user", "content": "Create a character profile for an RPG game in JSON format."}
]

# Generate the output
outputs = pipe(zeroshot_prompt)
print(outputs[0]["generated_text"])

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


 ```json
{
  "name": "Aria Stormbringer",
  "class": "Warrior",
  "race": "Human",
  "level": 10,
  "attributes": {
    "strength": 18,
    "dexterity": 12,
    "constitution": 16,
    "intelligence": 8,
    "wisdom": 10,
    "charisma": 14
  },
  "skills": {
    "melee": 15,
    "ranged": 10,
    "magic": 5,
    "stealth": 12,
    "acrobatics": 10,
    "animal_handling": 8
  },
  "equipment": {
    "weapon": "Two-handed Axe",
    "armor": "Chainmail Hauberk",
    "shield": "Warhammer",
    "accessories": [
      "Leather Boots",
      "Warrior's Bracers",
      "Warrior's Ring"
    ]
  },
  "background": "Aria grew up in a small village on the outskirts of a large city. She was always fascinated by the stories of great warriors and adventurers, and dreamed of one day becoming a legendary hero herself. When she was just a child, a group of bandits attacked her village, and Aria was the only one who managed to fend them off. Since then, she has dedicated her life to training and honing 

In [23]:
# One-shot learning: Providing an example of the output structure
one_shot_template = """Create a short character profile for an RPG game. Make sure to only use this format:

{
  "description": "A SHORT DESCRIPTION",
  "name": "THE CHARACTER'S NAME",
  "armor": "ONE PIECE OF ARMOR",
  "weapon": "ONE OR MORE WEAPONS"
}
"""
one_shot_prompt = [
    {"role": "user", "content": one_shot_template}
]

# Generate the output
outputs = pipe(one_shot_prompt)
print(outputs[0]["generated_text"])

 {
  "description": "A cunning rogue with a mysterious past, skilled in stealth and deception.",
  "name": "Shadowcloak",
  "armor": "Leather Hood",
  "weapon": "Dagger"
}


## Grammar: Constrained Sampling

Few-shot learning has a big disadvantage: we cannot explicitly prevent certain output from being generated

- Instead, packages have been rapidly developed to constrain and validate the output of generative models like:
    - Guidance
    - Guardrails
    - LMQL

The generative models retrieve the output as new prompts and attempt to validate it based on a number of predefined guardrails.

<img src="imgs/constrained_sampling.png" alt="XXX" width="500" height="200">

This validation process can also be used to control the formatting of the output by generating parts of its format ourselves as we already know how it should be structured

**We can already perform validation during the token sampling process**

When sampling tokens, we can define a number of grammars or rules that the LLM should adhere to when choosing its next token. 

By constraining the sampling process, we can have the LLM only output what we are interested in.

In [24]:
import gc
import torch
del model, tokenizer, pipe

# Flush memory
gc.collect()
torch.cuda.empty_cache()

In [27]:
from llama_cpp.llama import Llama
import json

In [26]:
# Load Phi-3
llm = Llama.from_pretrained(
    repo_id="microsoft/Phi-3-mini-4k-instruct-gguf",
    filename="*fp16.gguf",
    n_gpu_layers=-1,
    n_ctx=2048,
    verbose=False
)

Phi-3-mini-4k-instruct-fp16.gguf:   0%|          | 0.00/7.64G [00:00<?, ?B/s]

In [28]:
# Generate output
output = llm.create_chat_completion(
    messages=[
        {"role": "user", "content": "Create a warrior for an RPG in JSON format."},
    ],
    response_format={"type": "json_object"},
    temperature=0,
)['choices'][0]['message']["content"]

In [30]:
# Format as json
json_output = json.dumps(json.loads(output), indent=4)
print(json_output)

{
    "warrior": {
        "name": "Eldric Stormbringer",
        "class": "Warrior",
        "level": 5,
        "attributes": {
            "strength": 18,
            "dexterity": 9,
            "constitution": 20,
            "intelligence": 7,
            "wisdom": 6,
            "charisma": 10
        },
        "skills": [
            {
                "name": "Martial Arts",
                "proficiency": 95
            },
            {
                "name": "Swordsmanship",
                "proficiency": 98
            },
            {
                "name": "Shield Blocking",
                "proficiency": 92
            },
            {
                "name": "Heavy Armor Proficiency",
                "proficiency": 90
            }
        ],
        "equipment": [
            {
                "type": "Weapon",
                "name": "Iron Longsword",
                "damage": 12,
                "durability": 85
            },
            {
                "type": "A