# `NOTE` -> this reqiures more practice for more no of use cases.

<h1>Chapter 6 - Prompt Engineering</h1>
<i>Methods for improving the output through prompt engineering.</i>

<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961"><img src="https://img.shields.io/badge/Buy%20the%20Book!-grey?logo=amazon"></a>
<a href="https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/"><img src="https://img.shields.io/badge/O'Reilly-white.svg?logo=data:image/svg%2bxml;base64,PHN2ZyB3aWR0aD0iMzQiIGhlaWdodD0iMjciIHZpZXdCb3g9IjAgMCAzNCAyNyIgZmlsbD0ibm9uZSIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KPGNpcmNsZSBjeD0iMTMiIGN5PSIxNCIgcj0iMTEiIHN0cm9rZT0iI0Q0MDEwMSIgc3Ryb2tlLXdpZHRoPSI0Ii8+CjxjaXJjbGUgY3g9IjMwLjUiIGN5PSIzLjUiIHI9IjMuNSIgZmlsbD0iI0Q0MDEwMSIvPgo8L3N2Zz4K"></a>
<a href="https://github.com/HandsOnLLM/Hands-On-Large-Language-Models"><img src="https://img.shields.io/badge/GitHub%20Repository-black?logo=github"></a>
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/HandsOnLLM/Hands-On-Large-Language-Models/blob/main/chapter06/Chapter%206%20-%20Prompt%20Engineering.ipynb)

---

This notebook is for Chapter 6 of the [Hands-On Large Language Models](https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961) book by [Jay Alammar](https://www.linkedin.com/in/jalammar) and [Maarten Grootendorst](https://www.linkedin.com/in/mgrootendorst/).

---

<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961">
<img src="https://raw.githubusercontent.com/HandsOnLLM/Hands-On-Large-Language-Models/main/images/book_cover.png" width="350"/></a>

### [OPTIONAL] - Installing Packages on <img src="https://colab.google/static/images/icons/colab.png" width=100>

If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to **uncomment and run** the following codeblock to install the dependencies for this chapter:

---

💡 **NOTE**: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to
**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---


In [1]:
%%capture
!pip install langchain>=0.1.17 openai>=1.13.3 langchain_openai>=0.1.6 transformers>=4.40.1 datasets>=2.18.0 accelerate>=0.27.2 sentence-transformers>=2.5.1 duckduckgo-search>=5.2.2
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python

could you explain the package below and what are they uesd for?

```python
%%capture
!pip install langchain>=0.1.17 openai>=1.13.3 langchain_openai>=0.1.6 transformers>=4.40.1 datasets>=2.18.0 accelerate>=0.27.2 sentence-transformers>=2.5.1 duckduckgo-search>=5.2.2
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
```
<details>
	<summary>Answer by phind</summary>
The package installation command you've shared installs several important libraries for natural language processing and machine learning tasks. Let me break down each component and explain its purpose:

### LangChain and Related Packages

1. `langchain>=0.1.17`: LangChain is a framework for building applications powered by large language models. It provides tools for integrating LLMs with other components like databases, APIs, and file systems [7].

2. `openai>=1.13.3`: This is likely referring to the OpenAI Python library, which allows interaction with OpenAI's API services, including their language models.

3. `langchain_openai>=0.1.6`: This package integrates OpenAI's services with LangChain, enabling easy use of OpenAI models within LangChain workflows.

### Machine Learning and NLP Libraries

4. `transformers>=4.40.1`: This is the Transformers library developed by Hugging Face, providing implementations of popular transformer architectures and pre-trained models [1].

5. `datasets>=2.18.0`: Another Hugging Face library, this one focuses on datasets for machine learning and NLP tasks.

6. `accelerate>=0.27.2`: An accelerator for training machine learning models, especially useful for distributed training setups.

7. `sentence-transformers>=2.5.1`: This library provides an efficient way to compute dense vector representations for sentences, paragraphs, or images [1]. It's particularly useful for semantic similarity searches and other NLP tasks.

### Search Engine Integration

8. `duckduckgo-search>=5.2.2`: This package allows integration with DuckDuckGo's search engine, potentially useful for information retrieval tasks.

### Local Model Execution

9. `llama-cpp-python`: This is a Python binding for llama.cpp, which enables running certain LLMs locally on your machine [2]. It supports inference for many LLM models that can be found on Hugging Face.

The `CMAKE_ARGS="-DLLAMA_CUBLAS=on"` flag suggests that CUDA support is being enabled for llama-cpp-python, allowing for GPU acceleration if available.

### Summary

These packages collectively form a comprehensive toolkit for building AI-powered applications, particularly those involving natural language processing and machine learning. They enable:

- Integration with cloud-based LLMs (via OpenAI and LangChain)
- Local execution of certain LLMs (via llama-cpp-python)
- Efficient text embeddings and similarity searches (via sentence-transformers)
- Access to large-scale datasets and pre-trained models (via Hugging Face libraries)
- Distributed training capabilities (via accelerate)
- Web search integration (via duckduckgo-search)

This setup appears to be preparing an environment for advanced NLP tasks, possibly including semantic search, question answering, or content generation, all of which can be integrated into larger applications using LangChain.

Citations:

[1]: https://python.langchain.com/docs/integrations/text_embedding/sentence_transformers/

[2]: https://python.langchain.com/docs/integrations/llms/llamacpp/

[3]: https://medium.com/@garysvenson09/how-to-use-sentence-transformers-in-langchain-projects-c279bb535e0f

[4]: https://github.com/HandsOnLLM/Hands-On-Large-Language-Models/blob/main/environment.yml

[5]: https://colab.research.google.com/github/HandsOnLLM/Hands-On-Large-Language-Models/blob/main/chapter07/Chapter%207%20-%20Advanced%20Text%20Generation%20Techniques%20and%20Tools.ipynb

[6]: https://stackoverflow.com/questions/76447153/how-to-use-a-llama-model-with-langchain-it-gives-an-error-pipeline-cannot-infe

[7]: https://github.com/taishi-i/awesome-ChatGPT-repositories

[8]: https://huggingface.co/TheBloke/LLaMA-13b-GGUF

[9]: https://docs.llamaindex.ai/en/stable/examples/llm/llama_2_llama_cpp/

[10]: https://www.reddit.com/r/huggingface/comments/183igu3/ctransformers_vs_llamacpppython_which_one_should/

</details>

## Loading our model

In [2]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

In [3]:
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

In [4]:
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

In [5]:
# Create a pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=500,
    do_sample=False,
)

What does each of the parameters in the huggingface pipeline method means?
```python
# Create a pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=500,
    do_sample=False,
)
```

<details>
	<summary>Answer by Phind</summary>
Let's break down each parameter in the HuggingFace pipeline method for text generation:

Pipeline Parameters

1. `"text-generation"`:
   - This specifies the type of task the pipeline will perform.
   - It tells the pipeline to set up for generating text based on input prompts.

2. `model=model`:
   - This parameter specifies the pre-trained model to be used for generation.
   - Typically, this would be a large language model like GPT-2, T5, or BERT.

3. `tokenizer=tokenizer`:
   - The tokenizer is responsible for converting raw text into tokens that the model understands.
   - It's usually paired with the specific model being used.

4. `return_full_text=False`:
   - This determines whether the pipeline returns the entire generated text or just the newly generated portion.
   - Setting it to False means only the new text will be returned, excluding the input prompt.

5. `max_new_tokens=500`:
   - This limits the maximum number of tokens the model can generate in one call.
   - It helps control the length of the generated text.

6. `do_sample=False`:
   - This parameter controls the sampling strategy during generation.
   - Setting it to False means the model will choose the most likely next token at each step.
   - When set to True, it allows for random sampling, potentially leading to more varied outputs.

### Additional Context

- The pipeline method is part of HuggingFace's Transformers library, making it easy to use pre-trained models for various NLP tasks [1].

- These parameters allow fine-tuning of the generation process, balancing factors like output length, diversity, and computational efficiency.

- Other common parameters not included here might be:
  - `temperature`: Controls randomness in output (higher values lead to more varied responses).
  - `top_k`: Limits the number of tokens considered at each step.
  - `num_return_sequences`: Specifies how many different generations to produce.

Understanding these parameters is crucial for optimizing text generation tasks, as they directly influence the quality, length, and diversity of the generated output [1].


</details>

In [None]:
# Prompt
messages = [
    {"role": "user", "content": "Create a funny joke about chickens."}
]


`NOTE:`

1. The format of the prompt sent to a model may vary. The message above are very general. Please visit the doc of the Model in huggingface for more info about how to send the chat prompt to the models

2. Printing the prompt before sending it to the model is a very good practice

In [None]:
print(messages)

[{'role': 'user', 'content': 'Create a funny joke about chickens.'}]


In [None]:
# Generate the output
output = pipe(messages)
output

[{'generated_text': ' Why did the chicken join the band? Because it had the drumsticks!'}]

In [None]:
print(output[0]["generated_text"])

 Why did the chicken join the band? Because it had the drumsticks!


In [None]:
# Apply prompt template
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False)
print(prompt)

<|user|>
Create a funny joke about chickens.<|end|>
<|endoftext|>


It is very important to print the prompt. And also it is a very good practice to apply the chat template for each model. Because every model has a different type of chat model.

Also the chat tmeplate can be modified as required. Huggingface docs specify hw to you modfiy the chat.

In [None]:
# Using a high temperature
output = pipe(messages, do_sample=True, temperature=1)
print(output[0]["generated_text"])

 Why did the chicken join the police force? Because it wanted to keep an eye on the eggs!


LLM Temperature
The temperature parameter in a Large Language Model (LLM) controls the amount of randomness in the model's output. It's a numerical value that affects the balance between exploration and exploitation in the output.

|Temperature | Output|
|--|--|
|Low (towards 0)  | More deterministic and repetitive|
|High (towards 1) | More random and creative |

Here are some examples of when to use a high or low temperature setting:

Creative tasks
Use a high temperature setting to generate more creative and unconventional responses. For example, to list the five most fun things to do in South Africa.
Data analysis, technical descriptions, or research paper summaries
Use a low temperature setting to generate more accurate, concise, and factually correct responses.

`The temperature parameter is often set between 0 and 1, but can sometimes be higher.
`

<details>
	<summary>Blog</summary>
  What is LLM Temperature?

In recent years, Large Language Models (LLMs) have stood out as revolutionary tools that are capable of crafting human-quality texts, including producing coherent and contextually relevant text, translating languages elegantly, and dreaming up creative content on demand. Beneath the surface, there lies a fascinating factor that affects the nature and quality of the generated output, which is known as LLM temperature.

At its core, LLM temperature controls the balance between playing more safely and exploring new possibilities - exploration versus exploitation in the model's output. Lower temperatures favor exploiting the patterns LLMs have already learned and mastered, making the outputs more predictable and reliable. This is ideal when accurate and factual information is needed. Conversely, higher temperatures encourage exploration, meaning that the LLMs get adventurous. It ventures beyond the familiar patterns and increases the chance of being surprising and creative, potentially yielding more diverse, albeit riskier, outputs. This can be useful for brainstorming ideas.
Example of adjusting LLM Temperature in ChatGPT playground
hops_tutorials_cta
How does LLM Temperature work?

LLMs are usually trained on large amounts of text data. They learn the patterns of how likely words appear together or apart, building a complex network of possibilities. When LLMs generate output, there are usually a few candidates in the vocabulary for each word and each candidate word has a certain likelihood of being chosen. Those likelihoods are represented by a set of logits. Then the softmax function takes the set of logits and transforms them into probabilities that sum to 1. A temperature value in the softmax function scales these logits, influencing the final possibilities calculated for each candidate word and affecting the selection of the next word in the output.
How a word is output by an LLM
The Softmax Function with LLM Temperature

Mathematically, the softmax function for a given candidate word i​ with logits yi​ is defined as:
Where:

e is Euler's number (approximately 2.71828).

T is the LLM temperature parameter.

n is the size of the vocabulary.

From the above softmax function, we can see that the LLM temperature acts as a control mechanism. It affects the probabilities assigned to each candidate word by scaling the logits.
The Impact of LLM Temperature

Lower Temperature (𝑇<1): When 𝑇 is small, the softmax function magnifies differences between logits, leading to sharper probability distributions. This means that the model becomes more confident in selecting words with higher logits, making the LLM prioritize the most probable next word and effectively reducing randomness in the generated text. As a result, lower temperatures promote the exploitation of high-confidence predictions, often yielding more deterministic and conservative outputs.

Higher Temperature (𝑇>1): On the other hand, increasing 𝑇 softens the differences between logits, resulting in flatter probability distributions. Less probable words become more likely contenders. This encourages the model to explore a wider range of word choices, even those with lower logits. Consequently, higher temperatures foster diversity in the generated text, allowing the model to produce more varied and creative outputs.
Illustration on different temperatures resulting in different possibility distributions
Sampling Strategies

In practice, LLMs may employ different sampling strategies to incorporate LLM temperature during text generation. For example, at 𝑇=0, greedy sampling is usually employed. The model selects the word with the highest probability, effectively choosing the most confident prediction at each step.
How to choose the perfect LLM Temperature?

There's no magic number for LLM temperature. The ideal setting is based on the specific goal. Choosing the most suitable LLM temperature involves balancing various factors such as coherence, diversity, and specific task requirements. While there's no one-number-fits-all solution, here are some strategies:
Task Requirements

Coherence: If your task requires generating text that closely follows the input context or maintains a formal tone, such as summarizing research papers or writing technical reports, lower temperatures (𝑇<1 or maybe around 0.5) may be preferable to ensure high coherence and accuracy.

Creativity and Diversity: For tasks where creativity and diversity are valued, such as creative writing or brainstorming, higher temperatures (𝑇>1) can encourage the generation of more varied and innovative outputs.
Experiments

Experiment with different temperature values and evaluate the quality of the generated outputs. The evaluation can be done via human or user feedback. Monitor and observe how varying temperature selections impact the qualitative feedback on the performance of the LLMs. It is also worth noting that the optimal LLM temperature may not stay the same as the context or tasks evolve. Periodic reassessment and iteration are often beneficial.
Task-Specific Tuning

In some cases, fine-tuning the LLM temperature parameter for specific tasks or datasets may be necessary to achieve optimal performance. Train the LLM on domain-specific data and adjust the temperature based on the specific requirements of the task.

Finding the ideal temperature for an LLM is a delicate balancing act. Push it too high, and you risk nonsensical outputs; too low, and it becomes repetitive. It takes practice and experimentation to find the sweet spot. Furthermore, the LLM temperature value isn't the only factor affecting output. The prompt or question you provide to the LLM also plays a crucial role. A strong prompt with clear instructions might work well with a higher temperature, while a more open-ended one could benefit from a lower temperature for better exploration.
Summary

In our exploration of the LLM temperature, we uncover its crucial impact on the performance of LLMs. The LLM temperature serves as a critical parameter influencing the balance between predictability and creativity in generated text. Lower temperatures prioritize exploiting learned patterns, yielding more deterministic outputs, while higher temperatures encourage exploration, fostering diversity and innovation. Understanding and fine-tuning LLM temperature enables practitioners to tailor text generation to specific requirements, striking the optimal balance between coherence and novelty.
</details>

In [None]:
# Using a high top_p
output = pipe(messages, do_sample=True, top_p=1)
print(output[0]["generated_text"])

 Why did the chicken go to the bank? To get her nest egg.


# **Intro to Prompt Engineering**


## The Basic Ingredients of a Prompt


# **Advanced Prompt Engineering**


## Complex Prompt

In [None]:
# Text to summarize which we stole from https://jalammar.github.io/illustrated-transformer/ ;)
text = """In the previous post, we looked at Attention – a ubiquitous method in modern deep learning models. Attention is a concept that helped improve the performance of neural machine translation applications. In this post, we will look at The Transformer – a model that uses attention to boost the speed with which these models can be trained. The Transformer outperforms the Google Neural Machine Translation model in specific tasks. The biggest benefit, however, comes from how The Transformer lends itself to parallelization. It is in fact Google Cloud’s recommendation to use The Transformer as a reference model to use their Cloud TPU offering. So let’s try to break the model apart and look at how it functions.
The Transformer was proposed in the paper Attention is All You Need. A TensorFlow implementation of it is available as a part of the Tensor2Tensor package. Harvard’s NLP group created a guide annotating the paper with PyTorch implementation. In this post, we will attempt to oversimplify things a bit and introduce the concepts one by one to hopefully make it easier to understand to people without in-depth knowledge of the subject matter.
Let’s begin by looking at the model as a single black box. In a machine translation application, it would take a sentence in one language, and output its translation in another.
Popping open that Optimus Prime goodness, we see an encoding component, a decoding component, and connections between them.
The encoding component is a stack of encoders (the paper stacks six of them on top of each other – there’s nothing magical about the number six, one can definitely experiment with other arrangements). The decoding component is a stack of decoders of the same number.
The encoders are all identical in structure (yet they do not share weights). Each one is broken down into two sub-layers:
The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. We’ll look closer at self-attention later in the post.
The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position.
The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence (similar what attention does in seq2seq models).
Now that we’ve seen the major components of the model, let’s start to look at the various vectors/tensors and how they flow between these components to turn the input of a trained model into an output.
As is the case in NLP applications in general, we begin by turning each input word into a vector using an embedding algorithm.
Each word is embedded into a vector of size 512. We'll represent those vectors with these simple boxes.
The embedding only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they receive a list of vectors each of the size 512 – In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder that’s directly below. The size of this list is hyperparameter we can set – basically it would be the length of the longest sentence in our training dataset.
After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder.
Here we begin to see one key property of the Transformer, which is that the word in each position flows through its own path in the encoder. There are dependencies between these paths in the self-attention layer. The feed-forward layer does not have those dependencies, however, and thus the various paths can be executed in parallel while flowing through the feed-forward layer.
Next, we’ll switch up the example to a shorter sentence and we’ll look at what happens in each sub-layer of the encoder.
Now We’re Encoding!
As we’ve mentioned already, an encoder receives a list of vectors as input. It processes this list by passing these vectors into a ‘self-attention’ layer, then into a feed-forward neural network, then sends out the output upwards to the next encoder.
"""

# Prompt components
persona = "You are an expert in Large Language models. You excel at breaking down complex papers into digestible summaries.\n"
instruction = "Summarize the key findings of the paper provided.\n"
context = "Your summary should extract the most crucial points that can help researchers quickly understand the most vital information of the paper.\n"
data_format = "Create a bullet-point summary that outlines the method. Follow this up with a concise paragraph that encapsulates the main results.\n"
audience = "The summary is designed for busy researchers that quickly need to grasp the newest trends in Large Language Models.\n"
tone = "The tone should be professional and clear.\n"
# text = "MY TEXT TO SUMMARIZE"  # Replace with your own text to summarize
data = f"Text to summarize: {text}"

# The full prompt - remove and add pieces to view its impact on the generated output
query = persona + instruction + context + data_format + audience + tone + data

In [None]:
messages = [
    {"role": "user", "content": query}
]
print(tokenizer.apply_chat_template(messages, tokenize=False))

<|user|>
You are an expert in Large Language models. You excel at breaking down complex papers into digestible summaries.
Summarize the key findings of the paper provided.
Your summary should extract the most crucial points that can help researchers quickly understand the most vital information of the paper.
Create a bullet-point summary that outlines the method. Follow this up with a concise paragraph that encapsulates the main results.
The summary is designed for busy researchers that quickly need to grasp the newest trends in Large Language Models.
The tone should be professional and clear.
Text to summarize: In the previous post, we looked at Attention – a ubiquitous method in modern deep learning models. Attention is a concept that helped improve the performance of neural machine translation applications. In this post, we will look at The Transformer – a model that uses attention to boost the speed with which these models can be trained. The Transformer outperforms the Google Neur

In [None]:
# Generate the output
outputs = pipe(messages)
print(outputs[0]["generated_text"])

 - The Transformer model, introduced in the paper "Attention is All You Need," utilizes attention mechanisms to enhance the training speed of deep learning models, particularly in neural machine translation.
- The model consists of an encoding component and a decoding component, each with multiple identical sub-layers. The encoding component includes self-attention layers and feed-forward neural networks, while the decoding component incorporates attention layers that focus on relevant parts of the input sentence.
- The Transformer model is highly parallelizable, making it a preferred choice for Google Cloud's TPU offering.
- The model processes input sequences by converting each word into a vector using an embedding algorithm, with the size of the vector being a hyperparameter.
- The input vectors flow through the encoder's sub-layers, with dependencies between paths in the self-attention layer and independent execution in the feed-forward layer.
- The Transformer model has shown supe

## In-Context Learning: Providing Examples

Explicitly providing examples so that it works properly

In [None]:
# Use a single example of using the made-up word in a sentence
one_shot_prompt = [
    {
        "role": "user",
        "content": "A 'Gigamuru' is a type of Japanese musical instrument. An example of a sentence that uses the word Gigamuru is:"
    },
    {
        "role": "assistant",
        "content": "I have a Gigamuru that my uncle gave me as a gift. I love to play it at home."
    },
    {
        "role": "user",
        "content": "To 'screeg' something is to swing a sword at it. An example of a sentence that uses the word screeg is:"
    }
]
print(tokenizer.apply_chat_template(one_shot_prompt, tokenize=False))

<|user|>
A 'Gigamuru' is a type of Japanese musical instrument. An example of a sentence that uses the word Gigamuru is:<|end|>
<|assistant|>
I have a Gigamuru that my uncle gave me as a gift. I love to play it at home.<|end|>
<|user|>
To 'screeg' something is to swing a sword at it. An example of a sentence that uses the word screeg is:<|end|>
<|endoftext|>


In [None]:
# Generate the output
outputs = pipe(one_shot_prompt)
print(outputs[0]["generated_text"])

 During the medieval reenactment, the knight skillfully screeged the wooden target, impressing the onlookers with his prowess.


In [None]:
torch.cuda.empty_cache()

In [None]:
# Checking if GPU is available
if torch.cuda.is_available():
    print("GPU is available.")
    print('Using GPU: ', torch.cuda.get_device_name(0))
    print('Memory Usage: ')
    print('Allocated: ', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
    print('Cached: ', round(torch.cuda.memory_reserved(0)/1024**3,1), 'GB')

else:
    print("GPU is not available.")

GPU is available.
Using GPU:  Tesla T4
Memory Usage: 
Allocated:  7.1 GB
Cached:  7.1 GB


## Chain Prompting: Breaking up the Problem


In [None]:
# Create name and slogan for a product
product_prompt = [
    {"role": "user", "content": "Create a name and slogan for a chatbot that leverages LLMs."}
]

In [None]:
outputs = pipe(product_prompt)

In [None]:
product_description = outputs[0]["generated_text"]

In [None]:
print(product_description)

 Name: ChatSage
Slogan: "Unleashing the power of AI to enhance your conversations."


In [None]:
# Based on a name and slogan for a product, generate a sales pitch
sales_prompt = [
    {"role": "user", "content": f"Generate a very short sales pitch for the following product: '{product_description}'"}
]

In [None]:
outputs = pipe(sales_prompt)

In [None]:
sales_pitch = outputs[0]["generated_text"]


In [None]:
print(sales_pitch)

 Introducing ChatSage, the revolutionary AI-powered tool designed to elevate your conversations to new heights. With our cutting-edge technology, we unleash the power of AI to enhance your interactions, making every conversation more engaging, insightful, and meaningful. Experience the future of communication with ChatSage today!


# **Reasoning with Generative Models**


## Chain-of-Thought: Think Before Answering


In [6]:
# Answering without explicit reasoning
standard_prompt = [
    {"role": "user", "content": "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?"},
    {"role": "assistant", "content": "11"},
    {"role": "user", "content": "The cafeteria had 25 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?"}
]

# Run generative model
outputs = pipe(standard_prompt)
print(outputs[0]["generated_text"])

The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.


 The cafeteria started with 25 apples. They used 20 apples to make lunch, so they had 25 - 20 = 5 apples left. After buying 6 more apples, they now have 5 + 6 = 11 apples.


In [7]:
# Answering with chain-of-thought
cot_prompt = [
    {"role": "user", "content": "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?"},
    {"role": "assistant", "content": "Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11."},
    {"role": "user", "content": "The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?"}
]

# Generate the output
outputs = pipe(cot_prompt)
print(outputs[0]["generated_text"])

 The cafeteria started with 23 apples. They used 20 apples for lunch, so they had 23 - 20 = 3 apples left. After buying 6 more apples, they now have 3 + 6 = 9 apples. The answer is 9.


## Zero-shot Chain-of-Thought


In [8]:
# Zero-shot Chain-of-Thought
zeroshot_cot_prompt = [
    {"role": "user", "content": "The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have? Let's think step-by-step."}
]

# Generate the output
outputs = pipe(zeroshot_cot_prompt)
print(outputs[0]["generated_text"])

 Step 1: Start with the initial number of apples in the cafeteria, which is 23.

Step 2: Subtract the number of apples used to make lunch, which is 20.
23 - 20 = 3 apples remaining.

Step 3: Add the number of apples bought, which is 6.
3 + 6 = 9 apples.

So, the cafeteria now has 9 apples.


## Tree-of-Thought: Exploring Intermediate Steps


`NOTE`: This is a very new way of prompting. Tree of thoughts really helps a discussion among themselves.

In [9]:
# Zero-shot Chain-of-Thought
zeroshot_tot_prompt = [
    {"role": "user", "content": "Imagine three different experts are answering this question. All experts will write down 1 step of their thinking, then share it with the group. Then all experts will go on to the next step, etc. If any expert realises they're wrong at any point then they leave. The question is 'The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?' Make sure to discuss the results."}
]

In [10]:
# Generate the output
outputs = pipe(zeroshot_tot_prompt)
print(outputs[0]["generated_text"])

 Expert 1:
Step 1: Start with the initial number of apples, which is 23.

Expert 2:
Step 1: Subtract the number of apples used for lunch, which is 20.
Step 2: Add the number of apples bought, which is 6.

Expert 3:
Step 1: Start with the initial number of apples, which is 23.
Step 2: Subtract the number of apples used for lunch, which is 20.
Step 3: Add the number of apples bought, which is 6.

Results:
All three experts arrived at the same answer:

Expert 1: 23 - 20 + 6 = 9 apples
Expert 2: (23 - 20) + 6 = 9 apples
Expert 3: (23 - 20) + 6 = 9 apples

All three experts agree that the cafeteria has 9 apples left.


# **Output Verification**

## Providing Examples

In [11]:
# Zero-shot learning: Providing no examples
zeroshot_prompt = [
    {"role": "user", "content": "Create a character profile for an RPG game in JSON format."}
]

# Generate the output
outputs = pipe(zeroshot_prompt)
print(outputs[0]["generated_text"])

 ```json
{
  "name": "Eldrin the Wise",
  "race": "Elf",
  "class": "Wizard",
  "level": 10,
  "alignment": "Chaotic Good",
  "strength": 8,
  "dexterity": 14,
  "constitution": 12,
  "intelligence": 18,
  "wisdom": 16,
  "charisma": 10,
  "weapon_skill": "Magic",
  "armor_skill": "Light",
  "spell_slots": {
    "cantrips": ["Mage Hand", "Detect Magic", "Mage Armor", "Prestidigitation", "Identify", "Invisibility"],
    "1st level": ["Fireball", "Magic Missile", "Shield", "Cure Wounds", "Detect Thoughts", "Charm Person"],
    "2nd level": ["Light", "Hold Person", "Sleep", "Committee", "Enlarge Person", "Teleport"],
    "3rd level": ["Frostbite", "Fog Cloud", "Disintegrate", "Dimension Door", "Mirror Image", "Misty Step"]
  },
  "equipment": {
    "weapon": "Staff of the Ancients",
    "armor": "Leather Armor",
    "accessories": ["Staff of Power", "Ring of Protection", "Boots of Speed"]
  },
  "background": "Adept",
  "personality": "Curious and inventive, Eldrin is always seeking new k

In [12]:
# One-shot learning: Providing an example of the output structure
one_shot_template = """Create a short character profile for an RPG game. Make sure to only use this format:

{
  "description": "A SHORT DESCRIPTION",
  "name": "THE CHARACTER'S NAME",
  "armor": "ONE PIECE OF ARMOR",
  "weapon": "ONE OR MORE WEAPONS"
}
"""
one_shot_prompt = [
    {"role": "user", "content": one_shot_template}
]

# Generate the output
outputs = pipe(one_shot_prompt)
print(outputs[0]["generated_text"])

 {
  "description": "A cunning rogue with a mysterious past, skilled in stealth and deception.",
  "name": "Shadowcloak",
  "armor": "Leather Hood",
  "weapon": "Dagger"
}


With examples it took 4 secs and without it took almost 23 secs. the possible reason could be:
* it actually makes the LLM answer correstly and faster wiht the examples
* or may be because it saw the task before, it was able to give an example very quickly. So try to execute it without the ` Zero-shot learning: Providing no examples`

## Grammar: Constrained Sampling


In [13]:
import gc
import torch
del model, tokenizer, pipe

In [14]:
# Flush memory
gc.collect()
torch.cuda.empty_cache()

This is very much required whne you are making inference on limit resources.
Also it is very much required so that you don't make inference with a irrelevant model and present it to the reviewer.

In [16]:
!pip install llama-cpp-python

Collecting llama-cpp-python
  Using cached llama_cpp_python-0.3.1.tar.gz (63.9 MB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting diskcache>=5.6.1 (from llama-cpp-python)
  Using cached diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Using cached diskcache-5.6.3-py3-none-any.whl (45 kB)
Building wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.3.1-cp310-cp310-linux_x86_64.whl size=3510590 sha256=dabfe52212e4a54dc8996a48eec525082555ad8a520183d651265f3576c47311
  Stored in directory: /root/.cache/pip/wheels/f8/b0/a2/f47d952aec7ab061b9e2a345e23a1e1e137beb7891259e3d0c
Successfully built llama-cpp-python
Installing collected packages: diskcache, llama-cpp-python
S

In [17]:
from llama_cpp.llama import Llama

In [18]:
# Load Phi-3
llm = Llama.from_pretrained(
    repo_id="microsoft/Phi-3-mini-4k-instruct-gguf",
    filename="*fp16.gguf",
    n_gpu_layers=-1,
    n_ctx=2048,
    verbose=False
)

Phi-3-mini-4k-instruct-fp16.gguf:   0%|          | 0.00/7.64G [00:00<?, ?B/s]

`QUESTION`

What is the llama cpp versio of the models? Check out with Cluade, ChatGPT and andrej karpathy's videos

In [19]:
# Generate output
output = llm.create_chat_completion(
    messages=[
        {"role": "user", "content": "Create a warrior for an RPG in JSON format."},
    ],
    response_format={"type": "json_object"},
    temperature=0,
)['choices'][0]['message']["content"]


Check for `response_format={"type": "json_object"}` in the OpenAI python client and the huggingface paltform for similar variables.

The below is a very relevant way of converting the output string from the response to json_output which can be saved as a file.

### Converting the response into JSON format file

In [20]:
import json

In [21]:
# Format as json
json_output = json.dumps(json.loads(output), indent=4)
print(json_output)

{
    "warrior": {
        "name": "Eldric Stormbringer",
        "class": "Warrior",
        "level": 5,
        "attributes": {
            "strength": 18,
            "dexterity": 10,
            "constitution": 16,
            "intelligence": 8,
            "wisdom": 10,
            "charisma": 12
        },
        "skills": [
            {
                "name": "Martial Arts",
                "proficiency": 20,
                "description": "Expert in hand-to-hand combat and weapon handling."
            },
            {
                "name": "Shield Block",
                "proficiency": 18,
                "description": "Highly skilled at deflecting attacks with a shield."
            },
            {
                "name": "Heavy Armor",
                "proficiency": 16,
                "description": "Expertly equipped with heavy armor for protection."
            },
            {
                "name": "Survival",
                "proficiency": 14,
                "