<h1>Chapter 6 - Prompt Engineering</h1>
<i>Methods for improving the output through prompt engineering.</i>

<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961"><img src="https://img.shields.io/badge/Buy%20the%20Book!-grey?logo=amazon"></a>
<a href="https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/"><img src="https://img.shields.io/badge/O'Reilly-white.svg?logo=data:image/svg%2bxml;base64,PHN2ZyB3aWR0aD0iMzQiIGhlaWdodD0iMjciIHZpZXdCb3g9IjAgMCAzNCAyNyIgZmlsbD0ibm9uZSIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KPGNpcmNsZSBjeD0iMTMiIGN5PSIxNCIgcj0iMTEiIHN0cm9rZT0iI0Q0MDEwMSIgc3Ryb2tlLXdpZHRoPSI0Ii8+CjxjaXJjbGUgY3g9IjMwLjUiIGN5PSIzLjUiIHI9IjMuNSIgZmlsbD0iI0Q0MDEwMSIvPgo8L3N2Zz4K"></a>
<a href="https://github.com/HandsOnLLM/Hands-On-Large-Language-Models"><img src="https://img.shields.io/badge/GitHub%20Repository-black?logo=github"></a>
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/HandsOnLLM/Hands-On-Large-Language-Models/blob/main/chapter06/Chapter%206%20-%20Prompt%20Engineering.ipynb)

---

This notebook is for Chapter 6 of the [Hands-On Large Language Models](https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961) book by [Jay Alammar](https://www.linkedin.com/in/jalammar) and [Maarten Grootendorst](https://www.linkedin.com/in/mgrootendorst/).

---

<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961">
<img src="https://raw.githubusercontent.com/HandsOnLLM/Hands-On-Large-Language-Models/main/images/book_cover.png" width="350"/></a>

### [OPTIONAL] - Installing Packages on <img src="https://colab.google/static/images/icons/colab.png" width=100>

If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to **uncomment and run** the following codeblock to install the dependencies for this chapter:

---

💡 **NOTE**: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to
**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---


In [1]:
%%capture
!pip install langchain>=0.1.17 openai>=1.13.3 langchain_openai>=0.1.6 transformers>=4.40.1 datasets>=2.18.0 accelerate>=0.27.2 sentence-transformers>=2.5.1 duckduckgo-search>=5.2.2
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python

## Loading our model

In [2]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, GenerationConfig

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct")
generation_config = GenerationConfig(
    max_new_tokens=500,
    do_sample=False,
)
# Create a pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    generation_config=generation_config
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.04k [00:00<?, ?B/s]

configuration_exaone.py:   0%|          | 0.00/9.95k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct:
- configuration_exaone.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_exaone.py:   0%|          | 0.00/63.6k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct:
- modeling_exaone.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/22.2k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.65G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/134 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/70.7k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.93M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.22M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/4.96M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/563 [00:00<?, ?B/s]

Device set to use cuda


In [3]:
# Prompt
messages = [
    {"role": "user", "content": "Create a funny joke about chickens."}
]

# Generate the output
output = pipe(messages)
print(output[0]["generated_text"])

Why did the chicken decide to become a professional magician?

Because it heard the secret move was "peekaboo-chicken" – you know, where you pretend to disappear behind your wings just long enough to pull off the biggest egg-ception trick ever! And guess what? Chickens are naturally inclined to do the disappearing act anyway, making it their signature act!


In [4]:
# Apply prompt template
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False)
print(prompt)

[|system|][|endofturn|]
[|user|]Create a funny joke about chickens.



In [5]:
# Using a high temperature
output = pipe(messages, do_sample=True, temperature=1)
print(output[0]["generated_text"])

Why did the chicken decide to learn coding?

Because it heard chickens could write codes—and let's face it, if they can fly into the sunset, they might as well learn to debug out their coop's WiFi! 🐓💻☁️🔍


In [6]:
# Using a high top_p
output = pipe(messages, do_sample=True, top_p=1)
print(output[0]["generated_text"])

Why don't chickens ever play hide-and-seek?

Because good luck hiding when your next prize wins are always announced by saying "Chicken of course!"—just like in the old tune, "That’s my chicken, all my chicken."


# **Intro to Prompt Engineering**


## The Basic Ingredients of a Prompt


# **Advanced Prompt Engineering**


## Complex Prompt

In [10]:
# Text to summarize which we stole from https://jalammar.github.io/illustrated-transformer/ ;)
example = """In the previous post, we looked at Attention – a ubiquitous method in modern deep learning models. Attention is a concept that helped improve the performance of neural machine translation applications. In this post, we will look at The Transformer – a model that uses attention to boost the speed with which these models can be trained. The Transformer outperforms the Google Neural Machine Translation model in specific tasks. The biggest benefit, however, comes from how The Transformer lends itself to parallelization. It is in fact Google Cloud’s recommendation to use The Transformer as a reference model to use their Cloud TPU offering. So let’s try to break the model apart and look at how it functions.
The Transformer was proposed in the paper Attention is All You Need. A TensorFlow implementation of it is available as a part of the Tensor2Tensor package. Harvard’s NLP group created a guide annotating the paper with PyTorch implementation. In this post, we will attempt to oversimplify things a bit and introduce the concepts one by one to hopefully make it easier to understand to people without in-depth knowledge of the subject matter.
Let’s begin by looking at the model as a single black box. In a machine translation application, it would take a sentence in one language, and output its translation in another.
Popping open that Optimus Prime goodness, we see an encoding component, a decoding component, and connections between them.
The encoding component is a stack of encoders (the paper stacks six of them on top of each other – there’s nothing magical about the number six, one can definitely experiment with other arrangements). The decoding component is a stack of decoders of the same number.
The encoders are all identical in structure (yet they do not share weights). Each one is broken down into two sub-layers:
The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. We’ll look closer at self-attention later in the post.
The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position.
The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence (similar what attention does in seq2seq models).
Now that we’ve seen the major components of the model, let’s start to look at the various vectors/tensors and how they flow between these components to turn the input of a trained model into an output.
As is the case in NLP applications in general, we begin by turning each input word into a vector using an embedding algorithm.
Each word is embedded into a vector of size 512. We'll represent those vectors with these simple boxes.
The embedding only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they receive a list of vectors each of the size 512 – In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder that’s directly below. The size of this list is hyperparameter we can set – basically it would be the length of the longest sentence in our training dataset.
After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder.
Here we begin to see one key property of the Transformer, which is that the word in each position flows through its own path in the encoder. There are dependencies between these paths in the self-attention layer. The feed-forward layer does not have those dependencies, however, and thus the various paths can be executed in parallel while flowing through the feed-forward layer.
Next, we’ll switch up the example to a shorter sentence and we’ll look at what happens in each sub-layer of the encoder.
Now We’re Encoding!
As we’ve mentioned already, an encoder receives a list of vectors as input. It processes this list by passing these vectors into a ‘self-attention’ layer, then into a feed-forward neural network, then sends out the output upwards to the next encoder.
"""

# Prompt components
persona = "You are an expert in Large Language models. You excel at breaking down complex papers into digestible summaries.\n"
instruction = "Summarize the key findings of the paper provided.\n"
context = "Your summary should extract the most crucial points that can help researchers quickly understand the most vital information of the paper.\n"
data_format = "Create a bullet-point summary that outlines the method. Follow this up with a concise paragraph that encapsulates the main results.\n"
audience = "The summary is designed for busy researchers that quickly need to grasp the newest trends in Large Language Models.\n"
tone = "The tone should be professional and clear.\n"
text = "MY TEXT TO SUMMARIZE"  # Replace with your own text to summarize
data = f"Text to summarize: {example}"

# The full prompt - remove and add pieces to view its impact on the generated output
query = persona + instruction + context + data_format + audience + tone + data

In [11]:
messages = [
    {"role": "user", "content": query}
]
print(tokenizer.apply_chat_template(messages, tokenize=False))

[|system|][|endofturn|]
[|user|]You are an expert in Large Language models. You excel at breaking down complex papers into digestible summaries.
Summarize the key findings of the paper provided.
Your summary should extract the most crucial points that can help researchers quickly understand the most vital information of the paper.
Create a bullet-point summary that outlines the method. Follow this up with a concise paragraph that encapsulates the main results.
The summary is designed for busy researchers that quickly need to grasp the newest trends in Large Language Models.
The tone should be professional and clear.
Text to summarize: In the previous post, we looked at Attention – a ubiquitous method in modern deep learning models. Attention is a concept that helped improve the performance of neural machine translation applications. In this post, we will look at The Transformer – a model that uses attention to boost the speed with which these models can be trained. The Transformer outp

In [12]:
# Generate the output
outputs = pipe(messages)
print(outputs[0]["generated_text"])

### Key Findings Summary

- **Model Overview**:
  - **Transformer Architecture**: Introduced in "Attention is All You Need," focusing on efficiency through parallelizable design.
  - **Comparison**: Outperforms Google Neural Machine Translation models in specific tasks.
  - **Parallelization Advantage**: Key benefit lies in its ability to train models more efficiently due to parallelizable layers.

- **Model Components**:
  - **Encoding Stack**: Multiple identical encoder layers (typically six) stacked vertically.
    - **Self-Attention Layer**: Each encoder processes input vectors to capture dependencies within the sequence.
    - **Feed-Forward Network**: Applies independently to each position within the encoder output.
  - **Decoding Stack**: Identical decoder layers stacked vertically, each with:
    - **Self-Attention Layer**: Focuses decoder on relevant parts of the input sequence.
    - **Feed-Forward Network**: Similar to encoders but operates independently across positions.

-

## In-Context Learning: Providing Examples

In [13]:
# Use a single example of using the made-up word in a sentence
one_shot_prompt = [
    {
        "role": "user",
        "content": "A 'Gigamuru' is a type of Japanese musical instrument. An example of a sentence that uses the word Gigamuru is:"
    },
    {
        "role": "assistant",
        "content": "I have a Gigamuru that my uncle gave me as a gift. I love to play it at home."
    },
    {
        "role": "user",
        "content": "To 'screeg' something is to swing a sword at it. An example of a sentence that uses the word screeg is:"
    }
]
print(tokenizer.apply_chat_template(one_shot_prompt, tokenize=False))

[|system|][|endofturn|]
[|user|]A 'Gigamuru' is a type of Japanese musical instrument. An example of a sentence that uses the word Gigamuru is:
[|assistant|]I have a Gigamuru that my uncle gave me as a gift. I love to play it at home.[|endofturn|]
[|user|]To 'screeg' something is to swing a sword at it. An example of a sentence that uses the word screeg is:



In [14]:
# Generate the output
outputs = pipe(one_shot_prompt)
print(outputs[0]["generated_text"])

In the ancient forest, the rogue adventurer decided to **screeg** at the shadowy figure lurking near the ancient oak tree, hoping to scare it away before it could attack.


## Chain Prompting: Breaking up the Problem


In [15]:
# Create name and slogan for a product
product_prompt = [
    {"role": "user", "content": "Create a name and slogan for a chatbot that leverages LLMs."}
]
outputs = pipe(product_prompt)
product_description = outputs[0]["generated_text"]
print(product_description)

### Name: **SynapseChat**

### Slogan: **"Connect Intelligently, Engage Authentically"**

#### Explanation:
- **SynapseChat** evokes imagery of neural connections and intelligence, highlighting its capability to understand and interact deeply with users through advanced language models (LLMs). The name suggests a sophisticated yet approachable interface designed to bridge human communication seamlessly.
  
- **"Connect Intelligently, Engage Authentically"** encapsulates the core functionalities and promise of SynapseChat:
  - **Intelligently**: Emphasizes the sophisticated processing and understanding capabilities of LLMs, ensuring responses are not only accurate but also contextually relevant.
  - **Engage Authentically**: Underlines the bot’s ability to create meaningful, personalized interactions that feel genuine and tailored to individual users, enhancing user satisfaction and engagement.

This combination aims to position SynapseChat as a cutting-edge conversational AI tool that 

In [16]:
# Based on a name and slogan for a product, generate a sales pitch
sales_prompt = [
    {"role": "user", "content": f"Generate a very short sales pitch for the following product: '{product_description}'"}
]
outputs = pipe(sales_prompt)
sales_pitch = outputs[0]["generated_text"]
print(sales_pitch)

**SynapseChat:**  
**"Transform Your Conversations—Intelligently, Authentically."**  
Unlock deeper connections with SynapseChat, where advanced AI understands your nuances and responds with genuine, personalized interactions. Connect smarter, engage more authentically—experience the future of conversational AI today!


# **Reasoning with Generative Models**


## Chain-of-Thought: Think Before Answering


In [17]:
# # Answering without explicit reasoning
# standard_prompt = [
#     {"role": "user", "content": "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?"},
#     {"role": "assistant", "content": "11"},
#     {"role": "user", "content": "The cafeteria had 25 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?"}
# ]

# # Run generative model
# outputs = pipe(standard_prompt)
# print(outputs[0]["generated_text"])

To find out how many apples the cafeteria has now, follow these steps:

1. Start with the initial number of apples: 25 apples.
2. Subtract the number of apples used for lunch: \( 25 - 20 = 5 \) apples left.
3. Add the number of apples bought: \( 5 + 6 = 11 \) apples.

So, the cafeteria now has **11 apples**.


In [18]:
# Answering with chain-of-thought
cot_prompt = [
    {"role": "user", "content": "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?"},
    {"role": "assistant", "content": "Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11."},
    {"role": "user", "content": "The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?"}
]

# Generate the output
outputs = pipe(cot_prompt)
print(outputs[0]["generated_text"])

To find out how many apples the cafeteria has now, follow these steps:

1. Start with the initial number of apples: 23 apples.
2. Subtract the number of apples used for lunch: \( 23 - 20 = 3 \) apples remaining.
3. Add the number of apples bought: \( 3 + 6 = 9 \) apples.

So, the cafeteria now has **9 apples**.


## Zero-shot Chain-of-Thought


In [19]:
# Zero-shot Chain-of-Thought
zeroshot_cot_prompt = [
    {"role": "user", "content": "The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have? Let's think step-by-step."}
]

# Generate the output
outputs = pipe(zeroshot_cot_prompt)
print(outputs[0]["generated_text"])

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Sure, let's break it down step by step:

1. **Initial Number of Apples**: The cafeteria starts with 23 apples.
   
2. **Apples Used**: They used 20 apples to make lunch. So, we subtract these from the initial amount:
   \[
   23 - 20 = 3
   \]
   After using 20 apples, they have 3 apples left.

3. **Apples Bought**: They then bought 6 more apples. So, we add these to the remaining apples:
   \[
   3 + 6 = 9
   \]

Therefore, after using some apples and buying more, the cafeteria now has **9 apples**.


## Tree-of-Thought: Exploring Intermediate Steps


In [20]:
# Zero-shot Chain-of-Thought
zeroshot_tot_prompt = [
    {"role": "user", "content": "Imagine three different experts are answering this question. All experts will write down 1 step of their thinking, then share it with the group. Then all experts will go on to the next step, etc. If any expert realises they're wrong at any point then they leave. The question is 'The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?' Make sure to discuss the results."}
]

In [21]:
# Generate the output
outputs = pipe(zeroshot_tot_prompt)
print(outputs[0]["generated_text"])

### Expert Analysis Session: Counting Cafeteria Apples

#### Expert 1: **Mathematical Precision**
**Step 1:** **Initial Count Verification**  
- Start by confirming the initial number of apples: 23 apples.
- **Thought Process:** Ensuring accuracy from the outset is crucial to avoid errors early on.

**Step 2:** **Subtraction for Used Apples**  
- Subtract the apples used for lunch: \(23 - 20 = 3\).
- **Thought Process:** This step logically follows the subtraction needed to reflect the reduction in stock due to usage.

**Step 3:** **Addition for New Apples**  
- Add the newly purchased apples: \(3 + 6 = 9\).
- **Thought Process:** This step correctly accounts for the increase in stock after replenishment.

**Conclusion:** Expert 1 concludes that the cafeteria now has **9 apples**.

#### Expert 2: **Logical Flow and Verification**
**Step 1:** **Initial Inventory Check**  
- Confirm the starting point: 23 apples.
- **Thought Process:** Starting with a clear understanding of the initial q

# **Output Verification**

## Providing Examples

In [22]:
# Zero-shot learning: Providing no examples
zeroshot_prompt = [
    {"role": "user", "content": "Create a character profile for an RPG game in JSON format."}
]

# Generate the output
outputs = pipe(zeroshot_prompt)
print(outputs[0]["generated_text"])

Certainly! Below is a character profile formatted in JSON suitable for an RPG game. This profile includes basic attributes, skills, equipment, background, and some narrative elements to enrich the character's identity within the game world.

```json
{
  "characterProfile": {
    "name": "Eldrin Stormweaver",
    "race": "Elf",
    "class": "Ranger",
    "level": 10,
    "alignment": "Neutral Good",
    "background": "Highborn Explorer",
    "description": "Eldrin Stormweaver stands tall with an ethereal grace, his eyes reflecting the wisdom of ages spent wandering the verdant forests and ancient mountains. Adorned with intricate tattoos that shimmer under moonlight, Eldrin carries himself with a blend of confidence and humility, embodying both the wild spirit of nature and the scholarly pursuit of knowledge.",
    
    "attributes": {
      "strength": 14,
      "dexterity": 18,
      "constitution": 12,
      "intelligence": 10,
      "wisdom": 16,
      "charisma": 8
    },
    
    

In [23]:
# One-shot learning: Providing an example of the output structure
one_shot_template = """Create a short character profile for an RPG game. Make sure to only use this format:

{
  "description": "A SHORT DESCRIPTION",
  "name": "THE CHARACTER'S NAME",
  "armor": "ONE PIECE OF ARMOR",
  "weapon": "ONE OR MORE WEAPONS"
}
"""
one_shot_prompt = [
    {"role": "user", "content": one_shot_template}
]

# Generate the output
outputs = pipe(one_shot_prompt)
print(outputs[0]["generated_text"])

```json
{
  "description": "A weathered rogue with a knack for stealth and misdirection, haunted by a past mistake that led to the loss of a loved one. Driven by vengeance, they operate in the shadows, preferring to eliminate targets unseen rather than engage in direct confrontation.",
  "name": "Elysia Shadowhand",
  "armor": "Leather Armor (lightly patched and worn)",
  "weapon": "Dagger (twin blades), Thieves' Tools (lockpicks, grappling hook)"
}
```


## Grammar: Constrained Sampling


In [24]:
import gc
import torch
del model, tokenizer, pipe

# Flush memory
gc.collect()
torch.cuda.empty_cache()

In [49]:
!sudo apt update
!sudo apt install -y build-essential cmake libopenblas-dev

[33m0% [Working][0m            Hit:1 http://archive.ubuntu.com/ubuntu jammy InRelease
[33m0% [Waiting for headers] [Connecting to security.ubuntu.com (91.189.91.83)] [Co[0m                                                                               Hit:2 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
                                                                               Hit:3 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
[33m0% [Waiting for headers] [Connecting to security.ubuntu.com (91.189.91.83)] [Co[0m                                                                               Hit:4 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
[33m0% [Connecting to security.ubuntu.com (91.189.91.83)] [Connecting to r2u.stat.i[0m                                                                               Hit:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
[33m0% [Connecting to

In [7]:
!CMAKE_ARGS="-DGGML_CUDA=on"
!pip install llama-cpp-python

Collecting llama-cpp-python
  Using cached llama_cpp_python-0.3.8.tar.gz (67.3 MB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting diskcache>=5.6.1 (from llama-cpp-python)
  Using cached diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Using cached diskcache-5.6.3-py3-none-any.whl (45 kB)
Building wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.3.8-cp311-cp311-linux_x86_64.whl size=6008054 sha256=e10561a43f9209b6e5f83ad6f11cfc67072b6ee869b586e7ad2fbec13b4ef709
  Stored in directory: /root/.cache/pip/wheels/c0/03/66/eb3810eafd55d921b2be32896d1f44313996982360663aa80b
Successfully built llama-cpp-python
Installing collected packages: diskcache, llama-cpp-python
S

In [8]:
!pip show llama-cpp-python

Name: llama_cpp_python
Version: 0.3.8
Summary: Python bindings for the llama.cpp library
Home-page: https://github.com/abetlen/llama-cpp-python
Author: 
Author-email: Andrei Betlen <abetlen@gmail.com>
License: MIT
Location: /usr/local/lib/python3.11/dist-packages
Requires: diskcache, jinja2, numpy, typing-extensions
Required-by: 


In [9]:
from llama_cpp.llama import Llama

# Load Phi-3
llm = Llama.from_pretrained(
    repo_id="microsoft/Phi-3-mini-4k-instruct-gguf",
    filename="*fp16.gguf",
    n_gpu_layers=-1,
    n_ctx=2048,
    verbose=False
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Phi-3-mini-4k-instruct-fp16.gguf:   0%|          | 0.00/7.64G [00:00<?, ?B/s]

llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (4096) -- the full capacity of the model will not be utilized


In [12]:
# Generate output
output = llm.create_chat_completion(
    messages=[
        {"role": "user", "content": "Create a warrior for an RPG in JSON format."},
    ],
    response_format={"type": "json_object"},
    temperature=0,
)['choices'][0]['message']["content"]


KeyboardInterrupt: 

In [None]:
import json

# Format as json
json_output = json.dumps(json.loads(output), indent=4)
print(json_output)

{
    "name": "Eldrin Stormbringer",
    "class": "Ranger",
    "level": 5,
    "attributes": {
        "strength": 14,
        "dexterity": 18,
        "constitution": 12,
        "intelligence": 10,
        "wisdom": 13,
        "charisma": 9
    },
    "skills": {
        "archery": {
            "proficiency": 20,
            "critical_hit_chance": 5,
            "damage_range": [
                8,
                14
            ]
        },
        "stealth": {
            "proficiency": 17,
            "critical_hit_chance": 3,
            "damage_range": [
                2,
                6
            ]
        },
        "nature_magic": {
            "proficiency": 15,
            "critical_hit_chance": 4,
            "healing_range": [
                3,
                7
            ],
            "damage_range": [
                -2,
                2
            ]
        }
    },
    "equipment": {
        "weapons": [
            "Longbow",
            "Dagger"
      