# Prompt Engineering

We will exercise prompt engineering using a `text-generation` model via the [ChatHuggingFace](https://python.langchain.com/docs/integrations/chat/huggingface/) API.

In [1]:
%pip install --upgrade --quiet  langchain-huggingface text-generation transformers google-search-results numexpr langchainhub sentencepiece jinja2 bitsandbytes accelerate

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m31.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.5/137.5 MB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m405.1/405.1 kB[0m [31m22.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m249.1/249.1 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.9/289.9 kB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m5.2 MB/s[0m eta [36m0:0

In [2]:
import getpass
import os

if not os.getenv("HUGGINGFACEHUB_API_TOKEN"):
    os.environ["HUGGINGFACEHUB_API_TOKEN"] = getpass.getpass("Enter your token: ")

Enter your token: ··········


### Quantization

Quantization techniques **reduce memory and computational costs** by representing weights and activations with lower-precision data types like 8-bit integers (int8).

This enables loading larger models you normally wouldn’t be able to fit into memory, and speeding up inference. Transformers supports the AWQ and GPTQ quantization algorithms and it supports 8-bit and 4-bit quantization with `bitsandbytes`.

In [3]:
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_use_double_quant=True,
)

<img src="https://developer-blogs.nvidia.com/wp-content/uploads/2021/07/qat-training-precision.png" width="600">

Image Source: [nvidia.com/blog](https://developer.nvidia.com/blog/achieving-fp32-accuracy-for-int8-inference-using-quantization-aware-training-with-tensorrt/)

In [4]:
from langchain_huggingface import ChatHuggingFace, HuggingFacePipeline

llm = HuggingFacePipeline.from_model_id(
    model_id="HuggingFaceH4/zephyr-7b-beta",
    task="text-generation",
    pipeline_kwargs=dict(
        max_new_tokens=512, # (clip if the model doesn't stop itself)
        do_sample=True,  # Enable sampling (will depend on temperature)
        temperature=0.35, # Temperature of 0 is equivalent of no sampling (Greedy)
        repetition_penalty=1.03,
        return_full_text=False, # Determines whether to return the entire generated text or only the last generated token
    ),
    model_kwargs={"quantization_config": quantization_config},
)

chat_model = ChatHuggingFace(llm=llm)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/638 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/816M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]



### `temperature`
- In short, the lower the temperature, the more deterministic the results in the sense that the highest probable next token is always picked.
- Increasing temperature could lead to more randomness, which encourages more diverse or creative outputs.
- In terms of application, you might want to use a lower temperature value for tasks like fact-based QA to encourage more factual and concise responses.
- For poem generation or other creative tasks, it might be beneficial to increase the temperature value.



<img src="https://miro.medium.com/v2/resize:fit:1400/0*J37qonVPJvKZpzv2" width="600">

Image Source: [How to sample from language models | by Ben Mann | Towards Data Science]

### `max_new_tokens`

Specifying a max length helps you prevent long or irrelevant responses and **control costs**.


In [5]:
from langchain_core.messages import SystemMessage, HumanMessage

messages = [
    SystemMessage(content="You're a helpful assistant"),
]

In [6]:
# user_input = "What happens when an unstoppable force meets an immovable object?"
user_input = input("Tell the AI something: ")

Tell the AI something: why people use AI?


In [7]:
messages.append(HumanMessage(content=user_input))

In [8]:
messages

[SystemMessage(content="You're a helpful assistant", additional_kwargs={}, response_metadata={}),
 HumanMessage(content='why people use AI?', additional_kwargs={}, response_metadata={})]

In [9]:
ai_msg = chat_model.invoke(messages)

In [10]:
print(ai_msg.content)

People use AI (Artificial Intelligence) for several reasons, including:

1. Improved efficiency and productivity: AI can automate repetitive tasks, freeing up time for humans to focus on more complex and creative work. This can lead to increased productivity and efficiency in various industries such as healthcare, finance, and manufacturing.

2. Enhanced decision-making: AI can analyze large amounts of data and provide insights that humans may not be able to detect. This can help organizations make more informed decisions, reducing risks and improving outcomes.

3. Better customer experiences: AI can personalize products and services to individual preferences, leading to a better customer experience. For example, AI-powered chatbots can quickly and accurately answer customer queries, providing a faster and more convenient service.

4. Advancements in research and development: AI can help accelerate scientific discoveries by analyzing vast amounts of data and identifying patterns that h

### Your Turn

**Exercise 1**: Experiment with different prompts to achieve different tasks:

1. Summarization
2. Translation
3. Classification
4. Get creative and make up your own task

Instructions can be as long as you'd like. Be clear about:
- The task you want to be achieved by the LLM
- Any constraints you want the LLM to take care of
- The format, style, tone, and phrasing of the response you want from the LLM

Hint: you need to change the `SystemMessage(content=">>INSTRUCTIONS<<")` to include your instructions.

Resources:
- [Prompt Engineering Best Practices](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/prompt-engineering#best-practices)
- [Prompt Engineering Techniques](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/advanced-prompt-engineering?pivots=programming-language-chat-completions)

In [None]:
# YOUR CODE DOWN HERE
messages = [
    # ...
]

In [11]:
from langchain_core.messages import SystemMessage, HumanMessage

# YOUR CODE DOWN HERE



messages = [
    SystemMessage(content="Generate a concise summary of job scams, and highlight key insights from the text. [Scammers target their victims based on their online behaviour, financial situation, needs and even vulnerability to certain types of persuasion.The increasing cost of living in Australia is creating a fertile ground for job scams. People in desperate need of employment, those who have been unemployed for a very long time and those seeking additional income via part-time (usually remote) jobs are all at high risk of becoming victims to these job scams.These individuals are driven by economic need and will easily overlook or not recognise red flags. University students and recent graduates looking for valuable work experience in tough job markets are increasingly becoming targets of job scams, too.Immigrants can be particularly susceptible to job scams, mainly because they may not be familiar with legitimate employment processes, standard recruitment practices and Australian employment rights.In extreme cases, employment scams can even result in international human trafficking, as shown by an incident in Cambodia last year, with victims being locked into compounds, having their passports confiscated and being trained to scam others. Captors would release them only upon receiving a ransom fee payment.]"),
]


In [12]:
# user_input = "What happens when an unstoppable force meets an immovable object?"
user_input = input("Tell the AI something to summerize: ")

Tell the AI something to summerize: why people got decieved by job scams?


In [13]:
messages.append(HumanMessage(content=user_input))

In [14]:
messages

[SystemMessage(content='Generate a concise summary of job scams, and highlight key insights from the text. [Scammers target their victims based on their online behaviour, financial situation, needs and even vulnerability to certain types of persuasion.The increasing cost of living in Australia is creating a fertile ground for job scams. People in desperate need of employment, those who have been unemployed for a very long time and those seeking additional income via part-time (usually remote) jobs are all at high risk of becoming victims to these job scams.These individuals are driven by economic need and will easily overlook or not recognise red flags. University students and recent graduates looking for valuable work experience in tough job markets are increasingly becoming targets of job scams, too.Immigrants can be particularly susceptible to job scams, mainly because they may not be familiar with legitimate employment processes, standard recruitment practices and Australian employ

In [15]:
ai_msg = chat_model.invoke(messages)

In [16]:
print(ai_msg.content)

People can get deceived by job scams due to various reasons, including:

1. Economic Need: People who are struggling to make ends meet or are in dire need of employment may overlook red flags and fall prey to job scams.

2. Long-term Unemployment: Individuals who have been unemployed for a long time may become desperate and more susceptible to job scams.

3. Part-time Remote Jobs: With the increasing popularity of remote work, people seeking additional income through part-time jobs are also at higher risk of falling for job scams.

4. University Students and Recent Graduates: In tough job markets, university students and recent graduates looking for valuable work experience may become targets of job scams.

5. Immigrants: Immigrants who are unfamiliar with legitimate employment processes, standard recruitment practices, and Australian employment rights may be more susceptible to job scams.

6. Vulnerability to Certain Types of Persuasion: Scammers target their victims based on their on


**Exercise 2**: use a different LLM and compare the results by eye.

Hint: you need to change `model_id` to be some `text-generation` model from [HuggingFace Hub](https://huggingface.co/models?pipeline_tag=text-generation). Example: try
- [jinaai/reader-lm-1.5b](https://huggingface.co/jinaai/reader-lm-1.5b)
-

In [None]:
# YOUR CODE DOWN HERE

In [23]:
pip install --upgrade transformers



In [18]:
# YOUR CODE DOWN HERE
import transformers
import torch

In [28]:
!pip install huggingface_hub



In [None]:


import transformers
import torch
from huggingface_hub import login


login()

model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

outputs = pipeline(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00003-of-00004.safetensors:  50%|####9     | 2.44G/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


In [None]:
from langchain_core.messages import SystemMessage, HumanMessage

# YOUR CODE DOWN HERE



messages = [
    SystemMessage(content="Generate a concise summary of job scams, and highlight key insights from the text. [Scammers target their victims based on their online behaviour, financial situation, needs and even vulnerability to certain types of persuasion.The increasing cost of living in Australia is creating a fertile ground for job scams. People in desperate need of employment, those who have been unemployed for a very long time and those seeking additional income via part-time (usually remote) jobs are all at high risk of becoming victims to these job scams.These individuals are driven by economic need and will easily overlook or not recognise red flags. University students and recent graduates looking for valuable work experience in tough job markets are increasingly becoming targets of job scams, too.Immigrants can be particularly susceptible to job scams, mainly because they may not be familiar with legitimate employment processes, standard recruitment practices and Australian employment rights.In extreme cases, employment scams can even result in international human trafficking, as shown by an incident in Cambodia last year, with victims being locked into compounds, having their passports confiscated and being trained to scam others. Captors would release them only upon receiving a ransom fee payment.]"),
]
