# Prompt Engineering

We will exercise prompt engineering using a `text-generation` model via the [ChatHuggingFace](https://python.langchain.com/docs/integrations/chat/huggingface/) API.

In [1]:
%pip install --upgrade --quiet  langchain-huggingface text-generation transformers google-search-results numexpr langchainhub sentencepiece jinja2 bitsandbytes accelerate

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m54.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.5/137.5 MB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m405.1/405.1 kB[0m [31m30.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m249.1/249.1 kB[0m [31m19.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.8/289.8 kB[0m [31m24.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m7.2 MB/s[0m eta [36m0:0

In [2]:
import getpass
import os

if not os.getenv("HUGGINGFACEHUB_API_TOKEN"):
    os.environ["HUGGINGFACEHUB_API_TOKEN"] = getpass.getpass("Enter your token: ")

Enter your token: ··········


### Quantization

Quantization techniques **reduce memory and computational costs** by representing weights and activations with lower-precision data types like 8-bit integers (int8).

This enables loading larger models you normally wouldn’t be able to fit into memory, and speeding up inference. Transformers supports the AWQ and GPTQ quantization algorithms and it supports 8-bit and 4-bit quantization with `bitsandbytes`.

In [3]:
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_use_double_quant=True,
)

<img src="https://developer-blogs.nvidia.com/wp-content/uploads/2021/07/qat-training-precision.png" width="600">

Image Source: [nvidia.com/blog](https://developer.nvidia.com/blog/achieving-fp32-accuracy-for-int8-inference-using-quantization-aware-training-with-tensorrt/)

In [4]:
from langchain_huggingface import ChatHuggingFace, HuggingFacePipeline

llm = HuggingFacePipeline.from_model_id(
    model_id="HuggingFaceH4/zephyr-7b-beta",
    task="text-generation",
    pipeline_kwargs=dict(
        max_new_tokens=512, # (clip if the model doesn't stop itself)
        do_sample=True,  # Enable sampling (will depend on temperature)
        temperature=0.35, # Temperature of 0 is equivalent of no sampling (Greedy)
        repetition_penalty=1.03,
        return_full_text=False, # Determines whether to return the entire generated text or only the last generated token
    ),
    model_kwargs={"quantization_config": quantization_config},
)

chat_model = ChatHuggingFace(llm=llm)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/638 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/816M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]



### `temperature`
- In short, the lower the temperature, the more deterministic the results in the sense that the highest probable next token is always picked.
- Increasing temperature could lead to more randomness, which encourages more diverse or creative outputs.
- In terms of application, you might want to use a lower temperature value for tasks like fact-based QA to encourage more factual and concise responses.
- For poem generation or other creative tasks, it might be beneficial to increase the temperature value.



<img src="https://miro.medium.com/v2/resize:fit:1400/0*J37qonVPJvKZpzv2" width="600">

Image Source: [How to sample from language models | by Ben Mann | Towards Data Science]

### `max_new_tokens`

Specifying a max length helps you prevent long or irrelevant responses and **control costs**.


In [5]:
from langchain_core.messages import SystemMessage, HumanMessage

messages = [
    SystemMessage(content="You're a helpful assistant"),
]

In [6]:
# user_input = "What happens when an unstoppable force meets an immovable object?"
user_input = input("Tell the AI something: ")

Tell the AI something: recommend me a food for lunch


In [7]:
messages.append(HumanMessage(content=user_input))

In [8]:
messages

[SystemMessage(content="You're a helpful assistant", additional_kwargs={}, response_metadata={}),
 HumanMessage(content='recommend me a food for lunch', additional_kwargs={}, response_metadata={})]

In [9]:
ai_msg = chat_model.invoke(messages)

In [10]:
print(ai_msg.content)

I am not capable of knowing your personal preferences, dietary restrictions, or location. However, based on general recommendations, a grilled chicken salad with mixed greens, cherry tomatoes, cucumber, avocado, and a vinaigrette dressing could be a nutritious and delicious option for lunch. Alternatively, a vegetable stir-fry with brown rice and tofu could be a healthy and filling choice. Don't hesitate to let me know any specific dietary requirements or preferences you have, and I'll provide more tailored recommendations.


### Your Turn

**Exercise 1**: Experiment with different prompts to achieve different tasks:

1. Summarization
2. Translation
3. Classification
4. Get creative and make up your own task

Instructions can be as long as you'd like. Be clear about:
- The task you want to be achieved by the LLM
- Any constraints you want the LLM to take care of
- The format, style, tone, and phrasing of the response you want from the LLM

Hint: you need to change the `SystemMessage(content=">>INSTRUCTIONS<<")` to include your instructions.

In [11]:
# YOUR CODE DOWN HERE
messages = [
   SystemMessage(content="You translate English to Arabic"),
]

In [12]:
user_input = input("Tell the AI something to translate from English to Arabic: ")

Tell the AI something to translate from English to Arabic: How are you?


In [13]:
messages.append(HumanMessage(content=user_input))

In [14]:
messages

[SystemMessage(content='You translate English to Arabic', additional_kwargs={}, response_metadata={}),
 HumanMessage(content='How are you?', additional_kwargs={}, response_metadata={})]

In [15]:
ai_msg = chat_model.invoke(messages)

In [16]:
print(ai_msg.content)

أنا بخص, وكيف تعبي؟ (Anā baḵṣ, wa-kayf taʿibī?)

This is a common greeting in Arabic. The first part, "أنا بخص," translates to "I am fine," and the second part, "وكيف تعبي؟," translates to "And how about you?" or simply "How are you?" in English.

The response to this question would be:

أنا بخص، وهو شكراً جزيلاً. (Anā baḵṣ, wa-hūwa shukriyān jazīlān.)

This translates to "I am fine, and thanks a lot." or "I'm doing well, thank you very much."

In other Arabic-speaking regions, the greeting may vary slightly, but the meaning is generally the same.



**Exercise 2**: use a different LLM and compare the results by eye.

Hint: you need to change `model_id` to be some `text-generation` model from [HuggingFace Hub](https://huggingface.co/models?pipeline_tag=text-generation).

In [None]:
!pip install flash_attn


In [None]:
!pip install datamodel_code_generator

In [4]:
from langchain_huggingface import ChatHuggingFace, HuggingFacePipeline


In [27]:
# YOUR CODE DOWN HERE
llm = HuggingFacePipeline.from_model_id(
    model_id="openbmb/MiniCPM3-4B",
    task="text-generation",
    pipeline_kwargs=dict(
        max_new_tokens=512,
        do_sample=True,
        temperature=0.35,
        repetition_penalty=1.03,
        return_full_text=False,
    ),
    model_kwargs={"quantization_config": quantization_config},
)

chat_model = ChatHuggingFace(llm=llm)

The repository for openbmb/MiniCPM3-4B contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/openbmb/MiniCPM3-4B.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


tokenizer.model:   0%|          | 0.00/1.18M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.68M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/216 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.63k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.93k [00:00<?, ?B/s]

The repository for openbmb/MiniCPM3-4B contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/openbmb/MiniCPM3-4B.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


configuration_minicpm.py:   0%|          | 0.00/9.22k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/openbmb/MiniCPM3-4B:
- configuration_minicpm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


The repository for openbmb/MiniCPM3-4B contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/openbmb/MiniCPM3-4B.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


modeling_minicpm.py:   0%|          | 0.00/71.8k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/openbmb/MiniCPM3-4B:
- modeling_minicpm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
`low_cpu_mem_usage` was None, now set to True since model is quantized.


pytorch_model.bin:   0%|          | 0.00/8.15G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/122 [00:00<?, ?B/s]



The repository for openbmb/MiniCPM3-4B contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/openbmb/MiniCPM3-4B.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


In [48]:
from langchain_core.messages import SystemMessage, HumanMessage

messages = [
   SystemMessage(content="translate English to Arabic"),
]

In [49]:
user_input = input("Tell the AI something to translate from English to Arabic: ")

Tell the AI something to translate from English to Arabic: how are you?


In [50]:
messages.append(HumanMessage(content=user_input))

In [45]:
messages

[SystemMessage(content='You translate English to Arabic', additional_kwargs={}, response_metadata={}),
 HumanMessage(content='what did you do?', additional_kwargs={}, response_metadata={}),
 HumanMessage(content='how are you?', additional_kwargs={}, response_metadata={})]

In [51]:
ai_msg = chat_model.invoke(messages)

In [52]:
print(ai_msg.content)

ما رأيك؟
