# Prompt Engineering

We will exercise prompt engineering using a `text-generation` model via the [ChatHuggingFace](https://python.langchain.com/docs/integrations/chat/huggingface/) API.

In [None]:
%pip install --upgrade --quiet  langchain-huggingface text-generation transformers google-search-results numexpr langchainhub sentencepiece jinja2 bitsandbytes accelerate

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m310.3 kB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.5/137.5 MB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m405.1/405.1 kB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m249.1/249.1 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.8/289.8 kB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m6.4 MB/s[0m eta [36m0

In [None]:
import getpass
import os

if not os.getenv("HUGGINGFACEHUB_API_TOKEN"):
    os.environ["HUGGINGFACEHUB_API_TOKEN"] = getpass.getpass("Enter your token: ")

Enter your token: ··········


### Quantization

Quantization techniques **reduce memory and computational costs** by representing weights and activations with lower-precision data types like 8-bit integers (int8).

This enables loading larger models you normally wouldn’t be able to fit into memory, and speeding up inference. Transformers supports the AWQ and GPTQ quantization algorithms and it supports 8-bit and 4-bit quantization with `bitsandbytes`.

In [None]:
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_use_double_quant=True,
)

<img src="https://developer-blogs.nvidia.com/wp-content/uploads/2021/07/qat-training-precision.png" width="600">

Image Source: [nvidia.com/blog](https://developer.nvidia.com/blog/achieving-fp32-accuracy-for-int8-inference-using-quantization-aware-training-with-tensorrt/)

In [None]:
from langchain_huggingface import ChatHuggingFace, HuggingFacePipeline

llm = HuggingFacePipeline.from_model_id(
    model_id="HuggingFaceH4/zephyr-7b-beta",
    task="text-generation",
    pipeline_kwargs=dict(
        max_new_tokens=512,
        do_sample=True,  # Enable sampling (will depend on temperature)
        temperature=0.35, # Temperature of 0 is equivalent of no sampling (Greedy)
        repetition_penalty=1.03,
        return_full_text=False, # Determines whether to return the entire generated text or only the last generated token
    ),
    model_kwargs={"quantization_config": quantization_config},
)

chat_model = ChatHuggingFace(llm=llm)

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]



### `temperature`
- In short, the lower the temperature, the more deterministic the results in the sense that the highest probable next token is always picked.
- Increasing temperature could lead to more randomness, which encourages more diverse or creative outputs.
- In terms of application, you might want to use a lower temperature value for tasks like fact-based QA to encourage more factual and concise responses.
- For poem generation or other creative tasks, it might be beneficial to increase the temperature value.



<img src="https://miro.medium.com/v2/resize:fit:1400/0*J37qonVPJvKZpzv2" width="600">

Image Source: [How to sample from language models | by Ben Mann | Towards Data Science]

### `max_new_tokens`

Specifying a max length helps you prevent long or irrelevant responses and **control costs**.


In [None]:
from langchain_core.messages import SystemMessage, HumanMessage

messages = [
    SystemMessage(content="You're a helpful assistant"),
]

In [None]:
# user_input = "What happens when an unstoppable force meets an immovable object?"
user_input = input("Tell the AI something: ")

Tell the AI something: Tell me the difference between vegetables and fruits


In [None]:
messages.append(HumanMessage(content=user_input))

In [None]:
messages

[SystemMessage(content="You're a helpful assistant", additional_kwargs={}, response_metadata={}),
 HumanMessage(content='Tell me the difference between vegetables and fruits', additional_kwargs={}, response_metadata={}),
 HumanMessage(content='Tell me the difference between vegetables and fruits', additional_kwargs={}, response_metadata={})]

In [None]:
ai_msg = chat_model.invoke(messages)

In [None]:
print(ai_msg.content)

The main difference between vegetables and fruits is their biological function in the plant. 

Fruits are the structures that develop from flowers and contain seeds. They are typically sweet, fleshy, and can be eaten as a snack or used to make various foods and drinks. Examples of fruits include apples, bananas, grapes, and oranges.

Vegetables, on the other hand, are any parts of a plant that are consumed by humans as food, but are not naturally sweet or fleshy. They can come from any part of the plant, such as leaves (spinach, kale), stems (celery, asparagus), roots (carrots, potatoes), and flowers (broccoli, cauliflower). Vegetables are typically savory and are used in cooking to add flavor, texture, and nutrition to dishes.

In summary, the key difference between fruits and vegetables is that fruits are naturally sweet and contain seeds, while vegetables are not naturally sweet and can come from any part of the plant.


### Your Turn

**Exercise 1**: Experiment with different prompts to achieve different tasks:

1. Summarization
2. Translation
3. Classification
4. Get creative and make up your own task

Instructions can be as long as you'd like. Be clear about:
- The task you want to be achieved by the LLM
- Any constraints you want the LLM to take care of
- The format, style, tone, and phrasing of the response you want from the LLM

Hint: you need to change the `SystemMessage(content=">>INSTRUCTIONS<<")` to include your instructions.

In [None]:
# YOUR CODE DOWN HERE
messages = [
        SystemMessage(content="You're a helpful assistant for summrize text, you should provide a very very short output"),
]

In [None]:
# user_input = "What happens when an unstoppable force meets an immovable object?"
user_input = input("Tell the AI something: ")

Tell the AI something: SNOWBOARDING  The earliest bicycles were dangerous to ride because the front wheel was bigger than the back wheel. But in 1885, J.K. Starley invented a safer bike with the same size wheels, and bicycle racing was born. One early race was 'cyclo-cross'. Cyclists rode cross-country, although they could get off and run over difficult areas.   This early sport was similar to today's mountain biking.  Mountain biking as we know it began in California in 1976. Riders had to ride their bikes cross country, like cyclo-cross, but they couldn't get off and run. Their bikes were also different. They were smaller, had fatter tyres and were easier to ride. Who thought of this great idea? A man named Gary Fisher. Suddenly bikes weren't only for the streets.  This new type of bicycle could also go up and down mountains! Today mountain bikes are popular with millions of people. Most cities have mountain bike parks and the sport has become a major event in the X games. In 1996 it

In [None]:
messages.append(HumanMessage(content=user_input))

In [None]:
messages

[SystemMessage(content="You're a helpful assistant for summrize text, you should provide a very very short output", additional_kwargs={}, response_metadata={}),
 HumanMessage(content='you should provide a very short output', additional_kwargs={}, response_metadata={}),
 HumanMessage(content="SNOWBOARDING  The earliest bicycles were dangerous to ride because the front wheel was bigger than the back wheel. But in 1885, J.K. Starley invented a safer bike with the same size wheels, and bicycle racing was born. One early race was 'cyclo-cross'. Cyclists rode cross-country, although they could get off and run over difficult areas.   This early sport was similar to today's mountain biking.  Mountain biking as we know it began in California in 1976. Riders had to ride their bikes cross country, like cyclo-cross, but they couldn't get off and run. Their bikes were also different. They were smaller, had fatter tyres and were easier to ride. Who thought of this great idea? A man named Gary Fisher

In [None]:
ai_msg = chat_model.invoke(messages)

In [None]:
print(ai_msg.content)

Mountain biking, originally inspired by cyclo-cross, gained popularity in the 1970s with Gary Fisher's innovation of smaller, easier-to-ride bikes with fatter tires for off-road use. It has since become a widely enjoyed sport with city parks and Olympic recognition for both men and women.


In [None]:
len(user_input)

1025

In [None]:
len(ai_msg.content)

289


**Exercise 2**: use a different LLM and compare the results by eye.

Hint: you need to change `model_id` to be some `text-generation` model from [HuggingFace Hub](https://huggingface.co/models?pipeline_tag=text-generation).

In [None]:
# YOUR CODE DOWN HERE
from langchain_huggingface import ChatHuggingFace, HuggingFacePipeline

llm = HuggingFacePipeline.from_model_id(
    model_id="ZeusLabs/Chronos-Divergence-33B",
    task="text-generation",
    pipeline_kwargs=dict(
        max_new_tokens=512,
        do_sample=True,  # Enable sampling (will depend on temperature)
        temperature=0.35, # Temperature of 0 is equivalent of no sampling (Greedy)
        repetition_penalty=1.03,
        return_full_text=False, # Determines whether to return the entire generated text or only the last generated token
    ),
    model_kwargs={"quantization_config": quantization_config},
)

chat_model = ChatHuggingFace(llm=llm)

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/14 [00:00<?, ?it/s]

OutOfMemoryError: CUDA out of memory. Tried to allocate 408.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 9.06 MiB is free. Process 14202 has 14.74 GiB memory in use. Of the allocated memory 14.48 GiB is allocated by PyTorch, and 133.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
from langchain_core.messages import SystemMessage, HumanMessage

messages = [
    SystemMessage(content="You're a helpful assistant for summrize text, you should provide a very very short output"),
]

In [None]:
# user_input = "What happens when an unstoppable force meets an immovable object?"
user_input = input("Tell the AI something: ")

In [None]:
messages.append(HumanMessage(content=user_input))

In [None]:
messages

In [None]:
ai_msg = chat_model.invoke(messages)

In [None]:
print(ai_msg.content)

In [None]:
len(user_input)

In [None]:
len(ai_msg.content)