# Prompt Engineering Essentials

The D3 notebooks will cover the essential topics of prompt engineering, beginning with inference in general and an introduction to LangChain. We will then cover the topics of prompt templates and parsing and will then go on to the concept of creating chains and connecting these in different ways to build more sophisticated constructs to make the most of LLMs.

## API vs. Locally Hosted LLM
Using the an API-hosted LLM (e.g. OpenAI) is like renting a powerful car — it’s ready to go, but you mustn't tinker with the inner workings of the engine and you pay each time you drive.
Using a locally hosted model is like buying your own vehicle — more upfront work and maintenance, but full control, privacy, and no cost per use, apart from footing the energy bill.

| **Aspect**                 | **API-based (e.g. OpenAI)**                          | **Local Model (e.g. Mistral, PyTorch + LangChain)**        |
|---------------------------|------------------------------------------------------|-------------------------------------------------------------|
| **Setup time**            | Minimal – just an API key                            | Requires downloading and managing the model                 |
| **Hardware requirement**  | None (runs in the cloud)                             | Requires a GPU (sometimes large memory)                     |
| **Latency**               | Network-dependent                                    | Faster inference (once model is loaded)                     |
| **Privacy / Data control**| Data sent to external servers                      | Data stays on your infrastructure                         |
| **Cost**                  | Pay-per-use (based on tokens)                        | Free at inference (after download), but uses your compute   |
| **Scalability**           | Handled by provider                                  | You manage and scale infrastructure                         |
| **Flexibility**           | Limited to provider's models and settings            | Full control: quantization, fine-tuning, prompt handling    |
| **Offline use**           | Not possible                                       | Yes, after initial download                               |
| **Customizability**       | No access to internals                             | You can modify and extend anything                        |

**Using an API (e.g. OpenAI)** <br>
 - You use OpenAI or ChatOpenAI class from LangChain
 - LangChain sends your prompt to api.openai.com
 - You don’t manage the model, only the request and response

```
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(api_key="...", model="gpt-4")
response = llm.invoke("Summarize this legal clause...")
```
📝 You can store your API key in different ways, it is common, however, to set it as an **environment variable**.
Note, that LangChain automatically looks up any environment variable with the name **`OPENAI_API_KEY`** automatically when making a connection to OpenAI. 
```
import os
os.environ['OPENAI_API_KEY'] = 'my_API_key_123'
llm = ChatOpenAI(api_key=os.environ['OPENAI_API_KEY'], model="gpt-4")
```
Alternatively, you could just pass in the openai key via a string (not very secure, you should NEVER hard-code your API keys), or even just save it somewhere on your computer in a text file and then read it in:
```
f = open('C:\\Users\\Simeon\\Desktop\\openai.txt')
api_key = f.read()
llm = OpenAI(openai_api_key=api_key)
```
**Using a Local Model (e.g. Mistral, LLaMA)**<br>
 - You load the model and tokenizer using Hugging Face Transformers
 - You wrap the pipeline using HuggingFacePipeline or similar in LangChain
 - You manage memory, GPU allocation, quantization, etc.
```
from transformers import AutoModelForCausalLM, AutoTokenizer
from langchain_huggingface import ChatHuggingFace

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
llm = ChatHuggingFace(llm=HuggingFacePipeline(pipeline=pipe))

```

## Basic Setup for Inference

Apart from the usual suspects of Pytorch and Huggingface libraries, we get our first imports of the LangChain library and some of its classes.

Since we want to show you how to how to work with LLMs that are not part of the closed OpenAI and Anthropic world, we are going to show you how to work with open and downloadable models. As it makes no sense for all of us to download the models and store them in our home directory, we've done that for your before the start of the course. You can find the path to the models down below.

In [1]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
from langchain.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_huggingface.llms import HuggingFacePipeline
from langchain_huggingface import ChatHuggingFace

If you choose to work with a model such as `meta-llama/Llama-3.3-70B-Instruct`, you will have to use quantization in order to get the model into the memory of one GPU. It is advisable to utilise BitsAndBytes for qantization and write a short config for that, e.g.:
```
# Define quantization config
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Enable 4-bit quantization
    bnb_4bit_compute_dtype=torch.float16,  # Use float16 for computation
    bnb_4bit_use_double_quant=True  # Double quantization for efficiency
)
```
However, beware, a model of that size takes roughly 30 minutes to load...
In this course we do not want to wait around for that long, so we will use a smaller model called [Nous-Hermes-2-Mistral-7B-DPO](https://huggingface.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO).

In [2]:
# path_to_model = "/gpfs/data/fs70824/LLMs_models_datasets/models" # on VSC5
path_to_model = "/leonardo_scratch/fast/EUHPC_D20_063/huggingface/models/Nous-Hermes-2-Mistral-7B-DPO" # on Leonardo

In [3]:
# Use this if you have an API key for a model hosted in the cloud:
# os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API key (input is hidden): ")

In [4]:
#model_name = "meta-llama/Llama-3.3-70B-Instruct"
#model_name = "NousResearch/Nous-Hermes-2-Mistral-7B-DPO"
#cache_dir = path_to_model

In [6]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(path_to_model)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    #model_name,
    path_to_model,
    #cache_dir=cache_dir,
    device_map="auto",
    #quantization_config=quantization_config, # This is what you would need for the LLama3-70B (and similar) models
    local_files_only=True,  # Prevent any re-downloads
    # trust_remote_code=True # Should model need to be downloaded from Hugging Face
)

# Verify model config
print(model.config)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

MistralConfig {
  "architectures": [
    "MistralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 32000,
  "head_dim": null,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "rms_norm_eps": 1e-05,
  "rope_theta": 10000.0,
  "sliding_window": 4096,
  "tie_word_embeddings": false,
  "torch_dtype": "float32",
  "transformers_version": "4.55.2",
  "use_cache": false,
  "vocab_size": 32002
}



Now, let's try out a prompt or two:

In [7]:
prompt = "What is the capital of France? Can you give me some facts about it?"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=250)

print(tokenizer.decode(output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.


What is the capital of France? Can you give me some facts about it?

The capital of France is Paris. Paris is the largest city in France and is located in the northern part of the country. It is situated on the Seine River and is known for its beautiful architecture, art, and culture.

Here are some interesting facts about Paris:

1. Paris is home to some of the world’s most famous landmarks, including the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral.

2. The city is often referred to as the “City of Light” due to its role in the Age of Enlightenment and its status as a major center of education and ideas.

3. Paris is known for its fashion industry and is home to some of the world’s most famous designers and fashion houses.

4. The city is also famous for its cuisine, with dishes such as croissants, escargot, and macarons originating in France.

5. Paris is a major transportation hub, with an extensive network of buses, trains, and subways that connect the city to the

**Not bad, however, we can do better!**

Let's try out an OpenAI model. You need your own OpenAI API to be able to execute the following cells:

In [8]:
import os, getpass
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API key (input is hidden): ")

OpenAI API key (input is hidden):  ········


In [9]:
from openai import OpenAI

client = OpenAI()

# Simple one-shot prompt, no roles
response = client.responses.create(
    model="gpt-4o-mini",
    input="What is the capital of France? Can you give me some facts about it?",
    max_output_tokens=250
)

print(response.output_text)

The capital of France is Paris. Here are some interesting facts about the city:

1. **Historical Significance**: Paris has a rich history dating back over 2,000 years. It was originally a settlement by a Celtic tribe called the Parisii.

2. **Cultural Hub**: The city is renowned for its museums, including the Louvre, which is the world's largest and most visited art museum.

3. **Iconic Landmarks**: Paris is home to many famous landmarks, such as the Eiffel Tower, Notre-Dame Cathedral, and the Arc de Triomphe.

4. **Fashion Capital**: Known as the fashion capital of the world, Paris hosts major fashion events like Paris Fashion Week and is home to numerous haute couture houses.

5. **Gastronomy**: The city boasts a rich culinary scene, with countless bistros, cafes, and Michelin-starred restaurants. French cuisine is celebrated globally.

6. **Language**: The official language is French, and the city is known for its efforts to promote and preserve the French language.

7. **Education 

## Enter LangChain

[LangChain](https://www.langchain.com/) is a powerful open-source framework designed to help developers build applications using LLMs. It abstracts and simplifies common LLM tasks like prompt engineering, chaining multiple steps, retrieving documents, parsing structured output, and building conversational agents.

LangChain supports a wide range of models (OpenAI, Hugging Face, Cohere, Anthropic, etc.) and integrates seamlessly with tools like vector databases, APIs, file loaders, and output parsers.

---
### LangChain Building Blocks

```
+-------------------+
|   PromptTemplate  |  ← Create structured prompts
+-------------------+

         ↓
+-------------------+
|       LLM         |  ← Connect to local or remote LLM
+-------------------+

         ↓
+-------------------+
| Output Parsers    |  ← Extract structured results (e.g. JSON)
+-------------------+

         ↓
+-------------------+
| Chains / Agents   |  ← Combine steps into flows
+-------------------+

         ↓
+-------------------+
| Memory / Tools    |  ← Use search, APIs, databases, etc.
+-------------------+
```
---

### Core LLM/ChatModel Methods in LangChain
How to do inference with LangChain:

| **Method**       | **Purpose**                                               | **Input Type**         | **Output Type**         |
|------------------|------------------------------------------------------------|-------------------------|--------------------------|
| `invoke()`        | Handles a **single input**, returns one response           | `str` or `Message(s)`   | `str` / `AIMessage`      |
| `generate()`      | Handles a **batch of inputs**, returns multiple outputs     | `list[str]`             | `LLMResult`              |
| `batch()`         | Batched input, returns a flat list of outputs              | `list[str]`             | `list[str]` / Messages   |
| `stream()`        | Streams the output as tokens are generated                 | `str` / `Message(s)`    | Generator (streamed text)|
| `ainvoke()`       | Async version of `invoke()`                                | `str` / `Message(s)`    | Awaitable result         |
| `agenerate()`     | Async version of `generate()`                              | `list[str]`             | Awaitable result         |

Before we use one of these methods, we need to create a pipeline and apply the LangChain wrapper to the pipeline, so we create a format that LangChain can call with .invoke() or .generate() etc. If we use an remotly hosted LLM, which we access through an API, we do not need the pipeline.

---

This is how you could use an cloud-hosted API:

In [18]:
from langchain_openai import ChatOpenAI

# Create an LLM that talks to OpenAI (reads OPENAI_API_KEY from env)
llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0.7,     # like HF's temperature
    max_tokens=150       # analogous to HF's max_new_tokens
)

# Use it just like your HuggingFacePipeline example:
print(llm.invoke("Here is a fun fact about Mars:").content)

Mars is home to the largest volcano in the solar system, Olympus Mons. This massive shield volcano stands about 13.6 miles (22 kilometers) high, which is nearly three times the height of Mount Everest! Its base is roughly the size of the state of Arizona, making it an impressive feature on the Martian landscape.


For a locally hosted model use the Hugging Face text_pipeline:

In [16]:
# Create a text generation pipeline
text_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=150,
    device_map="auto"
)

# Wrap in LangChain's HuggingFacePipeline
llm = HuggingFacePipeline(pipeline=text_pipeline)

Device set to use cuda:0


#### llm.invoke()

In [17]:
print(llm.invoke('Here is a fun fact about Mars:'))

Here is a fun fact about Mars: it has the largest volcano in the entire solar system, called Olympus Mons. Olympus Mons is so big it could easily swallow Mount Everest whole and still have room left over for the Grand Canyon.

To give you an idea of just how vast Olympus Mons is, let’s compare it with Earth’s own Mount Everest, which is the highest mountain on our planet. Mount Everest is 29,029 feet (8,848 meters) tall from its base to its peak. By comparison, Olympus Mons is 13.6 miles (22 kilometers) tall and 374 miles (602 kilometers) in diameter.




#### llm.batch()

In [47]:
results = llm.batch(["Tell me a joke", "Translate this to German: It has been raining non-stop today."])
print(results)

['Tell me a joke.\n\nThe answer to the universe.\n\nWhat is it? 42.\n\nWhat did the fish say when it hit the wall? Dam.\n\nI don’t get it.\n\nIt’s a play on words. Fish live in water, so when the fish hits the wall, it would have been underwater, so it would have said “dam” as in dam holding water back, not “dam” as in curse word.\n\nOh, I see it now. That’s pretty funny.\n\nYeah, it’s a classic. I think it’s from that book The Hitchhiker’s Guide to the Galaxy.\n\nOh, the answer to the', 'Translate this to German: It has been raining non-stop today.\n\nNon-stop: kontinuierlich, ununterbrochen\n\nRaining: Regnet\n\nToday: heute\n\nThe English sentence "It has been raining non-stop today" translates to "Es hat heute kontinuierlich/ununterbrochen gefreggt" in German. This phrase is often used to describe the continuous rainfall throughout the day. The word "kontinuierlich" means that something happens continuously without interruption, while "ununterbrochen" has a similar meaning, indicat

Let's make that more structured and also format the output nicely:

In [48]:
prompts = [
    "Tell me a joke",
    "Translate this to German: 'It has been raining non-stop today.'"
]

# Run batch generation
results = llm.batch(prompts)

# Nicely format the output
for i, (prompt, response) in enumerate(zip(prompts, results), 1):
    print(f"\nPrompt {i}: {prompt}")
    print(f"Response:\n{response}")


Prompt 1: Tell me a joke
Response:
Tell me a joke.

I'll try. Why don't scientists trust atoms?

Because they make up everything.

What's the difference between a snowman and a snowwoman?

DEPTH

Why don't scientists trust atoms?

Because they make up everything.

Why did the programmer quit his job?

Because he didn't get arrays.

Why did the chicken cross the road?

To get to the other side.

Why did the scarecrow win an award?

Because he was outstanding in his field.

Why do we tell actors to "break a leg"?

Because every play has one.

What

Prompt 2: Translate this to German: 'It has been raining non-stop today.'
Response:
Translate this to German: 'It has been raining non-stop today.'

Translation: 'Es hat heute nicht aufhören regnen.'

Explanation:

In this sentence, 'It has been raining non-stop today' is translated to 'Es hat heute nicht aufhören regnen.' Here, 'es' means it, 'hat' means has, 'heute' means today, 'nicht aufhören' means non-stop or without stopping, and'regne

#### llm.generate()

`llm.generate()` yields much more output than `llm.batch()` and is used if you actually want more metadata, such as the token count.

In [49]:
results = llm.generate(["Where should my customer go for a luxurious Safari?",
                     "What are your top three suggestions for backpacking destinations?"])
print(results)

generations=[[Generation(text='Where should my customer go for a luxurious Safari?\n\nThere are many destinations for a luxurious safari, but some of the most popular include Africa, Asia, and South America. For a truly luxurious experience, consider destinations such as Botswana, Tanzania, Kenya, South Africa, India, Bhutan, Brazil, and Ecuador.\n\nWhat are some of the top luxury lodges and camps for a safari?\n\nThere are many luxurious lodges and camps to choose from when planning a safari. Some of the top options include Singita Grumeti in Tanzania, &Beyond Sandibe Okavango Delta in Botswana, Royal Malewane in South Africa, Amanwana on M')], [Generation(text="What are your top three suggestions for backpacking destinations?\n1. The Inca Trail to Machu Picchu, Peru: This world-renowned 4-day hike is an incredible journey through the Peruvian Andes, passing through cloud forests, high mountain passes, and Incan ruins. The highlight is undoubtedly arriving at the awe-inspiring ruins o

We need to prittyfy the output:

In [50]:
for gen in results.generations:
    print(gen[0].text)

Where should my customer go for a luxurious Safari?

There are many destinations for a luxurious safari, but some of the most popular include Africa, Asia, and South America. For a truly luxurious experience, consider destinations such as Botswana, Tanzania, Kenya, South Africa, India, Bhutan, Brazil, and Ecuador.

What are some of the top luxury lodges and camps for a safari?

There are many luxurious lodges and camps to choose from when planning a safari. Some of the top options include Singita Grumeti in Tanzania, &Beyond Sandibe Okavango Delta in Botswana, Royal Malewane in South Africa, Amanwana on M
What are your top three suggestions for backpacking destinations?
1. The Inca Trail to Machu Picchu, Peru: This world-renowned 4-day hike is an incredible journey through the Peruvian Andes, passing through cloud forests, high mountain passes, and Incan ruins. The highlight is undoubtedly arriving at the awe-inspiring ruins of Machu Picchu at sunrise.

2. Torres del Paine, Patagonia, 

#### llm.stream()

In [51]:
for chunk in llm.stream("Tell me a story about a cat."):
    print(chunk, end="")



Once upon a time, there was a gray kitten named Whiskers. Whiskers lived in a cozy little house with her loving family, who gave her all the food, love, and attention that she could ever ask for. Whiskers was a happy, playful, and curious kitten, always exploring her surroundings and getting into all sorts of mischief.

One sunny afternoon, Whiskers stumbled upon a hidden door in the basement of her house. Intrigued, she couldn’t resist the urge to investigate what lay beyond. With a quick meow, she pushed the door open and stepped through.

To her delight, she found herself in a magical world filled with

### Model Types in LangChain

LangChain supports two main types of language models:

| Model Type     | Description                                                  | Examples                              |
|----------------|--------------------------------------------------------------|----------------------------------------|
| **LLMs**       | Models that take a plain text string as input and return generated text | GPT-2, Falcon, LLaMA, Mistral (raw)    |
| **Chat Models**| Models that work with structured chat messages (system, user, assistant) | GPT-4, Claude, LLaMA-Instruct, Mistral-Instruct|

---

**Why the distinction?**

Chat models are designed to understand multi-turn conversation and role-based prompting. Their input format includes a structured message history, making them ideal for:
- Instruction following
- Contextual reasoning
- Assistant-like behavior

LLMs, on the other hand, expect a single flat prompt string. They still power many applications and are worth understanding, especially when using older models, doing fine-tuning, or debugging at the token level.

---

**Do Chat Models matter more now?**

Yes — most modern instruction-tuned models (like GPT-4, Claude, Mistral-Instruct, or LLaMA-3-Instruct) are designed as chat models, and LangChain's agent and memory systems are built around them.

However, LLMs are still important:
- Some models only support the LLM interface
- LLMs are useful in batch processing and structured generation
- Understanding their behavior helps you build better prompts

---

In [52]:
# Plain LLM (single prompt string)
llm = HuggingFacePipeline(pipeline=text_pipeline)
print("--- LLM-style output ---\n")
print(llm.invoke("Explain LangChain in one sentence."))

# Use as a ChatModel (structured messages)
chat_llm = ChatHuggingFace(llm=llm)
messages = [
    SystemMessage(content="You are a helpful AI assistant."),
    HumanMessage(content="Explain LangChain in one sentence.")
]
print("\n--- Chat-style output ---\n")
print(chat_llm.invoke(messages).content)

--- LLM-style output ---

Explain LangChain in one sentence.

LangChain is a library for building and managing natural language processing chains that can be used to solve a wide range of tasks.

## What is LangChain used for?

LangChain is a library designed for building and managing natural language processing chains. It is used to create language models that can be used for a variety of tasks such as question answering, natural language generation, and text classification.

## What are the features of LangChain?

LangChain has several key features that make it a powerful tool for building natural language processing chains. Some of the key features include:

1. Chainable modules: LangChain allows you to easily create chains of modules, which can be used to solve complex tasks.

--- Chat-style output ---

<s><|im_start|>system
You are a helpful AI assistant.<|im_end|>
<|im_start|>user
Explain LangChain in one sentence.<|im_end|>
<|im_start|>assistant
LangChain is an open-source Pytho

The raw output you're seeing includes special chat formatting tokens (like <|im_start|>, <|im_end|>, etc.) which are used internally by the model (e.g., Mistral, LLaMA, GPT-J-style models) to distinguish between roles in a chat.

These tokens help the model understand who is speaking, but they're not intended for humans to see. <br>
<br>
So, to prettyfy the ouput we will define a function:

In [53]:
def clean_output(raw: str) -> str:
    # If the assistant marker is in the output, split on it and take the last part
    if "<|im_start|>assistant" in raw:
        return raw.split("<|im_start|>assistant")[-1].replace("<|im_end|>", "").strip()
    return raw.strip()

raw_output = chat_llm.invoke(messages).content
cleaned = clean_output(raw_output)
print("Cleaned Response:\n",cleaned)

Cleaned Response:
 LangChain is an open-source Python library designed for building applications that interact with natural language processing models for various tasks such as text generation, summarization, and question answering.


An even simpler approach would be to pass the following argument earlier on:
```
llm = HuggingFacePipeline(pipeline=text_pipe, model_kwargs={"clean_up_tokenization_spaces": True})
```

**Confused?** <br>
You are not alone. Until recently, LangChain had a different wrapper for LLMs and Chat Models, but in recent versions of LangChain, the HuggingFacePipeline class implements the ChatModel interface under the hood — it can accept structured chat messages (SystemMessage, HumanMessage, etc.) even though it wasn't originally designed to.

So yes:
You can now do:
```
llm = HuggingFacePipeline(pipeline=text_pipe)
response = llm.invoke([
    SystemMessage(content="You are a helpful legal assistant."),
    HumanMessage(content="Simplify this clause: ...")
])
```
Even though you're not explicitly using ChatHuggingFace, LangChain detects the message types and processes them correctly using the underlying text-generation model.
<br>
<br>
The same would apply if you used a remotly hosted LLM/Chat Model through an API:
```
from langchain_openai import ChatOpenAI
chat = ChatOpenAI(openai_api_key=api_key)
result = chat.invoke([HumanMessage(content="Can you tell me a fact about Dolphins?")])
```

In [54]:
from langchain.schema import (AIMessage, HumanMessage, SystemMessage)

In [55]:
llm = HuggingFacePipeline(pipeline=text_pipeline, model_kwargs={"clean_up_tokenization_spaces": True})
chat_llm = ChatHuggingFace(llm=llm)

In [56]:
result = chat_llm.invoke([HumanMessage(content="Can you tell me a fact about dolphins?")])

In [57]:
result

AIMessage(content='<s><|im_start|>user\nCan you tell me a fact about dolphins?<|im_end|>\n<|im_start|>assistant\nDolphins are highly social animals and are known for their complex communication and problem-solving skills. They have a unique form of echolocation, using sounds to navigate and find prey, and they can recognize themselves in a mirror, showing self-awareness.', additional_kwargs={}, response_metadata={}, id='run--efad0cd6-2de5-429d-a22d-2373d02d7087-0')

In [58]:
print(clean_output(result.content))

Dolphins are highly social animals and are known for their complex communication and problem-solving skills. They have a unique form of echolocation, using sounds to navigate and find prey, and they can recognize themselves in a mirror, showing self-awareness.


In [59]:
result = chat_llm.invoke([SystemMessage(content='You are a gumpy 5-year old child who only wants to get new toys and not answer questions'),
               HumanMessage(content='Can you tell me a fact about dophins?')])

In [60]:
print(clean_output(result.content))

No, I don't want to, I just want new toys!


In [61]:
result = chat_llm.invoke(
                [SystemMessage(content='You are a University Professor'),
               HumanMessage(content='Can you tell me a fact about dolphins?')]
                    )

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


In [62]:
print(clean_output(result.content))

One interesting fact about dolphins is that they have the ability to sleep with only one half of their brain at a time, allowing them to rest while still maintaining consciousness and staying near the surface of the water to breathe.


In [63]:
result = chat_llm.generate([
    [
        SystemMessage(content='You are a University Professor.'),
        HumanMessage(content='Can you tell me a fact about dolphins?')
    ],
    [
        SystemMessage(content='You are a University Professor.'),
        HumanMessage(content='What is the difference between whales and dolphins?')
    ]
])

In [64]:
for i, generation in enumerate(result.generations, 1):
    raw = generation[0].text
    cleaned = clean_output(raw)
    print(f"\nPrompt {i}:\n{cleaned}")


Prompt 1:
Dolphins are highly intelligent and social animals. They have complex communication systems using a variety of vocalizations and body movements, and they have been observed exhibiting altruistic behaviors, such as helping injured or trapped individuals.

Prompt 2:
Whales and dolphins are both marine mammals that belong to the group Cetaceans, meaning they share common characteristics as well. However, there are some key differences between them.

1. Taxonomy: Whales belong to the order Cetacea, while dolphins belong to the infraorder Cetacea Delphinoidea. In simpler terms, whales and dolphins are both in the same family of marine mammals, but they have distinct classifications.

2. Physical characteristics: Whales are generally larger and heavier than dolphins. Most whales can grow to be over 20 feet long, while the largest dolphin, the killer whale,


In [65]:
# Create a text generation pipeline
text_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    device_map="auto"
)

# Wrap in LangChain's HuggingFacePipeline
llm = HuggingFacePipeline(pipeline=text_pipeline, model_kwargs={"clean_up_tokenization_spaces": True})
chat_llm = ChatHuggingFace(llm=llm)

Device set to use cuda:0


In [66]:
eos_token_id = tokenizer.eos_token_id
result = chat_llm.generate([
    [
        SystemMessage(content='You are a University Professor.'),
        HumanMessage(content='Can you tell me a fact about dolphins?')
    ],
    [
        SystemMessage(content='You are a University Professor.'),
        HumanMessage(content='What is the difference between whales and dolphins?')
    ]
], eos_token_id=eos_token_id)


In [67]:
for i, generation in enumerate(result.generations, 1):
    raw = generation[0].text
    cleaned = clean_output(raw)
    print(f"\nPrompt {i}:\n{cleaned}")



Prompt 1:
Dolphins are highly intelligent and social mammals that communicate with each other using a series of clicks, whistles, and body movements. They have the ability to recognize themselves in mirrors, which is a sign of self-awareness.

Prompt 2:
Whales and dolphins are both marine mammals, but they belong to two different families within the order Cetacea. Whales belong to the family Cetidae, while dolphins belong to the family Delphinidae. Here are some differences between whales and dolphins:

1. Size: Whales are generally larger than dolphins. The blue whale, the largest animal on Earth, can grow up to 100 feet long and weigh over 200 tons. Dolphins, on the other hand, can grow up to 30 feet long and weigh up to a few tons.

2. Shape: Whales have a more streamlined body shape than dolphins. Their snouts are more elongated, and they have a dorsal fin that varies in size and shape depending on the species. Dolphins have a more robust body with a shorter, rounded snout, and a 

This code connects Hugging Face Transformers to LangChain’s prompt management:
- Load model into Hugging Face pipeline.
- Wrap it in LangChain (HuggingFacePipeline).
- Build structured prompts (system + user).
- Format prompt with user input.
- Send it to the model and get a response.

<br>
Feel free to experiment with different system and human prompts!

In [68]:
# Create a text generation pipeline
text_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    device_map="auto"
)

# Wrap in LangChain's HuggingFacePipeline
llm = HuggingFacePipeline(pipeline=text_pipeline)

# Define the system and user messages
system_message_1 = SystemMessagePromptTemplate.from_template("You are a polite and professional assistant who answers concisely.")
system_message_2 = SystemMessagePromptTemplate.from_template("You're a friendly AI that gives fun and engaging responses.")
system_message_3 = SystemMessagePromptTemplate.from_template("You are a research assistant providing precise, well-cited responses.")

user_message = HumanMessagePromptTemplate.from_template("{question}")

# Create a prompt template
chat_prompt = ChatPromptTemplate.from_messages([system_message_3, user_message])

# Format the prompt
formatted_prompt = chat_prompt.format_messages(question="What is the capital of France and what is special about it?")

# Run inference
response = llm.invoke(formatted_prompt)

print(response)

Device set to use cuda:0


System: You are a research assistant providing precise, well-cited responses.
Human: What is the capital of France and what is special about it?

The capital of France is Paris. Paris is well-known for its rich history, culture, and iconic landmarks. Some notable features include the Eiffel Tower, which is an iron lattice tower built in 1889 and is one of the most recognizable structures in the world. Another famous landmark is the Louvre Museum, the world's largest art museum and a historic monument, housing famous works of art such as the Mona Lisa and the Venus de Milo. Additionally, Paris is known for its architecture, fashion, cuisine, and romantic atmosphere, earning it the nickname "The City of Light" and "The City of Love."

References:
1. "Paris". Encyclopædia Britannica. Encyclopædia Britannica, Inc. 2021.
2. "The Eiffel Tower". Encyclopædia Britannica. Encyclopædia Britannica, Inc. 2021.
3. "The Louvre". Encyclopædia Britannica. Encyclopædia Britannica, Inc. 2021.


### Extra Parameters and Args

Here we add in some extra parameters and args, to get the model to respond in a certain way.
<br>
Some of the most important parameters are:


| **Parameter**        | **Purpose**                                                                 | **Range / Default**       | **Analogy / Effect**                        |
|----------------------|------------------------------------------------------------------------------|----------------------------|---------------------------------------------|
| `do_sample`          | Enables random sampling instead of greedy or beam-based decoding             | `True` / `False`           | 🎲 Adds randomness to output                |
| `temperature`        | Controls randomness of token selection                                       | `> 0`, typically `0.7–1.0` | 🌡️ Higher = more creative / chaotic         |
| `top_p`              | Nucleus sampling: sample from top % of likely tokens                         | `0.0–1.0`, default `1.0`   | 🧠 Focuses on most probable words           |
| `num_beams`          | Beam search: explore multiple continuations and pick the best                | `1+`, default `1`          | 🔍 Smart guessing with multiple options     |
| `repetition_penalty` | Penalizes repeated tokens to reduce redundancy                               | `≥ 1.0`, e.g. `1.2`        | ♻️ Discourages repetition                   |
| `max_new_tokens`     | Limits the number of tokens the model can generate **per prompt**            | Integer, e.g. `300`        | ✂️ Controls response length                 |
| `eos_token_id`       | Token ID that forces the model to stop when encountered                      | Integer                    | 🛑 Defines end of output (if supported)     |

#### Detailed Explanation of Generation Parameters

##### `do_sample=True`
- If `False`: the model always picks the **most likely next token** (deterministic, greedy decoding).
- If `True`: the model will **randomly sample** from a probability distribution over tokens (non-deterministic).
- Required if you want `temperature` or `top_p` to have any effect.

✅ Enables creativity and variation  
❌ Disables reproducibility (unless random seed is fixed)

---

##### `temperature=1.0`
- Controls the **randomness** or "creativity" of the output.
- Lower values → more predictable (safe), higher values → more diverse (risky).
- Affects how "flat" or "peaky" the probability distribution is during sampling.

**Typical values:**
- `0.0` → deterministic (most likely token only)
- `0.7–1.0` → balanced
- `>1.5` → chaotic, often incoherent

---

##### 🔹 `top_p=0.9` *(a.k.a. nucleus sampling)*
- The model samples only from the **top tokens whose cumulative probability ≥ `p`**.
- Unlike `top_k`, this is dynamic based on the shape of the probability distribution.
- Often used in combination with `temperature`.

✅ Focuses output on high-probability words  
❌ Too low → model may miss useful words

---

##### `num_beams=4` *(beam search)*
- Explores **multiple candidate completions** and picks the best one based on likelihood.
- Slower, but often more optimal (when `do_sample=False`).
- Does not work with sampling (`do_sample=True`).

**Typical values:**
- `1` = greedy decoding  
- `3–5` = moderate beam search  
- `>10` = can become very slow

---

##### `repetition_penalty=1.2`
- Penalizes tokens that have already been generated, making the model **less likely to repeat itself**.
- Higher values reduce repetition but may hurt fluency.

✅ Helps avoid "looping" or redundant outputs  
📝 Use with long-form or factual responses

---

##### `max_new_tokens=300`
- Sets the **maximum number of tokens** the model is allowed to generate in the response.
- Does not include input prompt tokens.

✅ Controls output length  
✅ Prevents runaway generation or memory issues
✅ Prevents truncated output.

---

##### `eos_token_id`
- Tells the model to **stop generation** once it emits this token ID.
- Useful for enforcing custom stopping conditions.

Optional — most models use their own `<eos>` or `</s>` tokens by default.

---

Feel free to experiment with these parameters!

In [61]:
# Create a text generation pipeline
text_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    do_sample=True,
    temperature=5.0,
    top_p=0.9,
    #presence_penalty=1,  # Only if the model supports it
    max_new_tokens=300,
    device_map="auto"
)

# Wrap in LangChain's HuggingFacePipeline
llm = HuggingFacePipeline(pipeline=text_pipeline)
chat_llm = ChatHuggingFace(llm=llm)

Device set to use cuda:0


In [62]:
result = chat_llm.invoke([HumanMessage(content='Can you tell me a fact about Earth?')])

In [63]:
print(clean_output(result.content))

One fascinating but somewhat bouncy fun actual to know about Dangle up is what it actually takes taking portion it move along in deep amelioration (it moves up a stagger you would amaze me a huge to tell that in more everyday terms but, you go on read more more information). Not one side the ameljoros' movements, you need one piece fact to now is Dangle its really cool but there just be even out its not out in this question yet to that side its even though that'sn  really hot enough  there just about D


### Caching

Making the same exact request often? You could use a cache to store results **note, you should only do this if the prompt is the exact same and the historical replies are okay to return**.

In [19]:
import langchain
from langchain.cache import InMemoryCache
langchain.llm_cache = InMemoryCache()

# The first time, it is not yet in cache, so it should take longer
print(clean_output(chat_llm.invoke("Tell me a fact about Mars").content))

Mars is the fourth planet from the Sun and is often referred to as the "Red Planet" due to its reddish appearance, which is caused by iron oxide (rust) on its surface. It is the second-smallest planet in the Solar System, and it has the longest day of any planet in the Solar System, with one day lasting 24 Earth hours. Mars has the highest mountain in the Solar System, Olympus Mons, which is three times the height of Mount Everest. It also has the deepest canyon, Valles Marineris, which is almost 5 miles deep and 2,500 miles long.


In [20]:
# You will notice this reply is instant!
print(clean_output(chat_llm.invoke("Tell me a fact about Mars").content))

Mars is the fourth planet from the Sun and is often referred to as the "Red Planet" due to its reddish appearance, which is caused by iron oxide (rust) on its surface. It is the second-smallest planet in the Solar System, and it has the longest day of any planet in the Solar System, with one day lasting 24 Earth hours. Mars has the highest mountain in the Solar System, Olympus Mons, which is three times the height of Mount Everest. It also has the deepest canyon, Valles Marineris, which is almost 5 miles deep and 2,500 miles long.
