# Generative LLMs for RAG

This notebook provides a hands-on exploration of generative Large Language Models (LLMs). We'll start with cloud-based API models, and then explore how to run smaller models locally. Finally, we will explore different prompting strategies you can use to get best results.

## Model Access & Generating an API Key

For this lab, we recommend the BLABLADOR API provided by the German Supercomputing centre in Jülich. Follow [🔗 their instructions](https://sdlaml.pages.jsc.fz-juelich.de/ai/guides/blablador_api_access/) to generate an API key. Use your university login to gain access. You can also interact via their [🔗 Web UI](https://helmholtz-blablador.fz-juelich.de).

They host several models, and you can specify the following alias names in API calls:
- `alias-code` - Qwen2.5-Coder-7B-Instruct, a model that is specially trained for code.
- `alias-embeddings` - GritLM-7B, a model specially made for embeddings
- `alias-fast` - Ministral-8B-Instruct-2410, a model for high throughout (we will use this one in this lab)
- `alias-large` - DeepSeek-R1-Distill-Llama-70B, a very large model; the most accurate, but also the slowest.
- `alias-reasoning` - QwQ-32B, a model that is specially trained for reasoning.This model might not run 24h.


## Environment Setup

Make sure to install the required libraries (comment out the following line, or make sure that your environment has these dependencies installe):


In [5]:
!pip install openai torch transformers



In [2]:
from dotenv import dotenv_values

# load .env-Datei
config = dotenv_values(".env")

API_URL = "https://api.helmholtz-blablador.fz-juelich.de/v1/"
API_KEY = config.get("API_KEY")
API_MODEL = "alias-fast" # Best for fast dev runs

## Calling LLMs via an API

For the first step, we are interacting with an LLM via a hosted API. Most providers (BLABLADOR too) follow an Open-AI compliant API, meaning that you can use the `openai` python wrapper also to query models by non-openai providers. First, we create an API client object:

In [3]:
from openai import OpenAI

client = OpenAI(
    api_key=API_KEY,
    base_url=API_URL
)

Modern LLMs are usually finetuned for instructions, and their prompting follows a turn-based pattern: each message in a conversation with the LLM has a role associated with it (`user`, the user submitting the query; `system`, general instructions at the beginning of the conversation; or `assistant`, the reply of the LLM). For our first call, we are going to use the [`completions` API endpoint](https://platform.openai.com/docs/api-reference/chat/create), which you can use in python by calling `client.chat.completions.create`.

It takes 2 mandatory arguments: the model (we are going to use `alias-fast`, saved in the `API_MODEL` variable), and the message, formatted as list of `{"role": <role>, "content": <message text>}` dictionaries.

In [4]:
# Make a simple completion request
response = client.chat.completions.create(
    model=API_MODEL,
    messages=[
        {"role": "user", "content": "Explain the concept of retrieval augmented generation in 2 sentences."}
    ]
)

print(response.choices[0].message.content)

 Retrieval Augmented Generation is a method in natural language processing that combines two processes: retrieval and generation. The retrieval step involves pulling relevant information from a large dataset based on the input query, while the generation step uses this information along with a language model to produce a coherent and contextually accurate response. This approach is particularly useful for tasks like question answering, chatbots, and content creation.


## Working with Temperature

Of course, the API allows much more parameters to influence the results of the LLM generative process. First, `temperature`. Temperature controls "randomness" in the output, where `temperature = 0` yields a deterministic result, while `temperature = 1` yields a more unstable, but usually also more creative result. A balanced choice of e.g., `temperature = 0.7` is usually used.

Try out how different temperature values affect the response generation for the same prompt:

In [16]:
for temp in [0, 0.7, 1.0]:
    response = client.chat.completions.create(
        model=API_MODEL,
        temperature=temp,
        messages=[
            {"role": "user", "content": "Write a single short advertising tagline for a retrieval augmented generation system."}
        ]
    )
    print(temp, response.choices[0].message.content)

0  "Unlock the Power of Retrieval-Augmented Generation: Your Ideas, Our Expertise."
0.7  "Unlock Your Words, Unleash Your Ideas."
1.0  "Got Your Thoughts? Let AI Retrieve and Generate!"


## Chat Format: System, User, and Assistant Messages

Instruction-tuned LLMs generally use different message roles, which we can leverage via the API:

- **system**: sets behavior instructions, is usually passed as the first message in a chat to "set the tone"
- **user**: represents human input messages, the prompt as you would enter it in an LLM
- **assistant**: represents AI responses, the generated text received by the LLM (usually from an earlier conversation turn)

Try out different system prompts where you command the model to take on different personas to see how its reponse to the actual prompt differs.

In [7]:
system_prompt = "You are a technical expert who explains concepts briefly in only a few sentences."
user_prompt = "What is retrieval-augmented generation?"

response = client.chat.completions.create(
    model=API_MODEL,
    temperature=0.7,
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt}
    ],
)

print(response.choices[0].message.content)

 Retrieval-augmented generation (RAG) is a technique where a model retrieves relevant documents or information from an external knowledge base or index before generating a response. This helps to ensure that the generated content is accurate, up-to-date, and based on factual information. The retrieved data is then used to enhance the model's understanding and improve the quality of the output.


In [17]:
system_prompt = "You are a preschool teacher explaining concepts to children. The focus is on being short and accessible."
user_prompt = "What is retrieval-augmented generation?"

response = client.chat.completions.create(
    model="alias-fast",
    temperature=0.7,
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt}
    ],
)

print(response.choices[0].message.content)

 Retrieval-augmented generation is a type of AI that uses both a model and a database. When you ask a question, the AI first searches the database to find useful information. Then, it combines this information with its own knowledge to generate a more accurate and helpful answer. This way, it can provide more reliable information than a model that relies solely on its training data.


## Multi-turn Conversations

LLMs can maintain context across multiple messages. Via the API, you can simulate conversations by appending responses and your own follow-up message to the message list. Generated responses are inserted using the `assistant` role, and the follow-up user message is then posed with the `user` role.

Ask the LLM a question, append its response, and then ask a follow-up question to see how it maintains context across the whole conversation.


In [28]:
messages = [
    {"role": "system", "content": "You are a helpful AI assistant providing assistance to users by guiding their learning process. You answer in short, to-the-point complete answers."},
    {"role": "user", "content": "I want to learn about natural language processing and information retrieval. How should I start?"}
]

In [20]:
# First turn
response = client.chat.completions.create(
    model="alias-fast",
    temperature=0.7,
    messages=messages,
)

assistant_response = response.choices[0].message.content
print("Assistant:", assistant_response)

Assistant:  To learn about natural language processing (NLP) and information retrieval, follow these steps:
1. **Mathematics and Programming Basics**:
   - Brush up on your knowledge of linear algebra, calculus, and probability.
   - Learn Python, which is widely used in NLP and information retrieval.

2. **Introduction to NLP**:
   - Read introductory NLP books like "Speech and Language Processing" by Jurafsky and Martin.
   - Take online courses on platforms like Coursera, edX, or Udacity. For example, "Natural Language Processing with Classification and Vectors" on Coursera.

3. **Learn Key NLP Concepts**:
   - Tokenization, part-of-speech tagging, named entity recognition, sentiment analysis, and machine translation.
   - Study NLP libraries such as NLTK, spaCy, and Stanford NLP.

4. **Information Retrieval**:
   - Learn the basics of information retrieval, including vector space models, TF-IDF, and BM25.
   - Study IR systems and techniques such as search engines, recommendation s

In [11]:
# Add the response to the conversation history; notice the 'assistant'  role
messages.append({"role": "assistant", "content": assistant_response})
# Second turn - add your follow-up message; notice the 'user'  role
messages.append({"role": "user", "content": "What Python libraries should I use for that?"})

In [12]:
response = client.chat.completions.create(
    model="alias-fast",
    temperature=0.7,
    messages=messages,
)
print(response.choices[0].message.content)

 Here are some key Python libraries for Natural Language Processing (NLP) and Information Retrieval (IR):




## Running Local LLMs

Besides calling models via an API, you might want to run models locally, for example if you lack internet access, for experimentation without worrying about rate limits or API cost, or when working with privacy-sensitive data. In the following, we will use models loaded from the [Huggingface]() model repository and use them with the `transformers` library. As a rule of thumb, model below 400M parameters usually fit in 16GB RAM, with acceptable inference times on CPU.

We'll use a small model that can run on a CPU: [`HuggingFaceTB/SmolLM2-360M-Instruct`](). We need both the model itself, and a tokenizer to convert the message blocks into token IDs the model can consume.

In [34]:
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "HuggingFaceTB/SmolLM2-360M-Instruct"
device = "cpu" # "gpu" for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


Use the tokenizer to turn the message list into usable model inputs. The [`tokenizer.apply_chat_template`]() function can directly produce the desired result. Important: specify the arguments `tokenize=True` and `return_tensors="pt"` to get the correct Torch tensor objects for the model. Also remember to send the tokenized data to the same device as the model using the `.to(device)` method as already with the model.

In [35]:
# Define the message as before
messages = messages # Or replace with a new converstation
# Tokenize the message
inputs=tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt").to(device)

Now you can call the `model.generate()` function with the tokenized inputs to produce generated text. However, the model returns token IDs, not their string representation. You can convert the model output back into human-readable format using the `tokenizer.decode()` function.

In [36]:
# Run the model inference pass
outputs = model.generate(inputs, max_new_tokens=500)
# Decode the output message
print(tokenizer.decode(outputs[0]))

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<|im_start|>system
You are a helpful AI assistant providing assistance to users by guiding their learning process. You answer in short, to-the-point complete answers.<|im_end|>
<|im_start|>user
I want to learn about natural language processing and information retrieval. How should I start?<|im_end|>
<|im_start|>assistant
To start learning about natural language processing and information retrieval, begin by understanding the basics of natural language processing (NLP) and information retrieval. NLP is the field of computer science that deals with the interaction between computers and human language. It involves the development of algorithms and models to process and understand human language.

Start by learning about the fundamental concepts of NLP, such as tokenization, stemming, and part-of-speech tagging. These concepts will serve as the foundation for more advanced topics. Next, explore information retrieval, which involves the process of finding relevant information from large amo

## Prompt Engineering

To get the most out of any LLM, you need to carefully design prompts. Many prompting techniques exist and are hot topic in current research. The guides linked below provide a comprehensive overview on prompt engineering. In this notebook, we will explore three prompt engineering techniques:

- Prompting-Induced Planning / Chain-of-Thought
- Self-critique
- Structured output prompting

The following guides explore prompt engineering techniques in more detail:

- [🔗 Prompt Engineering Guide](https://drive.google.com/file/d/1AbaBYbEa_EbPelsT40-vj64L-2IwUJHy/view)
- [🔗 GPT4.1 Prompting Guide](https://cookbook.openai.com/examples/gpt4-1_prompting_guide)


For prompt engineering, we will consider the RAG usecase with the example query and retrieved snippets below. The goal is to provide a short but informative answer to the user.

In [44]:
query = "why has olive oil increased in price"
snippets = [
    ("Global olive oil prices have surged due to poor harvests in Spain and Italy, caused by extreme drought and heatwaves linked to climate change.", 0.983),
    ("A sharp decline in olive oil production, especially in major producing countries like Spain, has led to reduced supply and increased prices worldwide.", 0.942),
    ("Increased production costs, including higher labor and transportation expenses, have contributed to the rise in olive oil prices.", 0.891),
    ("Rising inflation and currency fluctuations have made imported goods, including olive oil, more expensive in several countries.", 0.847),
    ("Retailers report that consumer demand for premium oils has increased, indirectly pushing prices higher across all olive oil grades.", 0.812),
    ("Climate change has disrupted agricultural cycles in the Mediterranean region, impacting many crops including olives.", 0.768),
    ("Sunflower oil shortages due to the war in Ukraine have led to a shift in demand toward olive oil, tightening global supply.", 0.703),
    ("The Mediterranean diet, which emphasizes olive oil consumption, continues to gain popularity for its health benefits.", 0.623),
    ("Spain is one of the world's largest producers of olive oil, exporting millions of liters each year.", 0.578),
    ("Olive trees can live for hundreds of years and are cultivated mostly in Mediterranean climates.", 0.519)
]

### Vanilla RAG

The most basic RAG prompting technique directly feeds retrieved context into the model alongside the query. This approach serves as a baseline: it simply instructs the model to answer the question based on the given context, without extra guidance.

Implement the vanilla RAG prompt using a simple format with a system message and a user message containing the context and question.

In [40]:
system_prompt = "You are a helpful assistant answering a question based on retrieved context information."
user_prompt = f"""
Context:
{"\n- ".join([text for text, score in snippets])}

Question:
{query}

Answer:"""

response = client.chat.completions.create(
    model="alias-fast",
    temperature=0.7,
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt}
    ],
)

print(response.choices[0].message.content)

 Olive oil prices have surged due to several factors, including:
1. **Poor Harvests**: Extreme drought and heatwaves linked to climate change have led to poor olive oil harvests in major producing countries like Spain and Italy.
2. **Reduced Supply**: The decrease in olive oil production has resulted in a global supply shortage, driving up prices.
3. **Increased Production Costs**: Higher labor and transportation expenses have contributed to the rising costs of olive oil.
4. **Inflation and Currency Fluctuations**: Rising inflation and changes in currency values have made imported goods, including olive oil, more expensive.
5. **Shift in Demand**: A growing consumer demand for premium olive oils has indirectly pushed up prices across all grades.
6. **Impact of Climate Change**: Climate change has disrupted agricultural cycles, affecting olive production in the Mediterranean region.
7. **Sunflower Oil Shortages**: The war in Ukraine has led to sunflower oil shortages, shifting demand to

### Chain-of-Thought (CoT) Prompting

Chain-of-thought prompting encourages the model to reason step by step before producing a final answer. This helps improve factual accuracy and clarity, especially when the retrieved context is complex or contains multiple causal links.

Implement CoT prompting by asking the model to first identify relevant information, then reason through the answer, and finally summarize the conclusion.

In [43]:
system_prompt = "You are a smart assistant answering a question using the provided context. First, identify which parts of the context are most relevant to the question. Then, briefly reason through the answer using only the relevant information. Finally, provide a short and informative answer."

user_prompt = user_prompt # Same as before

assistant_prompt = """
Let's think step by step:
"""

response = client.chat.completions.create(
    model="alias-fast",
    temperature=0.7,
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
        {"role": "assistant", "content": assistant_prompt}
    ],
)

print(response.choices[0].message.content)

 1. The main reason is a sharp decline in olive oil production in major producing countries like Spain and Italy due to extreme drought and heatwaves linked to climate change.



### Critique-and-Revise Prompting

To improve factuality or clarity, we can instruct the model to critique an initial answer before revising it. This multi-step prompting encourages reflection and refinement, and can lead to higher-quality outputs.

Generate an initial answer, prompt the model to critique it, and then ask for a revised version based on that critique.

In [41]:
###### Self-critique

initial_response = client.chat.completions.create(
    model="alias-fast",
    temperature=0.7,
    messages=messages # Same as before
)

print("Initial response:", initial_response.choices[0].message.content)
messages.append({"role": "assistant", "content": response.choices[0].message.content})
messages.append({"role": "user", "content": "Critique this initial response, providing suggestions on how to make it better."})

critique = client.chat.completions.create(
    model="alias-fast",
    temperature=0.7,
    messages=messages
)

print("\n\n Critique:", critique.choices[0].message.content)
messages.append({"role": "assistant", "content": critique.choices[0].message.content})
messages.append({"role": "user", "content": "Answer the question using initial response, but take into account the suggestions."})

final_response = client.chat.completions.create(
    model="alias-fast",
    temperature=0.7,
    messages=messages
)

print("\n\n Final response:", final_response.choices[0].message.content)

Initial response:  Start by learning key concepts and tools in natural language processing and information retrieval. Here's a suggested pathway:
1. **Understand the Basics:**
   - Learn about text processing, tokenization, stemming, and lemmatization.
   - Study basic concepts like bag-of-words, TF-IDF, and word embeddings.

2. **Choose a Programming Language:**
   - Popular choices for NLP and IR include Python, R, and Java.

3. **Learn Essential Libraries and Tools:**
   - Python: NLTK, SpaCy, Gensim, Scikit-learn, and Transformers (by Hugging Face).
   - R: tm, text, quanteda, and text2vec.

4. **Online Courses and Tutorials:**
   - Coursera: "Natural Language Processing Specialization" by deeplearning.ai
   - edX: "Natural Language Processing with Classification and Vector Spaces" by IIT Bombay
   - Udacity: "Natural Language Processing Nanodegree Foundation"
   - YouTube: FreeCodeCamp, 3Blue1Brown, and TensorFlow tutorials

5. **Read Relevant Research Papers:**
   - Start with in

### Self-Consistency

As we have seen before, a model might produce different outputs at repeated inference when using higher temperatures. We can use this to our advantage by prompting the model for self-consistency: first, we generate several candidate answers at high temperature (leveraging the more creative, diverse output), and then provide these to the model to distill into a final answer at low temperature (yielding consistent output).

Implement self-consistency by generating 3 candidates at high temperature, and combining them in a follow-up inference pass. Design special prompts for both phases.

In [45]:
num_candidates = 3
candidate_responses = []
for i in range(num_candidates):
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
        {"role": "assistant", "content": assistant_prompt}
    ]
    response_candidate = client.chat.completions.create(
        model="alias-fast",
        temperature=0.9,
        messages=messages
    )
    candidate_responses.append(response_candidate.choices[0].message.content)

print(candidate_responses)

[' Olive oil prices have skyrocketed due to\n1. **Supply reduction**: Poor harvests in major producing countries like Spain and Italy, linked to extreme drought and heatwaves caused by climate change, have significantly reduced olive oil production.\n2. **Increased demand**: The high demand for premium oils and shifting demand from sunflower oil (due to the war in Ukraine) have driven prices higher.\n3. **Increased production costs**: Higher labor and transportation costs further elevate the price of olive oil.\n\nSo, the main reasons for the increased price of olive oil are reduced supply due to poor harvests and increased production costs.', ' Olive oil prices have increased due to:\n1. **Poor Harvests**: Drought and heatwaves in Spain and Italy, caused by climate change, have significantly reduced olive oil production.\n2. **Increased Production Costs**: Higher labor and transportation expenses have contributed to the rising costs of production.\n3. **Global Economic Factors**: Infl

In [21]:
response = client.chat.completions.create(
    model="alias-fast",
    temperature=0.1,
    messages=[
        {"role": "system", "content": "You are a smart assistant answering a question based on provided candidate answers. Provide a short and informative definitive answer with only the most relevant information."},
        {"role": "user", "content": "\n\n\n".join([f"Suggestion {i}: {text}" for i, text in enumerate(candidate_responses)])},
    ]
)
print(response.choices[0].message.content)

 Olive oil prices have increased due to poor harvests in major producing countries like Spain and Italy, caused by extreme drought and heatwaves linked to climate change. This reduction in supply, combined with increased production costs, currency fluctuations, and higher consumer demand for premium oils, has driven up prices worldwide.


#### Structured Output

For downstream applications, you should have the model return structured outputs, e.g., in JSON dictionaries. Then, you can elicit responses that decompose their answer into different aspects. For example, you could refine the critique-and-revise technique from before my having the model return the critique and the final response as fields in a JSON dictionary, or separate reasoning and response in the answer in order to only display the response.

For RAG, you can also use structured outputs to have the model attribute its reasoning to passages. Try to make it return its answer in a list of sentences, where for each sentence is represented as JSON dictionary with the text and the relevant passages information is taken from.

In [22]:
system_prompt = """
You are an assistant that answers questions based on retrieved context. Generate an informative but succinct answer to the question.
Organize your answer as a list of JSON dictionaries, one for each answer passage, in the following format:

[
    {
    "text": "", # The passage text
    "ref": [1,2] # The indices of sources used in that text
    },
    ... # More passages
]

Answer only with valid JSON.
"""
context = "\n".join([f"({i}) {text}" for i, (text, score) in enumerate(snippets)])
user_prompt = f"""
Context:
{context}

Question:
{query}

Answer:"""


In [23]:
response = client.chat.completions.create(
    model="alias-fast",
    temperature=0.1,
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt}
    ]
)
print(response.choices[0].message.content)

 Olive oil prices have increased due to several factors:
[
    {
    "text": "Global olive oil prices have surged due to poor harvests in Spain and Italy, caused by extreme drought and heatwaves linked to climate change.",
    "ref": [0]
    },
    {
    "text": "A sharp decline in olive oil production, especially in major producing countries like Spain, has led to reduced supply and increased prices worldwide.",
    "ref": [1]
    },
    {
    "text": "Increased production costs, including higher labor and transportation expenses, have contributed to the rise in olive oil prices.",
    "ref": [2]
    },
    {
    "text": "Rising inflation and currency fluctuations have made imported goods, including olive oil, more expensive in several countries.",
    "ref": [3]
    },
    {
    "text": "Retailers report that consumer demand for premium oils has increased, indirectly pushing prices higher across all olive oil grades.",
    "ref": [4]
    },
    {
    "text": "Climate change has disru