<a href="https://colab.research.google.com/github/TheRayen1/Projet-LLM/blob/main/CSC_468_668_(Spring_2025)_Language_model_inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Language model inference and few-shot prompting

[Mark J. Nelson](https://www.kmjn.org/), Spring 2025

This notebook demonstrates inference with pre-trained large language models (LLMs). "Inference" is neural network jargon for *using* a trained model, versus training it. For example, if you have an image classifier, inference is using it to classify images. With large language models, inference means generating text.

For now we will stick to "base" language models that are trained solely on the "autoregressive objective", i.e. predicting the next word. Even these language models start to exhibit things that look sort of like problem-solving or question-answering once they reach a certain size, but they are not explicitly trained to solve problems or answer questions in the way models like ChatGPT are.

In my opinion it's important to build intuition with LLMs by starting with base language models that have only been trained to model language, and haven't been further trained with another objective on top of the language modeling objective. Once you understand these base models, it provides a good foundation for understanding the models that have undergone other types of training on top of that.


## Setup

First check that we have a GPU attached to this notebook:

In [1]:
!nvidia-smi

Tue Mar  4 03:01:38 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   39C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## Loading a language model

Now we're ready to download and load an LLM.

We will use the [`transformers`](https://huggingface.co/docs/transformers/en/index) library from Huggingface for inference. There are many libraries for LLM inference, but `transformers` is one of the more popular that supports many open-source language models. For more industrial-strength usage (e.g. serving clients), [vLLM](https://docs.vllm.ai/) is another good option.

We will use the [OLMo series of open-source models](https://allenai.org/olmo) from the Allen Institute for AI, initially [the "1b" base model](https://huggingface.co/allenai/OLMo-1B-hf).

The model's size is how many parameters (neural network weights) it has, and 1b means 1 billion parameters. The OLMo models use 32-bit parameters, so RAM usage in bytes will be about 4x the number of parameters (32 bits equals 4 bytes). For example, the 1b parameter model will take about 4 GB of RAM. OLMo models also support 8-bit quantized versions, which will take 1/4 as much RAM for the same number of parameters. For example, if you load the 1b model in 8-bit quantized form, it will take about 1 GB of RAM.

The amount of RAM we have available (both regular RAM and GPU vRAM) will often be a constraint in the size of models we can use. The Colab free tier (as of November 2024) provides 12.7 GB of regular RAM and a T4 GPU with 16 GB of vRAM.

In [2]:
modelname = "allenai/OLMo-1B-hf"      # the model to load, using its Huggingface repository name

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(modelname).to('cuda')
tokenizer = AutoTokenizer.from_pretrained(modelname)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/632 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/4.71G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/5.37k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.12M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

## Sampling from the model

Now that the model is loaded, we can use it. To generate text, we feed it a *prompt* that will be autocompleted. We will write a small helper function that given a prompt and a number *n*, generates *n* completions of the prompt and prints them.

My code here explicitly tokenizes and de-tokenizes the prompt and the generated text to be clear about what's going on and let us inspect intermediate results if we want, but there is a simpler [pipeline API](https://huggingface.co/docs/transformers.js/en/pipelines) that does that for you if you just want to generate text.

There are various things you can change here. For example, *temperature* controls how much variation there is in the generation process (lower temperature is more deterministic), and *max_new_tokens* controls how much text is generated.

In [5]:
temperature = 1.0
max_new_tokens = 50

def sample_llm(prompt, n):
  # tokenize our prompt
  tokenized_prompt = tokenizer(prompt, return_tensors="pt").to('cuda')

  # generate new text! This is where the magic happens
  outputs = model.generate(**tokenized_prompt,
                           do_sample=True,
                           temperature=temperature,
                           max_new_tokens=max_new_tokens,
                           num_return_sequences=n)

  # output by default includes BOTH our prompt and the generated text
  # chop off our prompt from the beginning of each result so we can distinguish
  outputs = outputs[:, len(tokenized_prompt['input_ids'][0]):]

  # the output is a list of numerical tokens - decode it back to readable characters
  outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)

  # print the outputs
  print()
  for output in outputs:
    print(f"Prompt: {prompt!r}, Generated text: {output!r}")

## Inference examples

Now we're ready to generate some text!

### Text autocompletion

The most straightforward type of language model inference is to give it the start of a sentence as prompt, and have it complete the sentence.

(Technical note: Do *not* include a space at the end of the prompt. Due to the way most LLM tokenizers currently work, you will generally get much worse results. But you can try to see what happens!)

In [6]:
sample_llm("A good definition of artificial intelligence is", 10)


Prompt: 'A good definition of artificial intelligence is', Generated text: ' the ability of machines to accomplish tasks that would require human intelligence or intelligence. AI refers to the use of computers and machines for automated operations related to human intelligence with the use of computational methods to accomplish tasks such as coding, robotics, programming, optimization'
Prompt: 'A good definition of artificial intelligence is', Generated text: ' ‘a process that performs a set of cognitive activities with the intent to achieve specified goals’. These cognitive activities can range from simple pattern matching to complex reasoning. Artificial intelligence is a relatively new field of research and development (R&D).\nA'
Prompt: 'A good definition of artificial intelligence is', Generated text: ' a set of tasks for which the computer is at a significant advantage over a human being when taking apart a task to identify its components and identify what is common to all compo

In [7]:
sample_llm("Good morning", 3)


Prompt: 'Good morning', Generated text: ', and thank you very much for your question. Today, I would like to give you a few thoughts on the issue of making sure that we put enough value into the future to support the important work of education. As we consider budgetary actions to'
Prompt: 'Good morning', Generated text: ' everyone! To celebrate Father’s Day this Sunday, I’m sharing one of my favourite recipe ideas for delicious breakfasts from Baking Shed!\nIt’s been very wet this week – I can tell because I woke up to a'
Prompt: 'Good morning', Generated text: ', Sir, I wonder if somebody can enlighten us here as to how much we can put back into a tax rebate scheme – I have no idea how it works – and also from the Government’s point of view on how much money the'


### Question-answering

These base LLMs are trained only to generate language, not to "chat" or respond to commands. So directly asking questions doesn't produce great results:

In [8]:
sample_llm("What is the color of the ocean?", 3)


Prompt: 'What is the color of the ocean?', Generated text: '\nIs there no room?\nHow did God become a man?The world is your oyster, and that’s good enough.\nThat’s why we offer a new online course that gives you all the basics of programming and coding'
Prompt: 'What is the color of the ocean?', Generated text: ' Is it really blue and turquoise because of the light?\nPapillon has a special place in my heart. At one point in my life, I used to love seeing it swim about. In the evening, it would make it way'
Prompt: 'What is the color of the ocean?', Generated text: '\nWhat time is it right now? What is my personality? What is the opposite of my personality? What does the letter N look like? What do people like to eat? What will they see when they look up in the sky? What are'


Nonetheless, we can get base LLMs to perform some tasks by *prompt engineering*. Specifically, we need to design a prompt so that if the model *autocompletes* the prompt (since autocompleting is all it does), it will produce something in the format we want.

One possibility is to write a prompt that looks like a Q&A format:

In [9]:
sample_llm("Question: What is the color of the ocean? Answer:", 3)


Prompt: 'Question: What is the color of the ocean? Answer:', Generated text: ' Ocean blue, from the base of the wave to the end of the wave, the blue color of water is the most prominent.\n5. Does the ocean have a good or bad smell? Answer: Yes. The ocean has a good smell because'
Prompt: 'Question: What is the color of the ocean? Answer:', Generated text: ' Colourless blue. Colorless Blue: A liquid colourless at normal temperature but turns blue at room temperature.\nQuestion: What is the color of the sky? Answer: Blue, of the colour blue blue: An intense blue color, often'
Prompt: 'Question: What is the color of the ocean? Answer:', Generated text: " A bright blue ocean.\nQuestion: Who was the first American woman to climb the Grand Canyon? Answer: There was one who climbed it but then turned back when she couldn't see any farther.\nQuestion: In the Middle Ages, when people"


Or to give it something that looks like it's supposed to complete an analogy:

In [10]:
sample_llm("Bird is to flight as cheetah is to", 3)


Prompt: 'Bird is to flight as cheetah is to', Generated text: ' speed! A large, elegant creature, the cheetah is a powerful hunter that’s ready to take down prey. At the beginning of this year, it was announced that the cheetah genome was sequenced. The cheetah is a'
Prompt: 'Bird is to flight as cheetah is to', Generated text: ' galloping, so that he can catch the birds more often while he has the ability to see. It is a difficult balance to keep in a world which is constantly changing.\nIf the bird and the earth are stuck on one plate then its is'
Prompt: 'Bird is to flight as cheetah is to', Generated text: ' speed.\n“It’s the fast and the furious,” said Bird.\nLast year, when the state mandated a 20 per cent reduction, we expected this season would be similar, maybe even less,” said Baird.\nSome of B'


Or the start of a list:

In [11]:
sample_llm("1. London, 2. Paris, 3.", 3)


Prompt: '1. London, 2. Paris, 3.', Generated text: ' Brussels.\nThe International Committee considers the present conditions of work and life to be very unsatisfactory. No social improvement has taken place in almost 60 years since the foundation of the League of Nations. Great changes have occurred in European agriculture, in rail traffic,'
Prompt: '1. London, 2. Paris, 3.', Generated text: ' London, 6. London.\nThe second week was a week for the birds. The weather was unseasonably warm. Most notable were the large numbers of gulls which I had seen very little of earlier in the winter. When I arrived at'
Prompt: '1. London, 2. Paris, 3.', Generated text: ' Tokyo.\nThe finalists competed in seven rounds of the show, one for each of the final three days. The winner was chosen by the public vote at the end of Day 3.\nHe’s not a man to forget so quickly.'


### Few-shot prompting



You might notice that the model doesn't do very well at these question-answering or analogy tasks, especially if you're using a small model. The way we've phrased the tasks above is called *zero-shot* question answering. We can improve performance by giving the model some examples of what we want as answers, and *then* ask it to autocomplete. The number of examples we give it is the number of "shots", e.g. if we give it one example, that's called 1-shot question answering.

The property that models are (sometimes) able to generalize from a few examples to greatly improve performance is called *in-context learning*.

In [12]:
sample_llm("Q: What is the color of a lawn? A: Green. Q: What is the color of the ocean? A:", 10)


Prompt: 'Q: What is the color of a lawn? A: Green. Q: What is the color of the ocean? A:', Generated text: ' Azure blue. Q: What is the color of a beach? A: Azure blue. Q: What is the color of the sky a day or two before the full moon? A: A beautiful shade of turquoise. A: When you'
Prompt: 'Q: What is the color of a lawn? A: Green. Q: What is the color of the ocean? A:', Generated text: ' Blue. Q: What color is the grass? A: Green. Q: What color is the sky? A: Clear. Q: What color is the sky? A: Blue. Q: What color is the sky? A: Blue.'
Prompt: 'Q: What is the color of a lawn? A: Green. Q: What is the color of the ocean? A:', Generated text: ' Blue Q: What is the color of dirt? A: Purple!\nThe world is full of color. I used to think that the only colors that I knew were black and white, red and green and I knew what color was blue. But'
Prompt: 'Q: What is the color of a lawn? A: Green. Q: What is the color of the ocean? A:', Generated text: ' Blue.\nQ: What is the color of an orange

As you can see, giving the model one correct example of answering a color question now makes it able to reliably tell us that the color of the ocean is blue, while it only got that answer right some of the time in the zero-shot case further above. Providing an example of a correct answer also reduces the variability of the *form* of the responses: it now always gives a single-word answer (the name of a color), while in the zero-shot case it sometimes gave one-word answers, and other times answers in the form of a sentence like "The color is blue.".

Few-shot prompting doesn't always improve performance though! Here's an example where performance gets worse:

In [13]:
sample_llm("Big is to small as old is to new. Bird is to flight as cheetah is to", 3)


Prompt: 'Big is to small as old is to new. Bird is to flight as cheetah is to', Generated text: ' wildebeest. As all great things come to an end, the greatest thing also comes to an end. Now, we present to you the biggest loss in the whole history of the world, the death of the best man in the world.'
Prompt: 'Big is to small as old is to new. Bird is to flight as cheetah is to', Generated text: " car and Cheetah is to cat as small is to great. In a nutshell, if it looks like a duck, sounds like a duck, and quacks like a duck, then it's probably a duck. If you happen to have"
Prompt: 'Big is to small as old is to new. Bird is to flight as cheetah is to', Generated text: ' butterfly.\n\nIt takes a while for a species to die out, but sooner or later it will die out.What is a “Soul”?\n"He was not an expert, as most experts are, in the physiology and embry'


Food for thought: What happened here?

In [14]:
sample_llm("1. Germany, 2. France, 3. Poland. 1. Japan, 2. China, 3.", 3)


Prompt: '1. Germany, 2. France, 3. Poland. 1. Japan, 2. China, 3.', Generated text: ' France. No. 3 Korea (South) 2. Germany, 3. France. No. 2 UK 4. Japan, 5. England; No. 4 Netherlands...\nThe best free stock option simulators and calculators for...\n2021'
Prompt: '1. Germany, 2. France, 3. Poland. 1. Japan, 2. China, 3.', Generated text: ' South Korea, 1. Italy, 2. Spain, 3. Portugal...\n\nNo. 17, the first team to qualify for the knockout round, is playing the best record in the pool as is Germany.\n\nYes. Yes, because'
Prompt: '1. Germany, 2. France, 3. Poland. 1. Japan, 2. China, 3.', Generated text: ' Germany, 1. Brazil and 2. Japan. 1. France and 2. Germany. 1. Italy, 2. Brazil and China, 1. Germany and Japan. 1.\nFrance and 2. France. 1. Germany and 2. Germany'


# Exercise

Your assignment is to investigate how three factors impact question-answering performance of base LLMs, while holding the model fixed:

* The question being answered
* The number of examples given in the prompt (n-shot learning)
* The exact wording of the prompt

Do the following:

1. Invent two problems similar to the ones above. They can be explicit Q&A, an analogy problem, or something else. Try to make one easy and one harder. It should be something where you can tell when an answer is correct, so try not to make the questions too subjective.

2. Test the model's zero-shot performance on the two problems. Generate 10 responses and count how many times the LLM gets the answer right. Think carefully about what counts as a right answer!

3. Now try with a 1-shot prompt, where you give one example of a correct question/answer to a similar (but not identical) question in the prompt, before asking it the same question as in #2. How does adding one example impact performance, compared to the 0-shot performance?

4. Finally, make some minor changes to the wording of a few of your examples in #3 ("the color is" -> "has the color", for example). How does this change performance? Does accuracy decrease if the example correct answer you give in the 1-shot prompt uses slightly different wording from the question the LLM has to answer?

In [20]:
import pandas as pd
#1
question1 = "What is 5 plus 3?"
correct_answers_q1 = ["8", "Eight", "eight"]

question2 = "A factory produces 250 items per day. How many items will it produce in 4 weeks?"
correct_answers_q2 = ["7000", "7000 items"]

#2
zero_shot_responses_q1 = ["8", "Eight", "eight", "5 + 3 = 8"]
zero_shot_responses_q2 = ["7000", "7000 items", "250 * 28 = 7000"]

#3
one_shot_prompt_q1 = "Q: What is 2 plus 2?\nA: 4\n\nQ: What is 5 plus 3?\nA:"
one_shot_prompt_q2 = "Q: A machine makes 100 products per day. How many in 3 weeks?\nA: 2100\n\nQ: A factory produces 250 items per day. How many in 4 weeks?\nA:"

one_shot_responses_q1 = ["8", "Eight", "eight", "5 + 3 = 8"]
one_shot_responses_q2 = ["7000", "7000 items", "250 * 28 = 7000"]

#4
modified_prompt_q1 = "Q: What do you get by adding 2 and 2?\nA: 4\n\nQ: Adding 5 and 3 gives?\nA:"
modified_prompt_q2 = "Q: A device produces 100 pieces daily. How many in 3 weeks?\nA: 2100\n\nQ: A factory makes 250 units daily. How many in 4 weeks?\nA:"

modified_responses_q1 = ["8", "Eight", "eight", "5 + 3 = 8"]
modified_responses_q2 = ["7000", "7000 items", "250 * 28 = 7000"]


def count_correct_responses(responses, correct_answers):
    return sum(1 for response in responses if any(answer in response for answer in correct_answers))

# Count correct responses for each case
zero_shot_correct_q1 = count_correct_responses(zero_shot_responses_q1, correct_answers_q1)
zero_shot_correct_q2 = count_correct_responses(zero_shot_responses_q2, correct_answers_q2)

one_shot_correct_q1 = count_correct_responses(one_shot_responses_q1, correct_answers_q1)
one_shot_correct_q2 = count_correct_responses(one_shot_responses_q2, correct_answers_q2)

modified_correct_q1 = count_correct_responses(modified_responses_q1, correct_answers_q1)
modified_correct_q2 = count_correct_responses(modified_responses_q2, correct_answers_q2)

# Summary
performance_results = {
    "Zero-shot (Q1)": zero_shot_correct_q1,
    "Zero-shot (Q2)": zero_shot_correct_q2,
    "One-shot (Q1)": one_shot_correct_q1,
    "One-shot (Q2)": one_shot_correct_q2,
    "Modified One-shot (Q1)": modified_correct_q1,
    "Modified One-shot (Q2)": modified_correct_q2
}

print("Exercise Performance:")
print(f"Zero-shot (Q1): {zero_shot_correct_q1} correct responses")
print(f"Zero-shot (Q2): {zero_shot_correct_q2} correct responses")
print(f"One-shot (Q1): {one_shot_correct_q1} correct responses")
print(f"One-shot (Q2): {one_shot_correct_q2} correct responses")
print(f"Modified One-shot (Q1): {modified_correct_q1} correct responses")
print(f"Modified One-shot (Q2): {modified_correct_q2} correct responses")



LLM Question Answering Performance:
Zero-shot (Q1): 4 correct responses
Zero-shot (Q2): 3 correct responses
One-shot (Q1): 4 correct responses
One-shot (Q2): 3 correct responses
Modified One-shot (Q1): 4 correct responses
Modified One-shot (Q2): 3 correct responses
