<a href="https://colab.research.google.com/github/TheRayen1/Projet-LLM/blob/main/CSC_468_668_(Spring_2025)_Language_model_inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Language model inference and few-shot prompting

[Mark J. Nelson](https://www.kmjn.org/), Spring 2025

This notebook demonstrates inference with pre-trained large language models (LLMs). "Inference" is neural network jargon for *using* a trained model, versus training it. For example, if you have an image classifier, inference is using it to classify images. With large language models, inference means generating text.

For now we will stick to "base" language models that are trained solely on the "autoregressive objective", i.e. predicting the next word. Even these language models start to exhibit things that look sort of like problem-solving or question-answering once they reach a certain size, but they are not explicitly trained to solve problems or answer questions in the way models like ChatGPT are.

In my opinion it's important to build intuition with LLMs by starting with base language models that have only been trained to model language, and haven't been further trained with another objective on top of the language modeling objective. Once you understand these base models, it provides a good foundation for understanding the models that have undergone other types of training on top of that.


## Setup

First check that we have a GPU attached to this notebook:

In [1]:
!nvidia-smi

Tue Mar  4 02:50:44 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   63C    P8             11W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## Loading a language model

Now we're ready to download and load an LLM.

We will use the [`transformers`](https://huggingface.co/docs/transformers/en/index) library from Huggingface for inference. There are many libraries for LLM inference, but `transformers` is one of the more popular that supports many open-source language models. For more industrial-strength usage (e.g. serving clients), [vLLM](https://docs.vllm.ai/) is another good option.

We will use the [OLMo series of open-source models](https://allenai.org/olmo) from the Allen Institute for AI, initially [the "1b" base model](https://huggingface.co/allenai/OLMo-1B-hf).

The model's size is how many parameters (neural network weights) it has, and 1b means 1 billion parameters. The OLMo models use 32-bit parameters, so RAM usage in bytes will be about 4x the number of parameters (32 bits equals 4 bytes). For example, the 1b parameter model will take about 4 GB of RAM. OLMo models also support 8-bit quantized versions, which will take 1/4 as much RAM for the same number of parameters. For example, if you load the 1b model in 8-bit quantized form, it will take about 1 GB of RAM.

The amount of RAM we have available (both regular RAM and GPU vRAM) will often be a constraint in the size of models we can use. The Colab free tier (as of November 2024) provides 12.7 GB of regular RAM and a T4 GPU with 16 GB of vRAM.

In [3]:
modelname = "allenai/OLMo-1B-hf"      # the model to load, using its Huggingface repository name

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(modelname).to('cuda')
tokenizer = AutoTokenizer.from_pretrained(modelname)

RuntimeError: THPDtypeType.tp_dict == nullptr INTERNAL ASSERT FAILED at "../torch/csrc/Dtype.cpp":176, please report a bug to PyTorch. 

## Sampling from the model

Now that the model is loaded, we can use it. To generate text, we feed it a *prompt* that will be autocompleted. We will write a small helper function that given a prompt and a number *n*, generates *n* completions of the prompt and prints them.

My code here explicitly tokenizes and de-tokenizes the prompt and the generated text to be clear about what's going on and let us inspect intermediate results if we want, but there is a simpler [pipeline API](https://huggingface.co/docs/transformers.js/en/pipelines) that does that for you if you just want to generate text.

There are various things you can change here. For example, *temperature* controls how much variation there is in the generation process (lower temperature is more deterministic), and *max_new_tokens* controls how much text is generated.

In [None]:
temperature = 1.0
max_new_tokens = 50

def sample_llm(prompt, n):
  # tokenize our prompt
  tokenized_prompt = tokenizer(prompt, return_tensors="pt").to('cuda')

  # generate new text! This is where the magic happens
  outputs = model.generate(**tokenized_prompt,
                           do_sample=True,
                           temperature=temperature,
                           max_new_tokens=max_new_tokens,
                           num_return_sequences=n)

  # output by default includes BOTH our prompt and the generated text
  # chop off our prompt from the beginning of each result so we can distinguish
  outputs = outputs[:, len(tokenized_prompt['input_ids'][0]):]

  # the output is a list of numerical tokens - decode it back to readable characters
  outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)

  # print the outputs
  print()
  for output in outputs:
    print(f"Prompt: {prompt!r}, Generated text: {output!r}")

## Inference examples

Now we're ready to generate some text!

### Text autocompletion

The most straightforward type of language model inference is to give it the start of a sentence as prompt, and have it complete the sentence.

(Technical note: Do *not* include a space at the end of the prompt. Due to the way most LLM tokenizers currently work, you will generally get much worse results. But you can try to see what happens!)

In [None]:
sample_llm("A good definition of artificial intelligence is", 10)


Prompt: 'A good definition of artificial intelligence is', Generated text: ' the ability for a computer, or machine, to interact with people intelligently and autonomously. It means that computers have the ability to think and learn and adapt; it means learning in its purest form.\nComputer scientists define artificial intelligence as'
Prompt: 'A good definition of artificial intelligence is', Generated text: ' given by Thomas J. Watson founder of IBM. Watson is a phrase coined from the Greek word for clever (xanthe) and the Latin word for intelligence (artem). Watson is the machine that uses its own unique intelligence (artificial)'
Prompt: 'A good definition of artificial intelligence is', Generated text: ' to put into a machine the processes required to learn from experience without requiring training, except by using a method called reinforcement learning. Reinforcement learning is the practice of a computer algorithm learning to play a new game or solve a new problem from experie

In [None]:
sample_llm("Good morning", 3)


Prompt: 'Good morning', Generated text: ', good afternoon and welcome to our conference call with the financial results for the fourth quarter 2013.\nJoining me today are Mr. Michael L. Rachlin, Chief Executive Officer & President; Mr. Robert Rathgeb, Interim Chief'
Prompt: 'Good morning', Generated text: '. We are working on adding a link to a post about the story of Esther to the main menu of the site, so please look for that tomorrow. The story is pretty interesting, if a little scary. I think one of the funniest'
Prompt: 'Good morning', Generated text: '.” She smiles.\nA few of the people in the class, including yours truly, are still in their sleep. Some of the others are awake. I’m the last person awake and have to get up fast before all of my fellow sleep'


### Question-answering

These base LLMs are trained only to generate language, not to "chat" or respond to commands. So directly asking questions doesn't produce great results:

In [None]:
sample_llm("What is the color of the ocean?", 3)


Prompt: 'What is the color of the ocean?', Generated text: " What's the color of that color? That is not an ocean. It must be something very different. The ocean must be something very different.\nThe ocean is just water. It is a very beautiful, blue, beautiful, blue water. It"
Prompt: 'What is the color of the ocean?', Generated text: '\nThe colour of ocean water depends on many conditions. It can be white or blue because water evaporates quickly. The colour of the ocean changes depending on air pressure and temperature.\nThe colour of the ocean depends on the amount of sunlight absorbed by'
Prompt: 'What is the color of the ocean?', Generated text: '\nIt is the color of the sky, water, the moon, and the sun.\nWhat would you need to go to heaven?\nA white bible and a cross.\nWhat is the hardest animal?\nA dog in boots.'


Nonetheless, we can get base LLMs to perform some tasks by *prompt engineering*. Specifically, we need to design a prompt so that if the model *autocompletes* the prompt (since autocompleting is all it does), it will produce something in the format we want.

One possibility is to write a prompt that looks like a Q&A format:

In [None]:
sample_llm("Question: What is the color of the ocean? Answer:", 3)


Prompt: 'Question: What is the color of the ocean? Answer:', Generated text: ' Blue.\nIs the ocean blue? Let’s explore this with the help of some photos.\nA: I’ll choose green; a pretty vibrant color, right?\nA: That’s right. In fact, it’s'
Prompt: 'Question: What is the color of the ocean? Answer:', Generated text: ' Blue. What is the distance between the earth and the sun? Answer: 365,000 miles. What is the largest living organism? Answer: An orange.\nHow do you make a baby dolls eyes?\nHow do you stop a person'
Prompt: 'Question: What is the color of the ocean? Answer:', Generated text: ' Ocean. The color of the sea is blue and green, and it is made up of light and dark shades of blue and green.\nQuestion: Where can I find cheap hotel deals in the US? Answer: Hotels, airlines, and other'


Or to give it something that looks like it's supposed to complete an analogy:

In [None]:
sample_llm("Bird is to flight as cheetah is to", 3)


Prompt: 'Bird is to flight as cheetah is to', Generated text: ' speed, and in this case the cheetah is the bird.\nThe flightless bird is a symbol of the powerful force of the water, which is symbolized by the waterfall.The Rival series of golf balls are a mid'
Prompt: 'Bird is to flight as cheetah is to', Generated text: ' speed. He flies faster than the speed at which you can run, making it more like a blur than a speed, though I would add this: If you were aiming for high fives, you’d know not to try and do it while'
Prompt: 'Bird is to flight as cheetah is to', Generated text: ' speed.\nBut the most effective hunters, the most cunning hunters, can outwit, outrun, and out-muscle a predator with a single glance. These animals can run up to half a mile without tiring, and can dive'


Or the start of a list:

In [None]:
sample_llm("1. London, 2. Paris, 3.", 3)


Prompt: '1. London, 2. Paris, 3.', Generated text: ' New York. In total, 5.3 million, according to Géo.\n"We estimate that over the past 14 years, France (and particularly the Paris region) has created two generations of residents. The second generation is a little less'
Prompt: '1. London, 2. Paris, 3.', Generated text: ' New York, 4. San Francisco.\nI didn’t even know the UK was on the list, so I didn’t have to go around guessing where the world was.\nAnyway, I knew that I wanted London as my first stop'
Prompt: '1. London, 2. Paris, 3.', Generated text: ' New-York, for a few years past.\nHis works are of the most extraordinary excellence, they are of uncommon variety, and have a very extraordinary effect in almost all their kinds. What appears of most importance, is the great variety, so'


### Few-shot prompting



You might notice that the model doesn't do very well at these question-answering or analogy tasks, especially if you're using a small model. The way we've phrased the tasks above is called *zero-shot* question answering. We can improve performance by giving the model some examples of what we want as answers, and *then* ask it to autocomplete. The number of examples we give it is the number of "shots", e.g. if we give it one example, that's called 1-shot question answering.

The property that models are (sometimes) able to generalize from a few examples to greatly improve performance is called *in-context learning*.

In [None]:
sample_llm("Q: What is the color of a lawn? A: Green. Q: What is the color of the ocean? A:", 10)


Prompt: 'Q: What is the color of a lawn? A: Green. Q: What is the color of the ocean? A:', Generated text: ' Blue Q: What’s the color of money? A: Money is green Q: What makes a rainbow? A: Rain Q: What color is a flower? A: Pink Q: What color does your dog wear to a party? A'
Prompt: 'Q: What is the color of a lawn? A: Green. Q: What is the color of the ocean? A:', Generated text: ' Sea green.\nQ: What is a light-colored, grassy shrub? A: A holly. Q: What flower is also called a holly? A: Holly berry. Q: In the middle of a garden,'
Prompt: 'Q: What is the color of a lawn? A: Green. Q: What is the color of the ocean? A:', Generated text: ' Gray. Q: Where is the color white?\n5. What is the color of snow? A: Gray. Q: What color is blood? A: Red. A: Red. Q: What color is black? A: Black.'
Prompt: 'Q: What is the color of a lawn? A: Green. Q: What is the color of the ocean? A:', Generated text: ' Blue. Q: Where is the color red? A: Red.\nS.C.P. : When a book is written, it is a book with

As you can see, giving the model one correct example of answering a color question now makes it able to reliably tell us that the color of the ocean is blue, while it only got that answer right some of the time in the zero-shot case further above. Providing an example of a correct answer also reduces the variability of the *form* of the responses: it now always gives a single-word answer (the name of a color), while in the zero-shot case it sometimes gave one-word answers, and other times answers in the form of a sentence like "The color is blue.".

Few-shot prompting doesn't always improve performance though! Here's an example where performance gets worse:

In [None]:
sample_llm("Big is to small as old is to new. Bird is to flight as cheetah is to", 3)


Prompt: 'Big is to small as old is to new. Bird is to flight as cheetah is to', Generated text: ' leopardo.\nWoah! How’s this? The ‘Cheetah’ is the king when it comes to moving fast but with great strength. There is no doubt which is the king of speed! A leopard that moves'
Prompt: 'Big is to small as old is to new. Bird is to flight as cheetah is to', Generated text: ' speed.\nFriends in India are like sunsets-there are a lot of them, but one is more beautiful than the other.\nThe sun is like the earth-it is round, even, and beautiful.\nWe don’t'
Prompt: 'Big is to small as old is to new. Bird is to flight as cheetah is to', Generated text: ' cat. Cheetahs are to cheetahs and cheetahs are to cheetahs. All cheetahs are cheetahs and all cheetahs are cheetahs.Cherub'


Food for thought: What happened here?

In [None]:
sample_llm("1. Germany, 2. France, 3. Poland. 1. Japan, 2. China, 3.", 3)


Prompt: '1. Germany, 2. France, 3. Poland. 1. Japan, 2. China, 3.', Generated text: ' Hungary. 1. Japan, 2. Hungary. The top table shows Japan and their scores. 1. Turkey. 2. Japan. 3. Hungary. It is the national table with the current world ranking. 1. Spain, 2. Holland.'
Prompt: '1. Germany, 2. France, 3. Poland. 1. Japan, 2. China, 3.', Generated text: ' South Korea, 2. Russia.\n2. China, 3. Germany, 4. USA.\n2. China, 3. USA, 4. Russia.\n2. Russia, 3. USA, 1. China.\n3. USA'
Prompt: '1. Germany, 2. France, 3. Poland. 1. Japan, 2. China, 3.', Generated text: ' Russia. 1. Australia, 2. Canada. Germany, France, Great Britain, Japan, Russia, China, Sweden, Italy, Spain, etc.\nAnd so they come out in the top list of the world.\nThis doesn’t'


# Exercise

Your assignment is to investigate how three factors impact question-answering performance of base LLMs, while holding the model fixed:

* The question being answered
* The number of examples given in the prompt (n-shot learning)
* The exact wording of the prompt

Do the following:

1. Invent two problems similar to the ones above. They can be explicit Q&A, an analogy problem, or something else. Try to make one easy and one harder. It should be something where you can tell when an answer is correct, so try not to make the questions too subjective.

2. Test the model's zero-shot performance on the two problems. Generate 10 responses and count how many times the LLM gets the answer right. Think carefully about what counts as a right answer!

3. Now try with a 1-shot prompt, where you give one example of a correct question/answer to a similar (but not identical) question in the prompt, before asking it the same question as in #2. How does adding one example impact performance, compared to the 0-shot performance?

4. Finally, make some minor changes to the wording of a few of your examples in #3 ("the color is" -> "has the color", for example). How does this change performance? Does accuracy decrease if the example correct answer you give in the 1-shot prompt uses slightly different wording from the question the LLM has to answer?