# CS 195: Natural Language Processing
## Chat and Instruct Models

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericmanley/s26-CS195NLP/blob/main/F2_1_ChatInstruct.ipynb)

## References


* [Hugging Face Chat Basics](https://huggingface.co/docs/transformers/en/conversations)
* [SmolLM2: When Smol Goes Big — Data-Centric Training of a Small Language Model](https://arxiv.org/pdf/2502.02737)


## Demo Day

Sit with the same people you sat with last week

* Each person do a 5-min demo of **creative synthesis** project or completed **applied exploration** (or **core practice** if that's what you have)
* Write down the names of the people you presented to (you'll include this in your portfolio later)
* (optional) Nominate a cool project to show off to everyone



## Install Modules

We'll be using `transformers` version 5. You probably only need to run this if you are doing this for the first time on your own computer. If so, uncomment these two lines and run it.


In [1]:
import sys
!{sys.executable} -m pip install transformers accelerate

Collecting transformers
  Downloading transformers-5.1.0-py3-none-any.whl.metadata (31 kB)
Collecting accelerate
  Using cached accelerate-1.12.0-py3-none-any.whl.metadata (19 kB)
Collecting huggingface-hub<2.0,>=1.3.0 (from transformers)
  Downloading huggingface_hub-1.4.1-py3-none-any.whl.metadata (13 kB)
Collecting numpy>=1.17 (from transformers)
  Downloading numpy-2.4.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (6.6 kB)
Collecting pyyaml>=5.1 (from transformers)
  Using cached pyyaml-6.0.3-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (2.4 kB)
Collecting regex!=2019.12.17 (from transformers)
  Using cached regex-2026.1.15-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (40 kB)
Collecting tokenizers<=0.23.0,>=0.22.0 (from transformers)
  Using cached tokenizers-0.22.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.3 kB)
Collecting typer-slim (from t

## Large vs. Small language models

Large Language Models (GPT 5.2, Claude Opus 4.6, Grok 4.1, Gemini 3) get all the attention.

Smaller language models have come a long way too, and they require much less computation
* can often be run on a laptop or a Colab instance
* can be *fine-tuned* for specific applications with good performance


### Example: SmolLM2

Hugging Face developed a [family of small language models called SmolLM](https://huggingface.co/collections/HuggingFaceTB/smollm2) there's also a [SmolLM3](https://huggingface.co/blog/smollm3)

SmolLM2 comes in various sizes
* 135M (135 million parameters - weights in the neural network)
* 360M
* 1.7B

Contrast with the LLMs above which likely all have over 100 billion parameters and run on a cluster of devices in a data center

Each size has a **base model**, like [SmolLM2-360M](https://huggingface.co/HuggingFaceTB/SmolLM2-360M)
* *pre-trained* on lots of diverse text
* designed to predict the *next word* - it's your phone's keyboard text prediction on steroids

The *base model* is then fine-tuned on *instruction following* and *conversational data*, which make it useful for building **chat bots**.
* resulting model has a name like [SmolLM2-360M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct)

## Building a chat bot with the `text-generation` pipeline

Setting up a chat bot works the same way as other Hugging Face models we've seen, but we'll use the `text-generation` pipeline

In [2]:
from transformers import pipeline
from accelerate import Accelerator

device = Accelerator().device

chatbot = pipeline("text-generation", model="HuggingFaceTB/SmolLM2-360M-Instruct", device = device)

  from .autonotebook import tqdm as notebook_tqdm
Loading weights: 100%|██████████| 290/290 [00:00<00:00, 455.06it/s, Materializing param=model.norm.weight]                              


### Chat template

*Instruct* models often allow you to pass the input in using a **Chat Template** like this

In [3]:
chat_history = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain gravity in one paragraph."},
]

Now lets get the response and display what is returned

In [4]:
response = chatbot(chat_history)
response

Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': [{'role': 'system',
    'content': 'You are a helpful assistant.'},
   {'role': 'user', 'content': 'Explain gravity in one paragraph.'},
   {'role': 'assistant',
    'content': "Gravity is a fundamental force of nature that attracts two or more objects with mass towards each other. It is the reason why planets orbit around stars, why objects fall to the ground, and why satellites orbit around Earth. According to Albert Einstein's theory of general relativity, gravity is not a force that acts between objects, but rather a curvature of spacetime caused by the presence of mass and energy. This curvature affects the motion of objects in the vicinity of the mass, giving rise to the gravitational force that ultimately governs the behavior of celestial bodies."}]}]

**Notice:**
* it just uses normal lists and dictionaries
* it returns the entire chat history on the `'generated_text'` key

If you wanted to just print out the response, you could do it like this

In [5]:
print("ASSISTANT:", response[0]['generated_text'][-1]['content'])

ASSISTANT: Gravity is a fundamental force of nature that attracts two or more objects with mass towards each other. It is the reason why planets orbit around stars, why objects fall to the ground, and why satellites orbit around Earth. According to Albert Einstein's theory of general relativity, gravity is not a force that acts between objects, but rather a curvature of spacetime caused by the presence of mass and energy. This curvature affects the motion of objects in the vicinity of the mass, giving rise to the gravitational force that ultimately governs the behavior of celestial bodies.


## Multi-turn conversations

If you want to have a multi-turn conversation, you need to pass *the entire chat history* to the model - it doesn't have any inherent memory of the output it just gave you

In [6]:
another_chat_history = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain gravity in one paragraph."},
    {"role": "assistant", "content": "Gravity is a fundamental force of nature that causes objects with mass to attract each other. According to Sir Isaac Newton, the force of gravity between two objects depends on their masses and the distance between them. The larger the mass of the objects, the greater the gravitational pull. This force is responsible for keeping planets in orbit around the sun and causes objects to fall towards the ground when dropped. Albert Einstein also described gravity as a curvature of spacetime caused by massive objects, which in turn warps the fabric of spacetime around them."},
    {"role": "user", "content": "Which of those two do you think has had a bigger impact on the field?"}
]

next_response = chatbot(another_chat_history)
print(next_response[0]['generated_text'][-1]['content'])

Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


According to my previous explanation, Einstein's theory of general relativity has had a larger impact on the field of astrophysics and cosmology. General relativity describes gravity as a curvature of spacetime caused by massive objects, which has been extensively tested and confirmed by numerous experiments and observations, including the bending of light around massive objects such as stars and black holes. This theory has far-reaching implications for our understanding of the universe, from the behavior of planets in our solar system to the expansion of the cosmos itself.


## Exercise

Write a loop that allows for back-and-forth conversation with the model. Make sure to keep track of the full history of the chat as you go.

In [11]:
loop_chat_history = [{"role": "system", "content": "You are a helpful assistant."}]

while True:
    question = input()
    print(f"USER: {question}")
    loop_chat_history.append({"role": "user", "content": question})

    next_response = chatbot(loop_chat_history)
    loop_chat_history.append(next_response[0]['generated_text'][-1])
    print(f"RESPONSE: {next_response[0]['generated_text'][-1]['content']}")

Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


USER: who are some famous scientists?
RESPONSE: I'm sorry, but as a text-based AI, I don't have the ability to access databases or provide information from the internet. I'm here to assist with text-based conversations, so I can't provide information about famous scientists. However, you might want to search for specific names such as Albert Einstein, Isaac Newton, Marie Curie, or Ernest Rutherford.


Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


USER: of those you mentioned, what are they known for?
RESPONSE: Albert Einstein is known for his work on the theory of relativity, and his name is also associated with the famous equation E=mc2, which describes the relationship between mass and energy.

Isaac Newton is well-known for formulating the laws of motion and universal gravitation.

Marie Curie is famous for her pioneering work on radioactivity and her discovery of two new elements, polonium and radium.

Ernest Rutherford is known for his work on nuclear structure and the discovery of the noble gases. He was also a Nobel Prize winner.


KeyboardInterrupt: Interrupted by user

## Evaluating Chat Models: Benchmarks

A benchmark is a dataset with one or more reference answers that can be used to measure a model's response (like the reference summaries we compared against with ROUGE)


Model benchmarking is a big deal - companies like to report how well their models perform on all kinds of benchmarks

For example, see the **performance** tab here: https://deepmind.google/models/gemini/pro/

Take a look at this benchmark for multi-turn conversations: https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts

**Group Discussion:** What are some things you notice about this data?



## Evaluating Chat Models: Human Evaluators

Language models are often evaluated by having humans perform A/B testing where two models respond to the same prompt, and the human indicates which was better.

Try it out here: https://arena.ai

## Group Exercise

Do the following as a group:
* Come up with 5 language model prompts - what are some questions/instructions you think would help you decide how good a language model is?
* Test them using [SmolLM2-360M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct) and [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)
* Have each person in your group vote on which one they thought was the best
* Write down the results

## Applied Exploration

Choose two instruct models of similar size: https://huggingface.co/models?pipeline_tag=text-generation&sort=trending&search=instruct
  * Link to the model cards for the models you're using and describe each of them

Do one of the following:

1. Go to https://huggingface.co/datasets and find a dataset suitable to use as a benchmark, and compare the performance of the two models. It doesn't have to be a conversational benchmark - it could be a text classification, summarization, math, etc. dataset, as long as you can instruct the model to answer it. And, you don't have to use the whole dataset.
    * link to and describe the dataset
    * describe how you compared the performance (e.g., what metric did you use?)
    * report the results

OR

2. Come up with your own fun benchmark (Taylor Swift trivia, AI Dungeon Master, Joke telling, etc.), generate responses for both models, and have another person rate the answers.
    * Describe what you did
    * report the results
