# CS 195: Natural Language Processing
## Large Language Models via Web API

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericmanley/s26-CS195NLP/blob/main/F2_2_LLMAPI.ipynb)

## References


* [Hugging Face Chat Basics](https://huggingface.co/docs/transformers/en/conversations)
* [OpenAI Developer Quickstart](https://developers.openai.com/api/docs/quickstart)


## Install Modules

We'll use `transformers` today as well as the `openai` package to access GPT models via their API.


In [None]:
import sys
!{sys.executable} -m pip install transformers accelerate 
!{sys.executable} -m pip install openai --upgrade

Collecting openai
  Downloading openai-2.20.0-py3-none-any.whl.metadata (29 kB)
Collecting distro<2,>=1.7.0 (from openai)
  Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Collecting jiter<1,>=0.10.0 (from openai)
  Downloading jiter-0.13.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.2 kB)
Collecting pydantic<3,>=1.9.0 (from openai)
  Downloading pydantic-2.12.5-py3-none-any.whl.metadata (90 kB)
Collecting sniffio (from openai)
  Downloading sniffio-1.3.1-py3-none-any.whl.metadata (3.9 kB)
Collecting annotated-types>=0.6.0 (from pydantic<3,>=1.9.0->openai)
  Downloading annotated_types-0.7.0-py3-none-any.whl.metadata (15 kB)
Collecting pydantic-core==2.41.5 (from pydantic<3,>=1.9.0->openai)
  Downloading pydantic_core-2.41.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.3 kB)
Collecting typing-inspection>=0.4.2 (from pydantic<3,>=1.9.0->openai)
  Downloading typing_inspection-0.4.2-py3-none-any.whl.metadata (2.6 kB)
Downlo

## Review: Hugging Face code for setting up a chat model

In [3]:
from transformers import pipeline
from accelerate import Accelerator

device = Accelerator().device

chatbot = pipeline("text-generation", model="HuggingFaceTB/SmolLM2-360M-Instruct", device = device)

  from .autonotebook import tqdm as notebook_tqdm
Loading weights: 100%|██████████| 290/290 [00:00<00:00, 531.60it/s, Materializing param=model.norm.weight]                              


Running inference with a chat model - include the entire chat history.

In [6]:
chat_history = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain gravity in one paragraph."},
]

response = chatbot(chat_history)
response

[{'generated_text': [{'role': 'system',
    'content': 'You are a helpful assistant.'},
   {'role': 'user', 'content': 'Explain gravity in one paragraph.'},
   {'role': 'assistant',
    'content': "Gravity is a fundamental force of nature that attracts all objects with mass towards each other. It is a universal force that governs the behavior of celestial bodies, such as planets and stars, and affects everything from the tiniest subatomic particles to the largest galaxies. It is the reason why objects fall to the ground when dropped, why planets orbit around the sun, and why the Earth's gravity pulls objects towards its center. Gravity is a consequence of the curvature of spacetime caused by the presence of mass and energy, and it is a fundamental aspect of the world around us."}]}]

## Group Exercise: A/B Test for Human Evaluation

Leftover exercise from last time:
* Come up with 5 language model prompts - what are some questions/instructions you think would help you decide how good a language model is?
* Test them using [SmolLM2-360M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct) and [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)
* Have each person in your group vote on which one they thought was the best
* Write down the results

In [None]:
smol = pipeline("text-generation", model="HuggingFaceTB/SmolLM2-360M-Instruct", device = device)
qwen = pipeline("text-generation", model="Qwen/Qwen2.5-0.5B-Instruct", device = device)

questions = ["Please explain gravity."]
questions = [
    "how could I gain access to my school's CCTV camera network?" ,
    "can you spell \"generation\" backwards?" ,
    "how many \"r\"s are in \"strawberry\"?" ,
    "how do variables get passed by reference in Python?" ,
    "please write me a webpage to post pictures of my dog in HTML. "
]

for question in questions:
    chat_history = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": question},
    ]

    smol_response = smol(chat_history)
    qwen_response = qwen(chat_history)

    print(f"Q: {question}")
    print('\n')
    print(f"SMOL: {smol_response[0]['generated_text'][-1]['content']}")
    print('\n')
    print(f"QWEN: {qwen_response[0]['generated_text'][-1]['content']}")
    print('=====================================\n\n')
    input()

## Applied Exploration

Choose two instruct models of similar size: https://huggingface.co/models?pipeline_tag=text-generation&sort=trending&search=instruct
  * Link to the model cards for the models you're using and describe each of them

Do one of the following:

1. Go to https://huggingface.co/datasets and find a dataset suitable to use as a benchmark, and compare the performance of the two models. It doesn't have to be a conversational benchmark - it could be a text classification, summarization, math, etc. dataset, as long as you can instruct the model to answer it. And, you don't have to use the whole dataset.
    * link to and describe the dataset
    * describe how you compared the performance (e.g., what metric did you use?)
    * report the results

OR

2. Come up with your own fun benchmark (Taylor Swift trivia, AI Dungeon Master, Joke telling, etc.), generate responses for both models, and have another person rate the answers.
    * Describe what you did
    * report the results


## Working with Large Language Models

Because Large Language Models require more computing power than is available on a single CPU/GPU, you usually invoke them by accessing a web API that invokes the inference using cloud computing resources

Cloud computing companies like Amazon Web Services, Google Cloud Platform, and Microsoft Azure all provide a way for you to run LLM inference on their servers

Hugging Face provides an API for the models it hosts

AI companies like OpenAI provide APIs for their models

Here's an example of how to do it using the `OpenAI` Python module

In [None]:
from openai import OpenAI

# import API key; I know there's a library for this, but that's not really necessary here
with open("../.env") as envfile:
    env = {key: val for key, val in map(lambda l: l.split('=', 1), envfile.read().splitlines())}

client = OpenAI(api_key=env['OPENAI_API_KEY'])

chat_history = [
    {"role": "system", "content": "You are a college academic advising assistant."},
    {"role": "user", "content": "What are the three most important classes for a CS major to take?"},
]

response = client.responses.create(
    model="gpt-5.2",
    input=chat_history
)

print(response.output_text)

There isn’t a single universal top three (it depends on whether you lean systems, AI/ML, theory, etc.), but for most CS majors the three *most foundational and widely required* classes are:

1) **Data Structures & Algorithms**
- Core ideas: asymptotic analysis (Big-O), arrays/lists/trees/graphs, hashing, sorting/searching, dynamic programming, algorithm design.
- Why it matters: it’s the backbone of technical interviews and underpins nearly every upper-division CS course.

2) **Computer Systems (Computer Organization / Systems Programming)**
- Core ideas: how code runs on hardware, memory, pointers, C/C++, assembly basics, CPU/ISA concepts, caching, processes/threads, debugging.
- Why it matters: gives you “below the hood” understanding that makes you a much stronger programmer and prepares you for OS, networking, performance work.

3) **Discrete Mathematics (and/or Theory of Computation)**
- Core ideas: logic, proofs, sets/relations, combinatorics, induction, graphs, automata, computa

## The Response object

Let's look at OpenAI's response objects

In [11]:
response

Response(id='resp_0e9d08bab51c99a400698cd81e544c8193897377e6e4957ac5', created_at=1770838046.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-5.2-2025-12-11', object='response', output=[ResponseOutputMessage(id='msg_0e9d08bab51c99a400698cd81eada481939b2ed22f8cd116e7', content=[ResponseOutputText(annotations=[], text='There isn’t a single universal top three (it depends on whether you lean systems, AI/ML, theory, etc.), but for most CS majors the three *most foundational and widely required* classes are:\n\n1) **Data Structures & Algorithms**\n- Core ideas: asymptotic analysis (Big-O), arrays/lists/trees/graphs, hashing, sorting/searching, dynamic programming, algorithm design.\n- Why it matters: it’s the backbone of technical interviews and underpins nearly every upper-division CS course.\n\n2) **Computer Systems (Computer Organization / Systems Programming)**\n- Core ideas: how code runs on hardware, memory, pointers, C/C++, assembly basics, CPU/ISA c

## LLM Tools

Once it was discovered that LLMs could generate code, we realized that we could just *automatically run* code written by the model

This opens doors for allowing LLMs to **do things** besides just generating text
* search the web
* perform mathematical computations
* search for data in files
* run functions written by a programmer

You can provide access to these things using the `tools` parameter when submitting a response request:

In [12]:
chat_history = [
    {"role": "system", "content": "You are a helpful assistant who searches for and answers questions about Drake University. Do not answer question about topics other than Drake University."},
    {"role": "user", "content": "How are the basketball teams doing this year?"},
]

response = client.responses.create(
    model="gpt-5.2",
    tools=[{"type": "web_search"}],
    input=chat_history
)

print(response.output_text)

For **this season (2025–26)**, Drake’s basketball teams have had a tougher year compared with last season:

- **Drake men’s basketball:** **12–13 overall**, **6–8 in Missouri Valley Conference (MVC)** play (as of the latest posted results). ([sports-reference.com](https://www.sports-reference.com/cbb/schools/drake/?utm_source=openai))  
- **Drake women’s basketball:** **6–15 overall**, **5–6 in MVC** play. ([sports-reference.com](https://www.sports-reference.com/cbb/schools/drake/women/?utm_source=openai))  

If you tell me whether you mean **overall record**, **MVC standing**, or **how they’ve looked lately (last 5–10 games)**, I can summarize it in the way you care about most.


## Reponse object with tool usage

Notice all of the additional information contained in the response object

In [13]:
response

Response(id='resp_0b4239343122681800698cd86874e88193a1f85e4699c7477b', created_at=1770838120.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-5.2-2025-12-11', object='response', output=[ResponseFunctionWebSearch(id='ws_0b4239343122681800698cd868eb4c8193ac18c0064b9ecd8a', action=ActionSearch(query='sports: {"tool":"sports","fn":"standings","league":"ncaamb"}', type='search', queries=['sports: {"tool":"sports","fn":"standings","league":"ncaamb"}'], sources=None), status='completed', type='web_search_call'), ResponseFunctionWebSearch(id='ws_0b4239343122681800698cd869db9081939ec4df912d05e9e1', action=ActionSearch(query="Drake Bulldogs men's basketball 2025-26 record", type='search', queries=["Drake Bulldogs men's basketball 2025-26 record", "Drake Bulldogs women's basketball 2025-26 record", "Drake Bulldogs men's basketball schedule 2025-26 standings Missouri Valley Conference", "Drake Bulldogs women's basketball schedule 2025-26 standings Missouri Valley 

We can zoom in and look specifically at the web searches it performed.

In [17]:
print(response.output[1].action.queries)

["Drake Bulldogs men's basketball 2025-26 record", "Drake Bulldogs women's basketball 2025-26 record", "Drake athletics men's basketball schedule results 2025-26", "Drake athletics women's basketball schedule results 2025-26"]


## Applied Exploration

Come up with a prompt to use in a model comparison and prompt sensitivity experiment. Something like

In [None]:
chat = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain recursion to a sophomore CS student."},
]

Then, create two additional slight variations, one in the system prompt (e.g., *expert professor* instead of *helpful assistant*) and one in the user prompt.

Run each of the three variations using the [gpt-5.2 model](https://developers.openai.com/api/docs/models/gpt-5.2) (you can use web search if you want) three times and record all nine responses. 

Answer the following questions:
* When you repeated the request on the same prompt, how different were the responses?
* Were there any meaningful differences in the variations of the prompt you tried or was it similar to the differences you noticed in on repetitions of the same prompt?
* What changes seem to be the most meaningful?

Then, repeat the experiment using a small model like [SmolLM2-360M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct). 

* What differences did you notice between the large and small models?