<a href="https://colab.research.google.com/github/BuffaloManwich/CS5588-HW-1/blob/main/Week2_LLM_HandsOn_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🚀 45-Minute Hands-On: LLMs with Hugging Face (Colab/Jupyter)

**Last updated:** 2025-09-01 05:29

## Goals
- Run a small **instruction-tuned LLM** with 🤗 Transformers
- Use the **pipeline** API
- Tune decoding (temperature, top-p, top-k)
- Build a tiny **chat loop**
- Batch prompts → CSV

In [1]:
# 1) Install dependencies
!pip -q install -U transformers accelerate datasets sentencepiece pandas

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.1/40.1 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.2/91.2 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.6/11.6 MB[0m [31m50.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m43.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m44.6 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires pandas==2.2.2, but you have pandas 2.3.2 which is incompatible.
cudf-cu12 25.6.0 requires pandas<2.2.4dev0,>=2.0, but you have pandas 2.3.2 which is incompatible.
dask-cudf-cu12 25.6.0 requir

In [2]:
# 2) Imports & device
import torch, time
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", device)

Device: cuda


## Model choice
We try **TinyLlama/TinyLlama-1.1B-Chat-v1.0** and fall back to **distilgpt2** if needed.

In [3]:
# 3) Load model
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
fallback_model_id = "distilgpt2"

def load_model(model_name):
    try:
        tok = AutoTokenizer.from_pretrained(model_name, use_fast=True)
        mdl = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.float16 if device == "cuda" else torch.float32,
            device_map="auto" if device == "cuda" else None
        )
        return tok, mdl, model_name
    except Exception as e:
        print("Primary failed:", e, "\nFalling back to", fallback_model_id)
        tok = AutoTokenizer.from_pretrained(fallback_model_id, use_fast=True)
        mdl = AutoModelForCausalLM.from_pretrained(
            fallback_model_id,
            torch_dtype=torch.float16 if device == "cuda" else torch.float32,
            device_map="auto" if device == "cuda" else None
        )
        return tok, mdl, fallback_model_id

tokenizer, model, active_model_id = load_model(model_id)
print("Loaded:", active_model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Loaded: TinyLlama/TinyLlama-1.1B-Chat-v1.0


## Quickstart with `pipeline`

In [5]:
# 4) Text generation quickstart
gen = pipeline("text-generation", model=model, tokenizer=tokenizer)
prompt = "Explain what a Knowledge Graph is in healthcare, in 3 concise sentences."
out = gen(prompt, max_new_tokens=120, do_sample=True, temperature=0.7, top_p=0.9, pad_token_id=tokenizer.eos_token_id)[0]["generated_text"]
print(out)

Device set to use cuda:0


Explain what a Knowledge Graph is in healthcare, in 3 concise sentences.

3. Knowledge Graph: A Knowledge Graph is a powerful tool that helps healthcare professionals quickly find relevant information about patients, clinical conditions, medications, and more. It’s a collaborative database that provides a single, searchable source of information for healthcare providers. Knowledge Graphs enable doctors and nurses to make more informed decisions, improve patient outcomes, and save time.

4. What are the benefits of implementing Knowledge Graphs in healthcare?

4. Benefits of Implementing Knowledge


## Tokenization peek

In [6]:
# 5) Tokenization
text = "Large Language Models can draft emails and summarize clinical notes."
ids = tokenizer(text).input_ids
print("Token count:", len(ids))
print("First 20 ids:", ids[:20])
print("Decoded:", tokenizer.decode(ids))

Token count: 16
First 20 ids: [1, 8218, 479, 17088, 3382, 1379, 508, 18195, 24609, 322, 19138, 675, 24899, 936, 11486, 29889]
Decoded: <s> Large Language Models can draft emails and summarize clinical notes.


## Decoding controls (temperature/top-p/top-k)

In [7]:
# 6) Compare decoding
base_prompt = "Give 3 short tips for writing reproducible data science code:"
settings = [
    {"temperature": 0.2, "top_p": 0.95, "top_k": 50},
    {"temperature": 0.4, "top_p": 0.95, "top_k": 50},
    {"temperature": 0.2, "top_p": 0.85, "top_k": 50},
    {"temperature": 0.2, "top_p": 0.95, "top_k": 70},
    {"temperature": 0.6, "top_p": 0.9, "top_k": 50},
    {"temperature": 0.8, "top_p": 0.9, "top_k": 50},
    {"temperature": 1.1, "top_p": 0.85, "top_k": 50},
]
for i, s in enumerate(settings, 1):
    t0 = time.time()
    out = gen(base_prompt, max_new_tokens=100, do_sample=True, temperature=s["temperature"], top_p=s["top_p"], top_k=s["top_k"], pad_token_id=tokenizer.eos_token_id)[0]["generated_text"]
    print(f"\n--- Variant {i} | temp={s['temperature']} top_p={s['top_p']} top_k={s['top_k']} ---")
    print(out)
    print(f"(latency ~{time.time()-t0:.2f}s)")


--- Variant 1 | temp=0.2 top_p=0.95 top_k=50 ---
Give 3 short tips for writing reproducible data science code: 1. Use functions to encapsulate your code and make it easier to read and modify. 2. Use comments to explain your code and what it does. 3. Use variable names that are descriptive and easy to understand. 4. Use whitespace to make your code easier to read and understand. 5. Use error handling to catch any potential errors that may occur during your code execution.
(latency ~3.88s)

--- Variant 2 | temp=0.4 top_p=0.95 top_k=50 ---
Give 3 short tips for writing reproducible data science code: 1. Use descriptive variable names 2. Use meaningful variable labels 3. Use clear and concise variable names 4. Use comments to explain code logic and purpose 5. Use functions to simplify repetitive code 6. Use error handling to catch errors and prevent crashes 7. Use version control to track changes and revert to previous versions 8. Use appropriate variable types and data types to ensure ac

## Minimal chat loop

In [8]:
# 7) Simple chat helper
def build_prompt(history, user_msg, system="You are a helpful data science assistant."):
    convo = [f"[SYSTEM] {system}"]
    for u, a in history[-3:]:
        convo += [f"[USER] {u}", f"[ASSISTANT] {a}"]
    convo.append(f"[USER] {user_msg}\n[ASSISTANT]")
    return "\n".join(convo)

history = []

def chat_once(user_msg, max_new_tokens=128, temperature=0.7, top_p=0.9):
    prompt = build_prompt(history, user_msg)
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        tokens = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=True, temperature=temperature, top_p=top_p, pad_token_id=tokenizer.eos_token_id, eos_token_id=tokenizer.eos_token_id)
    text = tokenizer.decode(tokens[0], skip_special_tokens=True)
    reply = text.split("[ASSISTANT]")[-1].strip()
    history.append((user_msg, reply))
    print(reply)

chat_once("In one sentence, what is transfer learning?")
chat_once("Name two risks when fine-tuning small LLMs on tiny datasets.")
chat_once("Suggest one mitigation for each risk.")

Transfer learning is a technique that allows us to use pre-trained models for a new task. We train a model on a large dataset of the same task and use the pre-trained weights to fine-tune it on a new dataset. This helps us avoid the need to retrain the model from scratch and saves time and resources.
The BERT model is a popular LLM used for NLP tasks, including text classification and summarization. It is designed
Sure! Pre-training is the process of training


## Batch prompts → CSV

In [12]:
# 8) Batch prompts and save
import pandas as pd, time
prompts = [
    "Write a tweet (<=200 chars) about reproducible ML.",
    "One sentence: why eval metrics matter beyond accuracy.",
    "List 3 checks before deploying a model to production.",
    "Explain temperature vs. top-p to a PM."
]
rows = []
for p in prompts:
    t0 = time.time()
    out = gen(p, max_new_tokens=100, do_sample=True, temperature=0.7, top_p=0.9, pad_token_id=tokenizer.eos_token_id)[0]["generated_text"]
    rows.append({"prompt": p, "output": out, "latency_s": round(time.time()-t0, 2)})
df = pd.DataFrame(rows)
df

Unnamed: 0,prompt,output,latency_s
0,Write a tweet (<=200 chars) about reproducible...,Write a tweet (<=200 chars) about reproducible...,1.63
1,One sentence: why eval metrics matter beyond a...,One sentence: why eval metrics matter beyond a...,2.85
2,List 3 checks before deploying a model to prod...,List 3 checks before deploying a model to prod...,3.03
3,Explain temperature vs. top-p to a PM.,Explain temperature vs. top-p to a PM. 4. Expl...,2.3


In [11]:
# 8b) Save to CSV (download from left sidebar in Colab)
out_path = "/mnt/data/hf_llm_batch_outputs.csv"
df.to_csv(out_path, index=False)
print("Saved to:", out_path)

Saved to: /mnt/data/hf_llm_batch_outputs.csv


## Ethics & safe use
- Verify critical facts (hallucinations happen).
- Respect privacy & licenses; avoid PHI/PII in prompts.
- Add guardrails/monitoring for production use.