# LLM Prompting Basics with the Hugging Face Inference API

**Audience:** Senior-level CS students (beginner-friendly commentary)

**What you'll learn:**
- What *system*, *user*, and *assistant* messages are in chat prompting
- How to call the Hugging Face Inference API using `InferenceClient`
- How to craft prompts with roles, constraints, examples, and delimiters
- How parameters like `temperature`, `top_p`, and `max_tokens` affect outputs
- How to write better prompts by iterating from vague → structured

> ⚠️ You need a **Hugging Face API token** with Inference Endpoints access to run the API calls here.
Create one at https://huggingface.co/settings/tokens and set it as an environment variable named `HF_TOKEN`.

## 0) Setup

This section installs the Hugging Face Hub client and sets up a default model. 
We will use an *instruction-tuned* open model that supports chat (role messages).

**Notes for beginners:**
- The *model id* is a string pointing to a model hosted on Hugging Face.
- You can change the model later (e.g., switch to a different instruct model).
- Make sure the model supports **chat**/**instruct** style prompting.

In [None]:
# If running in an environment that does not have huggingface_hub installed, uncomment the next line:
# !pip install -q huggingface_hub

import os
from typing import List, Dict, Any
from huggingface_hub import InferenceClient

# === Choose a default chat/instruct model ===
# You can replace this with another chat-tuned model if desired.
MODEL_ID = "mistralai/Mistral-7B-Instruct-v0.2"

# === Read your HF token (create at https://huggingface.co/settings/tokens) ===
HF_TOKEN = os.environ.get("HF_TOKEN", None)
if HF_TOKEN is None:
    print("[INFO] No HF_TOKEN found in environment. Set it with:\n",
          "  import os; os.environ['HF_TOKEN'] = '<your-token>'\n",
          "or use your runtime's secret manager.")

# Create a client. If HF_TOKEN is None and the model requires auth, calls will fail.
client = InferenceClient(model=MODEL_ID, token=HF_TOKEN)
print(f"Ready. Using model: {MODEL_ID}")

## 1) Chat Roles: system, user, assistant

**Key idea:** Chat LLMs accept a *list* of messages. Each message has a `role` and `content`.

- **system**: sets high-level behavior, style, guardrails (think of it as an initial *instruction banner*).
- **user**: your actual question or task.
- **assistant**: the model's reply (you do **not** write this; the model fills it in).

These are *prompting conventions*—not parts of the neural network architecture.

In [None]:
# Minimal working example: one system + one user message.
# For deterministic, repeatable outputs, set temperature=0.

messages = [
    {"role": "system", "content": "You are a concise, precise teaching assistant."},
    {"role": "user", "content": "Explain the idea of 'attention' in transformers in one sentence."}
]

try:
    response = client.chat.completions.create(
        model=MODEL_ID,
        messages=messages,
        max_tokens=150,
        temperature=0.0,   # lower = less random, more deterministic
        top_p=1.0          # use full distribution (you can also try 0.9)
    )
    print(response.choices[0].message["content"])  # content of the assistant's reply
except Exception as e:
    print("[WARN] API call failed:", e)
    print("If you don't have a token or network access in this environment, read the code and try locally.")

## 2) Adding Constraints & Output Formatting

A **good prompt** clearly tells the model *what* to do and *how* to format the answer. 
For programs, JSON is often a useful structured format.

In [None]:
# Ask for JSON output and explicitly describe the schema.

messages = [
    {"role": "system", "content": (
        "You are a precise CS tutor. Always follow the required output schema if provided."
    )},
    {"role": "user", "content": (
        "Explain attention in transformers in 2 bullet points."
        "\nReturn JSON matching this schema: {\"bullets\":[\"...\",\"...\"]}."
    )}
]

try:
    response = client.chat.completions.create(
        model=MODEL_ID,
        messages=messages,
        max_tokens=200,
        temperature=0.0
    )
    raw = response.choices[0].message["content"]
    print("Raw model output:\n", raw)
except Exception as e:
    print("[WARN] API call failed:", e)

## 3) Delimiters for Clarity

When you include long text in a prompt (like instructions or examples), **delimiters** help the model understand boundaries.
Common patterns:
- Triple backticks ``` for text blocks
- XML-like tags `<context> ... </context>`
- Markdown headings or separators

We will show a prompt using triple backticks to clearly separate a data block from the instruction.

In [None]:
messages = [
    {"role": "system", "content": "You are a clear and honest teaching assistant."},
    {"role": "user", "content": (
        "Use the text between triple backticks as the source for your explanation.\n"
        "Explain the main idea in 2 short bullet points.\n\n"
        "```\nSelf-attention allows each token in a sequence to selectively focus on other tokens,\n"
        "based on learned similarity scores, creating context-aware representations.\n````"
    )}
]

try:
    response = client.chat.completions.create(
        model=MODEL_ID,
        messages=messages,
        max_tokens=150,
        temperature=0.0
    )
    print(response.choices[0].message["content"])
except Exception as e:
    print("[WARN] API call failed:", e)

## 4) Few-Shot Prompting (Providing Examples)

You can guide the style and structure of the answer by giving examples. This is called **few-shot prompting**.

In [None]:
messages = [
    {"role": "system", "content": "You are a helpful and concise CS tutor."},
    # Example 1 (as if the assistant had responded):
    {"role": "user", "content": "Explain backpropagation in 2 bullet points."},
    {"role": "assistant", "content": "- Computes gradients layer-by-layer using the chain rule.\n- Updates parameters to reduce loss."},
    # Example 2:
    {"role": "user", "content": "Explain overfitting in 2 bullet points."},
    {"role": "assistant", "content": "- Model memorizes training data patterns and noise.\n- Fails to generalize to unseen data."},
    # Now the real question we want answered (the pattern is clear):
    {"role": "user", "content": "Explain attention in transformers in 2 bullet points."}
]

try:
    response = client.chat.completions.create(
        model=MODEL_ID,
        messages=messages,
        max_tokens=120,
        temperature=0.0
    )
    print(response.choices[0].message["content"])
except Exception as e:
    print("[WARN] API call failed:", e)

## 5) Knobs: temperature, top_p, and max_tokens

- **temperature**: randomness. Lower (0–0.3) → more deterministic; higher (0.7–1.0) → more creative.
- **top_p**: nucleus sampling. 0.9 means choose from the top 90% of probability mass.
- **max_tokens**: maximum tokens to *generate* (does not limit prompt length).

Try changing these and observe the differences.

In [None]:
prompt_text = "List three creative analogies for how attention works in transformers."

def run_with_settings(temp: float, top_p: float, max_toks: int):
    messages = [
        {"role": "system", "content": "You are creative but concise."},
        {"role": "user", "content": prompt_text}
    ]
    try:
        response = client.chat.completions.create(
            model=MODEL_ID,
            messages=messages,
            temperature=temp,
            top_p=top_p,
            max_tokens=max_toks
        )
        print(f"\n=== temperature={temp}, top_p={top_p}, max_tokens={max_toks} ===")
        print(response.choices[0].message["content"])
    except Exception as e:
        print("[WARN] API call failed:", e)

# Run a few variants (feel free to tweak)
run_with_settings(0.0, 1.0, 120)
run_with_settings(0.7, 0.9, 120)

## 6) Bad → Better Prompts

**Bad (vague):**
```
Explain attention
```

**Better (structured):**
```
System: You are a CS teaching assistant.
User: Explain attention in transformers in 3 bullet points for senior CS students. Avoid equations.
```

**Even better (with formatting + guardrails):**
```
System: You are a precise technical tutor.
Instruction: Explain attention in transformers in exactly 3 bullets. No equations, keep each bullet under 20 words.
Format:
- bullet 1
- bullet 2
- bullet 3
If not sure, say "I don't know."
```

## 7) Exercises (Beginner-Friendly)

1. **Role Tuning**: Change the system prompt's *tone* (e.g., "friendly", "formal", "Socratic") and see how responses differ.
2. **Schema Control**: Ask for JSON output with a specific schema (e.g., `{\"summary\": \"...\", \"bullets\": []}`). Validate that it conforms.
3. **Delimiter Practice**: Insert a long text block with triple backticks and ask the model to summarize **only** that text.
4. **Few-Shot**: Provide 1–2 example Q/A pairs before your real question and observe style transfer.
5. **Knob Sweeps**: Try `temperature` values {0.0, 0.3, 0.7} and `top_p` values {0.9, 1.0}. Note differences.

> Tip: Start simple, check the output, then **iterate** your prompt—this is normal and expected in practice.

## 8) Utility: Simple Chat Wrapper (Optional)

This helper function makes it easy to run different prompts without repeating boilerplate.

In [None]:
def chat(
    system: str,
    user: str,
    *,
    model: str = MODEL_ID,
    temperature: float = 0.0,
    top_p: float = 1.0,
    max_tokens: int = 256
) -> str:
    """Minimal wrapper around the HF chat completion API.

    Args:
        system: The system prompt string (behavior/rules).
        user: The user prompt (task/question).
        model: Model id to use (chat-tuned is best).
        temperature: Randomness (0.0 = deterministic-ish).
        top_p: Nucleus sampling cap.
        max_tokens: Max tokens to generate in the reply.
    Returns:
        Text content from the assistant's reply.
    """
    msgs = [
        {"role": "system", "content": system},
        {"role": "user", "content": user},
    ]
    try:
        out = client.chat.completions.create(
            model=model,
            messages=msgs,
            temperature=temperature,
            top_p=top_p,
            max_tokens=max_tokens,
        )
        return out.choices[0].message.get("content", "")
    except Exception as e:
        return f"[WARN] API call failed: {e}"

# Example usage (uncomment to try):
# print(chat(
#     system="You are a precise CS tutor.",
#     user="Explain attention in transformers in exactly 2 bullet points.",
# ))

## 9) Completion (Single-String) Prompting (Optional)

Some models also support a *completion* API that takes a single string (no roles).
You can emulate roles by embedding them as text in your prompt. This is useful for models without chat templates.

In [None]:
completion_prompt = (
    "System: You are a concise technical writer.\n"
    "User: Explain attention in transformers in two short bullet points.\n"
    "Format: Start each line with '- '."
)

try:
    resp = client.completions.create(
        model=MODEL_ID,
        prompt=completion_prompt,
        max_tokens=120,
        temperature=0.0
    )
    print(resp.choices[0].text)
except Exception as e:
    print("[WARN] Completion API call failed:", e)
    print("Some chat-tuned models may prefer the chat endpoint.")

## 10) Wrap-up

- **System**/**User**/**Assistant** are *roles* used to format chat prompts.
- Clear instructions + constraints + examples → **better, more reliable outputs**.
- Tweak `temperature`, `top_p`, and `max_tokens` to control style and length.
- Iterate on your prompt—this is normal engineering practice.

**Next steps:** Try adapting these patterns to your own coursework or projects.