# Chatbot with Conversation History

*Copyright (c) Meta Platforms, Inc. and affiliates.
This software may be used and distributed according to the terms of the Llama Community License Agreement.*

<a href="https://colab.research.google.com/github/meta-llama/llama-cookbook/blob/main/end-to-end-use-cases/chatbot-with-conversation-history/chatbot-with-conversation-history.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This tutorial shows you how to build a chatbot with conversation history. Using Llama 4, we will create a conversational agent that takes a URL, understands its content, and allows you have an interactive conversation with it, while maintaining conversation history.

| Component          | Choice                                     | Why                   |
| :----------------- | :----------------------------------------- | :-------------------- |
| **Model**          | `Llama-4-Maverick-17B-128E-Instruct-FP8`     | A powerful Mixture-of-Experts (MoE) model ideal for complex instruction-following. Llama 4 Maverick offers superior performance and a massive context window (up to 1M tokens). |
| **Pattern**        | In-context learning + sliding window memory | We will pass the entire webpage content directly into the model's context. Llama 4's large context window makes this simple approach viable for even very large pages, often removing the need for a complex RAG system.        |           
| **Infrastructure**        | Meta's official [Llama API](https://llama.developer.meta.com/)          | Provides serverless, production-ready access to Llama 4 models using the `llama_api_client` SDK.          |
---

**Note on Inference Providers:** This tutorial uses the Llama API for demonstration purposes. However, you can run Llama 4 models with any preferred inference provider. Common examples include [Amazon Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-meta.html) and [Together AI](https://together.ai/llama). The core logic of this tutorial can be adapted to any of these providers.

## What you will learn

- **The fundamentals of chat completion:** How to structure conversations using system, user, and assistant roles.
- **How to manage conversation history:** Implement a sliding window to maintain context in long conversations without exceeding token limits.
- **Practical prompt engineering:** How to guide the model to answer questions based *only* on provided text.
- **How to perform meta-tasks:** Leverage the model to summarize the conversation history.

## Install dependencies

You will need a few libraries for this project: `requests` to download webpages, `readability-lxml` to extract the core content, `markdownify` to convert HTML to clean Markdown, `tiktoken` for accurate token counting, and the official `llama-api-client`.

In [22]:
!uv pip install --quiet requests beautifulsoup4 readability-lxml markdownify tiktoken llama-api-client

## Imports & Llama API client setup

In this tutorial, we will use [Llama API](https://llama.developer.meta.com/) as the inference provider. So, you would first need to get an API key from Llama API if you don't have one already. Then set the Llama API key as an environment variable, such as `LLAMA_API_KEY`, as shown in the example.

Remember, you can adapt this section to use your preferred inference provider.

In [26]:
import os, sys, re, html, textwrap
import requests
from typing import List, Dict
from bs4 import BeautifulSoup
import tiktoken
from readability import Document
from markdownify import markdownify
from llama_api_client import LlamaAPIClient

In [7]:
# --- Llama client ---
API_KEY = os.getenv("LLAMA_API_KEY")
if not API_KEY:
    sys.exit("❌  Please set the LLAMA_API_KEY environment variable.")

client = LlamaAPIClient(api_key=API_KEY)

## Fetch and clean a webpage

To get high-quality responses from the model, you first need to provide it with high-quality data. Raw HTML contains a lot of "noise" (like navigation bars, ads, and scripts) that can distract the model. The following function implements a three-step process to transform a messy webpage into clean, structured Markdown that is ideal for the LLM.

1.  **Extract Core Content:** It uses the `readability` library to pull out the main body of the article, discarding common boilerplate like headers, footers, and sidebars.
2.  **Final Cleanup:** It uses `BeautifulSoup` to remove any remaining `<script>` or `<style>` tags.
3.  **Convert to Markdown:** It converts the clean HTML to Markdown using `markdownify`. This is better than plain text because it preserves important semantic structure—such as headings, lists, and links—which helps the model better understand the content's hierarchy and meaning.

In [38]:
def fetch_page_text(url: str, timeout: int = 15) -> str:
    """Download a webpage and return plain text (scripts/styles removed)."""
    r = requests.get(url, timeout=timeout)
    r.raise_for_status
    html_raw = r.text

    # ---- 1. keep only the main article if possible -----------------------
    html_main = Document(html_raw).summary(html_partial=True)
    soup = BeautifulSoup(html_main, "html.parser")

    # ---- 2. drop noise ----------------------------------------------------
    for tag in soup(["script", "style", "noscript", "header", "footer", "nav", "aside"]):
        tag.decompose()

    # ---- 3. html to markdown ----------------------------------------------
    cleaned_html = str(soup)
    md_text = markdownify(cleaned_html, heading_style="ATX")      # ## Heading
    md_text = html.unescape(md_text)
    md_text = re.sub(r"\n{3,}", "\n\n", md_text).strip()

    return md_text

In [40]:
url = input("🔗  Paste a URL to chat about: ").strip()
raw_article = fetch_page_text(url)
print(f"✅  Retrieved {len(raw_article):,} characters.")
print(raw_article)

🔗  Paste a URL to chat about:  https://ai.meta.com/blog/llama-4-multimodal-intelligence/


✅  Retrieved 20,286 characters.
## Takeaways

* We’re sharing the first models in the Llama 4 herd, which will enable people to build more personalized multimodal experiences.
* Llama 4 Scout, a 17 billion active parameter model with 16 experts, is the best multimodal model in the world in its class and is more powerful than all previous generation Llama models, while fitting in a single NVIDIA H100 GPU. Additionally, Llama 4 Scout offers an industry-leading context window of 10M and delivers better results than Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1 across a broad range of widely reported benchmarks.
* Llama 4 Maverick, a 17 billion active parameter model with 128 experts, is the best multimodal model in its class, beating GPT-4o and Gemini 2.0 Flash across a broad range of widely reported benchmarks, while achieving comparable results to the new DeepSeek v3 on reasoning and coding—at less than half the active parameters. Llama 4 Maverick offers a best-in-class performance to

## Managing the context window

Llama models have a fixed context window. The context window is the maximum number of tokens they can consider at one time. A key advantage of Llama 4 is the size of this window. `Llama-4-Maverick-17B-128E-Instruct-FP8` supports up to 1 million tokens, enabling you to pass entire books or extensive documents as context.

While Llama 4 offers a very large context window of 1M tokens, most API providers support smaller token windows than this. As this tutorial uses Llama API, we'll work within its token window, which is 128k. We must ensure that our entire prompt, which includes the system message, the webpage content, and the conversation history, fits within this limit.

To prevent errors, we will truncate the webpage content if it's too long. We will use `tiktoken` for accurate token counting and truncation. We will reserve a `HEADROOM` of 16,384 tokens to accommodate a long-running chat history and the model's next response, and trim the article to fit the remaining space. Note that while `tiktoken` provides a accurate local count, the exact number of tokens processed by an API can vary slightly; hence, we use the '≈' symbol for the count.

In [42]:
MAX_CTX = 128000 # A practical context window for Llama 4 Maverick
HEADROOM = 16384 # for turns + response
MAX_ARTICLE = MAX_CTX - HEADROOM

encoding = tiktoken.get_encoding("o200k_base")
def count_tokens(s: str) -> int:
    """Returns the number of tokens in a text string."""
    return len(encoding.encode(s))

def truncate(text: str, max_tokens: int = MAX_ARTICLE) -> str:
    """Truncates a text string to a maximum number of tokens."""
    if count_tokens(text) <= max_tokens:
        return text
    
    tokens = encoding.encode(text)
    truncated_tokens = tokens[:max_tokens]
    return encoding.decode(truncated_tokens, errors='ignore') + "\n\n[... truncated to fit context ...]"
    
article = truncate(raw_article)
print(f"Article now ≈ {count_tokens(article)} tokens.")

Article now ≈ 4216 tokens.


## Chatbot with conversation history

Next, we will create a `PageChat` class to encapsulate the chatbot's logic and manage its state. Using a class is a clean way to handle the conversation history and model configuration.

The `SYSTEM_PROMPT` is a key component. It provides the model with its core instructions, defining its persona and constraints. A best practice is to be highly specific. Here, we instruct it to answer questions *only* from the provided webpage text and to explicitly state when information is missing. This is a critical technique for grounding the model and reducing the likelihood of fabricated answers (hallucinations).

The `_messages` method assembles the final payload sent to the API. Notice the order:
1.  The `SYSTEM_PROMPT` sets the overall behavior.
2.  A second system message injects the `article` content as the context.
3.  The last `k` turns of the conversation history are included, implementing a sliding window for memory.
4.  The latest `user_msg` is added.

This structure ensures the model has all the necessary context to generate a relevant and accurate response.

In [43]:
SYSTEM_PROMPT = (
    "You are PageChat, an AI that answers questions **only** from the supplied "
    "webpage text. If information is absent, say so. Be concise."
)

class PageChat:
    def __init__(self, article_text: str,
                 model: str = "Llama-4-Maverick-17B-128E-Instruct-FP8",
                 history_window: int = 128):
        self.article = article_text
        self.model  = model
        self.k      = history_window
        self.history: List[Dict[str, str]] = []

    def _messages(self, user_msg: str) -> List[Dict[str, str]]:
        msgs = [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "system", "content": f"[WEBPAGE]\n\n{self.article}"},
            *self.history[-self.k*2:],
            {"role": "user", "content": user_msg},
        ]
        return msgs

    def chat(self, user_msg: str) -> str:
        resp = client.chat.completions.create(
            model=self.model,
            messages=self._messages(user_msg),
            temperature=0.1,  # Lower temperature for more factual, deterministic answers
        )
        assistant = resp.completion_message.content.text
        self.history.extend([
            {"role": "user", "content": user_msg},
            {"role": "assistant", "content": assistant},
        ])
        return assistant

## Run the interactive chat loop

This final piece of code starts the interactive session. It creates an instance of the `PageChat` bot and enters a loop, waiting for your input. Type "exit" or "quit" to end the conversation.

In [44]:
bot = PageChat(article)
print("\n🤖  Ask me about the page!  Type 'exit' to quit.")
while True:
    try:
        user = input("\nYou: ").strip()
    except (EOFError, KeyboardInterrupt):
        break
    if user.lower() in {"exit", "quit"}:
        break
    if not user:
        continue
    answer = bot.chat(user)
    print(f"\nPageChat: {answer}")


🤖  Ask me about the page!  Type 'exit' to quit.



You:  Give me a 2 sentence summary



PageChat: Meta is releasing Llama 4 Scout and Llama 4 Maverick, two new multimodal AI models that offer state-of-the-art performance and are available for download on llama.com and Hugging Face. The models are part of the Llama 4 series, which also includes the larger Llama 4 Behemoth model that is still in training and has shown exceptional performance on various benchmarks.



You:  What is the difference between the Llama 4 Maverick and Llama 4 Scout models?



PageChat: Llama 4 Maverick and Llama 4 Scout are both 17 billion active parameter models, but they differ in the number of experts used in their mixture-of-experts (MoE) architecture: Llama 4 Maverick has 128 experts, while Llama 4 Scout has 16 experts. This difference affects their performance, with Llama 4 Maverick outperforming Llama 4 Scout and other models on various benchmarks, but also potentially requiring more resources.



You:  What is the max context length of the Llama 4 Scout model?



PageChat: The Llama 4 Scout model has a context window of 10 million tokens.



You:  What is the pricing of the Llama 4 Maverick model?



PageChat: The webpage does not mention the pricing of the Llama 4 Maverick model. It does mention that Llama 4 Maverick offers a "best-in-class performance to cost ratio", but the actual cost is not specified.



You:  When will the Behemoth model be released?



PageChat: The webpage does not provide a specific release date for the Llama 4 Behemoth model, stating only that it is "still training" and that more details will be shared "even while it's still in flight".



You:  exit


Sample queries to try:

* *"Give me a two-sentence summary."*
* *"What is the pricing?"*
* *Follow-up:* *"What are the top 3 takeaways?"*

## Bonus: Summarize the conversation

Since you are storing the conversation history, you can use the model to perform meta-tasks on it, such as summarizing the chat. This can be useful for logging, analysis, or providing a user with a quick recap of a long interaction.

In [45]:
def summarize_conversation(bot: PageChat, max_tokens: int = 128) -> str:
    msgs = [
        {"role": "system", "content": "Summarize the chat in 3 concise bullets."},
        {"role": "user",   "content": "\n".join(m['content'] for m in bot.history)},
    ]
    resp = client.chat.completions.create(
        model=bot.model,
        messages=msgs,
        max_completion_tokens=max_tokens,
    )
    return resp.completion_message.content.text.strip()

print("\nChat so far:")
print(summarize_conversation(bot))


Chat so far:
Here is a 2-sentence summary:

Meta has released two new multimodal AI models, Llama 4 Scout and Llama 4 Maverick, which offer state-of-the-art performance and are available for download. The models are part of the Llama 4 series, which also includes the larger Llama 4 Behemoth model that is still in training and has no specified release date.


## Next steps and upgrade paths

This tutorial provides a solid foundation, but you can extend it in several ways for a production-grade application.

| Need                           | Where to look                                                                                                                                                                                                                            |
| :----------------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Long pages / multiple docs** | For content larger than the context window, use **Retrieval-Augmented Generation (RAG)**. This involves chunking documents, storing them in a vector database, and retrieving only the most relevant chunks to answer a question. See our [Contextual chunking RAG cookbook](https://github.com/meta-llama/llama-cookbook/tree/main/end-to-end-use-cases/Contextual-Chunking-RAG). |
| **Persistent memory**          | For production systems, you might store conversation history in a database.                                                                   |
| **Real-time feel**             | Enable `stream=True` to receive the response token-by-token, improving perceived latency. See the streaming example in the [Chat & conversation guide](https://llama.developer.meta.com/docs/guides/chat-guide#enhancing-user-experience-with-streaming-responses).                                                                                       |
| **Live data & actions**        | Give the chatbot access to live data or external APIs using **Tool Calling**. See the full [Tool calling guide](https://llama.developer.meta.com/docs/guides/tool-guide/).                                                                                                          |