<a href="https://colab.research.google.com/github/M9star/nanoGPT/blob/main/tiny_ai_chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  Tiny AI Chatbot  
### A Hands-On Workshop on How AI Learns to Talk

![AI Chatbot Banner](https://iili.io/fNdUQ72.png)

---

##  Introduction

Artificial Intelligence (AI) can now **understand and generate human-like language**.  
From chatbots and virtual assistants to automated customer support, AI systems are learning how to *talk* and this notebook shows **how that actually works**, step by step.

In this notebook, we build a **simple AI chatbot** and train it using real text data. No advanced math or deep AI background is required.

For a deeper understanding of how Large Language Models (LLMs) work under the hood, watch this excellent talk by Andrej Karpathy:  
https://www.youtube.com/watch?v=bZQun8Y4L2A

---

##  What You Will Learn

By the end of this session, you will understand:

- How AI learns language from **text data**
- What a **language model** is and how it works
- The difference between **raw data** and **trained intelligence**
- How chatbots learn to:
  - Hold conversations  
  - Tell stories  
  - Solve basic math problems
- How a trained AI model can be used through a **simple chat interface**

---

##  What We Will Build

We will build a **Tiny AI Chatbot** that can:

- Chat with users naturally
- Tell short stories
- Answer math questions
- Remember short conversation history

---

##  Tools & Technologies Used

- **Python** – programming language
- **Hugging Face Datasets** – for text data
- **Transformers (GPT-2)** – the AI language model
- **Gradio** – to create a chat interface



Let’s begin our journey!

### Install required libraries
- **datasets** – to download and manage text datasets
- **transformers** – to load and train the GPT-2 language model
- **accelerate** – to optimize training performance
- **torch** – deep learning framework
- **gradio** – to create an interactive chat interface


In [None]:
!pip install -q datasets==2.16.0 transformers==4.57.0 accelerate==1.10.1 torch gradio==5.49.0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.4/41.4 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.6/68.6 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
Reason for being yanked: Error in the setup causing installation issues[0m[33m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m72.9 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m374.9/374.9 kB[0m [31m27.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.5/63.5 MB[0m [31m27.4 MB/s[0m eta [36m0:00:00[0m:00:01[0m0:01[0mm
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━

In [None]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

## Download and view Datasets  
**Datasets:**  
- **DailyDialog**: https://huggingface.co/datasets/roskoN/dailydialog  
- **TinyStories**: https://huggingface.co/datasets/roneneldan/TinyStories  
- **AI-MO / NuminaMath-CoT**: https://huggingface.co/datasets/AI-MO/NuminaMath-CoT  

In [None]:
from datasets import load_dataset

# Download dialog dataset (without trust_remote_code)
chat_ds = load_dataset("roskoN/dailydialog")
print(chat_ds)

Downloading data:   0%|          | 0.00/3.67M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/340k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/337k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'acts', 'emotions', 'utterances'],
        num_rows: 11118
    })
    validation: Dataset({
        features: ['id', 'acts', 'emotions', 'utterances'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['id', 'acts', 'emotions', 'utterances'],
        num_rows: 1000
    })
})


In [None]:
# Download story dataset
story_ds = load_dataset("roneneldan/TinyStories")

print(story_ds)
print()
story_ds['train'][0]["text"]

Downloading readme: 0.00B [00:00, ?B/s]

Downloading data:   0%|          | 0.00/249M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/248M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/246M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/248M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/9.99M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 2119719
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 21990
    })
})



'One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.\n\nLily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."\n\nTogether, they shared the needle and sewed the button on Lily\'s shirt. It was not difficult for them because they were sharing and helping each other. After they finished, Lily thanked her mom for sharing the needle and fixing her shirt. They both felt happy because they had shared and worked together.'

In [None]:
# Download math dataset
math_ds = load_dataset("AI-MO/NuminaMath-CoT")

print(math_ds)
print()
math_ds['train'][0]

Downloading readme: 0.00B [00:00, ?B/s]

Downloading data:   0%|          | 0.00/247M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/247M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/247M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/247M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/247M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/166k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/859494 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['source', 'problem', 'solution', 'messages'],
        num_rows: 859494
    })
    test: Dataset({
        features: ['source', 'problem', 'solution', 'messages'],
        num_rows: 100
    })
})



{'source': 'synthetic_math',
 'problem': 'Consider the terms of an arithmetic sequence: $-\\frac{1}{3}, y+2, 4y, \\ldots$. Solve for $y$.',
 'solution': 'For an arithmetic sequence, the difference between consecutive terms must be equal. Therefore, we can set up the following equations based on the sequence given:\n\\[ (y + 2) - \\left(-\\frac{1}{3}\\right) = 4y - (y+2) \\]\n\nSimplify and solve these equations:\n\\[ y + 2 + \\frac{1}{3} = 4y - y - 2 \\]\n\\[ y + \\frac{7}{3} = 3y - 2 \\]\n\\[ \\frac{7}{3} + 2 = 3y - y \\]\n\\[ \\frac{13}{3} = 2y \\]\n\\[ y = \\frac{13}{6} \\]\n\nThus, the value of $y$ that satisfies the given arithmetic sequence is $\\boxed{\\frac{13}{6}}$.',
 'messages': [{'content': 'Consider the terms of an arithmetic sequence: $-\\frac{1}{3}, y+2, 4y, \\ldots$. Solve for $y$.',
   'role': 'user'},
  {'content': 'For an arithmetic sequence, the difference between consecutive terms must be equal. Therefore, we can set up the following equations based on the sequence

## Preparing the Chat Dataset

The chat dataset is enhanced by injecting stories and math problems into conversations.

### Key Concepts
- Alternating **User** and **Bot** messages
- Random insertion of:
  - Story prompts and responses
  - Math problems and solutions
- **END_TOKEN** added after each bot response
- All prepared datasets are combined into a single training dataset.

In [None]:
from datasets import load_dataset, concatenate_datasets
import random
import re

# Config
# ----------------------------
CHAT_SAMPLES  = 2000
STORY_SAMPLES = 2000
MATH_SAMPLES  = 2000

SEED = 42

STORY_INSERT_PROB = 0.3
MATH_INSERT_PROB = 0.3

END_TOKEN = "\n"

random.seed(SEED)

STORY_PROMPTS = [
    "Write a story.",
    "Tell a story.",
    "Create a story.",
    "Write a short story.",
    "Make up a story.",
    "Share a story.",
    "Invent a story.",
    "Compose a story.",
    "Write a narrative.",
    "Create a narrative.",
    "Tell a tale.",
    "Write a fictional story.",
    "Create a short narrative.",
    "Tell an original story.",
    "Write an imaginative story.",
    "Compose a short tale.",
    "Create an original story.",
    "Write a creative story.",
    "Tell a short tale.",
    "Make up a short story."
]


# Math LaTeX Normalizer (Gradio-safe)
# -------------------------------------------------
def normalize_math_tex(text: str) -> str:
    # Convert \[ ... \] → $$ ... $$
    text = re.sub(r"\\\[(.*?)\\\]", r"$$\1$$", text, flags=re.DOTALL)

    # Convert single $...$ → $$...$$ (ignore existing $$)
    text = re.sub(
        r"(?<!\$)\$(?!\$)(.+?)(?<!\$)\$(?!\$)",
        r"$$\1$$",
        text,
        flags=re.DOTALL,
    )

    return text.strip()


# Load Story Dataset
# -------------------------------------------------
story_raw_ds = load_dataset(
    "roneneldan/TinyStories",
    split=f"train[:{STORY_SAMPLES}]"
)

stories = [
    s["text"].replace("\n\n", " ").strip()
    for s in story_raw_ds
]


# Load Math Dataset
# -------------------------------------------------
math_raw_ds = load_dataset(
    "AI-MO/NuminaMath-CoT",
    split=f"train[:{MATH_SAMPLES}]"
)

math_problems = [
    (
        normalize_math_tex(m["problem"]),
        normalize_math_tex(m["solution"])
    )
    for m in math_raw_ds
]


# Chat Formatter (Story + Math Injection)
# -------------------------------------------------
def format_chat(example):
    text = ""
    utterances = example["utterances"]

    insert_story = random.random() < STORY_INSERT_PROB
    insert_math = random.random() < MATH_INSERT_PROB

    insert_idx = random.randrange(0, len(utterances), 2)

    for i, sentence in enumerate(utterances):
        if i % 2 == 0:
            text += f"User: {sentence}\n"
        else:
            text += f"Bot: {sentence} {END_TOKEN}\n"

        if i == insert_idx and i % 2 == 0:
            if insert_story:
                prompt = random.choice(STORY_PROMPTS)
                story = random.choice(stories)
                text += f"User: {prompt}\n"
                text += f"Bot: {story} {END_TOKEN}\n"

            ## Uncomment this if you want to train with math dataset
            # elif insert_math:
            #     problem, solution = random.choice(math_problems)
            #     text += f"User: {problem}\n"
            #     text += f"Bot: {solution} {END_TOKEN}\n"

    return {"text": text.strip()}


# Story-only Formatter
# -------------------------------------------------
def format_story(example):
    prompt = random.choice(STORY_PROMPTS)
    story = example["text"].strip()  #.replace("\n\n", " ")
    return {
        "text": f"User: {prompt}\nBot: {story} {END_TOKEN}"
    }


# Math-only Formatter
# -------------------------------------------------
def format_math(example):
    return {
        "text": (
            f"User: {normalize_math_tex(example['problem'])}\n"
            f"Bot: {normalize_math_tex(example['solution'])} {END_TOKEN}"
        )
    }


# Load & Prepare Chat Dataset
# -------------------------------------------------
chat_ds = load_dataset(
    "roskoN/dailydialog",
    split=f"train[:{CHAT_SAMPLES}]",
    trust_remote_code=True
)

chat_ds = chat_ds.map(
    format_chat,
    remove_columns=chat_ds.column_names
)


# Prepare Story-only Dataset
# -------------------------------------------------
story_ds = story_raw_ds.map(
    format_story,
    remove_columns=story_raw_ds.column_names
)


# Prepare Math-only Dataset
# -------------------------------------------------
math_ds = math_raw_ds.map(
    format_math,
    remove_columns=math_raw_ds.column_names
)


# Merge + Shuffle
# -------------------------------------------------
dataset = concatenate_datasets([chat_ds, story_ds])           # Train with Chat + Story dataset
#dataset = concatenate_datasets([chat_ds, story_ds, math_ds])  # Train with Chat + Story + Math dataset

dataset = dataset.shuffle(seed=SEED)

# -------------------------------------------------
# Sanity Check
print(dataset)
for i in range(15):
    print("\n--- SAMPLE ---\n")
    print(dataset[i]["text"])

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Dataset({
    features: ['text'],
    num_rows: 4000
})

--- SAMPLE ---

User: That's good to know . We'll watch for crazy drivers .
Bot: Hey , guys . Ready for a pick-up game of touch football ? 

User: Football ? Sounds dangerous . Maybe I'll just be a cheerleader .
Bot: C'mon , Yi-jun . It's lots of fun . You've seen football played on TV . 

User: Uh , yeah . And I've seen guys in pads tackled . We have no pads . I don't want to be tackled .
Bot: There's barely any contact in touch football . People just tag you to stop the play . 

User: OK , but they better tag lightly !

--- SAMPLE ---

User: Do you speak only English in the class , or does your teacher explain everything to you in Spanish ?
Bot: Oh , we never speak Spanish in class ! Miss.Parker speaks to us only in English . 

User: I suppose she's right.Does she speak English very slowly ?
Bot: Not always.Sometimes we don't understand her.Then she has to repeat what she said . 

User: It must be interesting to study English .

## Loading the Base Language Model

A pre-trained GPT-2 model is used as the foundation.

### Why GPT-2?
- Already understands basic language patterns
- Small enough for fast training

The tokenizer and model are configured to support custom tokens.

In [None]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

import warnings
warnings.filterwarnings("ignore")

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

model = GPT2LMHeadModel.from_pretrained("gpt2")
model.resize_token_embeddings(len(tokenizer))

2026-01-07 01:49:40.850811: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1767750581.034843      55 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1767750581.083839      55 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1767750581.530726      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1767750581.530765      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1767750581.530768      55 computation_placer.cc:177] computation placer alr

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Embedding(50257, 768)

Insight:

* vocabulary_size = 50257
* dimension = 768


#### Alternative models

You can experiment with alternative models as well. Uncomment the following cell to use alternative model

- **microsoft/DialoGPT-small** (GPT-2 based but chat-trained)
- **meta-llama/Llama-2-Chat**
- **mistralai/Mistral-7B-Instruct**
- **Qwen/Qwen-Chat**

In [None]:
# from transformers import AutoTokenizer, AutoModelForCausalLM

##  Uncomment this code to use alternative models

# tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")
# model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-small")

## Tokenization

Text data must be converted into numbers for the model to understand.

#### Tokenization Steps
- Convert text into token IDs
- Truncate or pad sequences to a fixed length
- Create labels for supervised learning

In [None]:
def tokenize(example):
    tokens = tokenizer(
        example["text"],
        truncation=True,
        padding="max_length",
        max_length=128
    )

    tokens["labels"] = tokens["input_ids"].copy()
    return tokens

tokenized_ds = dataset.map(tokenize, remove_columns=["text"])

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

What happens here:

Takes raw text: "Hello, how are you?"
Converts to token IDs: [15496, 11, 703, 389, 345, 30, ...]

Parameters explained:


* truncation=True - If text is longer than 128 tokens, cut it off
* padding="max_length" - If text is shorter than 128 tokens, pad with pad_token (usually 0 or end-of-sequence token)
* max_length=128 - Every sequence will be exactly 128 tokens


## Training base model on dataset
The model is trained using supervised learning.
- The AI predicts the next word in a sentence
- Predictions are compared with the correct answer
- Errors are used to improve the model
- Training runs for multiple epochs

The process gradually improves the chatbot’s responses.
Increase training epoch or add more data to improve chatbot's response

In [None]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./chatbot",
    per_device_train_batch_size=8,
    num_train_epochs=2,
    logging_steps=100,
    save_steps=500,
    fp16=True,
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ds
)

trainer.train()

`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
100,2.1586
200,1.9499
300,1.89
400,1.8122
500,1.8314


TrainOutput(global_step=500, training_loss=1.9284068908691405, metrics={'train_runtime': 221.4418, 'train_samples_per_second': 36.127, 'train_steps_per_second': 2.258, 'total_flos': 522584064000000.0, 'train_loss': 1.9284068908691405, 'epoch': 2.0})

## Save our trained model for later use

This code saves our trained model on disk so we can load and chat with it later.

In [None]:
save_directory = "./my_gpt2_model"  # any folder you like
model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)

('./my_gpt2_model/tokenizer_config.json',
 './my_gpt2_model/special_tokens_map.json',
 './my_gpt2_model/vocab.json',
 './my_gpt2_model/merges.txt',
 './my_gpt2_model/added_tokens.json')

## Lets chat with our AI

#### Interface Features
- Text-based chat interaction
- Maintains short conversation history
- Displays AI-generated responses instantly

#### Conversation Memory Handling
The chatbot remembers recent messages to maintain context.

#### How It Works
- Keeps a limited number of past interactions
- Formats them into a structured prompt
- Feeds the prompt into the model for response generation

This makes conversations feel more natural.

In [None]:
import gradio as gr

MAX_HISTORY = 10

def build_prompt(message, history):
    history = history[-MAX_HISTORY:]

    prompt = ""
    for user, bot in history:
        prompt += f"User: {user}\nBot: {bot} {END_TOKEN}\n"

    prompt += f"User: {message}\nBot:"
    return prompt


def chat(message, history):
    history = history[-MAX_HISTORY:]

    prompt = build_prompt(message, history)

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    end_token_id = tokenizer.encode(END_TOKEN, add_special_tokens=False)[0]

    output = model.generate(
        **inputs,
        max_new_tokens=350,
        temperature=0.7,
        do_sample=True,
        eos_token_id=end_token_id,
        pad_token_id=tokenizer.eos_token_id,
    )

    decoded = tokenizer.decode(output[0], skip_special_tokens=True)

    # Extract only the last bot reply
    bot_reply = decoded.split("Bot:")[-1]
    bot_reply = bot_reply.split(END_TOKEN)[0].strip()

    return bot_reply


ui_interface = gr.ChatInterface(
    fn=chat,
    title="Tiny AI Bot",
    description="Trained on our dataset",
    theme="soft"
)

ui_interface.launch()

* Running on local URL:  http://127.0.0.1:7860
It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

* Running on public URL: https://a975b244ab8857c72a.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


