<a href="https://colab.research.google.com/github/howsam/Building-a-ChatGPT-like-Model-from-Scratch/blob/main/TinyStories.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  <font color='#FFE15D'><b>💎 TinyStories</b></font>
### ChatGPT Course [webpage](https://howsam.org/downloads/implementing-chatgpt-from-scratch-with-pytorch/)

### Source Codes on [Github](https://github.com/howsam/Building-a-ChatGPT-like-Model-from-Scratch.git)

### by Howsam AI Academy www.howsam.org

# 🔴 **Environment Setup**

## 🟠 Change the font size of the output cells

In [1]:
print('Salam Howsam!')

Salam Howsam!


In [2]:
from IPython.display import HTML
shell = get_ipython()

def adjust_font_size():
  display(HTML('''<style>
    body {
      font-size: 24px;
    }
  '''))

if adjust_font_size not in shell.events.callbacks['pre_execute']:
  shell.events.register('pre_execute', adjust_font_size)

In [3]:
print('Salam Howsam!')

Salam Howsam!


## 🟠 Pip Install

In [4]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.4.1-py3-none-any.whl (487 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m487.4/487.4 kB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.w

# 🔴 **Import**

In [5]:
import datasets

In [6]:
!python --version

Python 3.11.11


In [7]:
datasets.__version__

'3.4.1'

# 🔴 **TinyStories Dataset [🔗](https://huggingface.co/datasets/roneneldan/TinyStories)**

## 🟠 Dataset

In [8]:
from datasets import load_dataset

In [9]:
load_dataset?

In [11]:
dataset = load_dataset("roneneldan/TinyStories")
dataset

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 2119719
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 21990
    })
})

In [13]:
dataset['train'], dataset['validation']

(Dataset({
     features: ['text'],
     num_rows: 2119719
 }),
 Dataset({
     features: ['text'],
     num_rows: 21990
 }))

In [15]:
len(dataset['validation']['text'])

21990

In [21]:
dataset['validation'][0]['text']

'Spot. Spot saw the shiny car and said, "Wow, Kitty, your car is so bright and clean!" Kitty smiled and replied, "Thank you, Spot. I polish it every day."\n\nAfter playing with the car, Kitty and Spot felt thirsty. They found a small pond with clear water. They drank the water and felt very happy. They played together all day and became best friends.'

In [22]:
print(dataset['validation'][0]['text'])

Spot. Spot saw the shiny car and said, "Wow, Kitty, your car is so bright and clean!" Kitty smiled and replied, "Thank you, Spot. I polish it every day."

After playing with the car, Kitty and Spot felt thirsty. They found a small pond with clear water. They drank the water and felt very happy. They played together all day and became best friends.


In [26]:
from pprint import pprint

pprint(dataset['validation'][0]['text'], width=80)

('Spot. Spot saw the shiny car and said, "Wow, Kitty, your car is so bright '
 'and clean!" Kitty smiled and replied, "Thank you, Spot. I polish it every '
 'day."\n'
 '\n'
 'After playing with the car, Kitty and Spot felt thirsty. They found a small '
 'pond with clear water. They drank the water and felt very happy. They played '
 'together all day and became best friends.')


In [27]:
pprint(dataset['train'][0]['text'])

('One day, a little girl named Lily found a needle in her room. She knew it '
 'was difficult to play with it because it was sharp. Lily wanted to share the '
 'needle with her mom, so she could sew a button on her shirt.\n'
 '\n'
 'Lily went to her mom and said, "Mom, I found this needle. Can you share it '
 'with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share '
 'the needle and fix your shirt."\n'
 '\n'
 "Together, they shared the needle and sewed the button on Lily's shirt. It "
 'was not difficult for them because they were sharing and helping each other. '
 'After they finished, Lily thanked her mom for sharing the needle and fixing '
 'her shirt. They both felt happy because they had shared and worked together.')


## 🟠 Models

In [28]:
from transformers import AutoModelForCausalLM, AutoTokenizer  # Import model and tokenizer classes

# Load a pre-trained causal language model (TinyStories-33M)
model = AutoModelForCausalLM.from_pretrained('roneneldan/TinyStories-33M')

# Load a pre-trained tokenizer (GPT-Neo tokenizer is used instead of the TinyStories tokenizer)
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125M")

# Define the initial text prompt
prompt = "One day, a little girl named"

# Tokenize the input text and convert it into tensor format
input_ids = tokenizer.encode(prompt, return_tensors="pt")

# Generate text based on the input, setting a max length and using greedy search (num_beams=1)
output = model.generate(input_ids, max_length=1000, num_beams=1)

# Decode the generated token IDs back into human-readable text
output_text = tokenizer.decode(output[0], skip_special_tokens=True)

# Pretty-print the generated text
print(100 * "_")
pprint(output_text)


config.json:   0%|          | 0.00/968 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/291M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/291M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/357 [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


____________________________________________________________________________________________________
('One day, a little girl named Lily went to the park with her mom. They saw a '
 'big tree with a swing hanging from it. Lily wanted to play on the swing, but '
 'it was too high for her to reach.\n'
 '\n'
 'Lily\'s mom said, "Don\'t worry, I will help you." She picked her up and put '
 'her on the swing. Lily was so happy! She started to swing back and forth, '
 'higher and higher.\n'
 '\n'
 'As Lily was swinging, she saw a little boy named Tim. Tim was sad because he '
 'lost his toy. Lily stopped swinging and went to help Tim. She said, "Don\'t '
 'be sad, Tim. We will find your toy." They looked and looked, and finally, '
 'they found the toy under a bush. Tim was so happy, and he said, "Thank you, '
 'Lily!"\n'
 '\n'
 'Lily and Tim became good friends. They played together at the park every '
 'day. They always helped each other and had lots of fun. And they always '
 'remembered t

## 🟠 Others

### 🟡 Dataset Directory

In [29]:
dataset.cache_files

{'train': [{'filename': '/root/.cache/huggingface/datasets/roneneldan___tiny_stories/default/0.0.0/f54c09fd23315a6f9c86f9dc80f725de7d8f9c64/tiny_stories-train-00000-of-00004.arrow'},
  {'filename': '/root/.cache/huggingface/datasets/roneneldan___tiny_stories/default/0.0.0/f54c09fd23315a6f9c86f9dc80f725de7d8f9c64/tiny_stories-train-00001-of-00004.arrow'},
  {'filename': '/root/.cache/huggingface/datasets/roneneldan___tiny_stories/default/0.0.0/f54c09fd23315a6f9c86f9dc80f725de7d8f9c64/tiny_stories-train-00002-of-00004.arrow'},
  {'filename': '/root/.cache/huggingface/datasets/roneneldan___tiny_stories/default/0.0.0/f54c09fd23315a6f9c86f9dc80f725de7d8f9c64/tiny_stories-train-00003-of-00004.arrow'}],
 'validation': [{'filename': '/root/.cache/huggingface/datasets/roneneldan___tiny_stories/default/0.0.0/f54c09fd23315a6f9c86f9dc80f725de7d8f9c64/tiny_stories-validation.arrow'}]}

### 🟡 HuggingFace Hub

In [30]:
from huggingface_hub import hf_hub_download

In [31]:
hf_hub_download?

In [32]:
repo_id = "roneneldan/TinyStories-8M"  # نام دیتاست
filename = "vocab.json"  # اسم فایلی که می‌خوای دانلود کنی

file_path = hf_hub_download(repo_id=repo_id, filename=filename, local_dir="/content/")

print(f"File downloaded to: {file_path}")

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

File downloaded to: /content/vocab.json


In [34]:
repo_id = "roneneldan/TinyStories"  # نام دیتاست
filename = "TinyStoriesV2-GPT4-valid.txt"  # اسم فایلی که می‌خوای دانلود کنی

file_path = hf_hub_download(repo_id=repo_id, filename=filename, repo_type="dataset")

print(f"File downloaded to: {file_path}")

TinyStoriesV2-GPT4-valid.txt:   0%|          | 0.00/22.5M [00:00<?, ?B/s]

File downloaded to: /root/.cache/huggingface/hub/datasets--roneneldan--TinyStories/snapshots/f54c09fd23315a6f9c86f9dc80f725de7d8f9c64/TinyStoriesV2-GPT4-valid.txt


# 🔴 **TinyStories-GPT4 [🔗](https://huggingface.co/datasets/skeskinen/TinyStories-GPT4)**

In [35]:
dataset = load_dataset("skeskinen/TinyStories-GPT4")
dataset

README.md:   0%|          | 0.00/554 [00:00<?, ?B/s]

(…)-00000-of-00008-c63ccd5d5290f4a1.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

(…)-00001-of-00008-478199d8ac044910.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

(…)-00002-of-00008-9b868f59be94d815.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

(…)-00003-of-00008-d183cca02834cd90.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

(…)-00004-of-00008-5f8ac0bb66de5834.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

(…)-00005-of-00008-e8c22c3e776b87dd.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

(…)-00006-of-00008-941f57106aca3340.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

(…)-00007-of-00008-771d8aa2d5ce5c95.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2745100 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['story', 'summary', 'source', 'prompt', 'words', 'features'],
        num_rows: 2745100
    })
})

In [None]:
dataset["train"]["words"]

In [36]:
pprint(dataset["train"][0])

{'features': ['BadEnding', 'Twist'],
 'prompt': 'Write a short story (3-5 paragraphs) which only uses very simple '
           'words that a 3 year old child would understand. The story should '
           'use the verb "receive", the noun "opera" and the adjective "red". '
           'The story has the following features: the story has a bad ending, '
           'something unexpected happens / there is a plot twist. Remember to '
           'only use simple words!',
 'source': 'GPT-4',
 'story': 'Once upon a time, there was a big red cat named Tom. Tom loved to '
          'sing. One day, he heard about a special show called an opera. He '
          'wanted to be in the opera so much. So, he went to try out.\n'
          'At the try out, Tom sang his best. He was so good that he got to be '
          'in the opera. Tom was so happy. He went home to tell his friends. '
          'He said, "I will sing in the opera!" His friends were happy too.\n'
          'On the day of the opera, Tom