<a href="https://colab.research.google.com/github/michael-borck/the-ai-lab/blob/main/nanoGPT_simpsons.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NanoGTP using a Simpsons dataset

This notebook takes [Karpathy's NanoGPT](https://github.com/karpathy/nanoGPT) and trains on the [Simpsons dataset on Kaggle](https://www.kaggle.com/datasets/prashant111/the-simpsons-dataset?resource=download). The repo is called [nanoGPT_simpsons](https://github.com/rajshah4/nanoGPT_simpsons).


Ideas to extend this:
*   Try other datasets
*   Change the model architecture
*   Fine Tune the model






Let's start by grabbing my repo, installing packages, and logging into wandb (optional)

In [None]:
!git clone https://github.com/rajshah4/nanoGPT_simpsons

Cloning into 'nanoGPT_simpsons'...
remote: Enumerating objects: 687, done.[K
remote: Counting objects: 100% (38/38), done.[K
remote: Compressing objects: 100% (23/23), done.[K
remote: Total 687 (delta 19), reused 26 (delta 15), pack-reused 649[K
Receiving objects: 100% (687/687), 4.39 MiB | 19.45 MiB/s, done.
Resolving deltas: 100% (393/393), done.


In [None]:
# gpt2 uses transformers
%pip install tiktoken transformers wandb --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m101.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m84.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m33.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m87.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m69.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m188.5/188.5 kB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m215.6/215.6 kB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ...

In [None]:
#!wandb login

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


View and prepare the data

In [None]:
!head ./nanoGPT_simpsons/data/simpsons/simpsons.txt

Oh, for the love of...
What's wrong with this phone? It's making crazy noises.
Those "crazy noises" are computer signals.
Yeah, some guys at M.I.T. are sending us reasons why Captain Picard is better than Captain Kirk.
They're out of their minds!
I heard about this. This is the one where Scratchy finally gets Itchy.
My purpose in life is to witness this moment.
We need the outlet for our rock tumbler.
Plug it in! Plug it in!
What? The rock tumbler or the TV?


In [None]:
!cd ./nanoGPT_simpsons/data/simpsons/ && python prepare.py

length of dataset in characters: 7,116,231
all the unique characters: 
 !"#$%&'()*+,-./0123456789:;?ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]_`abcdefghijklmnopqrstuvwxyz¡­¿ÄÈÉÑÖÙàáâãäåæçèéêëìíïñòóôõöøùúûüāĈēěĜīĬłńŭżǎǐǒ…
vocab size: 137
train has 6,404,607 tokens
val has 711,624 tokens


Setup the configuration for training - you have a config file and you can pass arguments at training time

In [None]:
!cat ./nanoGPT_simpsons/config/train_simpsons_char.py

# train a miniature character-level shakespeare model
# good for debugging and playing on macbooks and such

out_dir = 'out-simpsons-char'
eval_interval = 250 # keep frequent because we'll overfit
eval_iters = 200
log_interval = 10 # don't print too too often

# we expect to overfit on this small dataset, so only save when val improves
always_save_checkpoint = False

wandb_log = False # override via command line if you like
wandb_project = 'simpsons-char'
wandb_run_name = 'nano-gpt'

dataset = 'simpsons'
gradient_accumulation_steps = 1
batch_size = 64
block_size = 256 # context of up to 256 previous characters

# baby GPT model :)
n_layer = 6
n_head = 6
n_embd = 384
dropout = 0.2

learning_rate = 1e-3 # with baby networks can afford to go a bit higher
max_iters = 10000
lr_decay_iters = 10000 # make equal to max_iters usually
min_lr = 1e-4 # learning_rate / 10 usually
beta2 = 0.99 # make a bit bigger because number of tokens per iter is small

warmup_iters = 100 # not super necessary po

Train the model

In [None]:
!cd nanoGPT_simpsons && python train.py config/train_simpsons_char.py

iter 490: loss 1.7712, time 114.43ms, mfu 3.28%
step 500: train loss 1.6916, val loss 1.6704
saving checkpoint to out-simpsons-char/ckpt_iter_500.pt
iter 500: loss 1.7608, time 11873.03ms, mfu 2.95%
iter 510: loss 1.7547, time 113.34ms, mfu 2.99%
iter 520: loss 1.7484, time 115.62ms, mfu 3.01%
iter 530: loss 1.7413, time 112.29ms, mfu 3.04%
iter 540: loss 1.7172, time 110.78ms, mfu 3.08%
iter 550: loss 1.7120, time 113.54ms, mfu 3.10%
iter 560: loss 1.7356, time 114.70ms, mfu 3.11%
iter 570: loss 1.7203, time 114.10ms, mfu 3.13%
iter 580: loss 1.7258, time 115.18ms, mfu 3.14%
iter 590: loss 1.7284, time 113.59ms, mfu 3.15%
iter 600: loss 1.7086, time 115.10ms, mfu 3.16%
iter 610: loss 1.6914, time 112.43ms, mfu 3.18%
iter 620: loss 1.7025, time 114.85ms, mfu 3.19%
iter 630: loss 1.7071, time 111.63ms, mfu 3.20%
iter 640: loss 1.6627, time 115.90ms, mfu 3.20%
iter 650: loss 1.6478, time 115.08ms, mfu 3.21%
iter 660: loss 1.6931, time 113.88ms, mfu 3.22%
iter 670: loss 1.6848, time 115.3

In [None]:
Review the samples

In [None]:
!cd ./nanoGPT_simpsons && python sample.py --out_dir=out-simpsons-char --ckpoint='ckpt_iter_250.pt' --start="Homer eats lunch and goes to"

Overriding: out_dir = out-simpsons-char
Overriding: ckpoint = ckpt_iter_250.pt
Overriding: start = Homer eats lunch and goes to
number of parameters: 10.67M
Loading meta from data/simpsons/meta.pkl...
Homer eats lunch and goes to and hat hould ar Ig a far you aw fir is down the don' that simpsicand he this of her aglo pring you cand a comill seed domls witall got mark in posianianting.
Oo. Haw, what's bell hing to paristen the hat's a remall this of ay. What up a cand tin the to a caning sic the pur the and thess a be she The cout st som.
And lell gothistafsin yout "n sp."
And wer foreat the of to ree do would fare and "for ficaremb they. Here scordoney theing se carry. I cand onst forle jus the onic. Ablaringromus my c
---------------
Homer eats lunch and goes to if thres?
Amence, in the call, shume if the to faping a bout me.
And they the bait the chomes a thin the comiled looh, the pics juster the is tell lo, in togir. I woulde able fart on undicaned spert mord thing, the the is jod

In [None]:
!cd ./nanoGPT_simpsons && python sample.py --out_dir=out-simpsons-char --ckpoint='ckpt_iter_1000.pt' --start="Homer eats lunch and goes to"

Overriding: out_dir = out-simpsons-char
Overriding: ckpoint = ckpt_iter_1000.pt
Overriding: start = Homer eats lunch and goes to
number of parameters: 10.67M
Loading meta from data/simpsons/meta.pkl...
Homer eats lunch and goes to go back home.
Well, we take to watch to drive.
Bart, little sixty family this fast to get the says in a sacross.
Well, don't worry, good.
Well, possible to tell us a lot of story name last for this is hatten Sunda Unti Movie Super-Christmas are not made a carpains.
Sti-tink-tack and he's a boys movie.
Five, they.
All right, that's right, but I don't like that.
These only spaning would for the chone first matter.
Goodbye, I can't!
Yiaga, it's Mr. For music Houseum. A first ice of little music pa
---------------
Homer eats lunch and goes to place the Americans.
Aw, there's not no party hammino!
Well, we've got ever gone to her. This is this charged passed on the instemine in this get.
And in the reside. The charges they can't.
We said "more thing," "charl is ju

In [None]:
!cd ./nanoGPT_simpsons && python sample.py --out_dir=out-simpsons-char --ckpoint='ckpt_iter_5000.pt' --start="Homer eats lunch and goes to"

Overriding: out_dir = out-simpsons-char
Overriding: ckpoint = ckpt_iter_5000.pt
Overriding: start = Homer eats lunch and goes to
number of parameters: 10.67M
Loading meta from data/simpsons/meta.pkl...
Homer eats lunch and goes together.
Ahh, Lisa Simpson! A boat of time, why don't you just say that happen?
Zong is going to buy it. It was gonna be dimensive thoughts he can do anything else.
Hey, what's the name?
You don't need that kick on the minute.
But he's gonna find a little about this bath-time.
Well, we've got some gate for money.
...Ralph Creeks... friends have spent the brother to shoot them for some way to be crazy.
I am my father. But it's for their presents to thank you to the jury.
Do you have to drop me?
Ok
---------------
Homer eats lunch and goes to you just push out in the cash.
What about the hell?
Ah, that's so hard enough to help you.
I'm going to my hair. Leave the picture of your junkel.
Mom, are you sick? If I can find out a wave of show my freedom, they're going

In [None]:
!cd ./nanoGPT && python sample.py --dtype=float16 --num_samples=5 --max_new_tokens=10 --start="to be"

/bin/bash: line 1: cd: ./nanoGPT: No such file or directory
