# aitextgen Training Hello World

_Last Updated: Feb 21, 2021 (v.0.4.0)_

by Max Woolf

A "Hello World" Tutorial to show how training works with aitextgen, even on a CPU!

In [8]:
from aitextgen.TokenDataset import TokenDataset
from aitextgen.tokenizers import train_tokenizer
from aitextgen.utils import GPT2ConfigCPU
from aitextgen import aitextgen
import os, os.path

ImportError: cannot import name '_TPU_AVAILABLE' from 'pytorch_lightning.utilities' (/Users/ECU/Library/Python/3.9/lib/python/site-packages/pytorch_lightning/utilities/__init__.py)

First, download this [text file of Shakespeare's plays](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt), to the folder with this notebook, then put the name of the downloaded Shakespeare text for training into the cell below.

In [None]:
file_name = "number_theory.json"



You can now train a custom Byte Pair Encoding Tokenizer on the downloaded text!

This will save one file: `aitextgen.tokenizer.json`, which contains the information needed to rebuild the tokenizer.

In [None]:
train_tokenizer(file_name)
tokenizer_file = 'aitextgen.tokenizer.json'






`GPT2ConfigCPU()` is a mini variant of GPT-2 optimized for CPU-training.

e.g. the # of input tokens here is 64 vs. 1024 for base GPT-2. This dramatically speeds training up.

In [None]:
config = GPT2ConfigCPU()

Instantiate aitextgen using the created tokenizer and config

In [None]:
ai = aitextgen(tokenizer_file=tokenizer_file, config=config)

Generate config GenerationConfig {
  "bos_token_id": 0,
  "eos_token_id": 0,
  "transformers_version": "4.26.1"
}



You can build datasets for training by creating TokenDatasets, which automatically processes the dataset with the appropriate size.

In [None]:
data = TokenDataset(file_name, tokenizer_file=tokenizer_file, block_size=64)

100%|██████████| 40000/40000 [00:00<00:00, 86712.61it/s]


TokenDataset containing 462,820 subsets loaded from file at input.txt.

Train the model! It will save pytorch_model.bin periodically and after completion to the `trained_model` folder. On a 2020 8-core iMac, this took ~25 minutes to run.

The configuration below processes 400,000 subsets of tokens (8 * 50000), which is about just one pass through all the data (1 epoch). Ideally you'll want multiple passes through the data and a training loss less than `2.0` for coherent output; when training a model from scratch, that's more difficult, but with long enough training you can get there!

In [None]:
ai.train(data, batch_size=8, num_steps=5000, generate_every=5000, save_every=500)

pytorch_model.bin already exists in /trained_model and will be overwritten!
GPU available: False, used: False
TPU available: None, using: 0 TPU cores
[1m5,000 steps reached: saving model to /trained_model[0m
[1m5,000 steps reached: generating sample texts.[0m
's dead;
But is no winted in his northeritiff
Tave passage, and eleve your hours.

PETRUCHIO:
What is this I does, I will, sir;
That, you have, nor tolding we
[1m10,000 steps reached: saving model to /trained_model[0m
[1m10,000 steps reached: generating sample texts.[0m
.

QUEEN ELIZABETH:
I know, to, fair beat, to my soul is wonder'd intend.

KING RICHARD III:
Hold, and threaten, my lord, and my shame!

QUEEN ELIZAB
[1m15,000 steps reached: saving model to /trained_model[0m
[1m15,000 steps reached: generating sample texts.[0m
s of capitcts!

EDWARD:
Gardener, what is this hour will not say.
What, shall the joint, I pray, if they
Harry, let bid me as he would readness so.

B
[1m20,000 steps reached: saving model to /t

Generate text from your trained model!

In [None]:
ai.generate(10, prompt="ROMEO:")

[1mROMEO:[0m
Abook, ho! forthing me, gentle Earl's royal king,
And this, I, with that I do not beseech you
To visit the battle, that I should believe you,
Which I would never
[1mROMEO:[0m
Confound is gone, thou art a maid into the widow;
Put up my life and make me no harmony
And make thee I know uncle,
Unconted and curses: therefore in my
[1mROMEO:[0m
God push! but what days to see
The giving bleedom's heart I do? Therefore,
And most unless I had rather. He saddle
Take your cold shack down; and so far I
[1mROMEO:[0m
Persetain'd up the earth of mercy,
And never yet, the sun to make him all the
More than my battle.

ROMEO:
I warrant him, to know, we'll not do't, but hate me
[1mROMEO:[0m
Methinks I am a mile, and trench one
Thy winded makes, in faults and cast
With one to meether, of twenty days,
That in my waters, that f
[1mROMEO:[0m
O, here is such a woman guilty.

ROMEO:
I do not think it; I should be renowned
That I am in that which can controy
A bawd I take it to the purp

With your trained model, you can reload the model at any time by providing the `pytorch_model.bin` model weights, the `config`, and the `tokenizer`.

In [None]:
ai2 = aitextgen(model_folder="trained_model",
                tokenizer_file="aitextgen.tokenizer.json")

In [None]:
ai2.generate(10, prompt="1+1 is")

[1mROMEO:[0m
Boy, unreacher, unhallupony, in Padua,
Untimely fall till I be learn'd.

ROMEO:
Fie, good friar, be quick, for I am,
I'll
[1mROMEO:[0m
I'll be plain, I am a tail of blessed wounds;
For I am dead, I have not borne to make
A couple of her fortune, but that I'll bear,
And say 'Ay, chur
[1mROMEO:[0m
And yet I am a resolution of my dear dear:
If I have not reason to do me say
I'll deny the sea of my body to answer,
And all thy tale, or I have my m
[1mROMEO:[0m
Intenty to a bawd of my bait,--

JULIET:
No, I hope to know the title,
For that I wish her place.

JULIET:
Do I assure her?
[1mROMEO:[0m
O, what's the parle that I chide thee,
That honourable may be, that I have still'd thee:
I pray thee, my lord.

MERCUTIO:
I', my lord.

ROMEO:
Here is a
[1mROMEO:[0m
And, for I am, and not talk of that?

ROMEO:
Where's my child, I would guess thee here.

JULIET:
Nay, boy, I'll not be bowling why I;
O thou
[1mROMEO:[0m
O, but thou hast seen thee of mine own.

ROMEO:
I would 

# MIT License

Copyright (c) 2021 Max Woolf

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.