#  aitextgen — Train a Custom GPT-2 Model + Tokenizer w/ GPU

by [Max Woolf](https://minimaxir.com)

*Last updated: May 16th, 2021 (aitextgen v0.5.2)*

Train a custom GPT-2 model **for free on a GPU using Colaboratory** using `aitextgen`!

It's recommended to only create a model from scratch if you really need to do so; otherwise, [finetuning 124M](https://colab.research.google.com/drive/15qBZx5y9rdaQSyWpsreMDnTiZ5IlN0zD?usp=sharing) may give you better results.

For more about `aitextgen`, you can visit [this GitHub repository](https://github.com/minimaxir/aitextgen) or [read the documentation](https://docs.aitextgen.io/).


To get started:

1. Copy this notebook to your Google Drive to keep it and save your changes. (File -> Save a Copy in Drive)
2. Run the cells below:


In [None]:
!pip install -q aitextgen

import logging
logging.basicConfig(
        format="%(asctime)s — %(levelname)s — %(name)s — %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO
    )

from aitextgen import aitextgen
from aitextgen.colab import mount_gdrive, copy_file_from_gdrive
from aitextgen.TokenDataset import TokenDataset, merge_datasets
from aitextgen.utils import build_gpt2_config
from aitextgen.tokenizers import train_tokenizer

[K     |████████████████████████████████| 573kB 4.4MB/s 
[K     |████████████████████████████████| 2.3MB 17.6MB/s 
[K     |████████████████████████████████| 92kB 9.3MB/s 
[K     |████████████████████████████████| 808kB 33.9MB/s 
[K     |████████████████████████████████| 3.3MB 37.0MB/s 
[K     |████████████████████████████████| 901kB 35.9MB/s 
[K     |████████████████████████████████| 829kB 34.0MB/s 
[K     |████████████████████████████████| 276kB 34.6MB/s 
[K     |████████████████████████████████| 112kB 41.6MB/s 
[K     |████████████████████████████████| 10.6MB 36.1MB/s 
[K     |████████████████████████████████| 645kB 22.2MB/s 
[K     |████████████████████████████████| 1.3MB 36.4MB/s 
[K     |████████████████████████████████| 143kB 36.8MB/s 
[K     |████████████████████████████████| 296kB 38.0MB/s 
[?25h  Building wheel for aitextgen (setup.py) ... [?25l[?25hdone
  Building wheel for fire (setup.py) ... [?25l[?25hdone
  Building wheel for future (setup.py) ... [?25l

## GPU

Colaboratory uses a Nvidia P4, an Nvidia T4, or an Nvidia P100 GPU. For finetuning GPT-2 124M, any of these GPUs will be fine, but for text generation, a T4 or a P100 is ideal since they have more VRAM.

You can verify which GPU is active by running the cell below. If you want to try for a different GPU, go to **Runtime -> Factory Reset Runtime**.

In [None]:
!nvidia-smi

Mon Jun 14 07:29:03 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.27       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   43C    P0    28W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Mounting Google Drive

The best way to get input text to-be-trained into the Colaboratory VM, and to get the trained model *out* of Colaboratory, is to route it through Google Drive *first*.

Running this cell (which will only work in Colaboratory) will mount your personal Google Drive in the VM, which later cells can use to get data in/out. (it will ask for an auth code; that auth is not saved anywhere)

In [None]:
mount_gdrive()

Mounted at /content/drive


## Uploading a Text File to be Trained to Colaboratory

In the Colaboratory Notebook sidebar on the left of the screen, select *Files*. From there you can upload files:

![alt text](https://i.imgur.com/w3wvHhR.png)

Upload **any smaller text file** (for example, [a text file of Shakespeare plays](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt)) and update the file name in the cell below, then run the cell.

In [None]:
file_name = "/content/drive/MyDrive/A5/Mémoire/notebooks/data/tokenized_training_data.csv"

If your text file is larger than 10MB, it is recommended to upload that file to Google Drive first, then copy that file from Google Drive to the Colaboratory VM.

In [None]:
# copy_file_from_gdrive(file_name)

## Training the Tokenizer

Now we can train a Byte-Pair Encoding tokenizer on the dataset we just downloaded. The `train_tokenizer()` function wraps the training method for the `tokenizer` package from Huggingface.

After the training is completed, this will save one file: **aitextgen.tokenizer.json**, which is needed to rebuild the tokenizer.

```python 
def train_tokenizer(
    files: Union[str, List[str]],
    dropout: float = None,
    vocab_size: int = 1000,
    min_frequency: int = 2,
    prefix: str = "aitextgen",
    save_path: str = "",
    added_tokens: List[str] = [],
    bos_token: str = "<|endoftext|>",
    eos_token: str = "<|endoftext|>",
    unk_token: str = "<|endoftext|>",
    serialize: bool = True,
    trim_offsets: bool = True,
) -> None:
    """
    Tokenizes the text(s) as a tokenizer, wrapping the tokenizer package.
    See: https://huggingface.co/blog/how-to-train
    For consistency, this function makes opinionated assuptions.
    :param files: path to file(s) to train tokenizer on
    :param dropout: Training dropout
    :param vocab_size: Final vocabulary size
    :param min_frequency: Minimum number of occurences to add to vocab
    :param prefix: File name prefix of the final tokenizer
    :param save_path: Where to save the final tokenizer
    :param added_tokens: List of tokens to add to the tokenizer (currently not working)
    :param bos_token: Beginning-of-string special token
    :param eos_token: End-of-string special token
    :param unk_token: Unknown special token
    """
```

In [None]:
VOCAB_SIZE = 1000

In [None]:
train_tokenizer(file_name, vocab_size=VOCAB_SIZE)

## Specify a Model Configuration

You can use `build_gpt2_config` to specify a model configuration. You most likely will want to adjust `max_length` (context window size) and `n_embd` (embedding size).

The config used here is the one used to build a [demo Reddit](https://github.com/minimaxir/aitextgen/blob/master/notebooks/reddit_demo.ipynb) model.

```python
def build_gpt2_config(
    vocab_size: int = 10000,
    bos_token_id: int = 0,
    eos_token_id: int = 0,
    max_length: int = 1024,
    dropout: float = 0.0,
    **kwargs
):
    """
    Builds a custom GPT-2 config based on a given Transformers config,
    with a few more user-friendly aliases.
    """
```

In [None]:
config = build_gpt2_config(vocab_size=VOCAB_SIZE, max_length=350)
config

GPT2Config {
  "activation_function": "gelu_new",
  "attn_pdrop": 0.0,
  "bos_token_id": 0,
  "embd_pdrop": 0.0,
  "eos_token_id": 0,
  "gradient_checkpointing": false,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 350,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 350,
  "resid_pdrop": 0.0,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.0,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "transformers_version": "4.6.1",
  "use_cache": true,
  "vocab_size": 1000
}

## Instantiating Your Custom GPT-2 Model

Pass all the information to `aitextgen()` and you're good to go!

In [None]:
ai = aitextgen(config=config,
               tokenizer_file="aitextgen.tokenizer.json",
               to_gpu=True)

06/14/2021 07:29:49 — INFO — aitextgen — Constructing model from provided config.
06/14/2021 07:29:52 — INFO — aitextgen — GPT2 loaded with 86M parameters.
06/14/2021 07:29:52 — INFO — aitextgen — Using a custom tokenizer.


Generated output from it will be effectively random, for now.

In [None]:
ai.generate(5)

) 697� 718 362 410 587 477 444� 78547 362 658 542 362N 362 754 302 15 245 730� 545. 245 386� 376 470} 403 518 376 806� 205� 386 58��� 386 376 905 727� 302 489� 658 523 658 239 15 658 651 730� 545 365 223 607 237� 727 245) 380 58 718 380� 905� 569 730 469 523 680 905) 730 628�
 523 458 380 604 302 754 376� 905 245 376 226 550 727� 680 628� 469 589 403 63639 730 727 245�O 727 376 117 169 15 905 325 695 302 658 302)�� 226 727 628� 466� 550 469 325 494 355. 73041} 658 695 600�. 811 391 169� 452 677 600 376 376 607 469 469 458 727�� 248 59 628 658 905 452� 302 226) 636 754 542 42 569w 391 658 489 355 198 754 469 722z 380� 550[ 607 226 636) 607 238 276 547 730 501 302� 226 523 386 413 680 376 452 245 452 637 228� 169 444� 246 660 489� 905 42 387 302[� 7308 636 435 299 628 730L 501[� 651 246 680� 73 169 376 628
� 222 362 8d� 362 39 308d 237 458 308n 696 734 509} 734 289� 255 362 696 587 239N 245 362 255 35 628 444F,O�� 665	O} 744 289 239 619} 538d� 619 431 386 582 308O� 245 431�� 7

## Train GPT-2

The next cell will start the actual training of GPT-2 in aitextgen. It runs for `num_steps`, and a progress bar will appear to show training progress, current loss (the lower the better the model), and average loss (to give a sense on loss trajectory).

The model will be saved every `save_every` steps in `trained_model` by default, and when training completes. If you mounted your Google Drive, the model will _also_ be saved there in a unique folder.

The training might time out after 4ish hours; if you did not mount to Google Drive, make sure you end training and save the results so you don't lose them! (if this happens frequently, you may want to consider using [Colab Pro](https://colab.research.google.com/signup))

Important parameters for `train()`:

- **`line_by_line`**: Set this to `True` if the input text file is a single-column CSV, with one record per row. aitextgen will automatically process it optimally.
- **`from_cache`**: If you compressed your dataset locally (as noted in the previous section) and are using that cache file, set this to `True`.
- **`num_steps`**: Number of steps to train the model for.
- **`generate_every`**: Interval of steps to generate example text from the model; good for qualitatively validating training.
- **`save_every`**: Interval of steps to save the model: the model will be saved in the VM to `/trained_model`.
- **`save_gdrive`**: Set this to `True` to copy the model to a unique folder in your Google Drive, if you have mounted it in the earlier cells
- **`batch_size`**: Batch size of the model training; setting it too high will cause the GPU to go OOM. _Unlike finetuning, since you are using a small model, you can massively increase the batch size to normalize the training_.
- **`fp16`**: Enables half-precision training for faster/more memory-efficient training. Only works on a T4 or V100 GPU.

Here are other important parameters for `train()` that are useful but you likely do not need to change.

- **`learning_rate`**: Learning rate of the model training.


In [None]:
ai.train(file_name,
         line_by_line=True,
         from_cache=False,
         num_steps=25646*5,
         generate_every=1000,
         save_every=1000,
         save_gdrive=False,
         learning_rate=1e-3,
         batch_size=4  
)

06/14/2021 07:30:02 — INFO — aitextgen — Loading text from /content/drive/MyDrive/A5/Mémoire/notebooks/data/tokenized_training_data.csv with generation length of 350.


HBox(children=(FloatProgress(value=0.0, layout=Layout(flex='2'), max=25646.0), HTML(value='')), layout=Layout(…

06/14/2021 07:30:03 — INFO — aitextgen.TokenDataset — Encoding 25,646 rows from /content/drive/MyDrive/A5/Mémoire/notebooks/data/tokenized_training_data.csv.
06/14/2021 07:30:15 — INFO — pytorch_lightning.utilities.distributed — GPU available: True, used: True
06/14/2021 07:30:15 — INFO — pytorch_lightning.utilities.distributed — TPU available: False, using: 0 TPU cores
06/14/2021 07:30:15 — INFO — pytorch_lightning.accelerators.gpu — LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]





HBox(children=(FloatProgress(value=0.0, layout=Layout(flex='2'), max=128230.0), HTML(value='')), layout=Layout…

[1m1,000 steps reached: saving model to /trained_model[0m
[1m1,000 steps reached: generating sample texts.[0m
8 999 406 485 444 513 490 499
[1m2,000 steps reached: saving model to /trained_model[0m
[1m2,000 steps reached: generating sample texts.[0m
2 495 502 450 457 484 513 509 510 454 469 506 509 486 495 523 475 486 521 460 493 515 499 530 458 486 478 465 520 523 507 546 513 484 514 486 480 478 490 482 484 465 507 528 542 530 469 486 514 530 541 510 512 483 498 516 521 531 486 469 478 466 490 487 499 528 495 547 513 515 499 486 560 480 518 514 490 516 481 506 510 527 495 493 513 468 478 499 518 520 482 487 518 523 486 458 488 499 495 504 518 471 521 454 458 537 458 480 547 465 482 500 478 494 499 511 515 523 490 458 485 558 480 523 520 511 502 534 465 491 507 511 513 528 507 486 486 506 478 523 454 465 493 493 480 528 498 531 547 541 518 482 483 500 517 514 511 528 514 537 479 495 465 469 469 454 505 469 523 486 510 513 523 507 518 469 546 516 472 458 547 509 496 560 465 480 

Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
    send_bytes(obj)
Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
    send_bytes(obj)
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
    self._send(header + buf)
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
    self._send(header + buf)
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, b

You're done! Feel free to go to the **Generate Text From The Trained Model** section to generate text based on your retrained model.


## Load a Trained Model

Running the next cell will copy the `pytorch_model.bin`, `config.json`, `aitextgen_vocab.json`, and `aitextgen_merges.json` files from the specified folder in Google Drive into the Colaboratory VM. (If no `from_folder` is specified, it assumes the two files are located at the root level of your Google Drive)

In [None]:
from_folder = None

for file in ["pytorch_model.bin", "config.json", "aitextgen.tokenizer.json"]:
  if from_folder:
    copy_file_from_gdrive(file, from_folder)
  else:
    copy_file_from_gdrive(file)

FileNotFoundError: ignored

The next cell will allow you to load the retrained model + metadata necessary to generate text.

In [None]:
ai = aitextgen(model_folder=".",
               tokenizer_file="aitextgen.tokenizer.json",
               to_gpu=True)

AssertionError: ignored

## Generate Text From The Trained Model

After you've trained the model or loaded a retrained model from checkpoint, you can now generate text.

**If you just trained a model**, you'll get much faster training performance if you reload the model; the next cell will reload the model you just trained from the `trained_model` folder.

In [None]:
ai = aitextgen(model_folder="trained_model",
               tokenizer_file="aitextgen.tokenizer.json",
               to_gpu=True)

06/14/2021 14:22:39 — INFO — aitextgen — Loading model from provided weights and config in /trained_model.
06/14/2021 14:22:41 — INFO — aitextgen — GPT2 loaded with 86M parameters.
06/14/2021 14:22:41 — INFO — aitextgen — Using a custom tokenizer.


`generate()` without any parameters generates a single text from the loaded model to the console.

In [None]:
ai.generate()

438 444 496 599 534 409 684 587 599 463 606 694 353 800 571 390 521 402 445 919 438 588 404 539 704 596 422 371 683 223 617 454 401 770 299 520 439 613 500 662 457 429 246 654 310 444 447 529 373 477 613 547 459 428 479 303 498 521 465 437 480 341 599 527 575 552 540 687 476 515 541 414 343 354 542 628 568 395 542 376 421 514 584 510 413 345 450 475 259 412 447 539 487 531 684 411 286 567 673 694 520 682 480 471 631 539 278 672 477 808 527 460 541 489 440 767 585 606 411 318 703 471 627 704 493 410 602 390 142 440 127 298 27 977 505 735 631 442 648 470 411 612 585 590 416 688 367 449 597 579 539 529 595 425 456 670 583 539 554 452 479 583 533 503 561 489 452 523 496 505 435 515 595 443 497 490 661 534 475 482 558 450 553 529 488 391 442 519 539 473 463 456 499 533 532 446 444 482 553 559 565 555 452 452 503 524 470 496 480 525 507 509 491 545 431 576 489 483 454 433 516 586 466 430 458 342 516 415 543 488 580 601 497 593 517 494 491 524 454


If you're creating an API based on your model and need to pass the generated text elsewhere, you can do `text = ai.generate_one()`

You can also pass in a `prompt` to the generate function to force the text to start with a given character sequence and generate text from there (good if you add an indicator when the text starts).

You can also generate multiple texts at a time by specifing `n`. You can pass a `batch_size` to generate multiple samples in parallel, giving a massive speedup (in Colaboratory, set a maximum of 50 for `batch_size` to avoid going OOM).

Other optional-but-helpful parameters for `ai.generate()` and friends:

*  **`max_length`**: Number of tokens to generate (default 256, you can generate up to 1024 tokens with GPT-2, but it will be _much_ slower)
* **`temperature`**: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **`top_k`**: Limits the generated guesses to the top *k* guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set `top_k=40`)
* **`top_p`**: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with `top_p=0.9`)

In [None]:
ai.generate(n=1,
            batch_size=5,
            prompt="123 12",
            temperature=1.0,
            top_p=0.9)

[1m123 12[0m9 683 494 604 481 363 552 468 567 487 581 555 531 481 534 478 528 446 481 513 416 505 637 519 479 524 467 542 589 500 563 390 536 516 523 496 497 482 520 466 542 617 544 489 480 486 434 481 533 491 544 515 474 461 546 336 542 378 509 583 434 588 548 540 517 489 543 454 495 465 486 603 524 501 501 488 506 497 488 489 528 449 493 494 488 507 489 364 509 394 501 471 498 473 506 477 528 578 503 541 473 385 478 507 477 535 438 338 479 514 520 536 508 509 479 476 498 539 458 520 369 420 547 428 578 598 426 599 522 378 524 508 581 560 518 351 567 445 495 528 431 469 561 516 519 498 569 488 513 488 601 165 333 581 430 527 460 415 501 640 426 549 523 535 482 494 518 324 533 528 507 591 447 523 446 607 563 469 585 546 556 511 451 547 505 445 506 468 473 466 553 501 540 547 440 388 439 507 585 467 546 578 550 500 556 513 466 487 525 447 545 506 516 534 475 471 400 531 588 540 442 584 466 514 487 445 479 536 464 488 501 515 515 537 521 547 503 524 431 409 538 471 485


For bulk generation, you can generate a large amount of texts to a file and sort out the samples locally on your computer. The next cell will generate `num_files` files, each with `n` texts and whatever other parameters you would pass to `generate()`. The files can then be downloaded from the Files sidebar!

You can rerun the cells as many times as you want for even more generated texts!

In [None]:
num_files = 5

for _ in range(num_files):
  ai.generate_to_file(n=1000,
                     batch_size=50,
                     prompt="ROMEO:",
                     temperature=1.0,
                     top_p=0.9)

04/19/2021 01:59:34 — INFO — aitextgen — Generating 1,000 texts to ATG_20210419_015934_63053824.txt


HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))

04/19/2021 02:00:04 — INFO — aitextgen — Generating 1,000 texts to ATG_20210419_020004_96905312.txt





HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))

04/19/2021 02:00:33 — INFO — aitextgen — Generating 1,000 texts to ATG_20210419_020033_47861406.txt





HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))

04/19/2021 02:01:03 — INFO — aitextgen — Generating 1,000 texts to ATG_20210419_020103_74636692.txt





HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))

04/19/2021 02:01:33 — INFO — aitextgen — Generating 1,000 texts to ATG_20210419_020133_64281888.txt





HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))




# LICENSE

MIT License

Copyright (c) 2020-2021 Max Woolf

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.