#  aitextgen — Train a GPT-2 Text-Generating Model w/ GPU

by [Max Woolf](https://minimaxir.com)

*Last updated: Jul 5th, 2020*

For more about `aitextgen`, you can visit [this GitHub repository](https://github.com/minimaxir/aitextgen) or [read the documentation](https://docs.aitextgen.io/).


In [6]:
import logging
logging.basicConfig(
        format="%(asctime)s — %(levelname)s — %(name)s — %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO
    )

from aitextgen import aitextgen

## GPU


In [7]:
!nvidia-smi

Thu Sep 17 23:01:50 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06    Driver Version: 450.36.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Quadro P5000        On   | 00000000:00:05.0 Off |                  Off |
| 27%   36C    P8     6W / 180W |   9635MiB / 16278MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Loading GPT-2

If you're retraining a model on new text, you need to download and load the GPT-2 model into the GPU. 

There are several sizes of GPT-2: currently, aitextgen only works with the smallest one:

* `124M` (default): the "small" model, 500MB on disk.


In [8]:
ai = aitextgen(tf_gpt2="124M", to_gpu=True)

09/17/2020 23:01:59 — INFO — aitextgen — Loading 124M GPT-2 model from /aitextgen.
09/17/2020 23:02:04 — INFO — aitextgen — Using the default GPT-2 Tokenizer.


In [9]:
file_name = "shakespeare.txt"

Additionally, you may want to consider [compressing the dataset to a cache first](https://docs.aitextgen.io/dataset/) on your local computer, then uploading the resulting `dataset_cache.tar.gz` and setting the `file_name`in the previous cell to that.

## Finetune GPT-2

The next cell will start the actual finetuning of GPT-2 in aitextgen. It runs for `num_steps`, and a progress bar will appear to show training progress, current loss (the lower the better the model), and average loss (to give a sense on loss trajectory).

The model will be saved every `save_every` steps in `trained_model` by default, and when training completes. If you mounted your Google Drive, the model will _also_ be saved there in a unique folder.

The training might time out after 4ish hours; if you did not mount to Google Drive, make sure you end training and save the results so you don't lose them! (if this happens frequently, you may want to consider using [Colab Pro](https://colab.research.google.com/signup))

Important parameters for `train()`:

- **`line_by_line`**: Set this to `True` if the input text file is a single-column CSV, with one record per row. aitextgen will automatically process it optimally.
- **`from_cache`**: If you compressed your dataset locally (as noted in the previous section) and are using that cache file, set this to `True`.
- **`num_steps`**: Number of steps to train the model for.
- **`generate_every`**: Interval of steps to generate example text from the model; good for qualitatively validating training.
- **`save_every`**: Interval of steps to save the model: the model will be saved in the VM to `/trained_model`.
- **`save_gdrive`**: Set this to `True` to copy the model to a unique folder in your Google Drive, if you have mounted it in the earlier cells

Here are other important parameters for `train()` that are useful but you likely do not need to change.

- **`learning_rate`**: Learning rate of the model training.
- **`batch_size`**: Batch size of the model training; setting it too high will cause the GPU to go OOM.

In [10]:
ai.train(file_name,
         line_by_line=False,
         from_cache=False,
         num_steps=200,
         generate_every=50,
         save_every=100,
         save_gdrive=False,
         learning_rate=1e-4,
         batch_size=1,      
         )

HBox(children=(FloatProgress(value=0.0, layout=Layout(flex='2'), max=25230.0), HTML(value='')), layout=Layout(…

09/17/2020 23:03:12 — INFO — aitextgen.TokenDataset — Encoding 25,230 sets of tokens from shakespeare.txt.
GPU available: True, used: True
09/17/2020 23:03:12 — INFO — lightning — GPU available: True, used: True
No environment variable for node rank defined. Set as 0.
CUDA_VISIBLE_DEVICES: [0]
09/17/2020 23:03:12 — INFO — lightning — CUDA_VISIBLE_DEVICES: [0]






HBox(children=(FloatProgress(value=0.0, layout=Layout(flex='2'), max=200.0), HTML(value='')), layout=Layout(di…

[1m50 steps reached: generating sample texts.[0m

COUKEBROKE:
How are you?

KING:
You are dead!

COUKEKE:
You are dead!
KING:
How are you?

COUKEKE:
My head.

KING:
Marry the king!

KING:
COUKE:
I'm dead!

COUKEKE:
How are you?

KING:
Marry the king!

COUKE:
Where is my king?

KING:
Where is my king?

COUKE:
I am dead.

KING:
Marry the king!

COUKE:
Where is my king?

KING:
A dead king.

COUKEKE:
Where are my king?

KING:
Marry the king!

COUKE:
I am dead!

KING:
I am dead!

COUKE:
I am dead!

KING:
Marry the king!


COUKE:
I am dead!

KING:
Marry the king!

COUKE:
[1m100 steps reached: saving model to /trained_model[0m
[1m100 steps reached: generating sample texts.[0m
We'll make it in.

The Mercutio:
We'll take it.

He's not the one who says 'We'll make it in,'
We'll take him back;
As for the one who loves us, we'll do it for him.

The Mercutio:
Our lives, our gods, are at liberty!

We'll do well; and we'll not lose our lives;
We'll do well to be gone.

Here is my noble head,
Bo

09/17/2020 23:04:49 — INFO — aitextgen — Saving trained model pytorch_model.bin to /trained_model



RICHARD:
What shall I do?

RICHARD:
Ah! what shall I do?

RICHARD:
What shall I do?

RICHARD:
What shall I do?

DUKE:
What, I am a woman?

RICHARD:
What, I am a man?

LORD:
I am a man!
I am a woman?

LORD:
I am a man!

RICHARD:
What, I am a man?

LORD:
I am a man!

LORD:
I am a woman!


LORD:
A man!

RICHARD:
A man!

LORD:
A woman!


LORD:
What is this?

RICHARD:
What is this?

LORD:
What is this?

RICHARD:
I think, I know not how
I am to know how
I am to know myself:
I cannot help but say I am a woman,
For I am a man, and I am a woman;
For


You're done! Feel free to go to the **Generate Text From The Trained Model** section to generate text based on your retrained model.


## Load a Trained Model

Running the next cell will copy the `pytorch_model.bin` and the `config.json`file from the specified folder in Google Drive into the Colaboratory VM. (If no `from_folder` is specified, it assumes the two files are located at the root level of your Google Drive)

In [None]:
from_folder = None

for file in ["pytorch_model.bin", "config.json"]:
  if from_folder:
    copy_file_from_gdrive(file, from_folder)
  else:
    copy_file_from_gdrive(file)

The next cell will allow you to load the retrained model + metadata necessary to generate text.

In [11]:
ai = aitextgen(model="trained_model/pytorch_model.bin", config="trained_model/config.json", to_gpu=True)

09/17/2020 23:11:55 — INFO — aitextgen — Loading GPT-2 model from provided trained_model/pytorch_model.bin.
09/17/2020 23:12:01 — INFO — aitextgen — Using the default GPT-2 Tokenizer.


## Generate Text From The Trained Model

After you've trained the model or loaded a retrained model from checkpoint, you can now generate text. `generate()` without any parameters generates a single text from the loaded model to the console.

In [12]:
ai.generate()


KING RICHARD II:
So shall thou be; that thou mayst live
In peace and harmony.

GLOUCESTER:
And you shall live peaceably,
And not in so much war.

KING RICHARD II:
And hence thou shalt live.

GLOUCESTER:
Your peace shall be like the peace of a king,
Which may last for ever.

KING RICHARD II:
You shall live peaceably, and live like a king.

GLOUCESTER:
And thou shalt live like a king.

KING RICHARD II:
And so shall I live.

GLOUCESTER:
And so shall I live.

KING RICHARD II:
And so shall I live.

GLOUCESTER:
And so shall I live.

KING RICHARD II:
And so shall I live.

GLOUCESTER:
And so shall I live.

KING RICHARD II:
And so shall I live.

GLOUCESTER:
And so shall I live.

KING RICHARD II


If you're creating an API based on your model and need to pass the generated text elsewhere, you can do `text = ai.generate_one()`

You can also pass in a `prompt` to the generate function to force the text to start with a given character sequence and generate text from there (good if you add an indicator when the text starts).

You can also generate multiple texts at a time by specifing `n`. You can pass a `batch_size` to generate multiple samples in parallel, giving a massive speedup (in Colaboratory, set a maximum of 50 for `batch_size` to avoid going OOM).

Other optional-but-helpful parameters for `ai.generate()` and friends:

*  **`max_length`**: Number of tokens to generate (default 256, you can generate up to 1024 tokens with GPT-2, but it will be _much_ slower)
* **`temperature`**: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **`top_k`**: Limits the generated guesses to the top *k* guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set `top_k=40`)
* **`top_p`**: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with `top_p=0.9`)

In [13]:
ai.generate(n=5,
            batch_size=5,
            prompt="ROMEO:",
            max_length=256,
            temperature=1.0,
            top_p=0.9)

[1mROMEO:[0m
My Lord of England,
Whom thou hast slain; I'll fight thee
Of my name's flesh and blood.

GLOUCESTER:
O Lord! come I;
For thou wilt die, my Lord, or else
You die.

KING RICHARD III:
I am your sovereign, for I am
Your Lord of England.

GLOUCESTER:
O Lord! come I;
For thy death is my own death;
And I myself die, O my Lord of England!

KING RICHARD III:
Hush, my Lord of Gloucester!

GLOUCESTER:
Wilt thou let me die, for I have
Thyself, for God's sake, kill'd thee?

KING RICHARD III:
My God, my Lord of England,
Tut, the souls I kill, and my children,
Do not die; but let us live;
For I die, and not myself; but let us die.

GLOUCESTER:
O, that you did do, or that I did kill thy son.

KING RICHARD III:
[1mROMEO:[0m
The good man is the one in whom you, like Romeo, will be pleased.

LADY CAPULET:
O, if it were not so, then let me know:
You, in what manner, as in your own
faith, are in that same faith!

CAPULET:
O my brother! let me tell you what
you are that are
in the same fait

For bulk generation, you can generate a large amount of texts to a file and sort out the samples locally on your computer. The next cell will generate `num_files` files, each with `n` texts and whatever other parameters you would pass to `generate()`. The files can then be downloaded from the Files sidebar!

You can rerun the cells as many times as you want for even more generated texts!

In [10]:
num_files = 5

for _ in range(num_files):
  ai.generate_to_file(n=1000,
                     batch_size=50,
                     prompt="ROMEO:",
                     max_length=256,
                     temperature=1.0,
                     top_p=0.9)

09/17/2020 21:39:31 — INFO — aitextgen — Generating 1,000 texts to ATG_20200917_213931_50011073.txt


HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))

KeyboardInterrupt: 

# LICENSE

MIT License

Copyright (c) 2020 Max Woolf

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.