<a href="https://colab.research.google.com/github/LiJianfu2008/text_style_transfer/blob/master/aitextgen_%E2%80%94%C2%A0Train_a_Custom_GPT_2_Model_%2B_Tokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  aitextgen — Train a Custom GPT-2 Model + Tokenizer w/ GPU

by [Max Woolf](https://minimaxir.com)

*Last updated: Jul 1st, 2020*

Train a custom GPT-2 model **for free on a GPU using Colaboratory** using `aitextgen`!

It's recommended to only create a model from scratch if you really need to do so; otherwise, [finetuning 124M](https://colab.research.google.com/drive/15qBZx5y9rdaQSyWpsreMDnTiZ5IlN0zD?usp=sharing) may give you better results.

For more about `aitextgen`, you can visit [this GitHub repository](https://github.com/minimaxir/aitextgen) or [read the documentation](https://docs.aitextgen.io/).


To get started:

1. Copy this notebook to your Google Drive to keep it and save your changes. (File -> Save a Copy in Drive)
2. Run the cells below:


In [None]:
# Freeze versions of dependencies for now
!pip install transformers==2.9.1
!pip install pytorch-lightning==0.7.6

!pip install -q aitextgen

import logging
logging.basicConfig(
        format="%(asctime)s — %(levelname)s — %(name)s — %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO
    )

from aitextgen import aitextgen
from aitextgen.colab import mount_gdrive, copy_file_from_gdrive
from aitextgen.TokenDataset import TokenDataset, merge_datasets
from aitextgen.utils import build_gpt2_config
from aitextgen.tokenizers import train_tokenizer

05/18/2020 01:05:34 — INFO — transformers.file_utils — PyTorch version 1.5.0+cu101 available.
05/18/2020 01:05:35 — INFO — transformers.file_utils — TensorFlow version 2.2.0 available.


## GPU

Colaboratory uses a Nvidia P4, an Nvidia T4, or an Nvidia P100 GPU. For finetuning GPT-2 124M, any of these GPUs will be fine, but for text generation, a T4 or a P100 is ideal since they have more VRAM.

You can verify which GPU is active by running the cell below. If you want to try for a different GPU, go to **Runtime -> Factory Reset Runtime**.

In [None]:
!nvidia-smi

Mon May 18 01:01:05 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

## Mounting Google Drive

The best way to get input text to-be-trained into the Colaboratory VM, and to get the trained model *out* of Colaboratory, is to route it through Google Drive *first*.

Running this cell (which will only work in Colaboratory) will mount your personal Google Drive in the VM, which later cells can use to get data in/out. (it will ask for an auth code; that auth is not saved anywhere)

In [None]:
mount_gdrive()

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


## Uploading a Text File to be Trained to Colaboratory

In the Colaboratory Notebook sidebar on the left of the screen, select *Files*. From there you can upload files:

![alt text](https://i.imgur.com/w3wvHhR.png)

Upload **any smaller text file** (for example, [a text file of Shakespeare plays](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt)) and update the file name in the cell below, then run the cell.

In [None]:
file_name = "input.txt"

If your text file is larger than 10MB, it is recommended to upload that file to Google Drive first, then copy that file from Google Drive to the Colaboratory VM.

In [None]:
copy_file_from_gdrive(file_name)

## Training the Tokenizer

Now we can train a Byte-Pair Encoding tokenizer on the dataset we just downloaded. The `train_tokenizer()` function wraps the training method for the `tokenizer` package from Huggingface.

After the training is completed, this will save two files: **aitextgen-vocab.json** and **aitextgen-merges.txt**, which are needed to rebuild the tokenizer.

In [None]:
train_tokenizer(file_name)

05/18/2020 01:03:18 — INFO — aitextgen.tokenizers — Saving aitextgen-vocab.json and aitextgen-merges.txt to the current directory. You will need both files to build the GPT2Tokenizer.


## Specify a Model Configuration

You can use `build_gpt2_config` to specify a model configuration. You most likely will want to adjust `max_length` (context window size) and `n_embd` (embedding size).

The config used here is the one used to build a [demo Reddit](https://github.com/minimaxir/aitextgen/blob/master/notebooks/reddit_demo.ipynb) model.

In [None]:
config = build_gpt2_config(vocab_size=5000, max_length=32, dropout=0.0, n_embd=256, n_layer=8, n_head=8)
config

GPT2Config {
  "activation_function": "gelu_new",
  "attn_pdrop": 0.0,
  "bos_token_id": 0,
  "embd_pdrop": 0.0,
  "eos_token_id": 0,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 32,
  "n_embd": 256,
  "n_head": 8,
  "n_layer": 8,
  "n_positions": 32,
  "resid_pdrop": 0.0,
  "summary_activation": null,
  "summary_first_dropout": 0.0,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "vocab_size": 5000
}

## Instantiating Your Custom GPT-2 Model

Pass all the information to `aitextgen()` and you're good to go!

In [None]:
ai = aitextgen(config=config,
               vocab_file="aitextgen-vocab.json",
               merges_file="aitextgen-merges.txt",
               to_gpu=True)

05/18/2020 01:05:43 — INFO — aitextgen — Constructing GPT-2 model from provided config.
05/18/2020 01:05:43 — INFO — aitextgen — Using a custom tokenizer.


Generated output from it will be effectively random, for now.

In [None]:
ai.generate(5)

 sick laboun ent bark ifry rgeloICHARD humour wrathashash bluntAbVOL'? aughtARIELgleeiancaianca ent circum voice� otherPER
beastard sickirg gates fameGREENryaints ent ent appear appear,' execution loving sacoun chiefuranceHisisageloaints gatesMyself expourish deointed cher
 not�arentWas arrived deep BOL BaduaBENCAS An voice He wisdom denied forth deputyoryeakCOMINIUSCOMINIUS ourselveslerreatBEN Gentleman house madam Edward g
 deputy ifaul woes He deniedvel butchergl butchergare Isab sovereLet dep welcome place finds feastointed words royal denied powerlet ROSthee alfall arrived sk
ursday evenum There loyal arrived forth� He bones shrLike ways work birth whAMIL wh ve note deliverexTwereundredCOMINIUS welcome thing Heen wisdom free


## Train GPT-2

The next cell will start the actual training of GPT-2 in aitextgen. It runs for `num_steps`, and a progress bar will appear to show training progress, current loss (the lower the better the model), and average loss (to give a sense on loss trajectory).

The model will be saved every `save_every` steps in `trained_model` by default, and when training completes. If you mounted your Google Drive, the model will _also_ be saved there in a unique folder.

The training might time out after 4ish hours; if you did not mount to Google Drive, make sure you end training and save the results so you don't lose them! (if this happens frequently, you may want to consider using [Colab Pro](https://colab.research.google.com/signup))

Important parameters for `train()`:

- **`line_by_line`**: Set this to `True` if the input text file is a single-column CSV, with one record per row. aitextgen will automatically process it optimally.
- **`num_steps`**: Number of steps to train the model for.
- **`generate_every`**: Interval of steps to generate example text from the model; good for qualitatively validating training.
- **`save_every`**: Interval of steps to save the model: the model will be saved in the VM to `/trained_model`.
- **`save_gdrive`**: Set this to `True` to copy the model to a unique folder in your Google Drive, if you have mounted it in the earlier cells
- **`batch_size`**: Batch size of the model training; setting it too high will cause the GPU to go OOM. _Unlike finetuning, since you are using a small model, you can massively increase the batch size to normalize the training_.

Here are other important parameters for `train()` that are useful but you likely do not need to change.

- **`learning_rate`**: Learning rate of the model training.


In [None]:
ai.train(file_name,
         line_by_line=False,
         num_steps=5000,
         generate_every=1000,
         save_every=1000,
         save_gdrive=False,
         learning_rate=1e-4,
         batch_size=256,
         )

GPU available: True, used: True
05/18/2020 01:05:48 — INFO — lightning — GPU available: True, used: True
No environment variable for node rank defined. Set as 0.
CUDA_VISIBLE_DEVICES: [0]
05/18/2020 01:05:48 — INFO — lightning — CUDA_VISIBLE_DEVICES: [0]


HBox(children=(FloatProgress(value=0.0, layout=Layout(flex='2'), max=5000.0), HTML(value='')), layout=Layout(d…

[1m1,000 steps reached: saving model to /trained_model[0m
[1m1,000 steps reached: generating sample texts.[0m
?

Nurse:
If it be, I shall be so with thy eyes.

KING EDWARD IV:
Sir, I do not,
[1m2,000 steps reached: saving model to /trained_model[0m
[1m2,000 steps reached: generating sample texts.[0m
, and
mot of the sea, and with a cidenter, the
dap of the queen. Do thou art a most most
[1m3,000 steps reached: saving model to /trained_model[0m
[1m3,000 steps reached: generating sample texts.[0m
!

ISABELLA:
I am the house: I thought it is now in.

ISABELLA:
I do pray you do't.

[1m4,000 steps reached: saving model to /trained_model[0m
[1m4,000 steps reached: generating sample texts.[0m
 like a man with his: if he have he
your own son, we say, who would have his good
onown, and that with all
[1m5,000 steps reached: saving model to /trained_model[0m
[1m5,000 steps reached: generating sample texts.[0m


05/18/2020 01:23:37 — INFO — aitextgen — Saving trained model pytorch_model.bin to /trained_model


 with her.

ROMEO:
This is a thousand-dicicers,
And thou soest me with thy father's face;



You're done! Feel free to go to the **Generate Text From The Trained Model** section to generate text based on your retrained model.


## Load a Trained Model

Running the next cell will copy the `pytorch_model.bin`, `config.json`, `aitextgen_vocab.json`, and `aitextgen_merges.json` files from the specified folder in Google Drive into the Colaboratory VM. (If no `from_folder` is specified, it assumes the two files are located at the root level of your Google Drive)

In [None]:
from_folder = None

for file in ["pytorch_model.bin", "config.json", "aitextgen_vocab.json", "aitextgen_merges.json"]:
  if from_folder:
    copy_file_from_gdrive(file, from_folder)
  else:
    copy_file_from_gdrive(file)

The next cell will allow you to load the retrained model + metadata necessary to generate text.

In [None]:
ai = aitextgen(model="pytorch_model.bin",
               config="config.json",
               vocab_file="aitextgen_vocab.json",
               merges_file="aitextgen_merges.json",
               to_gpu=True)

05/17/2020 20:27:21 — INFO — aitextgen — Loading GPT-2 model from provided pytorch_model.bin.
05/17/2020 20:27:26 — INFO — aitextgen — Using the default GPT-2 Tokenizer.


## Generate Text From The Trained Model

After you've trained the model or loaded a retrained model from checkpoint, you can now generate text. `generate()` without any parameters generates a single text from the loaded model to the console.

In [None]:
ai.generate()

'd bent, the wind of the duke.

DUKE VINCENTIO:
A gentleman, that she shall find so much to be
cilad


If you're creating an API based on your model and need to pass the generated text elsewhere, you can do `text = ai.generate_one()`

You can also pass in a `prompt` to the generate function to force the text to start with a given character sequence and generate text from there (good if you add an indicator when the text starts).

You can also generate multiple texts at a time by specifing `n`. You can pass a `batch_size` to generate multiple samples in parallel, giving a massive speedup (in Colaboratory, set a maximum of 50 for `batch_size` to avoid going OOM).

Other optional-but-helpful parameters for `ai.generate()` and friends:

*  **`max_length`**: Number of tokens to generate (default 256, you can generate up to 1024 tokens with GPT-2, but it will be _much_ slower)
* **`temperature`**: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **`top_k`**: Limits the generated guesses to the top *k* guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set `top_k=40`)
* **`top_p`**: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with `top_p=0.9`)

In [None]:
ai.generate(n=5,
            batch_size=5,
            prompt="ROMEO:",
            temperature=1.0,
            top_p=0.9)

[1mROMEO:[0m
This is the man that was from the world;
Which, if of some mean to love, you shall have
My brother York, to
[1mROMEO:[0m
O and by my soul!

JULIET:
Good noble queen!

FRIAR LAURENCE:
So long as to live a night on
[1mROMEO:[0m
I will, by this day. What's some?

GRUMIO:
A very little.

FRIAR LAURENCE:
Let me be
[1mROMEO:[0m
What is the cay that that is in my crown?

KING HENRY VI:
O Reter, what an hour's thou canst
[1mROMEO:[0m
Well, do you say; 'tis one thing to be brief,
And I will make a trobs to you such a tongue.


For bulk generation, you can generate a large amount of texts to a file and sort out the samples locally on your computer. The next cell will generate `num_files` files, each with `n` texts and whatever other parameters you would pass to `generate()`. The files can then be downloaded from the Files sidebar!

You can rerun the cells as many times as you want for even more generated texts!

In [None]:
num_files = 5

for _ in range(num_files):
  ai.generate_to_file(n=1000,
                     batch_size=50,
                     prompt="ROMEO:",
                     temperature=1.0,
                     top_p=0.9)

05/18/2020 01:23:39 — INFO — aitextgen — Generating 1,000 texts to ATG_20200518_012339_63155218.txt


HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))

05/18/2020 01:23:46 — INFO — aitextgen — Generating 1,000 texts to ATG_20200518_012346_42487021.txt





HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))

05/18/2020 01:23:53 — INFO — aitextgen — Generating 1,000 texts to ATG_20200518_012353_48292245.txt





HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))

05/18/2020 01:24:00 — INFO — aitextgen — Generating 1,000 texts to ATG_20200518_012400_47144218.txt





HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))

05/18/2020 01:24:07 — INFO — aitextgen — Generating 1,000 texts to ATG_20200518_012407_46059569.txt





HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))




# LICENSE

MIT License

Copyright (c) 2020 Max Woolf

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.