# The GPT Language Model

## Colab Setup

You can skip this section if not running on Google's colab.

If running with GPUs, sanity check that the GPUs are enabled.

In [None]:
!nvidia-smi

In [1]:
import torch
torch.cuda.is_available()

True

Ahould be True. If not, debug (Note: version of pytorch I used is not capatible with CUDA drivers on colab. Follow these instructions here explicitly).

In [2]:
!pwd

/content


This should be "/content" on Colab.

First, if running from colab, you must install the package. (You may skip if you installed already).

In [None]:
!git clone --single-branch --branch colab https://github.com/will-thompson-k/deeplearning-nlp-models.git
%cd deeplearning-nlp-models

In [None]:
!pip install datasets

In [None]:
!python setup.py install

## Imports

Here are the packages we need to import.

In [6]:
from nlpmodels.models import gpt
from nlpmodels.utils import train,utils,gpt_sampler
from nlpmodels.utils.elt import gpt_dataset
from argparse import Namespace
utils.set_seed_everywhere()

## Language Model: WikiText2

We will try to train our transformer model to learn how to predict the next word in torchtext WikiText2 database.
I took the first 300k from the training set to reduce computation time.

### Hyper-parameters

These are the data processing and model training hyper-parameters for this run. Note that we are running a smaller model
than cited in the paper for fewer iterations. This is meant merely to demonstrate it works.

In [16]:
args = Namespace(
        # Model hyper-parameters
        num_layers_per_stack=4,  # original value = 12
        dim_model=512, #original value = 768
        dim_ffn=2048, # original value = 3072
        num_heads=4, # original value = 12
        block_size=64, # original value = 512, context window
        dropout=0.1,
        # Training hyper-parameters
        num_epochs=10, #obviously super short
        learning_rate=0.0,
        batch_size=64, #original value = 64
    )

In [None]:
train_loader, vocab = gpt_dataset.GPTDataset.get_training_dataloader(args)
model = gpt.GPT(vocab_size = len(vocab),
            num_layers_per_stack= args.num_layers_per_stack,
            dim_model = args.dim_model,
            dim_ffn = args.dim_ffn,
            num_heads = args.num_heads,
            block_size = args.block_size,
            dropout = args.dropout)
trainer = train.GPTTrainer(args,vocab.mask_index,model,train_loader,vocab)

# Self-supervised training
Now we will run the first step in GPT training process, where we train the model to
maximize the objective

```
max p(x[k]|x[k-1],[k-2],...x[k-block_size])
```.

This is an unsupervised (more aptly described as "self-supervised") loss. After this model is trained,
we can run then continue it onto another problem (can freeze layers to only continue training the top layers).

In [18]:
trainer.run()

[Epoch 0]:   6%|▌         | 290/4688 [01:08<17:49,  4.11it/s, loss=7.4]

KeyboardInterrupt: ignored

Note that this model was run for a **very** short period of time. The goal is just to show how this works - you can
play with the hyper-parameters as you see fit.
We only ran for 1 epoch, on a much smaller model,
with a smaller dataset than was suggested in the paper.

Let's see if the output the model completes makes any sense.

# GPT Completes A Sequence

In the spirit of Kaparthy's minGPT::play_char notebook, we can use a greedy_sampler to see how the model
continues a sequence.

In [10]:
from typing import Tuple 
from nlpmodels.utils.elt import gpt_batch

def reformat_data(data: Tuple) -> gpt_batch.GPTBatch:
  """
  Args:
      data (Tuple): The tuples of LongTensors to be converted into a Batch object.
  Returns:
      GPTBatch object containing data for a given batch.
  """
  # (batch,seq) shape tensors
  source_integers, target_integers = data

  device = 'cpu'
  if torch.cuda.is_available():
      device = torch.cuda.current_device()
  # place data on the correct device
  source_integers = source_integers.to(device) if source_integers is not None else source_integers
  target_integers = target_integers.to(device) if target_integers is not None else target_integers

  # return a batch object with src,src_mask,tgt,tgt_mask tensors
  batch_data = gpt_batch.GPTBatch(source_integers,
                                  target_integers,
                                  0)

  return batch_data

In [11]:
import torch



prompt = "ernest hemingway first novel , the sun also rises , " \
         "treats of certain of those younger americas concerning whom gertrude stein has remarked :" \
         " you are all a lost generation . this is the novel for which a keen appetite was stimulated by" \
         " mr . hemingway 's exciting volume of short stories. " \
         " the clear objectivity and the sustained intensity of the stories , " \
         "and their concentration upon action in the present moment, seemed to point to a failure to project " \
         "a novel in terms of the same method, yet a resort to any other method would have let down the " \
         "reader's expectations. it is a relief to find that the sun also rises maintains the same heightened , " \
         "intimate tangibility as the shorter narratives and does it in the same kind of weighted, quickening prose. "
prompt_tensor = torch.LongTensor([[vocab.lookup_token(s) for s in prompt.split(" ")]])
prompt_tensor_batch = reformat_data((prompt_tensor,None))
steps = 64
y_hat_indices = gpt_sampler.sampler(model=model, data=prompt_tensor_batch,
                                          steps=steps,block_size=64,do_sample=True).src
y_hat_tokens = ' '.join([vocab.lookup_index(int(idx)) for idx in y_hat_indices[0]])

Here is the prompt it was provided, a review of Ernest Hemingway's novel, The Sun Also Rises:

In [12]:
for idx in range(0,len(prompt.split(" ")),8):
    print(" ".join(prompt.split(" ")[idx:idx+8]))

ernest hemingway first novel , the sun also
rises , treats of certain of those younger
americas concerning whom gertrude stein has remarked :
you are all a lost generation . this
is the novel for which a keen appetite
was stimulated by mr . hemingway 's exciting
volume of short stories.  the clear objectivity
and the sustained intensity of the stories ,
and their concentration upon action in the present
moment, seemed to point to a failure to
project a novel in terms of the same
method, yet a resort to any other method
would have let down the reader's expectations. it
is a relief to find that the sun
also rises maintains the same heightened , intimate
tangibility as the shorter narratives and does it
in the same kind of weighted, quickening prose.



Working off that sequence, here is how the model completed the next 64 words in this review:

In [15]:
for idx in range(0,len(y_hat_tokens.split(" ")),8):
    print(" ".join(y_hat_tokens.split(" ")[idx:idx+8]))

<UNK> naples unrest they found on its theme
behind the original artifact stipulates about about the
disease to pilot released the original allegory there
was intentionally considered portraying six andrzej wajda brothers
inspiration for some dozen moral trends commented that
appear in whether an full than the main
period what says that copy of their first
could be found used to be replaced it
rumors


Well, as expected... this doesn't make any sense really. Pockets of words make sense, but overall it does not.
As the language model is trained further and using more parameters, we would expect the believability of this text to increase.

# Supervised training

Once the model is trained in the self-supervised phase, go forth and apply it to a different problem!