# GPT-2 Solution Exploration

## Getting things ready 

In [None]:
%tensorflow_version 1.x
!pip install -q gpt-2-simple
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files
import os

TensorFlow 1.x selected.
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



In [None]:
! rm -rf gpt_2/
! git clone https://github.com/openai/gpt-2.git 
! mv gpt-2 gpt_2
! pip install -r gpt_2/requirements.txt

Cloning into 'gpt-2'...
remote: Enumerating objects: 233, done.[K
remote: Total 233 (delta 0), reused 0 (delta 0), pack-reused 233[K
Receiving objects: 100% (233/233), 4.38 MiB | 20.29 MiB/s, done.
Resolving deltas: 100% (124/124), done.


In [None]:
import requests

url = 'https://raw.githubusercontent.com/DMinghao/TensorFlow-NLG/main/dev/gpt2/all.txt'
r = requests.get(url)

with open('all.txt', 'wb') as f: f.write(r.content)

### Some variables 

In [None]:
model_name='774M'
seed=None
nsamples=1
batch_size=1
length=None
temperature=1
top_k=0
top_p=1
models_dir='models'

file_name = "all.txt"
test_survey = """
1) What is the reason for your visit today?
Review my MRI scan from 11/10
2) Are you RIGHT or LEFT handed?
Right
3) When did this problem start?
A few years ago
4) Has this problem happened before?
No
5) How did it start?
I was injinjured in a car accaccident 
6) Has it changed since it started?
No
7) Have you seen another physician for this problem?
No
8) What work-up has been done for this problem?
MRI Brain, Neurology Assoc.
9) Do you have a personal history of cancer or tumors?
No
10) If you answered yes to a personal history of cancer or tumors, please relate the details of the diagnosis and prior treatments
No answer
11) Do you have a personal history of any disease or condition relating to the visit today?
No 
12) Review of Systems: Do you have any of the following problems?
Heat Intolerance, Cold Intolerance, Blurred Vision
13) Review of Systems: Do you have any of the following problems?
No answer
14) Review of Systems: Do you have any of the following problems?
Nausea, Back pain, Neck pain, Joint pain
15) Review of Systems: Do you have any of the following problems?
Anxiety, Numbness, Tingling, Weakness
16) Review of Systems: Do you have any of the following problems?
No answer
17) Over the last two weeks how often have you been bothered by any of the following problems?
Score: 10
"""

### Check GPU

For larger models, Tesla P100 is recommended 

Otherwise use 124M model

In [None]:
!nvidia-smi

Sun Mar  7 16:34:00 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   45C    P0    42W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### Download GPT-2 Model 

There are 4 released sizes of GPT-2:

* `124M`: the "small" model, 500MB on disk.
* `355M`: the "medium" model, 1.5GB on disk.
* `774M`: the "large" model, 3.10GB on disk
* `1558M`: the "extra large", true model. It cannot be finetuned using colab (too big).

Larger models have more knowledge, but take longer to finetune and longer to generate text. 

In [None]:
if not os.path.isdir(os.path.join(models_dir, model_name)): gpt2.download_gpt2(model_name=model_name)

## Raw GPT-2 Implementation

In [None]:
import fire
import json
import numpy as np
import tensorflow as tf
from gpt_2.src import model, encoder

### Helper functions

Adapted and modified from official gpt-2 provided example

In [None]:
def top_k_logits(logits, k):
    if k == 0:
        # no truncation
        return logits

    def _top_k():
        values, _ = tf.nn.top_k(logits, k=k)
        min_values = values[:, -1, tf.newaxis]
        return tf.where(
            logits < min_values,
            tf.ones_like(logits, dtype=logits.dtype) * -1e10,
            logits,
        )
    return tf.cond(
       tf.equal(k, 0),
       lambda: logits,
       lambda: _top_k(),
    )

In [None]:
def top_p_logits(logits, p):
    """Nucleus sampling"""
    batch, _ = logits.shape.as_list()
    sorted_logits = tf.sort(logits, direction='DESCENDING', axis=-1)
    cumulative_probs = tf.cumsum(tf.nn.softmax(sorted_logits, axis=-1), axis=-1)
    indices = tf.stack([
        tf.range(0, batch),
        # number of indices to include
        tf.maximum(tf.reduce_sum(tf.cast(cumulative_probs <= p, tf.int32), axis=-1) - 1, 0),
    ], axis=-1)
    min_values = tf.gather_nd(sorted_logits, indices)
    return tf.where(
        logits < min_values,
        tf.ones_like(logits) * -1e10,
        logits,
    )


In [None]:
def sample_sequence(*, hparams, length, start_token=None, batch_size=None, context=None, temperature=1, top_k=0, top_p=1):
    if start_token is None:
        assert context is not None, 'Specify exactly one of start_token and context!'
    else:
        assert context is None, 'Specify exactly one of start_token and context!'
        context = tf.fill([batch_size, 1], start_token)

    def step(hparams, tokens, past=None):
        lm_output = model.model(hparams=hparams, X=tokens, past=past, reuse=tf.AUTO_REUSE)

        logits = lm_output['logits'][:, :, :hparams.n_vocab]
        presents = lm_output['present']
        presents.set_shape(model.past_shape(hparams=hparams, batch_size=batch_size))
        return {
            'logits': logits,
            'presents': presents,
        }

    with tf.name_scope('sample_sequence'):
        def body(past, prev, output):
            next_outputs = step(hparams, prev, past=past)
            logits = next_outputs['logits'][:, -1, :]  / tf.to_float(temperature)
            logits = top_k_logits(logits, k=top_k)
            logits = top_p_logits(logits, p=top_p)
            samples = tf.multinomial(logits, num_samples=1, output_dtype=tf.int32)
            return [
                next_outputs['presents'] if past is None else tf.concat([past, next_outputs['presents']], axis=-2),
                samples,
                tf.concat([output, samples], axis=1)
            ]

        past, prev, output = body(None, context, context)

        def cond(*args):
            return True

        _, _, tokens = tf.while_loop(
            cond=cond, body=body,
            maximum_iterations=length - 1,
            loop_vars=[
                past,
                prev,
                output
            ],
            shape_invariants=[
                tf.TensorShape(model.past_shape(hparams=hparams, batch_size=batch_size)),
                tf.TensorShape([batch_size, None]),
                tf.TensorShape([batch_size, None]),
            ],
            back_prop=False,
        )

        return tokens


### Generate Text From The Raw Model

In [None]:
models_dir = os.path.expanduser(os.path.expandvars(models_dir))
if batch_size is None:
    batch_size = 1
assert nsamples % batch_size == 0

enc = encoder.get_encoder(model_name, models_dir)
hparams = model.default_hparams()
with open(os.path.join(models_dir, model_name, 'hparams.json')) as f:
    hparams.override_from_dict(json.load(f))

if length is None:
    length = hparams.n_ctx // 2
elif length > hparams.n_ctx:
    raise ValueError("Can't get samples longer than window size: %s" % hparams.n_ctx)

with tf.Session(graph=tf.Graph()) as sess:
    context = tf.placeholder(tf.int32, [batch_size, None])
    np.random.seed(seed)
    tf.set_random_seed(seed)
    output = sample_sequence(
        hparams=hparams, length=length,
        context=context,
        batch_size=batch_size,
        temperature=temperature, top_k=top_k, top_p=top_p
    )

    saver = tf.train.Saver()
    ckpt = tf.train.latest_checkpoint(os.path.join(models_dir, model_name))
    saver.restore(sess, ckpt)

    
    context_tokens = enc.encode(test_survey)
    generated = 0
    for _ in range(nsamples // batch_size):
        out = sess.run(output, feed_dict={
            context: [context_tokens for _ in range(batch_size)]
        })[:, len(context_tokens):]
        for i in range(batch_size):
            generated += 1
            text = enc.decode(out[i])
            print("=" * 40 + " SAMPLE " + str(generated) + " " + "=" * 40)
            print(text)
    print("=" * 80)




Instructions for updating:
Use `tf.cast` instead.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Instructions for updating:
Use `tf.random.categorical` instead.
INFO:tensorflow:Restoring parameters from models/774M/model.ckpt
How do you treat this symptom?
Medication
Look for symptoms of heat intolerance between 10:00pm and 2:00am
Look for symptoms of heat tolerance between 10:00pm and 2:00am
1) As you move about about your home and workplace, it starts to feel warm and throbbing
2) Despite moving about, you can never quite escape your feeling of being hot
3) However, every time you look at a window you can feel the thought force your arm out from under you and onto your head!
4) At home this always happens only at certain times or during certain activities. If it happens on a regular basis it means you are not getting enough blood; If you want to work it means you are not getting enough nutrition and with sleep deprivation it means you

## Finetuning GPT-2 (Using [GPT_2_simple](https://github.com/minimaxir/gpt-2-simple))

The next cell will start the actual finetuning of GPT-2. It creates a persistent TensorFlow session which stores the training config, then runs the training for the specified number of `steps`. (to have the finetuning run indefinitely, set `steps = -1`)

The model checkpoints will be saved in `/checkpoint/run1` by default. The checkpoints are saved every 500 steps (can be changed) and when the cell is stopped.



**IMPORTANT NOTE:** 
- The training might time out after 4ish hours. 
- To rerun training: **restart the VM first** (Runtime -> Restart Runtime).

Other optional-but-helpful parameters for `gpt2.finetune`:

*  **`restore_from`**: Set to `fresh` to start training from the base GPT-2, or set to `latest` to restart training from an existing checkpoint.
* **`sample_every`**: Number of steps to print example output
* **`print_every`**: Number of steps to print training progress.
* **`learning_rate`**:  Learning rate for the training. 
*  **`run_name`**: subfolder within `checkpoint` to save the model. This is useful if you want to work with multiple models (will also need to specify  `run_name` when loading the model)
* **`overwrite`**: Set to `True` if you want to continue finetuning an existing model (w/ `restore_from='latest'`) without creating duplicate copies. 

In [None]:
!rm -rf checkpoint/

sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              dataset=file_name,
              model_name=model_name,
              steps=2000,
              learning_rate = 1e-5,
              print_every = 100, 
              sample_every = 500, 
              restore_from='fresh')

Instructions for updating:
Please use tensorflow.python.ops.op_selector.get_backward_walk_ops.
Loading checkpoint models/774M/model.ckpt
INFO:tensorflow:Restoring parameters from models/774M/model.ckpt


  0%|          | 0/1 [00:00<?, ?it/s]

Loading dataset...


100%|██████████| 1/1 [00:00<00:00,  2.65it/s]


dataset has 77225 tokens
Training...
[100 | 186.07] loss=0.31 avg=0.31
[200 | 358.31] loss=0.32 avg=0.32
[300 | 530.54] loss=0.25 avg=0.29
[400 | 702.80] loss=0.20 avg=0.27
[500 | 875.05] loss=0.15 avg=0.25
 N1-PN1:
5) What work-up has been done for this problem?
MRI 4/14/20 wolds hospital
6) Do you have a personal history of cancer or tumors?
No
7) If you answered yes to a personal history of cancer or tumors, please relate the details of the diagnosis and prior treatments
No answer
8) Do you have a personal history of any disease or condition relating to the visit today?
No answer
9) Review of Systems: Do you have any of the following problems?
No answer
10) Review of Systems: Do you have any of the following problems?
Difficulty swallowing
11) Review of Systems: Do you have any of the following problems?
Bladder incontinence
12) Review of Systems: Do you have any of the following problems?
Neck pain, Joint pain
13) Review of Systems: Do you have any of the following problems?
No ans

### Generate Text From The Trained Model

Parameters for `gpt2.generate`:

*  **`length`**: Number of tokens to generate (default 1023, the maximum)
* **`temperature`**: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **`top_k`**: Limits the generated guesses to the top *k* guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set `top_k=40`)
* **`top_p`**: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with `top_p=0.9`)
* **`truncate`**: Truncates the input text until a given sequence, excluding that sequence (e.g. if `truncate='<|endoftext|>'`, the returned text will include everything before the first `<|endoftext|>`). It may be useful to combine this with a smaller `length` if the input texts are short.
*  **`include_prefix`**: If using `truncate` and `include_prefix=False`, the specified `prefix` will not be included in the returned text.

In [None]:
gpt2.generate(sess,
              length=250,
              prefix=test_survey
              )

1) What is the reason for your visit today?
Review my MRI scan from 11/10
2) Are you RIGHT or LEFT handed?
Right
3) When did this problem start?
A few years ago
4) Has this problem happened before?
No
5) How did it start?
I was injinjured in a car accaccident 
6) Has it changed since it started?
No
7) Have you seen another physician for this problem?
No
8) What work-up has been done for this problem?
MRI Brain, Neurology Assoc.
9) Do you have a personal history of cancer or tumors?
No
10) If you answered yes to a personal history of cancer or tumors, please relate the details of the diagnosis and prior treatments
No answer
11) Do you have a personal history of any disease or condition relating to the visit today?
No 
12) Review of Systems: Do you have any of the following problems?
Heat Intolerance, Cold Intolerance, Blurred Vision
13) Review of Systems: Do you have any of the following problems?
No answer
14) Review of Systems: Do you have any of the following problems?
Nausea, Back p

```
@article{radford2019language,
  title={Language Models are Unsupervised Multitask Learners},
  author={Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya},
  year={2019}
}
```