

---

# Transformer Therapy

---

Aptly named Transformers Psychoanalysis because this Colabook'll be employing an investigatory system aimed at understanding the interaction of conscious and unconscious elements in the neural framework of the Transformer network.  

---



In [1]:
import torch
!git clone https://github.com/huggingface/transformers.git
!pip install -q transformers
from transformers import *

Cloning into 'transformers'...
remote: Enumerating objects: 17472, done.[K
remote: Total 17472 (delta 0), reused 0 (delta 0), pack-reused 17472[K
Receiving objects: 100% (17472/17472), 10.17 MiB | 13.08 MiB/s, done.
Resolving deltas: 100% (13000/13000), done.
[K     |████████████████████████████████| 450kB 8.5MB/s 
[K     |████████████████████████████████| 1.0MB 58.0MB/s 
[K     |████████████████████████████████| 870kB 48.1MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone





---


---

# GPT-2

---



---





---

## GPT-2 Simple, by Max Woolf

---

I've never met [Max Woolf](https://minimaxir.com/). But he's sharing his stuff, namely a Python package that wraps fine-tuning and generation scripts for [OpenAI](https://openai.com/)'s [GPT-2](https://github.com/openai/gpt-2), a text-generating NLP transformer model considered so viable it's [dangerous](https://slate.com/technology/2019/02/openai-gpt2-text-generating-algorithm-ai-dangerous.html). 

Woolf's package incorporates minimal low-level changes to three existing repos: management from [one](https://github.com/openai/gpt-2), fine-tuning from [two](https://github.com/nshepperd/gpt-2), and generative output management from [three](https://github.com/minimaxir/textgenrnn).

In this Colabook, I'll repeat Woolf's [GPT-2-simple](https://github.com/minimaxir/gpt-2-simple) instantiation, leveraging the chic methodology of [transfer learning](https://www.analyticsvidhya.com/blog/2017/06/transfer-learning-the-art-of-fine-tuning-a-pre-trained-model/) to fine-tune on custom datasets and see what strange things happen to me there. 

---



In [2]:
!pip install -q gpt-2-simple
import gpt_2_simple as gpt2
!pip install -q pytorch-transformers
from datetime import datetime
from google.colab import files

  Building wheel for gpt-2-simple (setup.py) ... [?25l[?25hdone
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

[K     |████████████████████████████████| 184kB 8.5MB/s 
[?25h



---

## Colab's GPU 

---

The cell below will print to screen the make and model of Colab's GPU and CPU stats. Colab employs either a NVIDIA Tesla T4 or NVIDIA K80 GPU.

---



In [0]:
# !lscpu
!nvidia-smi

Tue Sep 10 23:59:19 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.40       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   68C    P8    11W /  70W |      0MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru



---

## Load GPT-2 Model

---



---



In [3]:
gpt2.download_gpt2(model_name="355M")

Fetching checkpoint: 1.05Mit [00:00, 322Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 121Mit/s]                                                    
Fetching hparams.json: 1.05Mit [00:00, 312Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 1.42Git [00:19, 72.5Mit/s]                                 
Fetching model.ckpt.index: 1.05Mit [00:00, 141Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 132Mit/s]                                                 
Fetching vocab.bpe: 1.05Mit [00:00, 162Mit/s]                                                       




---

## Mount Google Drive

---

---



In [4]:
gpt2.mount_gdrive()

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive




---

## Load Text File

---

Project Gutenburg

---

### If File < 10MB

---



---



In [0]:
file_name = "pynchon.txt"



---

### Elif File > 10MB

---

Elif the file is larger than 10MB, the recommended procedure is to upload that file into Google Drive, and then copy it into Colabook.

---



In [0]:
# once file is successfully uploaded
# gpt2.copy_file_from_gdrive(file_name)



---

## Preprocess File

---



---



In [0]:
from collections import Counter

def word_count(file_name):
    with open(file_name) as file:
        return Counter(file.read().lower().split())
      
print('Author\'s Word Count')
print('_________ ____ _____')
print(word_count(file_name))

Author's Word Count
_________ ____ _____




---

## Fine-Tune GPT-2

---

**IMPORTANT NOTE**: If you want to rerun this cell, restart the VM first (Runtime -> Restart Runtime). You will need to rerun imports but not recopy files.

Other optional-but-helpful parameters for gpt2.finetune:

  restore_from: Set to fresh to start training from the base GPT-2, or set to latest to restart training from an existing checkpoint.
  
sample_every: Number of steps to print example output

print_every: Number of steps to print training progress.

learning_rate: Learning rate for the training. (default 1e-4, can lower to 1e-5 if you have <1MB input data)

temperature: used to control the randomness of predictions by scaling the logits before applying softmax. Without foofaraw: the higher the temperature, the more chaotic or unlikely the generated sequences will be.

run_name: subfolder within checkpoint to save the model. This is useful if you want to work with multiple models (will also need to specify run_name when loading the model)

overwrite: Set to True if you want to continue finetuning an existing model (w/ 
restore_from='latest') without creating duplicate copies.

---



In [0]:
run_name = 'run1'

In [7]:
sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              dataset=file_name,
              model_name='355M',
              steps=1000,
              learning_rate=1e-5,
              restore_from='fresh',
              run_name=run_name,
              print_every=10,
              sample_every=200,
              save_every=200
              )

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Instructions for updating:
Please use tensorflow.python.ops.op_selector.get_backward_walk_ops.
Loading checkpoint models/355M/model.ckpt
INFO:tensorflow:Restoring parameters from models/355M/model.ckpt


  0%|          | 0/1 [00:00<?, ?it/s]

Loading dataset...


100%|██████████| 1/1 [00:00<00:00,  1.53it/s]


dataset has 68060 tokens
Training...
[10 | 16.37] loss=4.25 avg=4.25
[20 | 25.21] loss=3.33 avg=3.79
[30 | 34.06] loss=3.87 avg=3.82
[40 | 42.91] loss=3.55 avg=3.75
[50 | 51.75] loss=3.93 avg=3.79
[60 | 60.59] loss=3.65 avg=3.76
[70 | 69.43] loss=3.92 avg=3.79
[80 | 78.29] loss=3.35 avg=3.73
[90 | 87.14] loss=3.78 avg=3.74
[100 | 96.00] loss=3.67 avg=3.73
[110 | 104.86] loss=3.82 avg=3.74
[120 | 113.69] loss=4.13 avg=3.77
[130 | 122.53] loss=2.99 avg=3.71
[140 | 131.36] loss=3.62 avg=3.70
[150 | 140.21] loss=3.57 avg=3.69
[160 | 149.07] loss=3.61 avg=3.69
[170 | 157.91] loss=3.12 avg=3.65
[180 | 166.73] loss=4.00 avg=3.67
[190 | 175.57] loss=3.98 avg=3.69
[200 | 184.40] loss=3.62 avg=3.69
Saving checkpoint/run1/model-200
. And we're going to start to think more clearly. This is the new direction. The New Order.
The New Order (NOS): The Nos are new. They've been there for a hundred years. The first Nos, when I was a boy, was a band of half-forgotten hounds, wandering the land, hunting b



---

## Copy Checkpoints

---

Once the model has successfully completed its training, the checkpoint folder can be copied to your Google Drive. What if your Google Drive data has its free 15GBs packed in like sardines? Download the checkpoint folder to your personal computer, that's what. But first copy the checkpoint folder to your personal computer, then download from Google Drive. For what it's worth, the checkpoint folder is copied as a *.rar* compressed file, which you can download and uncompress locally.

---



In [0]:
gpt2.copy_checkpoint_to_gdrive(run_name=run_name)



---



---





---


## Load Trained Model Checkpoint

---

The cell just below allows loading of the retrained model checkpoint and metadata necessary to generate text, and the one below that loads the last session and checkpoints into gpt2, allowing us to jump straight into generating output, bypassing the training process altogether.

---



In [0]:
gpt2.copy_checkpoint_from_gdrive(run_name=run_name)



---

If you want to rerun this cell, first restart the Colabook (Runtime -> Restart Runtime), since you will need to rerun imports while not recopying files.

---



In [0]:
sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess, run_name=run_name)

Loading checkpoint checkpoint/pyn49/model-1000
INFO:tensorflow:Restoring parameters from checkpoint/pyn49/model-1000




---



---





---

## Generate Text From Trained Model

---


In [0]:
gpt2.generate(sess, run_name=run_name)

It's hard to believe, but I'm a big fan of the '90s. I'm a big fan of that decade. It's such a weird time in our history. I mean, we had "The Sopranos," and then came "Homeland," and then came "The Sopranos." I loved all three of those shows. So when I came across the '90s revival of The Sopranos, I had to see it.

I was at the Javits Center in Manhattan when the show first came on, the night it was on, and I was really, really drunk. I'd just gotten off the subway, and I was like, "I want to go to this show!" And so I went. I had a cab, but I was really drunk, and I couldn't even get it to take me to the show. I had to slip around the back of the Javits Center, like this guy did, and I had to sneak my cab into the back of the theater. But then I had to go to the back and sneak my cab into another one, and that's when I got really drunk.

I'm a big fan of "The Sopranos," too. I mean, I loved that show for a long time. I remember getting an early copy of the first episode of "The Sopran



---

## Generate Text With Parameters

---



---



In [0]:
prefix = 'Murtle the turtle'

In [11]:
gpt2.generate(sess,
              run_name=run_name,
              length=500,
              temperature=0.9,
              prefix=prefix,
              nsamples=10,
              batch_size=10
              )

Murtle the turtleJumboTurtle" as he flapped his arms. "Hey, take this,I want to put one in mybeer."He threw the bottle to her and said, "Now is the time to act."All in all he'd taken us out forpizza, cappuccinos, sit-downs at the bar and sang, "I Got the Skinny From That Turkeyscore, T-Bone Burnett."
"He's not your type," explained Oedipa.
"I don't know," he said. "I was never around any that werenuts. They were horse people, those fur people. Like me. Like your uncle."
"Oh, come on." She wanted to tell him not to stick needles in people's foreheads. Stuff like that. Uncle Toucan was the prototypical horse people expert.
"He died about 2,000 years ago," he said. "He left a will. The only things in there he specifically blessed the use of the executor to take over any succession in any of his businesses, like, say, the land, or the personal properties. Not that he owned any, really. The will talks about the personal properties, but doesn't deal with the real estate."
"Like where his spe



---

## Regarding Dreams: A Sidenote

---

Captain's log, 9.23.19. What appears to a be a series of garbled, unconnected narratives produced by GPT2 have, in strange similarity, that familiar make-up of a dream-state. One can almost read a sample and return to that fuzzy framework memory of a dream narrative they once had.

Running along with this almost homogeneous correspondence of narration between GPT2's output and the brain's dream-state, further research will be submitted regarding the behaivor of the human brain, specifically it's neurons, during the dream-state, with intent to map this research onto the architecture of GPT2. 

---





---

## Generate in Bulk

---

Large amounts of text data can be generated to a file and the samples therein can be sorted out, locally, on your personal computer. The cell below will generate a generated text file with a unique timestamp. The cells are rerunable for as many times as you want for bulk.

---



In [0]:
VOICE = 'donq'

In [0]:
gen_file = 'gpt2_' + VOICE + '_gentext_{:%Y%m%d_%H%M%S}.txt'.format(datetime.utcnow())

gpt2.generate_to_file(sess,
                      destination_path=gen_file,
                      length=500,
                      temperature=0.7,
                      nsamples=100,
                      batch_size=20
                      )

In [0]:
# may have to run twice 
# to get files download
files.download(gen_file)



---



---





---


# Generate Text with XLNet

---



In [0]:
!python transformers/examples/run_generation.py \
    --model_type=xlnet \
    --length=100 \
    --model_name_or_path=xlnet-base-cased \



---


# Generate Text with TransformerXL

---





In [0]:
!python transformers/examples/run_generation.py \
    --model_type=transfo-xl \
    --length=100 \
    --model_name_or_path=transfo-xl-wt103 \



---



---

# Predict The Word

---





---

## Predict With GPT-2

---



In [0]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

inputs = "What is the meaning of"
indexed_tokens = tokenizer.encode(inputs)

tokens_tensor = torch.tensor([indexed_tokens])

model = GPT2LMHeadModel.from_pretrained('gpt2')

model.eval()


with torch.no_grad():
  outputs = model(tokens_tensor)
  predictions = outputs[0]
  
predicted_index = torch.argmax(predictions[0, -1, :]).item()
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])

print("Predicted Output\n========= ======\n",predicted_text)

100%|██████████| 1042301/1042301 [00:01<00:00, 819187.58B/s]
100%|██████████| 456318/456318 [00:00<00:00, 505449.32B/s]
100%|██████████| 176/176 [00:00<00:00, 30905.03B/s]
100%|██████████| 548118077/548118077 [00:43<00:00, 12527261.10B/s]


Predicted Output
 What is the meaning of the




---


## Predict With TransformerXL

---



In [0]:
tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')

inputs = "Bill is a"
indexed_tokens = tokenizer.encode(inputs)

tokens_tensor = torch.tensor([indexed_tokens])

model = TransfoXLLMHeadModel.from_pretrained('transfo-xl-wt103')

model.eval()

tokens_tensor = tokens_tensor.to('cuda')
model.to('cuda')

with torch.no_grad():
  outputs = model(tokens_tensor)
  predictions = outputs[0]
  
predicted_index = torch.argmax(predictions[0, -1, :]).item()
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])

print("Predicted Output\n========= ======\n",predicted_text)



---

## Predict with XLNet

---



In [0]:
tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')

inputs = "Bill is a "
indexed_tokens = tokenizer.encode(inputs)

tokens_tensor = torch.tensor([indexed_tokens])

model = XLNetLMHeadModel.from_pretrained('xlnet-base-cased')

model.eval()

with torch.no_grad():
  outputs = model(tokens_tensor)
  predictions = outputs[0]
  
predicted_index = torch.argmax(predictions[0, -1, :]).item()
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])

print("Predicted Output\n========= ======\n",predicted_text)

Predicted Output
 Bill is a it




---



---





---

# Word Masking

---





---

## XLNet Masking

---



In [0]:
tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')
model = XLNetLMHeadModel.from_pretrained('xlnet-large-cased')
# We show how to setup inputs to predict a next token using a bi-directional context.
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is very <mask>")).unsqueeze(0)  # We will predict the masked token
perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float)
perm_mask[:, :, -1] = 1.0  # Previous tokens don't see last token
target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float)  # Shape [1, 1, seq_length] => let's predict one token
target_mapping[0, 0, -1] = 1.0  # Our first (and only) prediction will be the last token of the sequence (the masked token)
outputs = model(input_ids, perm_mask=perm_mask, target_mapping=target_mapping)
next_token_logits = outputs[0]  # Output has shape [target_mapping.size(0), targ

100%|██████████| 798011/798011 [00:01<00:00, 735973.82B/s]
100%|██████████| 699/699 [00:00<00:00, 163752.15B/s]
100%|██████████| 1441285815/1441285815 [02:01<00:00, 11910900.78B/s]


In [0]:
predicted_index = torch.argmax(next_token_logits[0, -1, :]).item()
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])

print("Predicted Output\n========= ======\n",predicted_text)

Predicted Output
 broke knownan permission without very




---

# Question Answering

---



In [0]:
import json
input_file = 'train-v1.1.json'
with open(input_file, "r", encoding='utf-8') as reader:
    input_data = json.load(reader)["data"]

In [0]:
!mkdir squad

In [0]:
!python -m torch.distributed.launch --nproc_per_node=8 transformers/examples/run_squad.py \
    --model_type bert \
    --model_name_or_path bert-large-uncased-whole-word-masking \
    --do_train \
    --do_eval \
    --do_lower_case \
    --train_file train-v1.1.json \
    --predict_file dev-v1.1.json \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --output_dir squad/ \
    --per_gpu_eval_batch_size=3   \
    --per_gpu_train_batch_size=3   \

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=37 error=10 : invalid device ordinal
THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=37 error=10 : invalid device ordinal
Traceback (most recent call last):
Traceback (most recent call last):
  File "transformers/examples/run_squad.py", line 554, in <module>
  File "transformers/examples/run_squad.py", line 554, in <module>
    main()
    main()
  File "transformers/examples/run_squad.py", line 454, in main
  File "transformers/examples/run_squad.py", line 454, in main
    torch.cuda.set_device(args.local_rank)
    torch.cuda.set_device(args.local_rank)
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.p



---

# GLUE

---



In [0]:
!pip install -r transformers/examples/requirements.txt

In [0]:
mrpc_data = "msr_paraphrase_data.txt"
mrpc_train = "msr_paraphrase_train.txt" 
mrpc_test = "msr_paraphrase_test.txt"

In [0]:
!python run_glue.py \
  --model_type bert \
  --model_name_or_path bert-base-cased \
  --task_name MRPC \
  --do_train \
  --do_eval \
  --do_lower_case \
  --data_dir transformers/examples/run_glue.py/MRPC \
  --max_seq_length 128 \
  --per_gpu_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir /tmp/$TASK_NAME/



---

# SQUAD

---



In [0]:
""" Official evaluation script for v1.1 of the SQuAD dataset. """
from __future__ import print_function
from collections import Counter
import string
import re
import argparse
import json
import sys


def normalize_answer(s):
    """Lower text and remove punctuation, articles and extra whitespace."""
    def remove_articles(text):
        return re.sub(r'\b(a|an|the)\b', ' ', text)

    def white_space_fix(text):
        return ' '.join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return ''.join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))


def f1_score(prediction, ground_truth):
    prediction_tokens = normalize_answer(prediction).split()
    ground_truth_tokens = normalize_answer(ground_truth).split()
    common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
    num_same = sum(common.values())
    if num_same == 0:
        return 0
    precision = 1.0 * num_same / len(prediction_tokens)
    recall = 1.0 * num_same / len(ground_truth_tokens)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1


def exact_match_score(prediction, ground_truth):
    return (normalize_answer(prediction) == normalize_answer(ground_truth))


def metric_max_over_ground_truths(metric_fn, prediction, ground_truths):
    scores_for_ground_truths = []
    for ground_truth in ground_truths:
        score = metric_fn(prediction, ground_truth)
        scores_for_ground_truths.append(score)
    return max(scores_for_ground_truths)


def evaluate(dataset, predictions):
    f1 = exact_match = total = 0
    for article in dataset:
        for paragraph in article['paragraphs']:
            for qa in paragraph['qas']:
                total += 1
                if qa['id'] not in predictions:
                    message = 'Unanswered question ' + qa['id'] + \
                              ' will receive score 0.'
                    print(message, file=sys.stderr)
                    continue
                ground_truths = list(map(lambda x: x['text'], qa['answers']))
                prediction = predictions[qa['id']]
                exact_match += metric_max_over_ground_truths(
                    exact_match_score, prediction, ground_truths)
                f1 += metric_max_over_ground_truths(
                    f1_score, prediction, ground_truths)

    exact_match = 100.0 * exact_match / total
    f1 = 100.0 * f1 / total

    return {'exact_match': exact_match, 'f1': f1}


if __name__ == '__main__':
#     expected_version = '1.1'
#     parser = argparse.ArgumentParser(
#         description='Evaluation for SQuAD ' + expected_version)
#     parser.add_argument('dataset_file', help='Dataset file')
#     parser.add_argument('prediction_file', help='Prediction File')
#     args = parser.parse_args()
    with open('train-v1.1.json') as dataset_file:
        dataset_json = json.load(dataset_file)
#         if (dataset_json['version'] != expected_version):
#             print('Evaluation expects v-' + expected_version +
#                   ', but got dataset with v-' + dataset_json['version'],
#                   file=sys.stderr)
        dataset = dataset_json['data']
    with open('dev-v1.1.json') as prediction_file:
        predictions = json.load(prediction_file)
    print(json.dumps(evaluate(dataset, predictions)))

In [0]:
!mkdir wwm_uncased_finetuned_squad

In [0]:
!python -m torch.distributed.launch --nproc_per_node=8 transformers/examples/run_squad.py \
    --model_type bert \
    --model_name_or_path bert-large-uncased-whole-word-masking \
    --do_train \
    --do_eval \
    --do_lower_case \
    --train_file train-v1.1.json \
    --predict_file dev-v1.1.json \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --output_dir wwm_uncased_finetuned_squad/ \
    --per_gpu_eval_batch_size=3   \
    --per_gpu_train_batch_size=3   \



---

## Model

---



---



In [0]:
import numpy as np
import tensorflow as tf
from tensorflow.contrib.training import HParams

In [0]:
def default_hparams():
    return HParams(
        n_vocab=0,
        n_ctx=1024,
        n_embd=768,
        n_head=12,
        n_layer=12,
    )

In [0]:
def shape_list(x):
    """Deal with dynamic shape in tensorflow cleanly."""
    static = x.shape.as_list()
    dynamic = tf.shape(x)
    return [dynamic[i] if s is None else s for i, s in enumerate(static)]

In [0]:
def softmax(x, axis=-1):
    x = x - tf.reduce_max(x, axis=axis, keepdims=True)
    ex = tf.exp(x)
    return ex / tf.reduce_sum(ex, axis=axis, keepdims=True)

In [0]:
def gelu(x):
    return 0.5*x*(1+tf.tanh(np.sqrt(2/np.pi)*(x+0.044715*tf.pow(x, 3))))

In [0]:
def norm(x, scope, *, axis=-1, epsilon=1e-5):
    """Normalize to mean = 0, std = 1, then do a diagonal affine transform."""
    with tf.variable_scope(scope):
        n_state = x.shape[-1].value
        g = tf.get_variable('g', [n_state], initializer=tf.constant_initializer(1))
        b = tf.get_variable('b', [n_state], initializer=tf.constant_initializer(0))
        u = tf.reduce_mean(x, axis=axis, keepdims=True)
        s = tf.reduce_mean(tf.square(x-u), axis=axis, keepdims=True)
        x = (x - u) * tf.rsqrt(s + epsilon)
        x = x*g + b
        return x

In [0]:
def split_states(x, n):
    """Reshape the last dimension of x into [n, x.shape[-1]/n]."""
    *start, m = shape_list(x)
    return tf.reshape(x, start + [n, m//n])

In [0]:
def merge_states(x):
    """Smash the last two dimensions of x into a single dimension."""
    *start, a, b = shape_list(x)
    return tf.reshape(x, start + [a*b])

In [0]:
def conv1d(x, scope, nf, *, w_init_stdev=0.02):
    with tf.variable_scope(scope):
        *start, nx = shape_list(x)
        w = tf.get_variable('w', [1, nx, nf], initializer=tf.random_normal_initializer(stddev=w_init_stdev))
        b = tf.get_variable('b', [nf], initializer=tf.constant_initializer(0))
        c = tf.reshape(tf.matmul(tf.reshape(x, [-1, nx]), tf.reshape(w, [-1, nf]))+b, start+[nf])
        return c

In [0]:
def attention_mask(nd, ns, *, dtype):
    """1's in the lower triangle, counting from the lower right corner.
    Same as tf.matrix_band_part(tf.ones([nd, ns]), -1, ns-nd), but doesn't produce garbage on TPUs.
    """
    i = tf.range(nd)[:,None]
    j = tf.range(ns)
    m = i >= j - ns + nd
    return tf.cast(m, dtype)

In [0]:
def attn(x, scope, n_state, *, past, hparams):
    assert x.shape.ndims == 3  # Should be [batch, sequence, features]
    assert n_state % hparams.n_head == 0
    if past is not None:
        assert past.shape.ndims == 5  # Should be [batch, 2, heads, sequence, features], where 2 is [k, v]

    def split_heads(x):
        # From [batch, sequence, features] to [batch, heads, sequence, features]
        return tf.transpose(split_states(x, hparams.n_head), [0, 2, 1, 3])

    def merge_heads(x):
        # Reverse of split_heads
        return merge_states(tf.transpose(x, [0, 2, 1, 3]))

    def mask_attn_weights(w):
        # w has shape [batch, heads, dst_sequence, src_sequence], where information flows from src to dst.
        _, _, nd, ns = shape_list(w)
        b = attention_mask(nd, ns, dtype=w.dtype)
        b = tf.reshape(b, [1, 1, nd, ns])
        w = w*b - tf.cast(1e10, w.dtype)*(1-b)
        return w

    def multihead_attn(q, k, v):
        # q, k, v have shape [batch, heads, sequence, features]
        w = tf.matmul(q, k, transpose_b=True)
        w = w * tf.rsqrt(tf.cast(v.shape[-1].value, w.dtype))

        w = mask_attn_weights(w)
        w = softmax(w)
        a = tf.matmul(w, v)
        return a

    with tf.variable_scope(scope):
        c = conv1d(x, 'c_attn', n_state*3)
        q, k, v = map(split_heads, tf.split(c, 3, axis=2))
        present = tf.stack([k, v], axis=1)
        if past is not None:
            pk, pv = tf.unstack(past, axis=1)
            k = tf.concat([pk, k], axis=-2)
            v = tf.concat([pv, v], axis=-2)
        a = multihead_attn(q, k, v)
        a = merge_heads(a)
        a = conv1d(a, 'c_proj', n_state)
        return a, present

In [0]:
def mlp(x, scope, n_state, *, hparams):
    with tf.variable_scope(scope):
        nx = x.shape[-1].value
        h = gelu(conv1d(x, 'c_fc', n_state))
        h2 = conv1d(h, 'c_proj', nx)
        return h2

In [0]:
def block(x, scope, *, past, hparams):
    with tf.variable_scope(scope):
        nx = x.shape[-1].value
        a, present = attn(norm(x, 'ln_1'), 'attn', nx, past=past, hparams=hparams)
        x = x + a
        m = mlp(norm(x, 'ln_2'), 'mlp', nx*4, hparams=hparams)
        x = x + m
        return x, present

In [0]:
def past_shape(*, hparams, batch_size=None, sequence=None):
    return [batch_size, hparams.n_layer, 2, hparams.n_head, sequence, hparams.n_embd // hparams.n_head]

In [0]:
def expand_tile(value, size):
    """Add a new axis of given size."""
    value = tf.convert_to_tensor(value, name='value')
    ndims = value.shape.ndims
    return tf.tile(tf.expand_dims(value, axis=0), [size] + [1]*ndims)

In [0]:
def positions_for(tokens, past_length):
    batch_size = tf.shape(tokens)[0]
    nsteps = tf.shape(tokens)[1]
    return expand_tile(past_length + tf.range(nsteps), batch_size)

In [0]:
def model(hparams, X, past=None, scope='model', reuse=False):
    with tf.variable_scope(scope, reuse=reuse):
        results = {}
        batch, sequence = shape_list(X)

        wpe = tf.get_variable('wpe', [hparams.n_ctx, hparams.n_embd],
                             initializer=tf.random_normal_initializer(stddev=0.01))
        wte = tf.get_variable('wte', [hparams.n_vocab, hparams.n_embd],
                             initializer=tf.random_normal_initializer(stddev=0.02))
        past_length = 0 if past is None else tf.shape(past)[-2]
        h = tf.gather(wte, X) + tf.gather(wpe, positions_for(X, past_length))

        # Transformer
        presents = []
        pasts = tf.unstack(past, axis=1) if past is not None else [None] * hparams.n_layer
        assert len(pasts) == hparams.n_layer
        for layer, past in enumerate(pasts):
            h, present = block(h, 'h%d' % layer, past=past, hparams=hparams)
            presents.append(present)
        results['present'] = tf.stack(presents, axis=1)
        h = norm(h, 'ln_f')

        # Language model loss.  Do tokens <n predict token n?
        h_flat = tf.reshape(h, [batch*sequence, hparams.n_embd])
        logits = tf.matmul(h_flat, wte, transpose_b=True)
        logits = tf.reshape(logits, [batch, sequence, hparams.n_vocab])
        results['logits'] = logits
        return results



---

## Sample

---



---



In [0]:
def top_k_logits(logits, k):
    if k == 0:
        # no truncation
        return logits

    def _top_k():
        values, _ = tf.nn.top_k(logits, k=k)
        min_values = values[:, -1, tf.newaxis]
        return tf.where(
            logits < min_values,
            tf.ones_like(logits, dtype=logits.dtype) * -1e10,
            logits,
        )
    return tf.cond(
       tf.equal(k, 0),
       lambda: logits,
       lambda: _top_k(),
    )

In [0]:
def sample_sequence(*, hparams, length, start_token=None, batch_size=None, context=None, temperature=1, top_k=0):
    if start_token is None:
        assert context is not None, 'Specify exactly one of start_token and context!'
    else:
        assert context is None, 'Specify exactly one of start_token and context!'
        context = tf.fill([batch_size, 1], start_token)

    def step(hparams, tokens, past=None):
        lm_output = model.model(hparams=hparams, X=tokens, past=past, reuse=tf.AUTO_REUSE)

        logits = lm_output['logits'][:, :, :hparams.n_vocab]
        presents = lm_output['present']
        presents.set_shape(model.past_shape(hparams=hparams, batch_size=batch_size))
        return {
            'logits': logits,
            'presents': presents,
        }

    with tf.name_scope('sample_sequence'):
        # Don't feed the last context token -- leave that to the loop below
        # TODO: Would be slightly faster if we called step on the entire context,
        # rather than leaving the last token transformer calculation to the while loop.
        context_output = step(hparams, context[:, :-1])

        def body(past, prev, output):
            next_outputs = step(hparams, prev[:, tf.newaxis], past=past)
            logits = next_outputs['logits'][:, -1, :]  / tf.to_float(temperature)
            logits = top_k_logits(logits, k=top_k)
            samples = tf.multinomial(logits, num_samples=1, output_dtype=tf.int32)
            return [
                tf.concat([past, next_outputs['presents']], axis=-2),
                tf.squeeze(samples, axis=[1]),
                tf.concat([output, samples], axis=1),
            ]

        def cond(*args):
            return True

        _, _, tokens = tf.while_loop(
            cond=cond, body=body,
            maximum_iterations=length,
            loop_vars=[
                context_output['presents'],
                context[:, -1],
                context,
            ],
            shape_invariants=[
                tf.TensorShape(model.past_shape(hparams=hparams, batch_size=batch_size)),
                tf.TensorShape([batch_size]),
                tf.TensorShape([batch_size, None]),
            ],
            back_prop=False,
        )

        return tokens



---

## Encoder

---



---



In [0]:
import os
import json
import regex as re
from functools import lru_cache

In [0]:
@lru_cache()
def bytes_to_unicode():
    """
    Returns list of utf-8 byte and a corresponding list of unicode strings.
    The reversible bpe codes work on unicode strings.
    This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
    When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
    This is a signficant percentage of your normal, say, 32K bpe vocab.
    To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
    And avoids mapping to whitespace/control characters the bpe code barfs on.
    """
    bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1))
    cs = bs[:]
    n = 0
    for b in range(2**8):
        if b not in bs:
            bs.append(b)
            cs.append(2**8+n)
            n += 1
    cs = [chr(n) for n in cs]
    return dict(zip(bs, cs))

In [0]:
def get_pairs(word):
    """Return set of symbol pairs in a word.
    Word is represented as tuple of symbols (symbols being variable-length strings).
    """
    pairs = set()
    prev_char = word[0]
    for char in word[1:]:
        pairs.add((prev_char, char))
        prev_char = char
    return pairs

In [0]:
class Encoder:
    def __init__(self, encoder, bpe_merges, errors='replace'):
        self.encoder = encoder
        self.decoder = {v:k for k,v in self.encoder.items()}
        self.errors = errors # how to handle errors in decoding
        self.byte_encoder = bytes_to_unicode()
        self.byte_decoder = {v:k for k, v in self.byte_encoder.items()}
        self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
        self.cache = {}

        # Should haved added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions
        self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")

    def bpe(self, token):
        if token in self.cache:
            return self.cache[token]
        word = tuple(token)
        pairs = get_pairs(word)

        if not pairs:
            return token

        while True:
            bigram = min(pairs, key = lambda pair: self.bpe_ranks.get(pair, float('inf')))
            if bigram not in self.bpe_ranks:
                break
            first, second = bigram
            new_word = []
            i = 0
            while i < len(word):
                try:
                    j = word.index(first, i)
                    new_word.extend(word[i:j])
                    i = j
                except:
                    new_word.extend(word[i:])
                    break

                if word[i] == first and i < len(word)-1 and word[i+1] == second:
                    new_word.append(first+second)
                    i += 2
                else:
                    new_word.append(word[i])
                    i += 1
            new_word = tuple(new_word)
            word = new_word
            if len(word) == 1:
                break
            else:
                pairs = get_pairs(word)
        word = ' '.join(word)
        self.cache[token] = word
        return word

    def encode(self, text):
        bpe_tokens = []
        for token in re.findall(self.pat, text):
            token = ''.join(self.byte_encoder[b] for b in token.encode('utf-8'))
            bpe_tokens.extend(self.encoder[bpe_token] for bpe_token in self.bpe(token).split(' '))
        return bpe_tokens

    def decode(self, tokens):
        text = ''.join([self.decoder[token] for token in tokens])
        text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors=self.errors)
        return text

def get_encoder(model_name):
    with open(os.path.join('models', model_name, 'encoder.json'), 'r') as f:
        encoder = json.load(f)
    with open(os.path.join('models', model_name, 'vocab.bpe'), 'r', encoding="utf-8") as f:
        bpe_data = f.read()
    bpe_merges = [tuple(merge_str.split()) for merge_str in bpe_data.split('\n')[1:-1]]
    return Encoder(
        encoder=encoder,
        bpe_merges=bpe_merges,
    )



---

## Chatbot

---



---



In [0]:
def interact_model(
    model_name='117M',
    seed=None,
    length=20,
    temperature=1,
    top_k=0,
    conversation="""
you: hi
her: hey
you: i'm a human
her: i'm a robot
you: you ready?
her: yes :)
you: ok let's start chatting
her: sure, what do you want to talk about?"""
):

    enc = encoder.get_encoder(model_name)
    hparams = model.default_hparams()
    with open(os.path.join('models', model_name, 'hparams.json')) as f:
        hparams.override_from_dict(json.load(f))

    if length > hparams.n_ctx:
        raise ValueError("Can't get samples longer than window size: %s" % hparams.n_ctx)

    with tf.Session(graph=tf.Graph()) as sess:
        np.random.seed(seed)
        tf.set_random_seed(seed)
        context = tf.placeholder(tf.int32, [1, None])
        output = sample.sample_sequence(
            hparams=hparams, length=length,
            context=context,
            batch_size=1,
            temperature=temperature, top_k=top_k
        )

        print(conversation)

        while True: 
            saver = tf.train.Saver()
            ckpt = tf.train.latest_checkpoint(os.path.join('models', model_name))
            saver.restore(sess, ckpt)
            message = None
            while not message:
                message = input("you: ")
            conversation = conversation + "\nyou: " + message
            conversation = conversation + "\nher: "
            sys.stdout.write("her: ")
            sys.stdout.flush()

            #sys.stderr.write("************************"+conversation+"***********************")
            #sys.stderr.flush()
            
            encoded_conversation = enc.encode(conversation)
            #print(len(encoded_conversation))
            result = sess.run(output, feed_dict={
                context: [encoded_conversation]
            })[:, len(encoded_conversation):]
            text = enc.decode(result[0])
            
            #sys.stderr.write("=============="+text+"=================")
            #sys.stderr.flush()

            splits = text.split('\n')
            #line = splits[1] if len(splits)>1 else splits[0]
            #parts = line.split(': ')
            #reply = parts[1] if len(parts)>1 else parts[0]
            reply = splits[0]
            sys.stdout.write(reply+'\n')
            sys.stdout.flush()
            conversation = conversation + reply

In [0]:
def sample_model(
    model_name='117M',
    seed=None,
    nsamples=0,
    batch_size=1,
    length=None,
    temperature=1,
    top_k=0,
):
    enc = encoder.get_encoder(model_name)
    hparams = model.default_hparams()
    with open(os.path.join('models', model_name, 'hparams.json')) as f:
        hparams.override_from_dict(json.load(f))

    if length is None:
        length = hparams.n_ctx
    elif length > hparams.n_ctx:
        raise ValueError("Can't get samples longer than window size: %s" % hparams.n_ctx)

    with tf.Session(graph=tf.Graph()) as sess:
        np.random.seed(seed)
        tf.set_random_seed(seed)

        output = sample.sample_sequence(
            hparams=hparams, length=length,
            start_token=enc.encoder['<|endoftext|>'],
            batch_size=batch_size,
            temperature=temperature, top_k=top_k
        )[:, 1:]

        saver = tf.train.Saver()
        ckpt = tf.train.latest_checkpoint(os.path.join('models', model_name))
        saver.restore(sess, ckpt)

        generated = 0
        while nsamples == 0 or generated < nsamples:
            out = sess.run(output)
            for i in range(batch_size):
                generated += batch_size
                text = enc.decode(out[i])
                print("=" * 40 + " SAMPLE " + str(generated) + " " + "=" * 40)
                print(text)

In [0]:
def interact_model(
    model_name='117M',
    seed=None,
    nsamples=1,
    batch_size=None,
    length=None,
    temperature=1,
    top_k=0,
):
    if batch_size is None:
        batch_size = 1
    assert nsamples % batch_size == 0

    enc = encoder.get_encoder(model_name)
    hparams = model.default_hparams()
    with open(os.path.join('models', model_name, 'hparams.json')) as f:
        hparams.override_from_dict(json.load(f))

    if length is None:
        length = hparams.n_ctx // 2
    elif length > hparams.n_ctx:
        raise ValueError("Can't get samples longer than window size: %s" % hparams.n_ctx)

    with tf.Session(graph=tf.Graph()) as sess:
        context = tf.placeholder(tf.int32, [batch_size, None])
        np.random.seed(seed)
        tf.set_random_seed(seed)
        output = sample.sample_sequence(
            hparams=hparams, length=length,
            context=context,
            batch_size=batch_size,
            temperature=temperature, top_k=top_k
        )

        saver = tf.train.Saver()
        ckpt = tf.train.latest_checkpoint(os.path.join('models', model_name))
        saver.restore(sess, ckpt)

        while True:
            raw_text = input("Model prompt >>> ")
            while not raw_text:
                print('Prompt should not be empty!')
                raw_text = input("Model prompt >>> ")
            context_tokens = enc.encode(raw_text)
            generated = 0
            for _ in range(nsamples // batch_size):
                out = sess.run(output, feed_dict={
                    context: [context_tokens for _ in range(batch_size)]
                })[:, len(context_tokens):]
                for i in range(batch_size):
                    generated += 1
                    text = enc.decode(out[i])
                    print("=" * 40 + " SAMPLE " + str(generated) + " " + "=" * 40)
                    print(text)
            print("=" * 80)



---

# Notes & Errata

---

1. Consider swapping character names in each individual text for a set of character names static across texts.

---





---

### XLNet

---



---





---

### Transformer-XL

---



---



In [0]:
!python -q pytorch-transformers/examples/run_generation.py \
    --model_type=transfo-xl \
    --model_name_or_path=transfo-xl-wt103 \
    --prompt=prompt \
    --length=50 \
    --temperature=1.0 \
    --top_k=0 \
    --top_p=0.9 \