# Prepare Dataset



In [1]:
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

--2020-12-16 18:02:19--  https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42123633 (40M) [application/json]
Saving to: ‘train-v2.0.json’


2020-12-16 18:02:24 (66.4 MB/s) - ‘train-v2.0.json’ saved [42123633/42123633]

--2020-12-16 18:02:24--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.109.153, 185.199.111.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.109.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4370528 (4.2M) [application/json]
Saving to: ‘dev-v2.0.json’


2020-12-16 18:02:25 (21.7 MB/s) - ‘dev-v2.0.json’ saved [4370528/4370528]



In [2]:
%tensorflow_version 1.x
import tensorflow as tf
!pip install -q gpt-2-simple
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files

import json
import random
import numpy as np
from time import time
from tqdm import tqdm

TensorFlow 1.x selected.
  Building wheel for gpt-2-simple (setup.py) ... [?25l[?25hdone
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



In [7]:
# Preprocess training set
c, q, a, e = "<|startoftext|>\n[CONTEXT]: ", "\n[QUESTION]:", "\n[ANSWER]:", "\n<|endoftext|>\n"

def preprocess_paragraphs(data):
  pars_data = c + data['context']
  qas_list = data['qas']
  random.shuffle(qas_list) # random shuffle quesitons
  for qas in qas_list:
    pars_data += q + qas['question']
    try:
      if qas['is_impossible']:
        pars_data += a + 'unanswerable'
      else:
        pars_data += a + qas['answers'][0]['text']
    except:
      print(qas)
      raise
  pars_data += e
  return pars_data

def extract_article(data):
  article_data = ''
  paras_list = data['paragraphs']
  random.shuffle(paras_list) # random shuffle paragraphs
  for pars in paras_list:
    article_data += preprocess_paragraphs(pars)
  return article_data

def preprocess_train(data):
  dataset = ''
  article_list = data['data']
  random.shuffle(article_list)
  for article in article_list:
    dataset += extract_article(article)
  return dataset

In [8]:
with open('train-v2.0.json') as f:
  train_data = json.load(f)
  
train_dataset = preprocess_train(train_data)

In [9]:
print(f'Length of the corpus: {len(train_dataset)}')

Length of the corpus: 27563659


# Training set explanation

### Since GPT-2 is trained using unsupervised learning, the training set is just a really long python string, or namely corpus, without any label. GPT-2 learns to generate text given the warm-up text.

### So how can we make the autoregression model to answer a question? We need some special token to indicate what we expect GPT-2 to do next. Here is an example:


In [13]:
print(train_dataset[:3000])

<|startoftext|>
[CONTEXT]: Philadelphia artists have had a prominent national role in popular music. In the 1970s, Philadelphia soul influenced the music of that and later eras. On July 13, 1985, Philadelphia hosted the American end of the Live Aid concert at John F. Kennedy Stadium. The city reprised this role for the Live 8 concert, bringing some 700,000 people to the Ben Franklin Parkway on July 2, 2005. Philadelphia is home to the world-renowned Philadelphia Boys Choir & Chorale, which has performed its music all over the world. Dr. Robert G. Hamilton, founder of the choir, is a notable native Philadelphian. The Philly Pops is another famous Philadelphia music group. The city has played a major role in the development and support of American rock music and rap music. Hip-hop/Rap artists such as The Roots, DJ Jazzy Jeff & The Fresh Prince, The Goats, Freeway, Schoolly D, Eve, and Lisa "Left Eye" Lopes hail from the city.
[QUESTION]:What concert did Philly host on July13th, 1985?
[AN

In [15]:
f = open("squad2_dataset.txt","w")
f.write(train_dataset)
f.close()

In [None]:
file_name = "squad2_dataset.txt"

### In the raw dataset, unanwerable questions are always after answerable ones. This undesired pattern maybe learned by the GPT-2 model, leading to overfitting. So we randomly shuffle the questions to mitigate this problem.


### This also shows GPT-2 model has a strong capacity to fit many kinds of long-sequence patterns, so if the training text if uncarefully preprocessed, it's easy to overfitting.

In [16]:
# preprocess test set
def read_file(file_name):
  with open(file_name) as f:
    data = json.load(f)
  contexts = list()
  questions = list()
  answers = list()
    
  for group in data['data']:
      for passage in group['paragraphs']:
          context = passage['context']
          for qa in passage['qas']:
              question = qa['question']
              an = []
              if(qa['is_impossible']):
                an.append('unanswerable')
              else:
                for answer in qa['answers']:
                  an.append(answer['text'])
              contexts.append(context)
              questions.append(question)
              answers.append(an)

  return contexts, questions, answers

with open('dev-v2.0.json') as f:
  dev_data = json.load(f)

dev_contexts, dev_questions, dev_answers = read_file('dev-v2.0.json')

## An input example at prediction time. After finetuning, the GPT-2 model is expected to give the answer after the '[ANSWER]:' token.

In [17]:
print(c + dev_contexts[0] + q + dev_questions[0] + a)

<|startoftext|>
[CONTEXT]: The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.
[QUESTION]:In what country is Normandy located?
[ANSWER]:


In [18]:
!nvidia-smi

Wed Dec 16 18:14:44 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0    23W / 300W |      0MiB / 16130MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Downloading GPT-2

There are three released sizes of GPT-2:

* `124M` (default): the "small" model, 500MB on disk.
* `355M`: the "medium" model, 1.5GB on disk.
* `774M`: the "large" model, 
* `1558M`: the "extra large", 

Larger models have more knowledge, but take longer to finetune and longer to generate text. 

In [19]:
gpt2.download_gpt2(model_name="355M")

Fetching checkpoint: 1.05Mit [00:00, 208Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 73.1Mit/s]                                                   
Fetching hparams.json: 1.05Mit [00:00, 289Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 1.42Git [00:16, 84.7Mit/s]                                 
Fetching model.ckpt.index: 1.05Mit [00:00, 184Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 94.6Mit/s]                                                
Fetching vocab.bpe: 1.05Mit [00:00, 151Mit/s]                                                       


In [None]:
# gpt2.mount_gdrive()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### The training time for a medium model on our dataset takes 4.5 hour (30,000 steps) using Tesla V100 (fastest on colab, need colab pro). It will take much longer using other GPU.

In [None]:
%%time
sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              dataset=file_name,
              model_name='355M',
              steps=30000,
              restore_from='fresh',
              run_name='QA_m_30k_random',
              print_every=100,
              sample_every=2000,
              save_every=2000)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Instructions for updating:
Please use tensorflow.python.ops.op_selector.get_backward_walk_ops.
Loading checkpoint models/355M/model.ckpt
INFO:tensorflow:Restoring parameters from models/355M/model.ckpt


  0%|          | 0/1 [00:00<?, ?it/s]

Loading dataset...


100%|██████████| 1/1 [00:32<00:00, 32.37s/it]


dataset has 6756495 tokens
Training...
[100 | 62.37] loss=1.76 avg=1.76
[200 | 114.60] loss=1.65 avg=1.71
[300 | 166.66] loss=1.77 avg=1.73
[400 | 218.80] loss=1.84 avg=1.76
[500 | 270.97] loss=1.92 avg=1.79
[600 | 323.10] loss=2.05 avg=1.83
[700 | 375.19] loss=1.43 avg=1.78
[800 | 427.77] loss=2.18 avg=1.83
[900 | 480.00] loss=1.90 avg=1.84
[1000 | 532.11] loss=2.12 avg=1.86
[1100 | 584.27] loss=2.19 avg=1.90
[1200 | 636.44] loss=2.43 avg=1.94
[1300 | 688.52] loss=1.62 avg=1.92
[1400 | 740.58] loss=1.98 avg=1.92
[1500 | 792.75] loss=1.41 avg=1.88
[1600 | 844.87] loss=1.76 avg=1.88
[1700 | 897.04] loss=1.64 avg=1.86
[1800 | 949.21] loss=2.33 avg=1.89
[1900 | 1001.34] loss=2.27 avg=1.91
[2000 | 1053.87] loss=1.70 avg=1.90
Saving checkpoint/QA_m_70k_random/model-2000
 to the United States, is known as one of the most technologically advanced cities on the planet, and was also the capital of the country the United States during the Industrial Revolution and the American Revolution.
[QUEST

In [None]:
# if you want to save the model to you google drive after fine tuning
# gpt2.mount_gdrive()
# gpt2.copy_checkpoint_to_gdrive(run_name='QA_m_30k_random')

In [None]:
# copy model from drive
# gpt2.copy_checkpoint_from_gdrive(run_name='QA_m_30k_random')

In [None]:
sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess, run_name='QA_m_30k_random')

Loading checkpoint checkpoint/QA_m_30k_random/model-30000
INFO:tensorflow:Restoring parameters from checkpoint/QA_m_30k_random/model-30000


In [None]:
# f1 score helper function
def normalize_text(s):
    """Removing articles and punctuation, and standardizing whitespace are all typical text processing steps."""
    import string, re

    def remove_articles(text):
        regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
        return re.sub(regex, " ", text)

    def white_space_fix(text):
        return " ".join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return "".join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))

def compute_f1(prediction, truth):
    pred_tokens = normalize_text(prediction).split()
    truth_tokens = normalize_text(truth).split()
    
    # if either the prediction or the truth is no-answer then f1 = 1 if they agree, 0 otherwise
    if len(pred_tokens) == 0 or len(truth_tokens) == 0:
        return int(pred_tokens == truth_tokens)
    
    common_tokens = set(pred_tokens) & set(truth_tokens)
    
    # if there are no common tokens then f1 = 0
    if len(common_tokens) == 0:
        return 0
    
    prec = len(common_tokens) / len(pred_tokens)
    rec = len(common_tokens) / len(truth_tokens)
    
    return 2 * (prec * rec) / (prec + rec)

predicted_answer = 'The wetter climate may have allowed for the spread of the rainforest to spread out across the continent.'
true_answer = ['the wetter climate may have allowed the tropical rainforest to spread out across the continent', 'wetter']
f1_score = max((compute_f1(predicted_answer, a)) for a in true_answer)
f1_score

0.8461538461538461

In [None]:
%%time
ems = []
f1s = []
pred_answers = []
start = time()
# for i in range(len(dev_contexts)):
for i in tqdm(range(500)):
  
  if (i+1) % 15 == 0:
    # release memory
    gpt2.reset_session(sess)
    tf.reset_default_graph()
    sess = gpt2.start_tf_sess()
    gpt2.load_gpt2(sess, run_name='QA_m_30k_random')

  pre = c + dev_contexts[i] + q + dev_questions[i] + a
  pre_len = len(pre)
  ans = gpt2.generate(sess,
                length=30,
                temperature=0.5,
                prefix=pre,
                nsamples=1,
                batch_size=1,
                run_name="QA_m_30k_random",
                return_as_list=True)[0]
  # print(time() - start)
  predicted_answer = ans[pre_len:].split('\n')[0]
  pred_answers.append(predicted_answer)
  true_answer = dev_answers[i]
  f1s.append(max((compute_f1(predicted_answer, a)) for a in true_answer))
  # print(time() - start)
  ems.append(predicted_answer in true_answer)

  3%|▎         | 14/500 [02:44<1:48:39, 13.42s/it]

Loading checkpoint checkpoint/QA_m_30k_random/model-30000
INFO:tensorflow:Restoring parameters from checkpoint/QA_m_30k_random/model-30000


  6%|▌         | 29/500 [05:58<1:46:08, 13.52s/it]

Loading checkpoint checkpoint/QA_m_30k_random/model-30000
INFO:tensorflow:Restoring parameters from checkpoint/QA_m_30k_random/model-30000


  9%|▉         | 44/500 [09:08<1:44:01, 13.69s/it]

Loading checkpoint checkpoint/QA_m_30k_random/model-30000
INFO:tensorflow:Restoring parameters from checkpoint/QA_m_30k_random/model-30000


 12%|█▏        | 59/500 [12:11<1:38:14, 13.37s/it]

Loading checkpoint checkpoint/QA_m_30k_random/model-30000
INFO:tensorflow:Restoring parameters from checkpoint/QA_m_30k_random/model-30000


 15%|█▍        | 74/500 [15:14<1:35:23, 13.44s/it]

Loading checkpoint checkpoint/QA_m_30k_random/model-30000
INFO:tensorflow:Restoring parameters from checkpoint/QA_m_30k_random/model-30000


 18%|█▊        | 89/500 [18:18<1:31:31, 13.36s/it]

Loading checkpoint checkpoint/QA_m_30k_random/model-30000
INFO:tensorflow:Restoring parameters from checkpoint/QA_m_30k_random/model-30000


 21%|██        | 104/500 [21:18<1:27:00, 13.18s/it]

Loading checkpoint checkpoint/QA_m_30k_random/model-30000
INFO:tensorflow:Restoring parameters from checkpoint/QA_m_30k_random/model-30000


 24%|██▍       | 119/500 [24:20<1:23:49, 13.20s/it]

Loading checkpoint checkpoint/QA_m_30k_random/model-30000
INFO:tensorflow:Restoring parameters from checkpoint/QA_m_30k_random/model-30000


 27%|██▋       | 134/500 [27:24<1:22:57, 13.60s/it]

Loading checkpoint checkpoint/QA_m_30k_random/model-30000
INFO:tensorflow:Restoring parameters from checkpoint/QA_m_30k_random/model-30000


 30%|██▉       | 149/500 [30:31<1:19:12, 13.54s/it]

Loading checkpoint checkpoint/QA_m_30k_random/model-30000
INFO:tensorflow:Restoring parameters from checkpoint/QA_m_30k_random/model-30000


 33%|███▎      | 164/500 [33:37<1:16:05, 13.59s/it]

Loading checkpoint checkpoint/QA_m_30k_random/model-30000
INFO:tensorflow:Restoring parameters from checkpoint/QA_m_30k_random/model-30000


 36%|███▌      | 179/500 [36:43<1:13:23, 13.72s/it]

Loading checkpoint checkpoint/QA_m_30k_random/model-30000
INFO:tensorflow:Restoring parameters from checkpoint/QA_m_30k_random/model-30000


 39%|███▉      | 194/500 [39:50<1:09:10, 13.56s/it]

Loading checkpoint checkpoint/QA_m_30k_random/model-30000
INFO:tensorflow:Restoring parameters from checkpoint/QA_m_30k_random/model-30000


 42%|████▏     | 209/500 [42:55<1:05:38, 13.53s/it]

Loading checkpoint checkpoint/QA_m_30k_random/model-30000
INFO:tensorflow:Restoring parameters from checkpoint/QA_m_30k_random/model-30000


 45%|████▍     | 224/500 [46:01<1:03:19, 13.77s/it]

Loading checkpoint checkpoint/QA_m_30k_random/model-30000
INFO:tensorflow:Restoring parameters from checkpoint/QA_m_30k_random/model-30000


 48%|████▊     | 239/500 [49:11<1:00:13, 13.85s/it]

Loading checkpoint checkpoint/QA_m_30k_random/model-30000
INFO:tensorflow:Restoring parameters from checkpoint/QA_m_30k_random/model-30000


 51%|█████     | 254/500 [52:18<55:44, 13.59s/it]

Loading checkpoint checkpoint/QA_m_30k_random/model-30000
INFO:tensorflow:Restoring parameters from checkpoint/QA_m_30k_random/model-30000


 54%|█████▍    | 269/500 [55:22<51:39, 13.42s/it]

Loading checkpoint checkpoint/QA_m_30k_random/model-30000
INFO:tensorflow:Restoring parameters from checkpoint/QA_m_30k_random/model-30000


 57%|█████▋    | 284/500 [58:24<47:33, 13.21s/it]

Loading checkpoint checkpoint/QA_m_30k_random/model-30000
INFO:tensorflow:Restoring parameters from checkpoint/QA_m_30k_random/model-30000


 60%|█████▉    | 299/500 [1:01:23<43:53, 13.10s/it]

Loading checkpoint checkpoint/QA_m_30k_random/model-30000
INFO:tensorflow:Restoring parameters from checkpoint/QA_m_30k_random/model-30000


 63%|██████▎   | 314/500 [1:04:28<43:00, 13.88s/it]

Loading checkpoint checkpoint/QA_m_30k_random/model-30000
INFO:tensorflow:Restoring parameters from checkpoint/QA_m_30k_random/model-30000


 66%|██████▌   | 329/500 [1:07:39<39:30, 13.86s/it]

Loading checkpoint checkpoint/QA_m_30k_random/model-30000
INFO:tensorflow:Restoring parameters from checkpoint/QA_m_30k_random/model-30000


 69%|██████▉   | 344/500 [1:10:44<34:54, 13.43s/it]

Loading checkpoint checkpoint/QA_m_30k_random/model-30000
INFO:tensorflow:Restoring parameters from checkpoint/QA_m_30k_random/model-30000


 72%|███████▏  | 359/500 [1:13:45<31:07, 13.25s/it]

Loading checkpoint checkpoint/QA_m_30k_random/model-30000
INFO:tensorflow:Restoring parameters from checkpoint/QA_m_30k_random/model-30000


 75%|███████▍  | 374/500 [1:16:50<28:05, 13.37s/it]

Loading checkpoint checkpoint/QA_m_30k_random/model-30000
INFO:tensorflow:Restoring parameters from checkpoint/QA_m_30k_random/model-30000


 78%|███████▊  | 389/500 [1:19:50<24:31, 13.25s/it]

Loading checkpoint checkpoint/QA_m_30k_random/model-30000
INFO:tensorflow:Restoring parameters from checkpoint/QA_m_30k_random/model-30000


 81%|████████  | 404/500 [1:22:52<21:11, 13.25s/it]

Loading checkpoint checkpoint/QA_m_30k_random/model-30000
INFO:tensorflow:Restoring parameters from checkpoint/QA_m_30k_random/model-30000


 84%|████████▍ | 419/500 [1:26:02<18:54, 14.01s/it]

Loading checkpoint checkpoint/QA_m_30k_random/model-30000
INFO:tensorflow:Restoring parameters from checkpoint/QA_m_30k_random/model-30000


 87%|████████▋ | 434/500 [1:29:13<15:21, 13.96s/it]

Loading checkpoint checkpoint/QA_m_30k_random/model-30000
INFO:tensorflow:Restoring parameters from checkpoint/QA_m_30k_random/model-30000


 90%|████████▉ | 449/500 [1:32:21<11:37, 13.68s/it]

Loading checkpoint checkpoint/QA_m_30k_random/model-30000
INFO:tensorflow:Restoring parameters from checkpoint/QA_m_30k_random/model-30000


 93%|█████████▎| 464/500 [1:35:24<08:06, 13.52s/it]

Loading checkpoint checkpoint/QA_m_30k_random/model-30000
INFO:tensorflow:Restoring parameters from checkpoint/QA_m_30k_random/model-30000


 96%|█████████▌| 479/500 [1:38:33<04:49, 13.76s/it]

Loading checkpoint checkpoint/QA_m_30k_random/model-30000
INFO:tensorflow:Restoring parameters from checkpoint/QA_m_30k_random/model-30000


 99%|█████████▉| 494/500 [1:41:39<01:21, 13.50s/it]

Loading checkpoint checkpoint/QA_m_30k_random/model-30000
INFO:tensorflow:Restoring parameters from checkpoint/QA_m_30k_random/model-30000


100%|██████████| 500/500 [1:42:47<00:00, 12.34s/it]

CPU times: user 1h 43min 3s, sys: 1min 6s, total: 1h 44min 9s
Wall time: 1h 42min 47s





In [None]:
np.mean(f1s)

0.576591330891331

In [None]:
np.mean(ems)

0.508