# Baseline Model

## Overview

*   Apply BERT Large Cased Model on SQUAD 2.0 Data set 
*   Download the pretrained model and the questions
*   Set up GCP and TPUS
*   Train the model
*   Evaluate results on Dev set. 

### Step 1: Clone the Repo

In [None]:
#This will clone the BERT Repo

!git clone https://github.com/google-research/bert.git

Cloning into 'bert'...
remote: Enumerating objects: 340, done.[K
remote: Total 340 (delta 0), reused 0 (delta 0), pack-reused 340[K
Receiving objects: 100% (340/340), 317.20 KiB | 4.12 MiB/s, done.
Resolving deltas: 100% (185/185), done.


In [None]:
#The code in the BERT Repo is written in tf 1, and the tf conversion process fails on these files.
#For this reason, it was easiest to revert to tf v1 for the purposes of this notebook

%tensorflow_version 1.x
import tensorflow
print(tensorflow.__version__)


TensorFlow 1.x selected.
1.15.2


In [None]:
#Make sure were in the right place

%ls

[0m[01;34mbert[0m/  [01;34msample_data[0m/


In [None]:
# Move to BERT folder 

%cd bert

/content/bert


### Step 2: Select a model + Download Train/Dev sets

BERT Pretrained Model List :

*   BERT-Large, Uncased (Whole Word Masking) : 24-layer, 1024-hidden, 16-heads, 340M parameters
*   BERT-Large, Cased (Whole Word Masking) : 24-layer, 1024-hidden, 16-heads, 340M parameters
*   BERT-Base, Uncased : 12-layer, 768-hidden, 12-heads, 110M parameters
*   BERT-Large, Uncased : 24-layer, 1024-hidden, 16-heads, 340M parameters
*   BERT-Base, Cased: 12-layer, 768-hidden, 12-heads , 110M parameters
*   BERT-Large, Cased : 24-layer, 1024-hidden, 16-heads, 340M parameters

Based on my EDA, capitalization is important, so I am using the Large Cased model.

In [None]:
# Download the cased model. 

!wget https://storage.googleapis.com/bert_models/2018_10_18/cased_L-24_H-1024_A-16.zip

--2020-07-17 23:52:51--  https://storage.googleapis.com/bert_models/2018_10_18/cased_L-24_H-1024_A-16.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 173.194.74.128, 74.125.124.128, 172.217.212.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|173.194.74.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1242178883 (1.2G) [application/zip]
Saving to: ‘cased_L-24_H-1024_A-16.zip’


2020-07-17 23:53:00 (146 MB/s) - ‘cased_L-24_H-1024_A-16.zip’ saved [1242178883/1242178883]



In [None]:
# Unzip the pretrained model

!unzip cased_L-24_H-1024_A-16.zip

Archive:  cased_L-24_H-1024_A-16.zip
   creating: cased_L-24_H-1024_A-16/
  inflating: cased_L-24_H-1024_A-16/bert_model.ckpt.meta  
  inflating: cased_L-24_H-1024_A-16/bert_model.ckpt.data-00000-of-00001  
  inflating: cased_L-24_H-1024_A-16/vocab.txt  
  inflating: cased_L-24_H-1024_A-16/bert_model.ckpt.index  
  inflating: cased_L-24_H-1024_A-16/bert_config.json  


In [None]:
#Download the SQUAD train and dev dataset

!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

--2020-07-17 23:53:23--  https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42123633 (40M) [application/json]
Saving to: ‘train-v2.0.json’


2020-07-17 23:53:24 (58.8 MB/s) - ‘train-v2.0.json’ saved [42123633/42123633]

--2020-07-17 23:53:24--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4370528 (4.2M) [application/json]
Saving to: ‘dev-v2.0.json’


2020-07-17 23:53:25 (16.2 MB/s) - ‘dev-v2.0.json’ saved [4370528/4370528]



### Step 3: Imports and TPU Setup

In [None]:
# Imports

import datetime
import json
import os
import time
import pprint
import random
import string
import sys
import tensorflow as tf

# Get TPU Address for training

assert 'COLAB_TPU_ADDR' in os.environ, 'ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!'
TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']
print('TPU address is => ', TPU_ADDRESS)

#Authorize Google and connect.

from google.colab import auth
auth.authenticate_user()
with tf.Session(TPU_ADDRESS) as session:
  print('TPU devices:')
  pprint.pprint(session.list_devices())
  
  # Upload credentials to TPU.
  with open('/content/adc.json', 'r') as f:
    auth_info = json.load(f)
    
  tf.contrib.cloud.configure_gcs(session, credentials=auth_info)


TPU address is =>  grpc://10.19.30.242:8470
TPU devices:
[_DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:CPU:0, CPU, -1, 3295030357378495853),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 3914317039016132569),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 201647257280353741),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 11604506492514759225),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, 8970057966084767594),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:3, TPU, 17179869184, 10313378223518499565),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:4, TPU, 17179869184, 16989872891666213779),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:5, TPU, 17179869184, 11658468694366527241),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:6, TPU, 17179869184, 180613716

In [None]:
# Create variables for Buckets and Outputs for later use. 

BUCKET = 'thaddeussegura_final_project' #@param {type:"string"}
assert BUCKET, '*** Must specify an existing GCS bucket name ***'
output_dir_name = 'self_ensemble_1/' #@param {type:"string"}
BUCKET_NAME = 'gs://{}'.format(BUCKET)
OUTPUT_DIR = 'gs://{}/{}'.format(BUCKET,output_dir_name)
tf.io.gfile.makedirs(OUTPUT_DIR)
print('***** Model output directory: {} *****'.format(OUTPUT_DIR))

***** Model output directory: gs://thaddeussegura_final_project/self_ensemble_1/ *****


In [None]:
#Move the model to the google cloud bucket. 

!gsutil mv /content/bert/cased_L-24_H-1024_A-16 $BUCKET_NAME

Copying file:///content/bert/cased_L-24_H-1024_A-16/bert_model.ckpt.data-00000-of-00001 [Content-Type=application/octet-stream]...
/ [0 files][    0.0 B/  1.2 GiB]                                                ==> NOTE: You are uploading one or more large file(s), which would run
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means that any user who downloads such objects will need to have a
compiled crcmod installed (see "gsutil help crcmod"). This is because
without a compiled crcmod, computing checksums on composite objects is
so slow that gsutil disables downloads of composite objects.

Removing file:///content/bert/cased_L-24_H-1024_A-16/bert_model.ckpt.data-00000-of-00001...
Copying file:///cont

### Step 4: Train the Model

In [None]:
'''
Attempt 1:
LR: 3e-5, Epochs:3, Batch:24, Time = 90.3, EM: 77.09, F1: 80.38

Attempt 2:
LR: 3e-5, Epochs:4, Batch:24, Time = 114, EM: 76.6, F1: 80.23

Attempt 3:
LR: 2e-5, Epochs:3, Batch:32, Time = 86, EM: 76.37 , F1: 79.92

Attempt 4:
LR: 5e-5, Epochs:3, Batch:16, Time = 57, EM: 75.24  , F1: 78.93 

OVERTRAIN Model:
LR: 2e-5, Epochs:8, Batch:24, Time = , EM:   , F1:  
'''

#Will train on the training data.
#Will predict on the Dev Set.
#Timing the overall training

start_time = time.time()

!python run_squad.py \
  --vocab_file=$BUCKET_NAME/cased_L-24_H-1024_A-16/vocab.txt \
  --bert_config_file=$BUCKET_NAME/cased_L-24_H-1024_A-16/bert_config.json \
  --init_checkpoint=$BUCKET_NAME/cased_L-24_H-1024_A-16/bert_model.ckpt \
  --do_train=True \
  --train_file=train-v2.0.json \
  --do_predict=True \
  --predict_file=dev-v2.0.json \
  --train_batch_size=24 \
  --learning_rate=3e-5 \
  --num_train_epochs=4.0 \
  --use_tpu=True \
  --tpu_name=grpc://10.19.30.242:8470 \
  --max_seq_length=384 \
  --doc_stride=128 \
  --save_checkpoints_steps=5000 \
  --version_2_with_negative=True \
  --output_dir=$OUTPUT_DIR \
  --do_lower_case=False

end_time = time.time()

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
I0718 02:01:10.329540 140649565538176 tpu_estimator.py:604] Dequeue next (1) batch(es) of data from outfeed.
INFO:tensorflow:Enqueue next (1) batch(es) of data to infeed.
I0718 02:01:10.347740 140649565538176 tpu_estimator.py:600] Enqueue next (1) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (1) batch(es) of data from outfeed.
I0718 02:01:10.348136 140649565538176 tpu_estimator.py:604] Dequeue next (1) batch(es) of data from outfeed.
INFO:tensorflow:Enqueue next (1) batch(es) of data to infeed.
I0718 02:01:10.363640 140649565538176 tpu_estimator.py:600] Enqueue next (1) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (1) batch(es) of data from outfeed.
I0718 02:01:10.363898 140649565538176 tpu_estimator.py:604] Dequeue next (1) batch(es) of data from outfeed.
INFO:tensorflow:Enqueue next (1) batch(es) of data to infeed.
I0718 02:01:10.380717 140649565538176 tpu_estimator.py:600] Enqueue next (1) bat

In [None]:
total_time = end_time-start_time
print('Minutes to train Large Cased Model (8 epochs):')
print(total_time/60)

Minutes to train Large Cased Model (8 epochs):
214.89159316619237


### Step 5: Evaluate the Results

In [None]:
#may need this for the evaluation process
!git clone https://github.com/white127/SQUAD-2.0-bidaf.git

Cloning into 'SQUAD-2.0-bidaf'...
remote: Enumerating objects: 125, done.[K
remote: Total 125 (delta 0), reused 0 (delta 0), pack-reused 125[K
Receiving objects: 100% (125/125), 709.51 KiB | 5.14 MiB/s, done.
Resolving deltas: 100% (33/33), done.


In [None]:
#move evaluate-v2.0 into the bert folder

%mv /content/bert/SQUAD-2.0-bidaf/evaluate-v2.0.py /content/bert/

In [None]:
# Here I just moved the predictions file manually from the Google Cloud Bucket 
# Into colab.  I will automate this later when I have multiple lines to move. 

In [None]:
# Evaluate the Results 

print("Results for Large, Cased (8 Epochs")
!python evaluate-v2.0.py dev-v2.0.json preds.json


Results for Large, Cased (8 Epochs
{
  "exact": 74.84207866588056,
  "f1": 78.75671559407992,
  "total": 11873,
  "HasAns_exact": 76.73751686909581,
  "HasAns_f1": 84.57801691101753,
  "HasAns_total": 5928,
  "NoAns_exact": 72.95206055508831,
  "NoAns_f1": 72.95206055508831,
  "NoAns_total": 5945
}
