# Simple Self Ensemble: 8 way split

## Overview

*   This model takes the original train dataset that has been split 8 ways.
*   One BERT Large model is trained on each split of the data.
*   Each model is run on the dev set to generate predictions.
*   The predictions are combined through simple voting to determine an answer.
*   The models trained in this process are reused in the Deep Self Ensemble Notebook.

### Step 1: Clone the REPO


In [None]:
#This will clone the BERT Repo

!git clone https://github.com/google-research/bert.git

In [None]:
#The code in the BERT Repo is written in tf 1, and the tf conversion process fails on these files.
#For this reason, it was easiest to revert to tf v1 for the purposes of this notebook

%tensorflow_version 1.x
import tensorflow
print(tensorflow.__version__)


TensorFlow 1.x selected.
1.15.2


In [None]:
#Make sure were in the right place

%ls

[0m[01;34mbert[0m/  [01;34msample_data[0m/


In [None]:
# Move to BERT folder 

%cd bert

/content/bert


### Step 2: Select a model


BERT Pretrained Model List :

*   BERT-Large, Uncased (Whole Word Masking) : 24-layer, 1024-hidden, 16-heads, 340M parameters
*   BERT-Large, Cased (Whole Word Masking) : 24-layer, 1024-hidden, 16-heads, 340M parameters
*   BERT-Base, Uncased : 12-layer, 768-hidden, 12-heads, 110M parameters
*   BERT-Large, Uncased : 24-layer, 1024-hidden, 16-heads, 340M parameters
*   BERT-Base, Cased: 12-layer, 768-hidden, 12-heads , 110M parameters
*   BERT-Large, Cased : 24-layer, 1024-hidden, 16-heads, 340M parameters

Based on my EDA, capitalization is important, so I am using the Large Cased model.

In [None]:
# Download the cased model. 
!wget https://storage.googleapis.com/bert_models/2018_10_18/cased_L-24_H-1024_A-16.zip

--2020-07-10 21:06:13--  https://storage.googleapis.com/bert_models/2018_10_18/cased_L-24_H-1024_A-16.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 173.194.196.128, 173.194.192.128, 172.217.212.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|173.194.196.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1242178883 (1.2G) [application/zip]
Saving to: ‘cased_L-24_H-1024_A-16.zip’


2020-07-10 21:06:23 (122 MB/s) - ‘cased_L-24_H-1024_A-16.zip’ saved [1242178883/1242178883]



In [None]:
# Unzip the pretrained model
!unzip cased_L-24_H-1024_A-16.zip

Archive:  cased_L-24_H-1024_A-16.zip
  inflating: cased_L-24_H-1024_A-16/bert_model.ckpt.meta  
  inflating: cased_L-24_H-1024_A-16/bert_model.ckpt.data-00000-of-00001  
  inflating: cased_L-24_H-1024_A-16/vocab.txt  
  inflating: cased_L-24_H-1024_A-16/bert_model.ckpt.index  
  inflating: cased_L-24_H-1024_A-16/bert_config.json  


### Step 3: Get the train/dev data.

In [None]:
#Mount my drive so that I can access the split training sets. 

from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [None]:
#Copy the data from drive to colab. 

%cp -R /content/drive/My\ Drive/8_way_split/* /content/bert/

In [None]:
# Download the SQUAD train and dev dataset

# I do not need the training set since I am using the split version above. 
#!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json

# Still download the Dev set.
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

--2020-07-10 21:07:19--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4370528 (4.2M) [application/json]
Saving to: ‘dev-v2.0.json’


2020-07-10 21:07:20 (9.57 MB/s) - ‘dev-v2.0.json’ saved [4370528/4370528]



### Step 4: Imports, TPU Setup, and GCP Bucket Set up


In [None]:
# Imports 
import datetime
import json
import os
import time
import pprint
import random
import string
import sys
import tensorflow as tf
import re
from collections import Counter
from itertools import groupby

# Get TPU Address for training

assert 'COLAB_TPU_ADDR' in os.environ, 'ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!'
TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']
print('TPU address is => ', TPU_ADDRESS)

#Authorize Google and connect.

from google.colab import auth
auth.authenticate_user()
with tf.Session(TPU_ADDRESS) as session:
  print('TPU devices:')
  pprint.pprint(session.list_devices())
  
  # Upload credentials to TPU.
  with open('/content/adc.json', 'r') as f:
    auth_info = json.load(f)
    
  tf.contrib.cloud.configure_gcs(session, credentials=auth_info)


TPU address is =>  grpc://10.94.126.154:8470
TPU devices:
[_DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:CPU:0, CPU, -1, 16943761991303418183),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 13520321758806377632),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 18033879894413116901),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 11105852559762011962),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, 4507963364507085406),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:3, TPU, 17179869184, 18086502421233452356),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:4, TPU, 17179869184, 13751009646577734726),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:5, TPU, 17179869184, 17548891014368971901),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:6, TPU, 17179869184, 6690

In [None]:
# Create variables for Buckets and Outputs for later use. 

BUCKET = 'thaddeussegura_final_project' #@param {type:"string"}
assert BUCKET, '*** Must specify an existing GCS bucket name ***'
output_dir_name = 'self_ensemble_8/' #@param {type:"string"}
BUCKET_NAME = 'gs://{}'.format(BUCKET)
OUTPUT_DIR = 'gs://{}/{}'.format(BUCKET,output_dir_name)
tf.io.gfile.makedirs(OUTPUT_DIR)
print('***** Model output directory: {} *****'.format(OUTPUT_DIR))

***** Model output directory: gs://thaddeussegura_final_project/self_ensemble_8/ *****


In [None]:
#Move the model to the google cloud bucket. 

!gsutil mv /content/bert/cased_L-24_H-1024_A-16 $BUCKET_NAME

CommandException: No URLs matched: /content/bert/cased_L-24_H-1024_A-16


In [None]:
# Necessary installs so I can mount the files from my bucket onto colab

!echo "deb http://packages.cloud.google.com/apt gcsfuse-bionic main" > /etc/apt/sources.list.d/gcsfuse.list
!curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
!apt -qq update
!apt -qq install gcsfuse

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100   653  100   653    0     0  25115      0 --:--:-- --:--:-- --:--:-- 25115
OK
81 packages can be upgraded. Run 'apt list --upgradable' to see them.
The following package was automatically installed and is no longer required:
  libnvidia-common-440
Use 'apt autoremove' to remove it.
The following NEW packages will be installed:
  gcsfuse
0 upgraded, 1 newly installed, 0 to remove and 81 not upgraded.
Need to get 4,278 kB of archives.
After this operation, 12.8 MB of additional disk space will be used.
Selecting previously unselected package gcsfuse.
(Reading database ... 144379 files and directories currently installed.)
Preparing to unpack .../gcsfuse_0.30.0_amd64.deb ...
Unpacking gcsfuse (0.30.0) ...
Setting up gcsfuse (0.30.0) ...


In [None]:
# Make a folder for the bucket, this will have all of the files inside. 

!mkdir folderOnColab
!gcsfuse thaddeussegura_final_project folderOnColab 

Using mount point: /content/bert/folderOnColab
Opening GCS connection...
Opening bucket...
Mounting file system...
File system has been successfully mounted.


### Step 5: Train the model

*   Define the training function
*   Run the full training 8 times, generating 8 models and 8 sets of predictions.



In [None]:
#Function to Pass in the file name and train the model
#Will train on the specified subset of the training data.
#Will predict on the Dev Set.
#Keeping LR fixed across all models. 
#Using Cased model.  

def run_model(temp_train, temp_output):
  !python run_squad.py \
    --vocab_file=$BUCKET_NAME/cased_L-24_H-1024_A-16/vocab.txt \
    --bert_config_file=$BUCKET_NAME/cased_L-24_H-1024_A-16/bert_config.json \
    --init_checkpoint=$BUCKET_NAME/cased_L-24_H-1024_A-16/bert_model.ckpt \
    --do_train=True \
    --train_file=$temp_train \
    --do_predict=True \
    --predict_file=dev-v2.0.json \
    --train_batch_size=24 \
    --learning_rate=3e-5 \
    --num_train_epochs=3.0 \
    --use_tpu=True \
    --tpu_name=grpc://10.94.126.154:8470 \
    --max_seq_length=384 \
    --doc_stride=128 \
    --version_2_with_negative=True \
    --output_dir=$temp_output \
    --do_lower_case=False



In [None]:
# Training Loop
# Iterate through the 8 training files, making a new folder for each in GCP.
# Time the whole training loop. 

start_time = time.time()

for i in range(8):
  folder_name = "8_way_"+str(i)
  %mkdir /content/bert/folderOnColab/self_ensemble_8/$folder_name
  temp_op_dir = OUTPUT_DIR+folder_name
  file_name = folder_name+'.json'
  print(temp_op_dir)
  run_model(file_name, temp_op_dir)
  temp_preds_name = folder_name + '_preds.json'
  temp_n_name = folder_name + '_n_preds.json'
  %mv /content/bert/folderOnColab/self_ensemble_8/$folder_name/predictions.json /content/bert/folderOnColab/self_ensemble_8/$folder_name/$temp_preds_name
  %mv /content/bert/folderOnColab/self_ensemble_8/$folder_name/nbest_predictions.json /content/bert/folderOnColab/self_ensemble_8/$folder_name/$temp_n_name

end_time = time.time()

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
I0710 02:50:14.714301 139873697208192 tpu_estimator.py:604] Dequeue next (1) batch(es) of data from outfeed.
INFO:tensorflow:Enqueue next (1) batch(es) of data to infeed.
I0710 02:50:14.730863 139873697208192 tpu_estimator.py:600] Enqueue next (1) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (1) batch(es) of data from outfeed.
I0710 02:50:14.731185 139873697208192 tpu_estimator.py:604] Dequeue next (1) batch(es) of data from outfeed.
INFO:tensorflow:Enqueue next (1) batch(es) of data to infeed.
I0710 02:50:14.748049 139873697208192 tpu_estimator.py:600] Enqueue next (1) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (1) batch(es) of data from outfeed.
I0710 02:50:14.748325 139873697208192 tpu_estimator.py:604] Dequeue next (1) batch(es) of data from outfeed.
INFO:tensorflow:Enqueue next (1) batch(es) of data to infeed.
I0710 02:50:14.765269 139873697208192 tpu_estimator.py:600] Enqueue next (1) bat

In [None]:
# Measure total training loop time 

total_time = end_time-start_time
print('Minutes to train Large Cased Model (8 splits, 3 epochs each):')
print(total_time/60)

Minutes to train Large Cased Model (8 splits, 3 epochs each):
125.99501438538233


### Step 6: Evaluate the results 

In [None]:
# Clone the SQUAD Repo so that I can get the evaluation file. 

!git clone https://github.com/white127/SQUAD-2.0-bidaf.git

Cloning into 'SQUAD-2.0-bidaf'...
remote: Enumerating objects: 125, done.[K
remote: Total 125 (delta 0), reused 0 (delta 0), pack-reused 125[K
Receiving objects: 100% (125/125), 709.51 KiB | 5.77 MiB/s, done.
Resolving deltas: 100% (33/33), done.


In [None]:
# Move evaluate-v2.0 into the bert folder

%mv /content/bert/SQUAD-2.0-bidaf/evaluate-v2.0.py /content/bert/

In [None]:
# These are a number of helper functions that will be used below to combine the predictions.

#generate a list of file paths.
def generate_file_list(splits):
  list_of_files = []
  for i in range(splits):
    path = 'folderOnColab/self_ensemble_8/8_way_'+str(i)+'/8_way_'+str(i)+'_n_preds.json'
    list_of_files.append(path)
  return list_of_files

#extract the predicted text from each of the prediction files.
def extract_text(data):
    predictions = []
    for group in data:
        text = data[group][0]['text']
        predictions.append(text)
    return predictions

#get the names of the keys so I can search through each pred file. 
def extract_keys(data):
    predictions = []
    for group in data:
        predictions.append(group)
    return predictions

#helper function to open json
def open_json(path):
    with open(path) as json_file:
        temp_json = json.load(json_file)
        return temp_json

#get a master list of all of the predictions 
def get_master(list_of_files):
    master_list = []  #create a master list I will need to hold each of the text lists
    for file in list_of_files:  #iterate through and open each file. 
        temp_json = open_json(file)
        if len(master_list) == 0: #if this is the first one, i also need a key list
            key_list = extract_keys(temp_json)
        text_list = extract_text(temp_json) #now extract the text from the open file 
        master_list.append(text_list) #add the text list to the master list
    return key_list, master_list

def find_modes(key_list, master_list):
    pred_dict = {}
    for i,key in enumerate(key_list):
        temp_list = []
        for j in range(len(master_list)):
            # master_list[j] -> takes me to one specific model's prediction
            # master_list[j][i] -> that instance of prediction for each model.
            temp_list.append(master_list[j][i])
        freqs = groupby(Counter(temp_list).most_common(), lambda x:x[1])
        modes = [val for val,count in next(freqs)[1]]
        best_guess = modes[0]    
        pred_dict[key] = best_guess
    return pred_dict 

#dump the prediction dict into a json file.
def output_predictions(predictions):
    with open('preds.json', 'w', encoding = 'utf-8') as json_file:
        json.dump(pred_dict, json_file, ensure_ascii=True)

In [None]:
#Create a file list with generate_file_list
#pass that file list into get_master, which calls extract_keys and extract_text, outputs keys and a master preds list.
#pass keys and preds to find_modes, outputs pred_dict
#pass pred dict to output_predictions -> creates a preds.json file. 

list_of_files = generate_file_list(8)
keys, master_preds = get_master(list_of_files)
pred_dict = find_modes(keys, master_preds)
output_predictions(pred_dict)

In [None]:
# This is to move some predictions manually into this file.
# Only used for testing.

# %cp folderOnColab/self_ensemble_8/predictions.json /content/bert/
# %rm preds.json
# %mv predictions.json preds.json

In [None]:
# Evaluate the Results. 

print("Results for Large, Cased (4 Epochs")
!python evaluate-v2.0.py dev-v2.0.json preds.json


Results for Large, Cased (4 Epochs
{
  "exact": 71.07723406047334,
  "f1": 74.15988841080048,
  "total": 11873,
  "HasAns_exact": 74.00472334682861,
  "HasAns_f1": 80.17887231805548,
  "HasAns_total": 5928,
  "NoAns_exact": 68.15811606391927,
  "NoAns_f1": 68.15811606391927,
  "NoAns_total": 5945
}
