# 2nd Order Data Generation: 1 Way

## Overview

*   This notebook covers the generation of data that will be used in the second order training. 
*   I will scrape the top 10 predictions from each model for each question, and score each answer by its number of occurances.
*  This will be used to form a multiple choice selection for training a BertForMultipleChoice model on the same questions in the next notebook. 





### Step 1: Clone the Repo



In [None]:
#This will clone the BERT Repo

!git clone https://github.com/google-research/bert.git

Cloning into 'bert'...
remote: Enumerating objects: 340, done.[K
remote: Total 340 (delta 0), reused 0 (delta 0), pack-reused 340[K
Receiving objects: 100% (340/340), 317.20 KiB | 4.12 MiB/s, done.
Resolving deltas: 100% (185/185), done.


In [None]:
#The code in the BERT Repo is written in tf 1, and the tf conversion process fails on these files.
#For this reason, it was easiest to revert to tf v1 for the purposes of this notebook

%tensorflow_version 1.x
import tensorflow
print(tensorflow.__version__)

TensorFlow 1.x selected.
1.15.2


In [None]:
#Make sure were in the right place

%ls

[0m[01;34mbert[0m/  [01;34msample_data[0m/


In [None]:
# Move to BERT folder 

%cd bert

/content/bert


### Step 2: Imports and Connect to TPU

In [None]:
# Still need imports 

import datetime
import json
import os
import time
import pprint
import random
import string
import sys
import tensorflow as tf
import re
from collections import Counter
from itertools import groupby
import pandas as pd
import numpy as np

# Get TPU Address for training

assert 'COLAB_TPU_ADDR' in os.environ, 'ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!'
TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']
print('TPU address is => ', TPU_ADDRESS)

#Authorize Google and connect.

from google.colab import auth
auth.authenticate_user()
with tf.Session(TPU_ADDRESS) as session:
  print('TPU devices:')
  pprint.pprint(session.list_devices())
  
  # Upload credentials to TPU.
  with open('/content/adc.json', 'r') as f:
    auth_info = json.load(f)
    
  tf.contrib.cloud.configure_gcs(session, credentials=auth_info)



TPU address is =>  grpc://10.65.157.10:8470
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

TPU devices:
[_DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:CPU:0, CPU, -1, 4641764646491240834),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 17166744095793098699),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 4475611599629586264),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 224226454049083860),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, 15924873725363919402),
 _DeviceAttributes(/job:tpu_worker/replica:0/task

In [None]:
#Create variables for Buckets and Outputs for later use. 

BUCKET = 'thaddeussegura_final_project' #@param {type:"string"}
assert BUCKET, '*** Must specify an existing GCS bucket name ***'
output_dir_name = 'self_ensemble_1/' #@param {type:"string"}
BUCKET_NAME = 'gs://{}'.format(BUCKET)
OUTPUT_DIR = 'gs://{}/{}'.format(BUCKET,output_dir_name)
tf.io.gfile.makedirs(OUTPUT_DIR)
print('***** Model output directory: {} *****'.format(OUTPUT_DIR))

***** Model output directory: gs://thaddeussegura_final_project/self_ensemble_1/ *****


### Step 3: Connect Drive and GCP


In [None]:
#Mount my drive so that I can access the split training sets. 

from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [None]:
# Download the SQUAD train and dev dataset

# I will need the full training set.  
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json

# Still download the Dev set.
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

--2020-07-21 22:39:28--  https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.110.153, 185.199.108.153, 185.199.109.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.110.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42123633 (40M) [application/json]
Saving to: ‘train-v2.0.json’


2020-07-21 22:39:29 (56.9 MB/s) - ‘train-v2.0.json’ saved [42123633/42123633]

--2020-07-21 22:39:29--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4370528 (4.2M) [application/json]
Saving to: ‘dev-v2.0.json’


2020-07-21 22:39:30 (15.9 MB/s) - ‘dev-v2.0.json’ saved [4370528/4370528]



In [None]:
# Necessary installs so I can mount the files from my bucket onto colab

!echo "deb http://packages.cloud.google.com/apt gcsfuse-bionic main" > /etc/apt/sources.list.d/gcsfuse.list
!curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
!apt -qq update
!apt -qq install gcsfuse

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100   653  100   653    0     0  26120      0 --:--:-- --:--:-- --:--:-- 26120
OK
67 packages can be upgraded. Run 'apt list --upgradable' to see them.
The following package was automatically installed and is no longer required:
  libnvidia-common-440
Use 'apt autoremove' to remove it.
The following NEW packages will be installed:
  gcsfuse
0 upgraded, 1 newly installed, 0 to remove and 67 not upgraded.
Need to get 4,278 kB of archives.
After this operation, 12.8 MB of additional disk space will be used.
Selecting previously unselected package gcsfuse.
(Reading database ... 144465 files and directories currently installed.)
Preparing to unpack .../gcsfuse_0.30.0_amd64.deb ...
Unpacking gcsfuse (0.30.0) ...
Setting up gcsfuse (0.30.0) ...


In [None]:
# Make a folder for the bucket, this will have all of the files inside. 

!mkdir folderOnColab
!gcsfuse thaddeussegura_final_project folderOnColab 

Using mount point: /content/bert/folderOnColab
Opening GCS connection...
Opening bucket...
Mounting file system...
File system has been successfully mounted.


### Step 4: Generate Predictions on the DEV Set


*   Take Each Model and predict it on the DEV SET
*   Do not look at answers.
*   Output to build the dataset for the BERTforMultipleChoice model


In [None]:
# Prediction function

def dev_pred(checkpoint, temp_output):
  !python run_squad.py \
    --vocab_file=$BUCKET_NAME/cased_L-24_H-1024_A-16/vocab.txt \
    --bert_config_file=$BUCKET_NAME/cased_L-24_H-1024_A-16/bert_config.json \
    --init_checkpoint=$checkpoint \
    --do_train=False \
    --max_query_length=30  \
    --do_predict=True \
    --predict_file=dev-v2.0.json \
    --predict_batch_size=8 \
    --n_best_size=10 \
    --use_tpu=True \
    --tpu_name=grpc://10.92.18.210:8470 \
    --max_seq_length=384 \
    --doc_stride=128 \
    --output_dir=$temp_output

#full_train_pred('gs://thaddeussegura_final_project/self_ensemble_4/4_way_3/model.ckpt-8372', 'gs://thaddeussegura_final_project/self_ensemble_4/4_way_3/')


In [None]:
# run the loop.
# Tends to crash the instance....

file_list = []
for i in range(4):
  file_list.append('gs://thaddeussegura_final_project/self_ensemble_1/model.ckpt-'+str(5000*(i+1)))

def full_pred_loop():
  for i in range(4):
    folder_name = 'checkpoint'+str(i)
    checkpoint = file_list[i]
    temp_output = OUTPUT_DIR+folder_name+'/'
    dev_pred(checkpoint, temp_output)

full_pred_loop()

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
I0718 15:09:41.009507 140592795490176 tpu_estimator.py:604] Dequeue next (1) batch(es) of data from outfeed.
INFO:tensorflow:Enqueue next (1) batch(es) of data to infeed.
I0718 15:09:41.026633 140592795490176 tpu_estimator.py:600] Enqueue next (1) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (1) batch(es) of data from outfeed.
I0718 15:09:41.026780 140592795490176 tpu_estimator.py:604] Dequeue next (1) batch(es) of data from outfeed.
INFO:tensorflow:Enqueue next (1) batch(es) of data to infeed.
I0718 15:09:41.043542 140592795490176 tpu_estimator.py:600] Enqueue next (1) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (1) batch(es) of data from outfeed.
I0718 15:09:41.043690 140592795490176 tpu_estimator.py:604] Dequeue next (1) batch(es) of data from outfeed.
INFO:tensorflow:Enqueue next (1) batch(es) of data to infeed.
I0718 15:09:41.060693 140592795490176 tpu_estimator.py:600] Enqueue next (1) bat

### Step 5: Create Multiple Choices
 


*   Use the predictions above, to pull out the top 5 predictions from each model for each answer.
*   Use the most common predictions to generate multiple choices answers
*   Pad to 5 choices if there are less than 5. 
*   If the correct answer is missing from all 5 choices in the Training set, add it back in.
*   Repeat for the Dev set, but do not look at the answers.

In [None]:
# This will take split data and return the keys that are present.
def extract_split_keys(data):
    keys = []
    for i in range(len(data['data'])):
        paragraphs = len(data['data'][i]['paragraphs'])
        for j in range(paragraphs):
            qas = len(data['data'][i]['paragraphs'][j]['qas'])
            for k in range(qas):
                keys.append(data['data'][i]['paragraphs'][j]['qas'][k]['id'])
    return keys

#this will be used to get the full answer list. 
def extract_answer_text(data):
    text = []
    for i in range(len(data['data'])):
        paragraphs = len(data['data'][i]['paragraphs'])
        for j in range(paragraphs):
            qas = len(data['data'][i]['paragraphs'][j]['qas'])
            for k in range(qas):
                if data['data'][i]['paragraphs'][j]['qas'][k]['is_impossible'] == True:
                    text.append("")
                else:
                    text.append(data['data'][i]['paragraphs'][j]['qas'][k]['answers'][0]['text'])
    return text

# Need a helper function to open each file
def open_json(path):
    with open(path) as json_file:
        temp_json = json.load(json_file)
        return temp_json

#pass in any set of full predictions to find the full keys. 
def extract_keys(data):
    predictions = []
    for group in data:
        predictions.append(group)
    return predictions

# #extract the top 5 choices for each question from nbest_predictions.
# def get_top_5(file):
#     nbest = open_json(file)
#     pred_list = []
#     for question in nbest:
#         temp_list = []
#         for i in range(len(nbest[question])):
#             if len(temp_list) < 5:
#                 temp_list.append(nbest[question][i]['text'])
#             else:
#                 break
#             temp_list.append("")
#         pred_list.append(temp_list)
#     return pred_list 

# THIS VERISON IS JUST FOR THE AB OPTION. 
def get_top_5(file):
    nbest = open_json(file)
    pred_list = []
    for question in nbest:
        temp_list = []
        for i in range(len(nbest[question])):
            if len(temp_list) < 1:
                temp_list.append(nbest[question][i]['text'])
            else:
                break
        pred_list.append(temp_list)
    return pred_list     

# go through each file in a file list and extract the top 5 predictions. 
# use those predictions to vote on possible answers.

# def create_multichoice(file_list):
#     #Make a list of lists from each model.
#     full_train = open_json('train-v2.0.json')
#     full_keys = extract_split_keys(full_train)
#     full_answers = extract_answer_text(full_train)
#     full_preds = []
#     for model in file_list:
#         pred_list = get_top_5(model)
#         full_preds.append(pred_list)

#     #Take the List of Lists and go through each question.
#     master_list = []
#     for i in range(len(full_preds[0])):
#         temp_list = []
#         for pred_list in full_preds:
#             for pred in pred_list[i]:
#                 temp_list.append(pred)
#         words_to_count = (word for word in temp_list)
#         #Find the most commmon words in the list.
#         c = Counter(words_to_count)
#         most_common = [c.most_common(5)[i][0] for i in range(len(c.most_common(5)))]
#         while len(most_common) < 5:
#             most_common.append("_padding_")
#         #append the answers for that specific question to the master list.     
#         master_list.append(most_common)
        
#     return master_list

#THIS IS FOR THE AB OPTION 
def create_multichoice(file_list):
    #Make a list of lists from each model.
    full_train = open_json('train-v2.0.json')
    full_keys = extract_split_keys(full_train)
    full_answers = extract_answer_text(full_train)
    full_preds = []
    for model in file_list:
        pred_list = get_top_5(model)
        full_preds.append(pred_list)

    #Take the List of Lists and go through each question.
    master_list = []
    for i in range(len(full_preds[0])):
        temp_list = []
        for pred_list in full_preds:
            for pred in pred_list[i]:
                temp_list.append(pred)
        words_to_count = (word for word in temp_list)
        #Find the most commmon words in the list.
        c = Counter(words_to_count)
        most_common = [c.most_common(2)[i][0] for i in range(len(c.most_common(2)))]
        while len(most_common) < 2:
            most_common.append("_padding_")
        #append the answers for that specific question to the master list.     
        master_list.append(most_common)
        
    return master_list

# This will take the master answers list and create the labels
# By identifying which of the predictions was correct. 
def correct_index(master_answers, file, mode):
    data = open_json(file)
    question_keys = extract_split_keys(data)
    answers = extract_answer_text(data)
    num_wrong = 0
    for i in range(len(master_answers)):
        found = False
        for j in range(len(master_answers[i])):
            if master_answers[i][j] == answers[i]:
                master_answers[i].append(j)
                found = True
            if found == True:
                break
        if mode == 'train':
            if not found:
                del master_answers[i][1]
                master_answers[i].append(answers[i])
                num_wrong += 1
    return master_answers, num_wrong

#this will generate the question data
#This is done by combining the question and context.
def extract_multichoice_question(file):
    data = open_json(file)
    text = []
    for i in range(len(data['data'])):
        paragraphs = len(data['data'][i]['paragraphs'])
        for j in range(paragraphs):
            context = data['data'][i]['paragraphs'][j]['context']
            qas = len(data['data'][i]['paragraphs'][j]['qas'])
            for k in range(qas):
                question_id = data['data'][i]['paragraphs'][j]['qas'][k]['id']
                question = data['data'][i]['paragraphs'][j]['qas'][k]['question']
                text.append([question_id, context, question])
    return text

In [None]:
#Full loop putting everything together to build the predictions into a csv.

def build_csv(nbest_files, train_file, mode, csv_name):
  master_answers = create_multichoice(nbest_files)
  answers_labeled, num_wrong = correct_index(master_answers, train_file, mode)
  answers_df = pd.DataFrame(answers_labeled)
  questions = extract_multichoice_question(train_file)
  question_df = pd.DataFrame(questions)
  full_df = pd.concat([question_df, answers_df], axis=1, sort=False)
  full_df.columns = ['id', 'context', 'question', 'a','b', 'correct_index']
  full_df.to_csv(csv_name)


In [None]:
import json
# generate a file list with the proper paths. 
file_list = [('folderOnColab/self_ensemble_1/checkpoint'+str(i)+'/nbest_predictions.json') for i in range(4)]

# Pass in, File list, which training file I am using, the mode (train) or (dev),

build_csv(file_list, 'dev-v2.0.json', 'dev', 'dev_AB.csv')

In [None]:
#copy it from colab to Drive

%cp -R /content/bert/dev_AB.csv /content/drive/My\ Drive/

In [None]:
# check it out to make sure its good. 

df = pd.read_csv('dev_SE1.csv')
df['a'].fillna("", inplace=True)
df


Unnamed: 0.1,Unnamed: 0,id,context,question,a,b,c,d,e,correct_index
0,0,56ddde6b9a695914005b9628,The Normans (Norman: Nourmands; French: Norman...,In what country is Normandy located?,,France,France.,in France,West Francia,1.0
1,1,56ddde6b9a695914005b9629,The Normans (Norman: Nourmands; French: Norman...,When were the Normans in Normandy?,,10th and 11th centuries,the 10th and 11th centuries,in the 10th and 11th centuries,10th and 11th,1.0
2,2,56ddde6b9a695914005b962a,The Normans (Norman: Nourmands; French: Norman...,From which countries did the Norse originate?,,"Denmark, Iceland and Norway",Iceland and Norway,"from Denmark, Iceland and Norway",Denmark,1.0
3,3,56ddde6b9a695914005b962b,The Normans (Norman: Nourmands; French: Norman...,Who was the Norse leader?,,Rollo,o,leader Rollo,"Rollo,",1.0
4,4,56ddde6b9a695914005b962c,The Normans (Norman: Nourmands; French: Norman...,What century did the Normans first gain their ...,,10th,10th century,first half of the 10th,the 10th,2.0
...,...,...,...,...,...,...,...,...,...,...
11868,11868,5737aafd1c456719005744ff,"The pound-force has a metric counterpart, less...",What is the seldom used force unit equal to on...,,the metric slug,metric slug,the metric slug (sometimes mug or hyl),"kilogram-force leads to an alternate, but rare...",
11869,11869,5ad28ad0d7d075001a4299cc,"The pound-force has a metric counterpart, less...",What does not have a metric counterpart?,,newton,the newton,"pound-force has a metric counterpart, less com...",the newton: the kilogram-force (kgf),0.0
11870,11870,5ad28ad0d7d075001a4299cd,"The pound-force has a metric counterpart, less...",What is the force exerted by standard gravity ...,,kilogram-force,the kilogram-force,kilogram-force (kgf),the kilogram-force (kgf),0.0
11871,11871,5ad28ad0d7d075001a4299ce,"The pound-force has a metric counterpart, less...",What force leads to a commonly used unit of mass?,,kilogram-force,The kilogram-force,kilogram,pound-force,0.0


In [None]:
#copy it from colab to Drive

%cp -R /content/bert/dev_SE1.csv /content/drive/My\ Drive/