---

<center>Liushiya Chen, November 2019</center>

**Summary of BERT algorithm:**

The BERT algorithm is a neural network-based, often pre-trained, word embedding algorithm. The algorithms takes words as inputs and embed them into $\mathbb{R}^d$ space. BERT distinguishes itself from other embedding algorithms by being bidirectionally contextual, i.e. when BERT embeds a word, BERT takes into account words that come before and after it. This method of embedding allows for better results in prediction tasks such as machine translation and automated question-answering.

## **Environment Setup**

In [None]:
#!git clone https://<Username>:<Password>@github.com/TheShiya/bert.git
# Clone BERT repo
!git clone https://github.com/google-research/bert.git

fatal: destination path 'bert' already exists and is not an empty directory.


In [None]:
cd bert

/content/bert


In [None]:
# Check GPU status
!nvidia-smi

### **Predicting SQuAD data set with untuned BERT base model**

In [None]:
mkdir /content/bert/squad

In [None]:
cd /content/bert/squad

In [None]:
# Download SQUAD data and evaluation script
!wget "https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json"
!wget "https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json"
!wget "https://raw.githubusercontent.com/allenai/bi-att-flow/master/squad/evaluate-v1.1.py"

In [None]:
cd /content/bert

/content/bert


In [None]:
# Download and unzip BERT-Base Uncased
!wget "https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip"
!unzip uncased_L-12_H-768_A-12.zip
!mv uncased_L-12_H-768_A-12 model

In [None]:
# Predict using untuned model
!python run_squad.py \
  --vocab_file=model/vocab.txt \
  --bert_config_file=model/bert_config.json \
  --init_checkpoint=model/bert_model.ckpt \
  --do_train=False \
  --train_file=squad/train-v1.1.json \
  --do_predict=True \
  --predict_file=squad/dev-v1.1.json \
  --train_batch_size=12 \
  --learning_rate=3e-5 \
  --num_train_epochs=2.0 \
  --max_seq_length=384 \
  --doc_stride=128 \
  --output_dir=output

In [None]:
# Load predictions, create and download .csv file
import json
import pandas as pd
from google.colab import files

json = json.load(open('output/predictions.json', 'rb'))
df = pd.Series(json).reset_index()
df.columns = ['question_id', 'prediction_ans']
df.to_csv('predicted_answers.csv', index=False)
files.download('predicted_answers.csv')

df.sample(n=5)

Unnamed: 0,question_id,prediction_ans
982,573361404776f4190066093f,the Royal Castle Curia
338,56bf3a223aeaaa14008c9575,year veteran who had already overcome three AC...
8532,572925491d046914007790c6,"like malaria, HIV/AIDS, pneumonia, diarr"
4004,5725e28f38643c19005ace26,"vitt, Scott and Schwei"
9268,572ff293947a6a140053ce56,"ports of Rotterdam, Antwerp and Amsterdam. The..."


In [None]:
# Evaluate predictions
!python squad/evaluate-v1.1.py squad/dev-v1.1.json output/predictions.json

{"exact_match": 0.0946073793755913, "f1": 7.458751659520597}


### **Predicting MRPC Data with tuned BERT base model**

In [None]:
!wget "https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/17b8dd0d724281ed7c3b2aeeda662b92809aadd5/download_glue_data.py"

In [None]:
!python download_glue_data.py

In [None]:
!wget "https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip"
!unzip uncased_L-12_H-768_A-12.zip
!mv uncased_L-12_H-768_A-12 model_tuned

In [None]:
# Train model on MRPC data
!python run_classifier.py \
  --task_name=MRPC \
  --do_train=true \
  --do_eval=true \
  --data_dir=glue_data/MRPC \
  --vocab_file=model_tuned/vocab.txt \
  --bert_config_file=model_tuned/bert_config.json \
  --init_checkpoint=model_tuned/bert_model.ckpt \
  --max_seq_length=128 \
  --train_batch_size=32 \
  --learning_rate=2e-5 \
  --num_train_epochs=3.0 \
  --output_dir=mrpc_output/

In [None]:
# Predict msr_paraphrase_test.txt
!python run_classifier.py \
  --task_name=MRPC \
  --do_predict=true \
  --data_dir=glue_data/MRPC \
  --vocab_file=model_tuned/vocab.txt \
  --bert_config_file=model_tuned/bert_config.json \
  --init_checkpoint=mrpc_output/model.ckpt-343 \
  --max_seq_length=128 \
  --output_dir=mrpc_output/predictions

In [None]:
results = pd.read_table('mrpc_output/predictions/test_results.tsv',delim_whitespace=True,header=None)
results.head()

Unnamed: 0,0,1
0,0.003206,0.996794
1,0.012832,0.987168
2,0.002985,0.997015
3,0.021265,0.978735
4,0.974216,0.025784


In [None]:
f = open('glue_data/MRPC/test.tsv', 'rb')
lines = [str(l).split('\\t') for l in f.readlines()]
test = pd.DataFrame(lines[1:], columns=lines[0]).iloc[:,1:3]
test['Predicted_quality'] = results[1]
test.to_csv('paraphrase_predictions.csv', index=False)
files.download('paraphrase_predictions.csv')

test.head()

Unnamed: 0,#1 ID,#2 ID,Predicted_quality
0,1089874,1089925,0.996794
1,3019446,3019327,0.987168
2,1945605,1945824,0.997015
3,1430402,1430329,0.978735
4,3354381,3354396,0.025784


In [None]:
ls glue_data/MRPC

dev_ids.tsv  msr_paraphrase_test.txt   test.tsv
dev.tsv      msr_paraphrase_train.txt  train.tsv


In [None]:
***** Eval results *****
eval_accuracy = 0.86764705
eval_loss = 0.4515621
global_step = 343
loss = 0.4515621

'\n***** Eval results *****\neval_accuracy = 0.86764705\neval_loss = 0.4515621\nglobal_step = 343\nloss = 0.4515621\n'