# Mount and module/library setup

In this step, the shortcut setup in your Google Drive will be mounted to this Google Colab Notebook in order for access. It is important the steps were followed correctly in the user guide and file names are not changed.

There will be a pop-up window in which Google Colab will request access to your Google Drive, this is normal and must be accepted to progress.

In [None]:
#Import google collab drive usage, mount then enter directory for file access.
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/adamcao-1906735-project

!pip install -q tensorflow-ranking
!pip install -U tensorflow_text
!pip install -q tf-models-official==2.4.0
!pip install -U nltk

import pathlib

import tensorflow as tf
import tensorflow_ranking as tfr
import tensorflow_text as tf_text
from google.protobuf import text_format
import bz2
import json
import pandas as pd
import re
from official.modeling import tf_utils
from official import nlp
from official.nlp import bert
# Load the required submodules
import official.nlp.optimization
import official.nlp.bert.bert_models
import official.nlp.bert.configs
import official.nlp.bert.run_classifier
import official.nlp.bert.tokenization
import official.nlp.data.classifier_data_lib
import official.nlp.modeling.losses
import official.nlp.modeling.models
import official.nlp.modeling.networks
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize


# Fine-Tuning BERT

This is the script run with hyperparameter settings to fine tune bert. This utilises the Train and Validation ELWC .tfrecord files from the data pipeline.

The main hyperparameters to be adjusted are:

*   train_input_pattern - This should be adjusted to the path of the ELWC data to be used, this can differ depending on which feature selections you want to train the model on.
*   eval_input_pattern - This should be adjusted to the path of the ELWC data to be used, this can differ depending on which feature selections you want to validate the models on.
*   bert_max_seq_length - This is the sequence length to train BERT on, it should correspond to the same sequence length as the input data.
*   model_dir - This is the directory location that the model will be saved to.
*   list_size - This is the max number of datasets in each ranking problem that can be processed, 31 is the highest Colab Pro Plus allows before running out of memory. (It is recommended to lower this on normal Colab accounts.)
*   loss - Loss should be chosen between approx_ndcg_loss and softmax_loss, the recommended is approx_ndcg_loss
* num_train_steps - Recommend 100-500 for quick testing, for an idea of how long it takes: 5000 steps takes 2-4 hours.
* checkpoint_secs - This is the number of seconds before a checkpoint is saved, this should be adjusted according to the training steps.

Parameters used to train **NDCGTitleTags31256Final**

```
--train_input_pattern=FormattedData/TitleTagsData/256TrainELWC.tfrecord \
--eval_input_pattern=FormattedData/TitleTagsData/256ValELWC.tfrecord \
--bert_config_file=cased_L-12_H-768_A-12/bert_config.json \
--bert_init_ckpt=cased_L-12_H-768_A-12/bert_model.ckpt \
--bert_max_seq_length=256 \
--model_dir=models/NDCGTitleTags31256Final \
--list_size=31 \
--loss=approx_ndcg_loss \
--train_batch_size=1 \
--eval_batch_size=1 \
--learning_rate=1e-5 \
--num_train_steps=25000 \
--num_eval_steps=100 \
--checkpoint_secs=900 \
--num_checkpoints=1000
```

Parameters used to train **NDCGTitle31256Final**

```
--train_input_pattern=FormattedData/TitleData/256TrainELWC.tfrecord \
--eval_input_pattern=FormattedData/TitleData/256ValELWC.tfrecord \
--bert_config_file=cased_L-12_H-768_A-12/bert_config.json \
--bert_init_ckpt=cased_L-12_H-768_A-12/bert_model.ckpt \
--bert_max_seq_length=256 \
--model_dir=models/NDCGTitle31256Final \
--list_size=31 \
--loss=approx_ndcg_loss \
--train_batch_size=1 \
--eval_batch_size=1 \
--learning_rate=1e-5 \
--num_train_steps=5000 \
--num_eval_steps=100 \
--checkpoint_secs=900 \
--num_checkpoints=1000
```




In [None]:
!ls

!python bertPython/tfrbert_example.py \
   --train_input_pattern=FormattedData/TitleTagsData/256TrainELWC.tfrecord \
   --eval_input_pattern=FormattedData/TitleTagsData/256ValELWC.tfrecord \
   --bert_config_file=cased_L-12_H-768_A-12/bert_config.json \
   --bert_init_ckpt=cased_L-12_H-768_A-12/bert_model.ckpt \
   --bert_max_seq_length=256 \
   --model_dir=models/SampleModel \
   --list_size=16 \
   --loss=approx_ndcg_loss \
   --train_batch_size=1 \
   --eval_batch_size=1 \
   --learning_rate=1e-5 \
   --num_train_steps=250 \
   --num_eval_steps=100 \
   --checkpoint_secs=300 \
   --num_checkpoints=1000

# Training Evaluation Using TensorBoard

Once model training is complete, the training process can be evaluated by using Tensorboard and loading in the training data generated. This can be done by running the following script:

You will also be able to see the training metrics of the top performing models by uncommenting the other models (Only 1 should be uncommented at a time)

In [None]:
%reload_ext tensorboard
%tensorboard --logdir="/content/drive/MyDrive/adamcao-1906735-project/models/SampleModel/"
#%tensorboard --logdir="/content/drive/MyDrive/adamcao-1906735-project/models/NDCGTitle31256Final/"
#%tensorboard --logdir="/content/drive/MyDrive/adamcao-1906735-project/models/NDCGTitleTags31256Final/"