**Kaggle Challenge: Google-QUEST-Q-A-Labeling**

This challenge is mainly regression based, where each example data consists of a few question and answer features respectively, and 30 output variables, whose values have to be estimated. The following notebook consists of the central BERT-based model which has been used for this challenge

In [None]:
from google.colab import drive


drive.mount('/content/gdrive', force_remount = True)
dataset_path = 'gdrive/My Drive/Projects/quest/'

In [None]:
!pip install tensorflow==2.1.0-rc2



In [None]:
!pip install sacremoses

In [None]:
!pip install transformers

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import GroupKFold
import matplotlib.pyplot as plt
from tqdm import tqdm
# import tensorflow_hub as hub
import tensorflow as tf
# import bert_tokenization as tokenization
import tensorflow.keras.backend as K
from tensorflow.keras.layers import Bidirectional, LSTM, GRU
import os
from scipy.stats import spearmanr
from math import floor, ceil
from transformers import *
tf.compat.v1.enable_eager_execution()
np.set_printoptions(suppress=True)
print(tf.__version__)

This cell specifies the **BERT tokenizer** to be used, and reads the data. Here, the uncased bert version is used with 12 layers

In [None]:
PATH = dataset_path

# BERT_PATH = '../input/bert-base-from-tfhub/bert_en_uncased_L-12_H-768_A-12'
# tokenizer = tokenization.FullTokenizer(BERT_PATH+'/assets/vocab.txt', True)

BERT_PATH = dataset_path + 'bert-base-uncased-huggingface-transformer/'
tokenizer = BertTokenizer.from_pretrained(BERT_PATH+'bert-base-uncased-vocab.txt',do_lower_case = True)

#tokenizer.add_tokens(['[Q-TITLE]'])
#l = len(tokenizer)
MAX_SEQUENCE_LENGTH = 512

df_train = pd.read_csv(PATH+'train.csv')
df_test = pd.read_csv(PATH+'test.csv')
df_sub = pd.read_csv(PATH+'sample_submission.csv')
print('train shape =', df_train.shape)
print('test shape =', df_test.shape)

Retrieving the input features and output categories of the data

In [None]:
output_categories_qn = list(df_train.columns[11:32])
output_categories_ans = list(df_train.columns[32:])
input_categories = list(df_train.columns[[1,2,5]])
print('\noutput categories:\n\t', len(output_categories_qn))
print('\ninput categories:\n\t', len(output_categories_ans))
output_categories = output_categories_qn+output_categories_ans

**Processing of Input Data**

Each input example consists of the question title, question body, and the answer body. These input examples are then passed on to the BERT tokenizer, in two ways (one consists of the question title and body, the other consists of the question title and answer body), which then separates the data into two sets of ids, masks and segments, one for the question, the other for the answer.

BERT accepts input vectors of length 512 only. To incorporate most information into the input vectors, the first 200 components and the last 312 components are taken of the ids and segments that are obtained as outputs of the BERT tokenizer.

In [None]:
def _convert_to_transformer_inputs(title, question, answer, tokenizer, max_sequence_length):
    """Converts tokenized input to ids, masks and segments for transformer (including bert)"""
    
    def return_id(str1, str2, truncation_strategy, length):

        inputs = tokenizer.encode_plus(str1, str2,
            add_special_tokens=True,
            )
        
        input_ids =  inputs["input_ids"]
        input_segments = inputs["token_type_ids"]
        if len(input_ids) > length:
          input_ids = input_ids[:200] + input_ids[-312:]
          input_segments = input_segments[:200] + input_segments[-312:]

        input_masks = [1] * len(input_ids)
        padding_length = length - len(input_ids)
        padding_id = tokenizer.pad_token_id
        input_ids = input_ids + ([padding_id] * padding_length)
        input_masks = input_masks + ([0] * padding_length)
        input_segments = input_segments + ([0] * padding_length)
        
        return [input_ids, input_masks, input_segments]
    
    input_ids_q, input_masks_q, input_segments_q = return_id(
        title+" "+question, None, 'longest_first', max_sequence_length)
    
    input_ids_a, input_masks_a, input_segments_a = return_id(
        title + " " + answer, None, 'longest_first', max_sequence_length)
    
    return [input_ids_q, input_masks_q, input_segments_q,
            input_ids_a, input_masks_a, input_segments_a]

def compute_input_arrays(df, columns, tokenizer, max_sequence_length):
    input_ids_q, input_masks_q, input_segments_q = [], [], []
    input_ids_a, input_masks_a, input_segments_a = [], [], []
    for _, instance in tqdm(df[columns].iterrows()):
        t, q, a = instance.question_title, instance.question_body, instance.answer

        ids_q, masks_q, segments_q, ids_a, masks_a, segments_a = \
        _convert_to_transformer_inputs(t, q, a, tokenizer, max_sequence_length)
        
        input_ids_q.append(ids_q)
        input_masks_q.append(masks_q)
        input_segments_q.append(segments_q)

        input_ids_a.append(ids_a)
        input_masks_a.append(masks_a)
        input_segments_a.append(segments_a)
        
    return [np.asarray(input_ids_q, dtype=np.int32), 
            np.asarray(input_masks_q, dtype=np.int32), 
            np.asarray(input_segments_q, dtype=np.int32),
            np.asarray(input_ids_a, dtype=np.int32), 
            np.asarray(input_masks_a, dtype=np.int32), 
            np.asarray(input_segments_a, dtype=np.int32)]

def compute_output_arrays(df, columns):
    return np.asarray(df[columns])

In [None]:
from keras.layers import Bidirectional
from keras.layers import LSTM

**BERT - LSTM model**

This model is a concatenation of two branch models, one of the question and one for the answer. The basic construction of the two branches is the same. Both take as input their respective id, mask and segment, pass them onto the pretrained BERT model. Thereafter, the last four hidden layers of the BERT model are concatenated and passed on to a bi-LSTM of 512 cells. This layer is passed on to a pooling layer which is the final layer of the branch. After that, the two branches are concatenated, to which a drop out layer is added. The next layer is the output layer consisting of 30 cells for the corresponding output variables.

In [None]:
def compute_spearmanr_ignore_nan(trues, preds):
    rhos = []
    for tcol, pcol in zip(np.transpose(trues), np.transpose(preds)):
        rhos.append(spearmanr(tcol, pcol).correlation)
    return np.nanmean(rhos)

def create_model_qn():
    q_id_1 = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32)
    q_mask_1 = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32)
    q_atn_1 = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32)
    
    config = BertConfig() # print(config) to see settings
    config.output_hidden_states = False # Set to True to obtain hidden states
    # caution: when using e.g. XLNet, XLNetConfig() will automatically use xlnet-large config
    
    # normally ".from_pretrained('bert-base-uncased')", but because of no internet, the 
    # pretrained model has been downloaded manually and uploaded to kaggle. 
    bert_model = TFBertModel.from_pretrained(
        BERT_PATH+'bert-base-uncased-tf_model.h5', config=config)
    #bert_model.resize_token_embeddings(30523)
    # if config.output_hidden_states = True, obtain hidden states via bert_model(...)[-1]
    #outputs = bert_model(q_id, attention_mask=q_mask, token_type_ids=q_atn)[2]
    
    
    #l_1, l_2, l_3, l_4 = outputs[-1], outputs[-2], outputs[-3], outputs[-4]

    q_embedding_1 = bert_model(q_id_1, attention_mask=q_mask_1, token_type_ids=q_atn_1)[0]
    
    #q_embedding = tf.keras.layers.concatenate([l_1, l_2, l_3, l_4])
    
    q_1 = tf.keras.layers.GlobalAveragePooling1D()(q_embedding_1)
        
    x_1 = tf.keras.layers.Dropout(0.2)(q_1)
    
    x_1 = tf.keras.layers.Dense(21, activation='sigmoid')(x_1)

    model = tf.keras.models.Model(inputs=[q_id_1, q_mask_1, q_atn_1], outputs=x_1)
    
    return model

def create_model_ans():
    q_id = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32)
    q_mask = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32)
    q_atn = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32)
    
    a_id = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32)
    
    
    a_mask = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32)
    
    
    a_atn = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32)
    
    config = BertConfig() # print(config) to see settings
    config.output_hidden_states = True # Set to True to obtain hidden states
    # caution: when using e.g. XLNet, XLNetConfig() will automatically use xlnet-large config
    
    # normally ".from_pretrained('bert-base-uncased')", but because of no internet, the 
    # pretrained model has been downloaded manually and uploaded to kaggle. 
    bert_model = TFBertModel.from_pretrained(
        BERT_PATH+'bert-base-uncased-tf_model.h5', config=config)
    #bert_model.resize_token_embeddings(30523)
    # if config.output_hidden_states = True, obtain hidden states via bert_model(...)[-1]
    outputs = bert_model(q_id, attention_mask=q_mask, token_type_ids=q_atn)[2]
    
    l_1, l_2, l_3, l_4 = outputs[-1], outputs[-2], outputs[-3], outputs[-4]
    
    q_embedding = tf.keras.layers.concatenate([l_1, l_2, l_3, l_4])
    q_embedding = Bidirectional(LSTM(512, return_sequences=True))(q_embedding)
    #q_embedding = Bidirectional(LSTM(128, return_sequences=True))(q_embedding)
    #q_embedding = bert_model(q_id, attention_mask=q_mask, token_type_ids=q_atn)[0]
    
    outputs_ans = bert_model(a_id, attention_mask=a_mask, token_type_ids=a_atn)[2]
    
    a_1, a_2, a_3, a_4 = outputs_ans[-1], outputs_ans[-2], outputs_ans[-3], outputs_ans[-4]
    
    a_embedding = tf.keras.layers.concatenate([a_1, a_2, a_3, a_4])
    a_embedding = Bidirectional(LSTM(512, return_sequences=True))(a_embedding)
    #a_embedding = Bidirectional(LSTM(128, return_sequences=True))(a_embedding)
    #q_embedding = bert_model(q_id, attention_mask=q_mask, token_type_ids=q_atn)[0]
    #a_embedding = bert_model(a_id, attention_mask=a_mask, token_type_ids=a_atn)[0]
    q = tf.keras.layers.GlobalAveragePooling1D()(q_embedding)
    a = tf.keras.layers.GlobalAveragePooling1D()(a_embedding)
    
    x = tf.keras.layers.Concatenate()([q, a])
    
    x = tf.keras.layers.Dropout(0.2)(x)
    
    x = tf.keras.layers.Dense(30, activation='sigmoid')(x)

    model = tf.keras.models.Model(inputs=[q_id, q_mask, q_atn,a_id, a_mask, a_atn], outputs=x)
    
    return model

In [None]:
#outputs_qn = compute_output_arrays(df_train, output_categories_qn)
#outputs_ans = compute_output_arrays(df_train, output_categories_ans)
outputs = compute_output_arrays(df_train, output_categories)
inputs = compute_input_arrays(df_train, input_categories, tokenizer, MAX_SEQUENCE_LENGTH)
test_inputs = compute_input_arrays(df_test, input_categories, tokenizer, MAX_SEQUENCE_LENGTH)

The training is performed at this stage, with 10 fold cross validation

In [None]:
gkf = GroupKFold(n_splits=10).split(X=df_train.question_body, groups=df_train.question_body)

valid_preds = []
test_preds = []
K.clear_session()
for fold, (train_idx, valid_idx) in enumerate(gkf):
    
    # will actually only do 2 folds (out of 5) to manage < 2h
    if fold in range(10):

        #train_inputs_qn = [inputs[i][train_idx] for i in range(3)]
        train_inputs = [inputs[i][train_idx] for i in range(len(inputs))]
        train_outputs = outputs[train_idx]
        #train_outputs_qn = outputs_qn[train_idx]
        #train_outputs_ans = outputs_ans[train_idx]
        
        #valid_inputs_qn = [inputs[i][valid_idx] for i in range(3)]
        valid_inputs = [inputs[i][valid_idx] for i in range(len(inputs))]
        valid_outputs = outputs[valid_idx]
        #valid_outputs_qn = outputs_qn[valid_idx]
        #valid_outputs_ans = outputs_ans[valid_idx]
        
       
        
        #model = create_model_qn()
        optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)
        #model.compile(loss='binary_crossentropy', optimizer=optimizer)
        #model.fit(train_inputs_qn, train_outputs_qn, epochs=1, batch_size=6)
        #model.save_weights(dataset_path + 'bert-ques'+str(fold)+'.h5')'''
        K.clear_session()
        model1 = create_model_ans()
        model1.compile(loss='binary_crossentropy', optimizer=optimizer)
        model1.fit(train_inputs, train_outputs, epochs=3, batch_size=6)
        model1.save_weights(dataset_path + 'bert-ans'+str(fold)+'.h5')
        
        valid_preds.append(model1.predict(valid_inputs))
        test_preds.append(model1.predict(test_inputs))
        #valid_outputs = np.column_stack((valid_outputs_qn,valid_outputs_ans))
        rho_val = compute_spearmanr_ignore_nan(valid_outputs, valid_preds[-1])
        print('validation score = ', rho_val)

There are a number of variants of the model mentioned above. 

1.   Use bi-GRU of the same number of cells instead of bi_LSTM
2.   Adjust the number of LSTM/GRU cells
3.   Instead of concatenating the last 4 hidden layers of the BERT model and passing it to bi-LSTM or bi-GRU, use the original output of the BERT model
4.   Instead of creating two branches for questions and answers in the same model, create two separate models. One model will take into account the question based features and predict only the question based output variables, the other model will take in the question and answer based features, and predict the answer based output variables. The structure of the model can be any of the above 3 architectures. However, this has not been tried, as it usually consumes a lot of resource.






