# Fine-tuning Sentence Pair Classification with BERT

Pre-trained language representations have been shown to improve many downstream NLP tasks such as
question answering, and natural language inference. To apply pre-trained
representations to these tasks, there are two main strategies:

1. The *feature-based* approach, which uses the pre-trained representations as additional
features to the downstream task.
2. Or the *fine-tuning*-based approach, which trains the downstream tasks by
fine-tuning pre-trained parameters.

While feature-based approaches such as ELMo [3] (introduced in the previous tutorial) are effective
in improving many downstream tasks, they require task-specific architectures.
Devlin, Jacob, et al proposed BERT [1] (Bidirectional Encoder Representations
from Transformers), which *fine-tunes* deep bi-directional representations on a
wide range of tasks with minimal task-specific parameters, and obtains state-
of-the-art results.

In this tutorial, we will focus on fine-tuning with the
pre-trained BERT model to classify semantically equivalent sentence pairs.

Specifically, we will:

1. Load the state-of-the-art pre-trained BERT model and attach an additional layer for classification
2. Process and transform sentence-pair data for the task at hand
3. Fine-tune the BERT model for sentence classification

## Setup

To use this tutorial, please download the required files from the above download link, and install
GluonNLP.

### Importing necessary modules

In [1]:
!pip install gluonnlp

You should consider upgrading via the '/home/ec2-user/anaconda3/envs/mxnet_p36/bin/python -m pip install --upgrade pip' command.[0m


In [2]:
!pip install ipdb

You should consider upgrading via the '/home/ec2-user/anaconda3/envs/mxnet_p36/bin/python -m pip install --upgrade pip' command.[0m


In [3]:
# import os
# import sys
# module_path = os.path.abspath(os.path.join('..'))
# # if module_path not in sys.path:
# #     sys.path.append(module_path)

In [4]:
# module_path

In [5]:
import warnings
warnings.filterwarnings('ignore')

import io
import random
import numpy as np
import pandas as pd
import mxnet as mx
import gluonnlp as nlp
from gluonnlp.calibration import BertLayerCollector
# this notebook assumes that all required scripts are already
# downloaded from the corresponding tutorial webpage on http://gluon-nlp.mxnet.io
from bert import data
from sklearn.metrics import f1_score, accuracy_score
from sklearn.model_selection import train_test_split

nlp.utils.check_version('0.8.1')

### Setting up the environment

Please note the comment in the code if no GPU is available.

In [6]:
np.random.seed(100)
random.seed(100)
mx.random.seed(10000)
# change `ctx` to `mx.cpu()` if no GPU is available.
ctx = mx.gpu(0)

## Using the pre-trained BERT model

The list of pre-trained BERT models available
in GluonNLP can be found
[here](../../model_zoo/bert/index.rst).

In this
tutorial, the BERT model we will use is BERT
BASE trained on an uncased corpus of books and
the English Wikipedia dataset in the
GluonNLP model zoo.

### Get BERT

Let's first take
a look at the BERT model
architecture for sentence pair classification below:
<div style="width:
500px;">![bert-sentence-pair](bert-sentence-pair.png)</div>
where the model takes a pair of
sequences and pools the representation of the
first token in the sequence.
Note that the original BERT model was trained for a
masked language model and next-sentence prediction tasks, which includes layers
for language model decoding and
classification. These layers will not be used
for fine-tuning the sentence pair classification.

We can load the
pre-trained BERT fairly easily
using the model API in GluonNLP, which returns the vocabulary
along with the
model. We include the pooler layer of the pre-trained model by setting
`use_pooler` to `True`.

In [7]:
bert_base, vocabulary = nlp.model.get_model('bert_12_768_12',
                                             dataset_name='book_corpus_wiki_en_uncased',
                                             pretrained=True, ctx=ctx, use_pooler=True,
                                             use_decoder=False, use_classifier=False)
print(bert_base)

BERTModel(
  (encoder): BERTEncoder(
    (dropout_layer): Dropout(p = 0.1, axes=())
    (layer_norm): LayerNorm(eps=1e-12, axis=-1, center=True, scale=True, in_channels=768)
    (transformer_cells): HybridSequential(
      (0): BERTEncoderCell(
        (dropout_layer): Dropout(p = 0.1, axes=())
        (attention_cell): DotProductSelfAttentionCell(
          (dropout_layer): Dropout(p = 0.1, axes=())
        )
        (proj): Dense(768 -> 768, linear)
        (ffn): PositionwiseFFN(
          (ffn_1): Dense(768 -> 3072, linear)
          (activation): GELU()
          (ffn_2): Dense(3072 -> 768, linear)
          (dropout_layer): Dropout(p = 0.1, axes=())
          (layer_norm): LayerNorm(eps=1e-12, axis=-1, center=True, scale=True, in_channels=768)
        )
        (layer_norm): LayerNorm(eps=1e-12, axis=-1, center=True, scale=True, in_channels=768)
      )
      (1): BERTEncoderCell(
        (dropout_layer): Dropout(p = 0.1, axes=())
        (attention_cell): DotProductSelfAttention

### Transform the model for `SentencePair` classification

Now that we have loaded
the BERT model, we only need to attach an additional layer for classification.
The `BERTClassifier` class uses a BERT base model to encode sentence
representation, followed by a `nn.Dense` layer for classification.

In [8]:
bert_classifier = nlp.model.BERTClassifier(bert_base, num_classes=2, dropout=0.1)
# only need to initialize the classifier layer.
bert_classifier.classifier.initialize(init=mx.init.Normal(0.02), ctx=ctx)
bert_classifier.hybridize(static_alloc=True)

# softmax cross entropy loss for classification
loss_function = mx.gluon.loss.SoftmaxCELoss()
loss_function.hybridize(static_alloc=True)

metric = mx.metric.Accuracy()
# metric = mx.metric.F1()

## Data preprocessing for BERT

For this tutorial, we need to do a bit of preprocessing before feeding our data introduced
the BERT model. Here we want to leverage the dataset included in the downloaded archive at the
beginning of this tutorial.

### Loading the dataset

We use
the dev set of the
Microsoft Research Paraphrase Corpus dataset. The file is
named 'dev.tsv'. Let's take a look at the first few lines of the raw dataset.

In [9]:
# tsv_file = io.open('dev.tsv', encoding='utf-8')
# for i in range(5):
#     print(tsv_file.readline())

The file contains 5 columns, separated by tabs.
The header of
the file explains each of these columns, although an explanation for each is included
here:
0. The label indicating whether the two
sentences are semantically equivalent
1. The id of the first sentence in this
sample
2. The id of the second sentence in this sample
3. The content of the
first sentence
4. The content of the second sentence

For our task, we are
interested in the 0th, 3rd and 4th columns.
To load this dataset, we can use the
`TSVDataset` API and skip the first line because it's just the schema:

In [10]:
train_df = pd.read_csv('training.csv')

In [11]:
train_df.head(3)

Unnamed: 0,ID,question,answer,relevance
0,2788,who kill franz ferdinand ww1,A plaque commemorating the location of the Sar...,0
1,8166,what is a medallion guarantee,Sample of a Medallion signature guarantee stampIn,0
2,4289,what does a vote to table a motion mean ?,The difference is the idea of what the table i...,0


In [12]:
train_df['question'] = train_df['question'].astype(str)
train_df['answer'] = train_df['answer'].astype(str)

train_data_raw = train_df[['question', 'answer', 'relevance']].values
data_train_raw = mx.gluon.data.SimpleDataset(train_data_raw)

sample_id = 0
# Sentence A
print(data_train_raw[sample_id][0])
# Sentence B
print(data_train_raw[sample_id][1])
# 1 means equivalent, 0 means not equivalent
print(data_train_raw[sample_id][2])

who kill franz ferdinand ww1
A plaque commemorating the location of the Sarajevo assassination ( image taken in 2009 )
0


In [13]:
len(train_df)

6861

In [14]:
sum(train_df['relevance'])

838

In [15]:
838/6861

0.12213962979157557

To use the pre-trained BERT model, we need to pre-process the data in the same
way it was trained. The following figure shows the input representation in BERT:
<div style="width: 500px;">![bert-embed](bert-embed.png)</div>

We will use
`BERTDatasetTransform` to perform the following transformations:
- tokenize
the
input sequences
- insert [CLS] at the beginning
- insert [SEP] between sentence
A and sentence B, and at the end
- generate segment ids to indicate whether
a token belongs to the first sequence or the second sequence.
- generate valid length

In [16]:
# Use the vocabulary from pre-trained model for tokenization
bert_tokenizer = nlp.data.BERTTokenizer(vocabulary, lower=True)

# The maximum length of an input sequence
# max_len = 64
max_len = 128
# The labels for the two classes 
all_labels = [0, 1]

# whether to transform the data as sentence pairs.
# for single sentence classification, set pair=False
# for regression task, set class_labels=None
# for inference without label available, set has_label=False
pair = True
transform = data.transform.BERTDatasetTransform(bert_tokenizer, max_len,
                                                class_labels=all_labels,
                                                has_label=True,
                                                pad=True,
                                                pair=pair)
data_train = data_train_raw.transform(transform)

print('vocabulary used for tokenization = \n%s'%vocabulary)
print('%s token id = %s'%(vocabulary.padding_token, vocabulary[vocabulary.padding_token]))
print('%s token id = %s'%(vocabulary.cls_token, vocabulary[vocabulary.cls_token]))
print('%s token id = %s'%(vocabulary.sep_token, vocabulary[vocabulary.sep_token]))
print('token ids = \n%s'%data_train[sample_id][0])
print('segment ids = \n%s'%data_train[sample_id][1])
print('valid length = \n%s'%data_train[sample_id][2])
print('label = \n%s'%data_train[sample_id][3])

vocabulary used for tokenization = 
Vocab(size=30522, unk="[UNK]", reserved="['[CLS]', '[SEP]', '[MASK]', '[PAD]']")
[PAD] token id = 1
[CLS] token id = 2
[SEP] token id = 3
token ids = 
[    2  2040  3102  8965  9684  1059  2860  2487     3  1037 11952 20646
  1996  3295  1997  1996 18354 10102  1006  3746  2579  1999  2268  1007
     3     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1]
segment ids = 
[0 0 0 0 0 0 0 0 0

In [17]:
batch_size = 16
# The FixedBucketSampler and the DataLoader for making the mini-batches
train_sampler = nlp.data.FixedBucketSampler(lengths=[int(item[2]) for item in data_train],
                                            batch_size=batch_size,
                                            shuffle=True)
bert_dataloader = mx.gluon.data.DataLoader(data_train, batch_sampler=train_sampler)

In [18]:
train_df

Unnamed: 0,ID,question,answer,relevance
0,2788,who kill franz ferdinand ww1,A plaque commemorating the location of the Sar...,0
1,8166,what is a medallion guarantee,Sample of a Medallion signature guarantee stampIn,0
2,4289,what does a vote to table a motion mean ?,The difference is the idea of what the table i...,0
3,8180,when was the lady gaga judas song released,`` Judas '' is a song by American recording ar...,1
4,725,How did Edgar Allan Poe die ?,His work forced him to move among several citi...,0
...,...,...,...,...
6856,1310,when is the wv state fair,Free parking is provided adjacent to the fairg...,0
6857,3413,what are square diamonds called ?,"However , while displaying the same high degre...",0
6858,9631,what is direct marketing channel,Direct marketing is practiced by businesses of...,0
6859,581,who was charged with murder after the massacre...,They received hate mail and death threats and ...,0


In [19]:
X_train, X_val = train_test_split(train_df, test_size=0.20, random_state=42)

In [20]:
len(X_train)

5488

In [21]:
len(X_val)

1373

In [22]:
X_train_data_raw = X_train[['question', 'answer', 'relevance']].values
X_val_data_raw = X_val[['question', 'answer', 'relevance']].values
X_data_train_raw = mx.gluon.data.SimpleDataset(X_train_data_raw)
X_data_val_raw = mx.gluon.data.SimpleDataset(X_val_data_raw)

X_data_train = X_data_train_raw.transform(transform)
X_data_val = X_data_val_raw.transform(transform)


# The training set dataloader
X_train_sampler = nlp.data.FixedBucketSampler(lengths=[int(item[2]) for item in X_data_train],
                                            batch_size=batch_size,
                                            shuffle=True)
X_train_dataloader = mx.gluon.data.DataLoader(X_data_train, batch_sampler=X_train_sampler)

# validation set dataloader
X_val_sampler = nlp.data.FixedBucketSampler(lengths=[int(item[2]) for item in X_data_val],
                                            batch_size=batch_size,
                                            shuffle=True)
X_val_dataloader = mx.gluon.data.DataLoader(X_data_val, batch_sampler=X_val_sampler)


In [23]:
# The hyperparameters
batch_size = 16
lr = 5e-6



trainer = mx.gluon.Trainer(bert_classifier.collect_params(), 'adam',
                           {'learning_rate': lr, 'epsilon': 1e-9})

# Collect all differentiable parameters
# `grad_req == 'null'` indicates no gradients are calculated (e.g. constant parameters)
# The gradients for these params are clipped later
params = [p for p in bert_classifier.collect_params().values() if p.grad_req != 'null']
grad_clip = 1

# Training the model with only three epochs
log_interval = 4
num_epochs = 40
for epoch_id in range(num_epochs):
    metric.reset()
    step_loss = 0
    for batch_id, (token_ids, segment_ids, valid_length, label) in enumerate(X_train_dataloader):
        with mx.autograd.record():

            # Load the data to the GPU
            token_ids = token_ids.as_in_context(ctx)
            valid_length = valid_length.as_in_context(ctx)
            segment_ids = segment_ids.as_in_context(ctx)
            label = label.as_in_context(ctx)

            # Forward computation
            out = bert_classifier(token_ids, segment_ids, valid_length.astype('float32'))
            ls = loss_function(out, label).mean()

        # And backwards computation
        ls.backward()

        # Gradient clipping
        trainer.allreduce_grads()
        nlp.utils.clip_grad_global_norm(params, 1)
        trainer.update(1)

        step_loss += ls.asscalar()
#         import ipdb; ipdb.set_trace() # debugging starts here
        metric.update([label], [out])

        # Printing vital information
        if (batch_id + 1) % (log_interval) == 0:
            print('[Epoch {} Batch {}/{}] loss={:.4f}, lr={:.7f}, acc={:.3f}'
                         .format(epoch_id, batch_id + 1, len(X_train_dataloader),
                                 step_loss / log_interval,
                                 trainer.learning_rate, metric.get()[1]))
            step_loss = 0
            
    result = []
    gt_label = []
    for _, seqs in enumerate(X_val_dataloader):
        token_ids, segment_ids, valid_length, label = seqs
        token_ids = token_ids.as_in_context(ctx)
        valid_length = valid_length.as_in_context(ctx)
        segment_ids = segment_ids.as_in_context(ctx)
        out = bert_classifier(token_ids, segment_ids, valid_length.astype('float32'))
        batch_labels = np.argmax(out, axis=1)
        result += batch_labels.asnumpy().astype(int).tolist()
        gt_label += label.asnumpy().flatten().tolist()
    print(f'Test F1: {f1_score(gt_label, result)}, Accuracy Score: {accuracy_score(gt_label, result)}')

[Epoch 0 Batch 4/347] loss=0.6525, lr=0.0000050, acc=0.692
[Epoch 0 Batch 8/347] loss=0.6371, lr=0.0000050, acc=0.690
[Epoch 0 Batch 12/347] loss=0.6757, lr=0.0000050, acc=0.684
[Epoch 0 Batch 16/347] loss=0.5760, lr=0.0000050, acc=0.735
[Epoch 0 Batch 20/347] loss=0.5911, lr=0.0000050, acc=0.745
[Epoch 0 Batch 24/347] loss=0.5841, lr=0.0000050, acc=0.754
[Epoch 0 Batch 28/347] loss=0.4885, lr=0.0000050, acc=0.777
[Epoch 0 Batch 32/347] loss=0.5244, lr=0.0000050, acc=0.783
[Epoch 0 Batch 36/347] loss=0.4782, lr=0.0000050, acc=0.797
[Epoch 0 Batch 40/347] loss=0.4970, lr=0.0000050, acc=0.802
[Epoch 0 Batch 44/347] loss=0.4691, lr=0.0000050, acc=0.809
[Epoch 0 Batch 48/347] loss=0.4600, lr=0.0000050, acc=0.812
[Epoch 0 Batch 52/347] loss=0.4026, lr=0.0000050, acc=0.818
[Epoch 0 Batch 56/347] loss=0.5692, lr=0.0000050, acc=0.813
[Epoch 0 Batch 60/347] loss=0.3906, lr=0.0000050, acc=0.819
[Epoch 0 Batch 64/347] loss=0.3705, lr=0.0000050, acc=0.824
[Epoch 0 Batch 68/347] loss=0.4175, lr=0.0

KeyboardInterrupt: 

## Fine-tuning the model

Now we have all the pieces to put together, and we can finally start fine-tuning the
model with very few epochs. For demonstration, we use a fixed learning rate and
skip the validation steps. For the optimizer, we leverage the ADAM optimizer which
performs very well for NLP data and for BERT models in particular.

In [19]:
# The hyperparameters
batch_size = 16
lr = 5e-6

trainer = mx.gluon.Trainer(bert_classifier.collect_params(), 'adam',
                           {'learning_rate': lr, 'epsilon': 1e-9})

# Collect all differentiable parameters
# `grad_req == 'null'` indicates no gradients are calculated (e.g. constant parameters)
# The gradients for these params are clipped later
params = [p for p in bert_classifier.collect_params().values() if p.grad_req != 'null']
grad_clip = 1

# Training the model with only three epochs
log_interval = 4
num_epochs = 9
for epoch_id in range(num_epochs):
    metric.reset()
    step_loss = 0
    for batch_id, (token_ids, segment_ids, valid_length, label) in enumerate(bert_dataloader):
        with mx.autograd.record():

            # Load the data to the GPU
            token_ids = token_ids.as_in_context(ctx)
            valid_length = valid_length.as_in_context(ctx)
            segment_ids = segment_ids.as_in_context(ctx)
            label = label.as_in_context(ctx)

            # Forward computation
            out = bert_classifier(token_ids, segment_ids, valid_length.astype('float32'))
            ls = loss_function(out, label).mean()

        # And backwards computation
        ls.backward()

        # Gradient clipping
        trainer.allreduce_grads()
        nlp.utils.clip_grad_global_norm(params, 1)
        trainer.update(1)

        step_loss += ls.asscalar()
#         import ipdb; ipdb.set_trace() # debugging starts here
        metric.update([label], [out])

        # Printing vital information
        if (batch_id + 1) % (log_interval) == 0:
            print('[Epoch {} Batch {}/{}] loss={:.4f}, lr={:.7f}, acc={:.3f}'
                         .format(epoch_id, batch_id + 1, len(bert_dataloader),
                                 step_loss / log_interval,
                                 trainer.learning_rate, metric.get()[1]))
            step_loss = 0


[Epoch 0 Batch 4/434] loss=0.7231, lr=0.0000050, acc=0.469
[Epoch 0 Batch 8/434] loss=0.6147, lr=0.0000050, acc=0.609
[Epoch 0 Batch 12/434] loss=0.6109, lr=0.0000050, acc=0.677
[Epoch 0 Batch 16/434] loss=0.6227, lr=0.0000050, acc=0.674
[Epoch 0 Batch 20/434] loss=0.5898, lr=0.0000050, acc=0.683
[Epoch 0 Batch 24/434] loss=0.5403, lr=0.0000050, acc=0.720
[Epoch 0 Batch 28/434] loss=0.4930, lr=0.0000050, acc=0.752
[Epoch 0 Batch 32/434] loss=0.5040, lr=0.0000050, acc=0.762
[Epoch 0 Batch 36/434] loss=0.4732, lr=0.0000050, acc=0.777
[Epoch 0 Batch 40/434] loss=0.4627, lr=0.0000050, acc=0.790
[Epoch 0 Batch 44/434] loss=0.4754, lr=0.0000050, acc=0.797
[Epoch 0 Batch 48/434] loss=0.5080, lr=0.0000050, acc=0.798
[Epoch 0 Batch 52/434] loss=0.4316, lr=0.0000050, acc=0.805
[Epoch 0 Batch 56/434] loss=0.4887, lr=0.0000050, acc=0.807
[Epoch 0 Batch 60/434] loss=0.4164, lr=0.0000050, acc=0.812
[Epoch 0 Batch 64/434] loss=0.3525, lr=0.0000050, acc=0.819
[Epoch 0 Batch 68/434] loss=0.3794, lr=0.0

In [20]:
# for epoch_id in range(5, 8):
#     metric.reset()
#     step_loss = 0
#     for batch_id, (token_ids, segment_ids, valid_length, label) in enumerate(bert_dataloader):
#         with mx.autograd.record():

#             # Load the data to the GPU
#             token_ids = token_ids.as_in_context(ctx)
#             valid_length = valid_length.as_in_context(ctx)
#             segment_ids = segment_ids.as_in_context(ctx)
#             label = label.as_in_context(ctx)

#             # Forward computation
#             out = bert_classifier(token_ids, segment_ids, valid_length.astype('float32'))
#             ls = loss_function(out, label).mean()

#         # And backwards computation
#         ls.backward()

#         # Gradient clipping
#         trainer.allreduce_grads()
#         nlp.utils.clip_grad_global_norm(params, 1)
#         trainer.update(1)

#         step_loss += ls.asscalar()
#         metric.update([label], [out])

#         # Printing vital information
#         if (batch_id + 1) % (log_interval) == 0:
#             print('[Epoch {} Batch {}/{}] loss={:.4f}, lr={:.7f}, acc={:.3f}'
#                          .format(epoch_id, batch_id + 1, len(bert_dataloader),
#                                  step_loss / log_interval,
#                                  trainer.learning_rate, metric.get()[1]))
#             step_loss = 0

In [21]:
test_df = pd.read_csv('public_test_features.csv')
test_df['question'] = test_df['question'].astype(str)
test_df['answer'] = test_df['answer'].astype(str)

In [22]:
test_df.head(2)

Unnamed: 0,ID,question,answer
0,917,when does the electoral college votes,The Twelfth Amendment specifies how a Presiden...
1,6587,what year lord of rings made ?,Tolkien 's work has been the subject of extens...


In [23]:
len(test_df)

2941

In [24]:
test_trans = data.transform.BERTDatasetTransform(bert_tokenizer, 
                                                 max_len,
                                                 class_labels=None,
                                                 pad=True, 
                                                 pair=pair,
                                                 has_label=False)

test_data_raw = test_df[['question', 'answer']].values
data_test_raw = mx.gluon.data.SimpleDataset(test_data_raw)
data_test = data_test_raw.transform(test_trans)

In [25]:

print('token ids = \n%s'%data_test[0][0])
print('segment ids = \n%s'%data_test[0][1])
print('valid length = \n%s'%data_test[0][2])


token ids = 
[    2  2043  2515  1996  6092  2267  4494     3  1996 11313  7450 27171
  2129  1037  2343  1998  3580  2343  2024  2700  1998  5942  2169 20374
  2000  3459  2028  3789  2005  2343  1998  2178  3789  2005  3580  2343
  1012     3     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1]
segment ids = 
[0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

In [26]:
# sampler for evaluation
pad_val = vocabulary[vocabulary.padding_token]
batchify_fn = nlp.data.batchify.Tuple(
    nlp.data.batchify.Pad(axis=0, pad_val=pad_val),  # input
    nlp.data.batchify.Pad(axis=0, pad_val=0),  # segment
    nlp.data.batchify.Stack())  # lenght


dev_dataloader = mx.gluon.data.DataLoader(data_test, batch_size=batch_size, num_workers=4,
                                           shuffle=False, batchify_fn=batchify_fn)


In [27]:
len(data_test)

2941

In [28]:
result = []
for _, seqs in enumerate(dev_dataloader):
    token_ids, segment_ids, valid_length = seqs
    token_ids = token_ids.as_in_context(ctx)
    valid_length = valid_length.as_in_context(ctx)
    segment_ids = segment_ids.as_in_context(ctx)
    out = bert_classifier(token_ids, segment_ids, valid_length.astype('float32'))
    batch_labels = np.argmax(out, axis=1)
    result += list(batch_labels)

In [29]:
len(result)

2941

In [30]:
answer = [i.as_in_context(mx.cpu()) for i in result]
answer = np.array([i.asnumpy() for i in answer])
answer = list(answer.flatten().astype(int))
# dct = {0: 'N', 1: 'Y'}
# answer = [dct[k] for k in answer]
test_df['relevance'] = answer

In [31]:
test_df

Unnamed: 0,ID,question,answer,relevance
0,917,when does the electoral college votes,The Twelfth Amendment specifies how a Presiden...,0
1,6587,what year lord of rings made ?,Tolkien 's work has been the subject of extens...,0
2,5227,what countries are under the buddhism religion,Estimate of the worldwide Buddhist population ...,0
3,4707,what does ( sic ) mean ?,Sic may also refer to:,0
4,700,when is it memorial day,In cases involving a family graveyard where re...,0
...,...,...,...,...
2936,5590,how many ports are there in networking,"That is , data packets are routed across the n...",0
2937,5320,what genre is bloody beetroots,"In fact , the only identifying public feature ...",0
2938,1664,where is green bay packers from,They are members of the North Division of the ...,0
2939,1245,when did the civil war start and where,The Union marshaled the resources and manpower...,0


In [32]:
sum(test_df['relevance'])

332

In [33]:
177/2941

0.060183611016661

In [34]:
result_df = test_df[["ID", "relevance"]]
result_df.to_csv("test_submission_nlp2.csv", index=False)

In [35]:
result_df

Unnamed: 0,ID,relevance
0,917,0
1,6587,0
2,5227,0
3,4707,0
4,700,0
...,...,...
2936,5590,0
2937,5320,0
2938,1664,0
2939,1245,0
