![license_header_logo](../../../images/license_header_logo.png)

> **Copyright (c) 2021 CertifAI Sdn. Bhd.**<br>
<br>
This program is part of OSRFramework. You can redistribute it and/or modify
<br>it under the terms of the GNU Affero General Public License as published by
<br>the Free Software Foundation, either version 3 of the License, or
<br>(at your option) any later version.
<br>
<br>This program is distributed in the hope that it will be useful
<br>but WITHOUT ANY WARRANTY; without even the implied warranty of
<br>MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
<br>GNU Affero General Public License for more details.
<br>
<br>You should have received a copy of the GNU Affero General Public License
<br>along with this program.  If not, see <http://www.gnu.org/licenses/>.
<br>

# Introduction

This tutorial is to demonstrate the implementation of text classification on news data using machine learning approach. The classifier are required to classify the text data into their corresponding categories in supervised manner.

# What will we accomplish?

Steps to implement text classifier in machine learning:

> Step 1: Importing Libraries

> Step 2: Loading Datasets & Exploratory Data Analysis

> Step 3: Text Pre-processing

> Step 4: Feature Extraction (Vectorization)

> Step 5: Running ML algorithms

> Step 6: Grid Search for parameter tuning

# Notebook Content

* [Getting Started with BERT](#Getting-Started-with-BERT)

    * [Hugging Face Transformers](#Hugging-Face-Transformers)
    
    * [Generating BERT Embeddings](#Generating-BERT-Embeddings)
        * [Import Libraries](#Import-Libraries)
        * [Download Pre-trained BERT Model](#Download-Pre-trained-BERT-Model)
        * [Preprocessing the Input](#Preprocessing-the-Input)
        * [Getting the Embedding](#Getting-the-Embedding)


* [Fine-tuning BERT for Downstream Tasks](#Fine-tuning-BERT-for-Downstream-Tasks)

    * [Text Classification](#Text-Classification)
        * [Import the Dependencies](#Import-the-Dependencies)
        * [Loading the Model and Dataset](#Loading-the-Model-and-Dataset)
        * [Train-Test Split](#Train-Test-Split)
        * [Download and Load Pre-trained Model](#Download-and-Load-Pre-trained-Model)
        * [Preprocess the Dataset](#Preprocess-the-Dataset)
        * [Training the Model](#Training-the-Model)    
        
    * [Q&A with Finetuned BERT](#Q&A-with-Finetuned-BERT)
        * [Import Dependencies](#Import-Dependencies)
        * [Load Pre-trained Model](#Load-Pre-trained-Model)
        * [Preprocessing the Input](#Preprocessing-the-Input)
        * [Getting the Answer](#Getting-the-Answer)

# Getting Started with BERT

## Hugging Face Transformers

**Hugging Face** is an organization that is on the path of democratizing AI through natural language. Their open source transformers library is very popular among the **Natural Language Processing (NLP)** community. It is very useful and powerful for several NLP and **Natural Language Understanding (NLU)** tasks. It includes **thousands of pre-trained models** in more than 100 languages. One of the many advantages of the transformer's library is that it is compatible with both PyTorch and TensorFlow.

## Generating BERT Embeddings

In this section, we will learn how to extract embeddings from the pre-trained BERT model. Consider the sentence *I love Paris*. Let's see how to obtain the contextualized word embedding of all the words in the sentence using the pre-trained BERT model with Hugging Face's transformers library.

### Import Libraries

In [1]:
from transformers import BertModel, BertTokenizer
import torch

### Download Pre-trained BERT Model

Next, we download the pre-trained BERT model. We can check all the available pre-trained BERT models [here](https://huggingface.co/transformers/pre-trained_models.html). We use the `'bert-base-uncased'` model. As the name suggests, it is the BERT-base model with 12 encoders and it is trained with uncased tokens. Since we are using BERTbase, the representation size will be 768.

In [2]:
!jupyter nbextension enable --py widgetsnbextension

model = BertModel.from_pretrained('bert-base-uncased')

Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: ok
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Next, we download and load the tokenizer that was used to pre-train the bert-baseuncased model.

In [3]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

### Preprocessing the Input

In [4]:
# Define the sentence
sentence = "I love Paris"

Tokenize the sentence and obtain the tokens.

In [5]:
tokens = tokenizer.tokenize(sentence)

In [6]:
# Print the tokens
print(tokens)

['i', 'love', 'paris']


Add the `[CLS]` token at the beginning and the `[SEP]` token at the end of the `tokens` list

In [7]:
tokens = ['[CLS]'] + tokens + ['[SEP]']
print(tokens)

['[CLS]', 'i', 'love', 'paris', '[SEP]']


Say we need to keep the length of our `tokens` list to 7; in that case, we add two `[PAD]` tokens at the end.

In [8]:
tokens = tokens + ['[PAD]'] + ['[PAD]']
print(tokens)

['[CLS]', 'i', 'love', 'paris', '[SEP]', '[PAD]', '[PAD]']


Next, we create the attention mask. We set the attention mask value to 1 if the token is not a `[PAD]` token, else we set the attention mask to 0.

In [9]:
attention_mask = [1 if i != '[PAD]' else 0 for i in tokens]
print(attention_mask)

[1, 1, 1, 1, 1, 0, 0]


As we can see, we have attention mask values 0 at positions where have a `[PAD]` token and 1 at other positions.


Next, we convert all the tokens to their token IDs.

In [10]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)

print(token_ids)

[101, 1045, 2293, 3000, 102, 0, 0]


From the output, we can observe that each token is mapped to a unique token ID.


Now, we convert `token_ids` and `attention_mask` to tensors.

In [11]:
token_ids = torch.tensor(token_ids).unsqueeze(0)

attention_mask = torch.tensor(attention_mask).unsqueeze(0)

Next, we feed token_ids and attention_mask to the pre-trained BERT model and get the embedding.

### Getting the Embedding

As shown in the following code, we feed `token_ids` and `attention_mask` to model and get the embeddings. Note that model returns the output as a **tuple** with two values. The first value indicates the hidden state representation, `hidden_rep`, and it consists of the representation of all the tokens obtained from the final encoder (encoder 12), and the second value, `cls_head`, consists of the representation of the `[CLS]` token.

In [12]:
hidden_rep, cls_head = model.forward(input_ids=token_ids, attention_mask=attention_mask, return_dict=False)

In [13]:
# Embedding (representation) of all the tokens in our input
print(hidden_rep, hidden_rep.shape, sep='\n\n')

tensor([[[-0.0719,  0.2163,  0.0047,  ..., -0.5865,  0.2262,  0.1981],
         [ 0.2236,  0.6536, -0.2294,  ..., -0.3547,  0.5517, -0.2367],
         [ 1.0410,  0.7755,  1.0335,  ..., -0.5621,  0.5218, -0.0852],
         ...,
         [ 0.6156,  0.1036, -0.1875,  ..., -0.3799, -0.7008, -0.3500],
         [ 0.0791,  0.4287,  0.4147,  ..., -0.2417,  0.2403,  0.0378],
         [-0.0165,  0.2459,  0.4566,  ..., -0.2179,  0.1876,  0.0228]]],
       grad_fn=<NativeLayerNormBackward>)

torch.Size([1, 7, 768])


The size [1, 7, 768] indicates [batch_size, sequence_length, hidden_size]

We can obtain the representation of each token as follows:
* `hidden_rep[0][0]` gives the representation of the first token, which is `[CLS]`.
* `hidden_rep[0][1]` gives the representation of the second token, which is I.
* `hidden_rep[0][2]` gives the representation of the third token, which is love.

Therefore, we can obtain the contextual representation of all the tokens. This is the contextualized word embeddings of all the words in the given sentence.

In [14]:
hidden_rep[0][0]

tensor([-7.1920e-02,  2.1631e-01,  4.7180e-03, -8.1534e-02, -3.0399e-01,
        -2.6997e-01,  3.6993e-01,  4.3028e-01,  1.1932e-02, -2.0674e-01,
        -8.9630e-02, -1.3917e-01,  1.7530e-01,  4.8318e-01,  3.0506e-01,
        -5.9535e-03, -1.7049e-01,  4.9769e-01,  4.6345e-01, -1.6272e-01,
         2.8591e-02, -2.6006e-01, -3.3321e-01, -8.1934e-02, -8.8632e-02,
        -3.5845e-01, -1.2788e-01, -7.6149e-02,  3.1540e-01, -1.5370e-02,
         2.4448e-01,  7.5998e-02, -6.1328e-02,  1.8551e-01,  2.3354e-01,
        -5.2519e-02,  3.3775e-01, -1.0754e-01, -3.2548e-02,  2.1909e-01,
         1.7896e-01, -8.9923e-03,  2.1548e-01, -4.8307e-02,  2.7949e-01,
        -2.8501e-01, -1.8575e+00, -3.7983e-02, -6.7010e-02, -2.6804e-01,
         2.5982e-01, -9.3902e-02,  4.1909e-01,  3.3008e-01,  5.1305e-02,
         2.5632e-01, -3.9642e-01,  6.5480e-01,  1.2961e-01,  3.6180e-01,
         1.5786e-01,  1.1038e-03, -1.5318e-01,  3.4398e-02, -1.8015e-01,
         2.6369e-01,  3.7324e-02,  2.1566e-01, -3.7

Now, let's take a look at `cls_head`. It contains the representation of the `[CLS]` token

In [15]:
print(cls_head.shape)

torch.Size([1, 768])


The size [1, 768] indicates [batch_size, hidden_size]

We learned that `cls_head` holds the **aggregate representation**, so we can use `cls_head` as the representation of the sentence *I love Paris*.

# Fine-tuning BERT for Downstream Tasks

So far, we have learned how to use the pre-trained BERT model. Now, let's learn how to fine-tune the pre-trained BERT model for downstream tasks. Note that fine-tuning implies that we are not training BERT from scratch; instead, we are using the pre-trained BERT and updating its weights according to our task.

In this section, we will learn how to fine-tune the pre-trained BERT model for the following downstream tasks:
* Text classification (Sentiment Analysis)
* Question-answering

## Text Classification 

Let's learn how to finetune the pre-trained BERT for text classification tasks. Say, we are performing sentiment analysis. In the sentiment analysis, our goal is to classify whether a sentence is positive or negative. Suppose, we have a dataset containing sentences along with their labels. 

Consider a sentence: 'I love Pairs'. First, we tokenize the sentence, add the [CLS] token at the beginning, and [SEP] token at the end of the sentence. Then, we feed the tokens as an input to the pre-trained BERT and get the embeddings of all the tokens. 

Next, we ignore the embedding of all other tokens and take only the embedding of [CLS] token which is $R_{[CLS]}$. The embedding of the [CLS] token will hold the aggregate representation of the sentence. We feed $R_{[CLS]}$ to a classifier (feed-forward network with softmax function) and train the classifier to perform sentiment analysis. 

Wait! How does it differ from what we saw at the beginning of the section. How finetuning the pre-trained BERT differs from using the pre-trained BERT as a feature extractor?

In "Extracting embeddings from pre-trained BERT" section, we learned that after extracting the embedding $R_{[CLS]}$ of a sentence, we feed the $R_{[CLS]}$ to a classifier and train the classifier to perform classification. Similarly, during finetuning, we feed the embedding of $R_{[CLS]}$ to a classifier and train the classifier to perform classification.

The difference is that when we finetune the pre-trained BERT, we can update the weights of the pre-trained BERT along with a classifier. But when we use the pre-trained BERT as a feature extractor, we can update only the weights of a classifier and not the pre-trained BERT. 

During finetuning, we can adjust the weights of the model in the following two ways:

- Update the weights of the pre-trained BERT along with the classification layer 
- Update only the weights of the classification layer and not the pre-trained BERT. When we do this, it becomes the same as using the pre-trained BERT as a feature extractor

The following figure shows how we finetune the pre-trained BERT for the sentiment analysis task:


![title](../../../images/text-clf-BERT.jpg)

As we can observe from the preceding figure, we feed the tokens to the pre-trained BERT and get the embedding of all the tokens. We take the embedding of [CLS] token and feed it to a feedforward network with a softmax function and perform classification. 

Let's get a better understanding of how finetuning works by getting hands-on with finetuning the pre-trained BERT for sentiment analysis in the next section. 

## Finetuning BERT for Sentiment Analysis 
Let's explore how to finetune the pre-trained BERT for a sentiment analysis task with the IMDB dataset. The IMDB dataset consists of movie reviews along with the respective sentiment. You should have the dataset in the `dataset` folder.

Import the dependencies 

First, let's install the necessary libraries: 

### Import the Dependencies

In [16]:
from transformers import BertForSequenceClassification, BertTokenizerFast, Trainer, TrainingArguments
from nlp import load_dataset
import torch
import numpy as np

### Loading the Model and Dataset

In [17]:
dataset = load_dataset('csv', data_files='../../../resources/day_12/imdbs.csv', split='train')

Using custom data configuration default



Let us check the datatype:

In [18]:
type(dataset)

nlp.arrow_dataset.Dataset


Next, let's split the dataset into train and test set:

### Train-Test Split

In [19]:
dataset = dataset.train_test_split(test_size=0.3)


Let's print the dataset:

In [20]:
dataset

{'train': Dataset(features: {'text': Value(dtype='string', id=None), 'label': Value(dtype='int64', id=None)}, num_rows: 70),
 'test': Dataset(features: {'text': Value(dtype='string', id=None), 'label': Value(dtype='int64', id=None)}, num_rows: 30)}


Now, we create the  train and test sets:




In [21]:
train_set = dataset['train']
test_set = dataset['test']


Next, let's download and load the pre-trained BERT model. In this example, we use the pre-trained bert-base-uncased model. As we can observe below, since we are performing sequence classification, we use the BertForSequenceClassification class: 


### Download and Load Pre-trained Model

In [22]:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at


Next, we download and load the tokenizer which is used for pretraining the bert-base-uncased model.
As we can observe, we create the tokenizer using the BertTokenizerFastclass instead of BertTokenizer. The BertTokenizerFast class has many advantages compared to BertTokenizer. We will learn about this in the next section: 


In [23]:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')


Now that we loaded the dataset and model, next let's preprocess the dataset. 

### Preprocess the Dataset
We can preprocess the dataset in a quicker way using our tokenizer. For example, consider the sentence: 'I love Paris'.  

First, we tokenize the sentence and add the [CLS] token at the beginning and [SEP] token at the end as shown below: 


tokens = [ [CLS], I, love, Paris, [SEP] ]


Next, we map the tokens to the unique input ids (token ids). Suppose the following are the unique input ids (token ids):


input_ids = [101, 1045, 2293, 3000, 102]

Then, we need to add the segment ids (token type ids). Wait, what are segment ids? Suppose we have two sentences in the input. In that case, segment ids are used to distinguish one sentence from the other. All the tokens from the first sentence will be mapped to 0 and all the tokens from the second sentence will be mapped to 1. Since here we have only one sentence, all the tokens will be mapped to 0 as shown below:


token_type_ids = [0, 0, 0, 0, 0]


Now, we need to create the attention mask. We know that an attention mask is used to differentiate the actual tokens and [PAD] tokens. It will map all the actual tokens to 1 and the [PAD] tokens to 0. Suppose, our tokens length should be 5. Now, our tokens list has already 5 tokens. So, we don't have to add [PAD] token. Then our attention mask will become: 


attention_mask = [1, 1, 1, 1, 1]


That's it. But instead of doing all the above steps manually, our tokenizer will do these steps for us. We just need to pass the sentence to the tokenizer as shown below: 


In [24]:
tokenizer('I love Paris')

{'input_ids': [101, 1045, 2293, 3000, 102], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}


With the tokenizer, we can also pass any number of sentences and perform padding dynamically. To do that, we need to set padding to True and also the maximum sequence length. For instance, as shown below, we pass three sentences and we set the maximum sequence length, max_length to 5:


In [25]:
tokenizer(['I love Paris', 'birds fly','snow fall'], padding = True, max_length=5)

{'input_ids': [[101, 1045, 2293, 3000, 102], [101, 5055, 4875, 102, 0], [101, 4586, 2991, 102, 0]], 'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 0], [1, 1, 1, 1, 0]]}


That's it, with the tokenizer, we can easily preprocess our dataset. So we define a function called preprocess for processing the dataset as shown below: 


In [26]:
def preprocess(data):
    return tokenizer(data['text'], padding=True, truncation=True)


Now, we preprocess the train and test set using the preprocess function: 


In [30]:
train_set = train_set.map(preprocess, batched=True, batch_size=len(train_set))
test_set = test_set.map(preprocess, batched=True, batch_size=len(test_set))


Next, we use the set_format function and select the columns which we need in our dataset and also in which format we need them as shown below:  


In [31]:
train_set.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
test_set.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

That's it. Now that we have the dataset ready, let's train the model. 

### Training the Model 


Define the batch size and epoch size: 

In [32]:
# You can use more batch size if you have more vram
batch_size = 1
epochs = 2


Define the warmup steps and weight decay:

In [33]:
warmup_steps = 500
weight_decay = 0.01


Define the training arguments:

In [34]:
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    warmup_steps=warmup_steps,
    weight_decay=weight_decay,
    logging_dir='./logs',
)



Now define the trainer: 

In [35]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_set,
    eval_dataset=test_set
)

Start training the model:

In [36]:
import gc

gc.collect()

torch.cuda.empty_cache()

In [37]:
trainer.train()

***** Running training *****
  Num examples = 70
  Num Epochs = 2
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 140


Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=140, training_loss=0.6749564579554966, metrics={'train_runtime': 20.8098, 'train_samples_per_second': 6.728, 'train_steps_per_second': 6.728, 'total_flos': 36835547750400.0, 'train_loss': 0.6749564579554966, 'epoch': 2.0})


After training we can evaluate the model using the evaluate function:

In [38]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 30
  Batch size = 1


{'eval_loss': 0.6275491118431091,
 'eval_runtime': 1.014,
 'eval_samples_per_second': 29.587,
 'eval_steps_per_second': 29.587,
 'epoch': 2.0}


In this way, we can finetune the pre-trained BERT. Now that we have learned how to finetune the BERT for the text classification task.

## Q&A with Finetuned BERT 

In this section, let's learn how to perform question answering with a finetuned Q&A BERT. First, let us import the necessary modules:

### Import Dependencies

In [39]:
import torch
from transformers import BertForQuestionAnswering, BertTokenizer


Now, we download and load the model. We use the bert-large-uncased-whole-word-masking-finetuned-squad model which is finetuned on the SQUAD (Stanford question answering dataset). 


### Load Pre-trained Model

In [40]:
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

loading configuration file https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad/resolve/main/config.json from cache at C:\Users\tanch/.cache\huggingface\transformers\402f6d8c99fdd3bffd354782842e2b5a6be81f80ab630591051ebc78ca726f39.ebffac96fee44dbe30674c204dd3d3f358c1b8c33100281ecdd688514f41410a
Model config BertConfig {
  "architectures": [
    "BertForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.10.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file https://huggingface.co/bert-


Next, we download and load the tokenizer which is used for pretraining the bert-large-uncased-whole-word-masking-finetuned-squad model: 


In [41]:
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

loading file https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad/resolve/main/vocab.txt from cache at C:\Users\tanch/.cache\huggingface\transformers\68e5260dea718cdc2daf27dc106fd8741636b03e3173b5492e57a7fa525ca33b.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99
loading file https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad/resolve/main/special_tokens_map.json from cache at None
loading file https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad/resolve/main/tokenizer_config.json from cache at C:\Users\tanch/.cache\huggingface\transformers\b9f8d92aa5a32cfe504c3524c173dc611dbe81d49392f40601286b94ee1e1169.20430bd8e10ef77a7d2977accefe796051e01bc2fc4aa146bc862997a1a15e79
loading file https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad/reso


Now that we downloaded the model and tokenizer, let's preprocess the input. 

### Preprocessing the Input
First, we define the input to the BERT which is question and paragraph text:


In [42]:
question = "What is the immune system?"
paragraph = "The immune system is a system of many biological structures and processes within an organism that protects against disease. To function properly, an immune system must detect a wide variety of agents, known as pathogens, from viruses to parasitic worms, and distinguish them from the organism's own healthy tissue."

Add [CLS] token to the beginning of the question and [SEP] token at the end of both the question and paragraph: 

In [43]:
question = '[CLS] ' + question + '[SEP]'
paragraph = paragraph + '[SEP]'


Now, tokenize the question and paragraph: 


In [44]:
question_tokens = tokenizer.tokenize(question)
paragraph_tokens = tokenizer.tokenize(paragraph)



Combine the question and paragraph tokens and convert them to input_ids:

In [45]:
tokens = question_tokens + paragraph_tokens 
input_ids = tokenizer.convert_tokens_to_ids(tokens)



Next, we define the segment_ids. The segment_ids will be 0 for all the tokens of question and it will be 1 for all the tokens of the paragraph:


In [46]:
segment_ids = [0] * len(question_tokens)
segment_ids += [1] * len(paragraph_tokens)


Now we convert the input_ids and segment_ids to tensor: 

In [47]:
input_ids = torch.tensor([input_ids])
segment_ids = torch.tensor([segment_ids])



Now that we processed the input. Let's feed them to the model and get the result. 

### Getting the Answer
We feed the input_ids and segment_ids to the model which return the start score and end score for all of the tokens: 


In [48]:
start_scores, end_scores = model(input_ids, token_type_ids = segment_ids, return_dict=False)


Now, we select the start_index which is the index of the token which has a maximum start score and end_index which is the index of the token which has a maximum end score: 


In [49]:
start_index = torch.argmax(start_scores)
end_index = torch.argmax(end_scores)


That's it! Now, we print the text span between the start and end index as our answer: 

In [50]:
print(' '.join(tokens[start_index:end_index+1]))

a system of many biological structures and processes within an organism that protects against disease



Now that we learned how to finetune BERT for the question answering task.



# Conclusion

We started the chapter by looking at the simple implementation of the **pre-trained BERT model** provided by Google. Then, we learned that we can use the pre-trained BERT model in two ways: as a **feature extractor** by extracting embeddings, and by **fine-tuning the pre-trained BERT model** for downstream tasks such as text classification, question-answering, and more.

Then, we learned how to **extract embeddings** from the pre-trained BERT model in detail. We also learned how to use Hugging Face's transformers library to generate embeddings. Then, we learned how to extract embeddings from all the encoder layers of BERT in detail. Moving on, we learned how to **fine-tune pre-trained BERT for downstream tasks**. We learned how to fine-tune BERT for **text classification** and **question-answering** in detail.

# Contributors

**Author**
<br>Chee Lam

# References

Sudharsan Ravichandiran - Getting Started with Google BERT_ Build and train state-of-the-art natural language processing models using BERT-Packt Publishing Ltd (2021)