<center><a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a></center>

# 6. Natural Language Processing

In this tutorial, we'll take a detour away from stand-alone pieces of data such as still images, to data that is dependent on other data items in a sequence. For our example, we'll use text sentences. Language is naturally composed of sequence data, in the form of characters in words, and words in sentences. Other examples of sequence data include stock prices and  weather data over time. Videos, while containing still images, are also sequences. Elements in the data have a relationship with what comes before and what comes after, and this fact requires a different approach.

## 6.1 Objectives

* Use a tokenizer to prepare text for a neural network
* See how embeddings are used to identify numerical features for text data

## 6.2 BERT

BERT, which stands for **B**idirectional **E**ncoder **R**epresentations from **T**ransformers, was a ground-breaking model introduced in 2018 by [Google](https://www.google.com/).

BERT is simultaneously trained on two goals:
* Predict a missing word from a sequence of words
* Predict a new sentence after a sequence of sentences

Let's see BERT in action with these two types of challenges.

## 6.3 Tokenization

Since neural networks are number crunching machines, let's turn text into numerical tokens. Let's load BERT's [tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer#tokenizer):

In [1]:
import torch
from transformers import BertTokenizer, BertModel, BertForMaskedLM, BertForQuestionAnswering
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

The BERT `tokenizer` can [encode](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.encode) multiple texts at once. We will later test BERT's memory, so let's give it information and a question about that information. Feel free to come back here later and try a different combination of sentences.

In [None]:
text_1 = "I understand equations, both the simple and quadratical."
text_2 = "What kind of equations do I understand?"

# Tokenized input with special tokens around it (for BERT: [CLS] at the beginning and [SEP] at the end)
indexed_tokens = tokenizer.encode(text_1, text_2, add_special_tokens=True)
indexed_tokens

If we count the number of tokens, there are more tokens than words in our sentences. Let's see why that is. We can use [convert_ids_to_tokens](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.convert_ids_to_tokens) to see what was used as tokens.

In [None]:
tokenizer.convert_ids_to_tokens([str(token) for token in indexed_tokens])

There are two reasons why the indexed list is longer than our origincal input:
1. The `tokenizer` adds `special_tokens` to represent the start (`[CLS]`) of a sequence and separation ('[SEP]`) between sentences.
2. The `tokenizer` can break a word down into multiple parts.

From a linguistic perspective, the second one is interesting. Many languages have [word roots](https://en.wikipedia.org/wiki/List_of_Greek_and_Latin_roots_in_English), or components that make up a word. For instance, the word "quadratic" has the root "quadr" which means "4". Rather than use word roots as defined by a language, BERT uses a [WordPiece](https://paperswithcode.com/method/wordpiece) model to find patterns in how to break up a word. The BERT model we will be using today has `28996` token vocabulary.

If we want to [decode](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.decode) our encoded text directly, we can. Notice the `special_tokens` have been added in.

In [None]:
tokenizer.decode(indexed_tokens)

## 6.4 Segmenting Text

In order to use the BERT model for predictions, it also needs a list of `segment_ids`. This is a vector the same length as our tokens and represents which segment belongs to each sentence.

Since our `tokenizer` added in some `special_tokens`, we can use these special tokens to find the segments. First, let's define which index correspnds to which special token.

In [None]:
cls_token = 101
sep_token = 102

Next, we can create a `for` loop. We'll start with our `segment_id` set to `0`, and we'll increment the `segment_id` whenever we see the [SEP] token. For good measure, we will return both the `segment_ids` and `indexd_tokens` as tensors as we will be feeding these into the model later.

In [None]:
def get_segment_ids(indexed_tokens):
    segment_ids = []
    segment_id = 0
    for token in indexed_tokens:
        if token == sep_token:
            segment_id += 1
        segment_ids.append(segment_id)
    segment_ids[-1] -= 1  # Last [SEP] is ignored
    return torch.tensor([segment_ids]), torch.tensor([indexed_tokens])

Let's test it out. Does each number correctly correspond to the first and second sentence?

In [None]:
segments_tensors, tokens_tensor = get_segment_ids(indexed_tokens)
segments_tensors

## 6.4 Text Masking

Let's start with the focus BERT has on words. To train for word embeddings, BERT masks out a word in a sequence of words. The mask is its own special token:

In [None]:
tokenizer.mask_token

In [None]:
tokenizer.mask_token_id

Let's take our two sentences from before and mask out the position at index `5`. Feel free to return here to change the index to see how it changes the results!

In [None]:
masked_index = 5

Next, we'll apply the mask and verify it appears in our sequence of setences.

In [None]:
indexed_tokens[masked_index] = tokenizer.mask_token_id
tokens_tensor = torch.tensor([indexed_tokens])
tokenizer.decode(indexed_tokens)

Then, we will load the model used to predict the missing word: `modelForMaskedLM`.

In [None]:
masked_lm_model = BertForMaskedLM.from_pretrained("bert-base-cased")

Just like with other PyTorch modules, we can check the architecture.

In [None]:
masked_lm_model

Can you spot the section labeled `word_embeddings`? These are the embeddings BERT learned for each token.

In [None]:
embedding_table = next(masked_lm_model.bert.embeddings.word_embeddings.parameters())
embedding_table

We can verify there is an embedding of size `768` for each of the `28996` tokens in BERT's vocabulary.

In [None]:
embedding_table.shape

Let's test the model! Can it correctly predict the missing word in our provided sentences? We will use [torch.no_grad](https://pytorch.org/docs/stable/generated/torch.no_grad.html) to inform PyTorch not to calculate a gradient.

In [None]:
with torch.no_grad():
    predictions = masked_lm_model(tokens_tensor, token_type_ids=segments_tensors)
predictions

This is a little bit hard to read, let's look at the `shape` to get a better sense of what's going on.

In [None]:
predictions[0].shape

The `24` is our number of tokens, and the `28996` are the predictions for every token in BERT's vocabulary. We'd like to find the highest value accross all the token in the vocabulary, so we can use [torch.argmax](https://pytorch.org/docs/stable/generated/torch.argmax.html) to find it.

In [None]:
# Get the predicted token
predicted_index = torch.argmax(predictions[0][0], dim=1)[masked_index].item()
predicted_index

Let's see what token `1241` corresponds to:

In [None]:
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
predicted_token

What do you think? Is it correct?

In [None]:
tokenizer.decode(indexed_tokens)

## 6.5 Question and Answering

While word masking is interesting, BERT was designed for more complex problems such as sentence prediction. It is able to accomplish this by building on the [Attention Transformer](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) architecture.

We will be using a different version of BERT for this section, which has its own tokenizer. Let's find a new set of tokens for our sample sentences.

In [None]:
text_1 = "I understand equations, both the simple and quadratical."
text_2 = "What kind of equations do I understand?"

question_answering_tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
indexed_tokens = question_answering_tokenizer.encode(text_1, text_2, add_special_tokens=True)
segments_tensors, tokens_tensor = get_segment_ids(indexed_tokens)

Next, let's load the `question_answering_model`.

In [None]:
question_answering_model = BertForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

We can feed in our tokens and segments, just like when we were masking out a word.

In [None]:
# Predict the start and end positions logits
with torch.no_grad():
    out = question_answering_model(tokens_tensor, token_type_ids=segments_tensors)
out

The `question_answering_model` and answering model is scanning through our input sequence to find the subsequence that best answers the question. The higher the value, the more likely the start of the answer is.

In [None]:
out.start_logits

Similarly, the higher the value in `end_logits`, the more likely the answer will end on that token.

In [None]:
out.end_logits

We can then use [torch.argmax](https://pytorch.org/docs/stable/generated/torch.argmax.html) to find the `answer_sequence` from start to finish:

In [None]:
answer_sequence = indexed_tokens[torch.argmax(out.start_logits):torch.argmax(out.end_logits)+1]
answer_sequence

Finally, let's [decode](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.decode) these tokens to see if the answer is correct!

In [None]:
question_answering_tokenizer.convert_ids_to_tokens(answer_sequence)

In [None]:
question_answering_tokenizer.decode(answer_sequence)

## 6.7 Summary

Great work! You successfully used a Large Language Model (LLM) to extract answers from a sequence of sentences. Even though BERT was state-of-the-art when it was first released, many other LLMs have since broke ground. [build.nvidia.com](https://build.nvidia.com/explore/discover) hosts many of these models to be interacted with in the browser. Go check it out and see where the state-of-the-art is today!

### 6.7.1 Clear the Memory
Before moving on, please execute the following cell to clear up the GPU memory.

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

### 6.7.2 Next

Congratulations, you have completed all the learning objectives of the course!

As a final exercise, and to earn certification in the course, successfully complete an end-to-end image classification problem in the assessment.

<center><a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a></center>