This kernel is to show how roberta tokenizes and how to output of roberta looks like for absolute beginners. Once you see the output of the model you can pass the output through linear layers of desired dimension according to dataset and use case.

I am using [Abhishek Thakur's](https://www.kaggle.com/abhishek) pretrained roberta base [model](https://www.kaggle.com/abhishek/roberta-base) and data from the contest [Tweet Sentiment Extraction](https://www.kaggle.com/c/tweet-sentiment-extraction) in this kernel.

In [None]:
import pandas as pd
import transformers
import tokenizers
import torch
import torch.nn as nn

In [None]:
data = pd.read_csv('../input/tweet-sentiment-extraction/train.csv')
data.head()

In [None]:
ROBERTA_PATH = '../input/roberta-base'

# How roberta tokenizes the text?

In [None]:
MAX_LEN = 192
TOKENIZER = tokenizers.ByteLevelBPETokenizer(vocab_file=f"{ROBERTA_PATH}/vocab.json", 
                                             merges_file=f"{ROBERTA_PATH}/merges.txt", 
                                             add_prefix_space=True, 
                                             lowercase=True)

Note here we have an argument add_prefix_space which is set to True so the tokenizer will add a space to start of the text passed into it.

In [None]:
tokens = TOKENIZER.encode(data.text.values[0])
tokens

In [None]:
tokens.tokens

It seems like to every word that is splitted by the tokenizer it is adding a special character 'Ġ' to the start of the first part of the splitted text.

Also it is to be noted that special tokens have not been added by the tokenizer.

In [None]:
tokens.ids

In [None]:
tokens.type_ids

Roberta doesn't use the type_ids so we will be passing all zero vector of size of length of token ids everytime

In [None]:
tokens.offsets

Here the first token is 'i' which is the first part when we split "i'd" in the input sentence. 'i' is the first character of the sentence so the offset should be (0, 1) but we have to take into account that a space was added by the tokenizer to the text.

In [None]:
tokens.attention_mask

All ones as there is no padding

The above are the attributes that would be needed to train roBERTa

## What happens if set add_prefix_space to false?

In [None]:
expt_tokenizer = tokenizers.ByteLevelBPETokenizer(vocab_file=f"{ROBERTA_PATH}/vocab.json", 
                                                merges_file=f"{ROBERTA_PATH}/merges.txt", 
                                                add_prefix_space=False, 
                                                lowercase=True)

temp = expt_tokenizer.encode(data.text.values[0])
temp.tokens, temp.ids, temp.type_ids, temp.offsets, temp.attention_mask

The results are exactly the same. The reason is that we require add_prefix_space as True because tokenizer need a space to start the input string. If we set it to False then tokenizer encode and decode method will not conserve the absence of a space at the beginning of a string. Look at the example below

In [None]:
expt_tokenizer.decode(expt_tokenizer.encode("Hello").ids)

In [None]:
TOKENIZER.decode(TOKENIZER.encode("Hello").ids)

Notice the difference in results

# Pretrained RoBERTa output

In [None]:
conf = transformers.ReformerConfig.from_pretrained(ROBERTA_PATH)
model = transformers.RobertaModel.from_pretrained(ROBERTA_PATH, config=conf)

In [None]:
ids = torch.tensor([[0] + tokens.ids + [2]])
attention_mask = torch.tensor([[1, 1] + tokens.attention_mask])
token_type_ids = torch.tensor([tokens.type_ids + [0, 0]])

In [None]:
output = model(ids, attention_mask=attention_mask, token_type_ids=token_type_ids)

In [None]:
len(output)

In [None]:
output[0].shape

The first output is the sequenced output. One 768 sized tensor for each of the 12 tokens.

In [None]:
output[1].shape

The second output is the result of pooling together all of the 768 length layers of first output.

# Using the question answering model from huggingface with pretrained model

In [None]:
model = transformers.RobertaForQuestionAnswering.from_pretrained('roberta-base')

We will be using the tokenizer from above examples

In [None]:
ques = "What is the name of prime minister of India?"
text = "India is one of the largest country in the world and its current prime minister is Narendra Modi."

In [None]:
tok_ques = TOKENIZER.encode(ques)
tok_text = TOKENIZER.encode(text)

In [None]:
len(tok_ques.ids), len(tok_text.ids)

In [None]:
ids = torch.tensor([[0] + tok_ques.ids + [2, 2] + tok_text.ids + [2]])
attention_mask = torch.tensor([[1] + tok_ques.attention_mask + [1, 1] + tok_text.attention_mask + [1]])
# roberta doesn't make use of token_type_ids so we can have a all zero tensor of correct dimension
token_type_ids = torch.tensor([[0] + tok_ques.type_ids + [0, 0] + tok_text.type_ids + [0]])

In [None]:
start, end = model(ids, attention_mask=attention_mask, token_type_ids=token_type_ids)

In [None]:
start = nn.Softmax()(start)
end = nn.Softmax()(end)

In [None]:
start.shape, end.shape

In [None]:
n = start.shape[1]
max_ij = 0

start_idx = None
end_idx = None

for i in range(14, n-2):
    for j in range(i+1, n-1):
        if start[0][i] + end[0][j] > max_ij:
            max_ij = start[0][i] + end[0][j]
            start_idx = i
            end_idx = j

In [None]:
start_idx, end_idx, max_ij

In [None]:
result = list(ids[0][start_idx: end_idx+1])

In [None]:
TOKENIZER.decode(ids=result)

Without any fine tuning it is just way off the answer. The point of this kernel wasn't have accuracy but to show how we can use huggingface's implementation of RoBERTa.

With fine tuning and better function to choose start and end index roberta can give very accurate results

# fin