# Deep Learning
## Excercise 10: Language Tasks and Transformers

This exercise will give you a short overlook of the Huggingface Transformers library, a commonly and easy to use libary.

Resources used in this notebook:
- Github: https://github.com/huggingface/transformers
- Documentation: https://huggingface.co/transformers/
- Model hub: https://huggingface.co/models
- Huggingface datasets: https://huggingface.co/datasets
- Original BERT paper: https://arxiv.org/pdf/1810.04805.pdf
- Nice introductury course: https://huggingface.co/course/chapter1

### 1. Proprocessing text
For models to work with textual data there are some necessary pre-processing steps. They usually use a vocabulary of tokens. A token can be for example a word, subword or punctuation symbol. The vocabulary maps each of these tokens to a token id. A sequence of these token ids will be the input to the model. (Compare also the last exercise)

#### Tokenize the given sample inputs

Load the pretrained tokenizer `'google-bert/bert-base-uncased'` then tokenize the sample sentences.
1. What output do you get?
    - Start with calling the tokenizer directly.
    - Also try the functions `tokenizer.encode`, `tokenizer.tokenize`
    - What are differences, what are similarities?
2. Convert the token ids back to textual tokens. How does your text look like?
3. How does the tokenizer handle multiple sentences?
    - What happens if you feed both sentences comma-separated?
    - What happens if you feed both sentences as a list?
    - What happens if you feed the second sentence as `text_pair=s2`?

*Hint*: [Here](https://huggingface.co/docs/transformers/main_classes/tokenizer) you can find some infos about the tokenizer.

In [None]:
import torch
from transformers import AutoTokenizer

s1 = "Hi, this is an example sentence."
s2 = "I'm interested in how language processing works"

In [None]:
#ToDo: Try the tokenizer functionalities

In [None]:
# Load the tokenizer for our model
tokenizer = AutoTokenizer.from_pretrained('google-bert/bert-base-uncased')

# The underlying vocabulary
vocab = tokenizer.get_vocab()
print(f"Vocabulary Length: {len(vocab)} \nSample Entries {list(vocab.items())[:20]}\n")

#1 - tokenize text
print(f"tokenizer(s1): {tokenizer(s1)}")
print(f"tokenizer.tokenize(s1): {tokenizer.tokenize(s1)}")
print(f"tokenizer.tokenize(s1, add_special_tokens=True): {tokenizer.tokenize(s1, add_special_tokens=True)}")
print(f"tokenizer.encode(s1): {tokenizer.encode(s1)}\n")

#2 - convert ids back to tokens
input_ids = tokenizer(s1)['input_ids']
print(f"Tokens: {tokenizer.convert_ids_to_tokens(input_ids)}\n")

#3 - multiple sentences
print(f"comma separated: {tokenizer(s1, s2)}")
print(f"Created Tokens: {tokenizer.convert_ids_to_tokens(tokenizer(s1, s2)['input_ids'])}")
print(f"text_pair: {tokenizer(s1, text_pair=s2)}")
print(f"list: {tokenizer([s1, s2], padding=True)}")

So calling the tokenizer we get three outputs:
1. input_ids: These are the vocabulary ids for the input tokens.
    - We get only those when we call `encode`.
    - We can convert them back to tokens: Then we get a list of tokens. This is similar to calling `tokenize` with special tokens.
2. token_type_ids: These are used to distinguish between two sequences, if you feed text_pairs, either comma separated or with the key-word. This is useful in you want to do sequence classification, next sentence prediction, question answering etc.
3. attention_mask: Sometimes we don't want to attend to all tokens in a sequence, e.g. when we are using padding. For these cases we can use the attention mask.

### 2. Masked language modeling
BERT is pre-trained with masked language modeling. This is a task where we randomly mask tokens and let the model predict the masked words. To understand what tokens to predict, BERT has to learn some general understanding of language.


#### 1. Load a pretrained BERT model

Load the pretrained `'google-bert/bert-base-uncased'` model for MaskedLM. Identify the MASK token for the model.

In [None]:
# ToDo load the pretrained model

In [None]:
from transformers import AutoModelForMaskedLM

bert = AutoModelForMaskedLM.from_pretrained('google-bert/bert-base-uncased')

print(tokenizer.mask_token)


#### 2. Feed a masked sentence to the model

Sample sentence: `'Berlin is the capital of ??.'`. Replace `??` by the mask token.
What outputs do you get?

In [None]:
# ToDo: Feed a masked sentence to the model

In [None]:
sample_sentence = 'Berlin is the capital of'
masked_sentence = sample_sentence + ' ' + tokenizer.mask_token +'.'
bert_input = tokenizer(masked_sentence, return_tensors='pt')
outputs = bert(**bert_input)
print(outputs)
print(outputs['logits'].shape)

In this case, the model output only contains the logits. But depending on the use-case, returning the loss, hidden_states or attentions will be useful.

#### 3. Extract the most probable tokens for the masked token.
The logits output contains for each token of the sequence the prediction scores for all words in the vocabulary.

Use these to find the 10 most probable tokens for the masked token.

In [None]:
def get_top_tokens(logits, token_index, topK):
    """
    From the logits extract the most probable token ids.
    
    Input values:
        logits : the output of the BERT model
        token_index : index of the token of interest
        topK : The number of tokens to be extracted.
        
    Output values:
        topK_token_ids : the token ids of the most probable tokens
        topK_probabilities : the probabilities of these tokens
    """
    # ToDo: implement the extraction
    
    
    return topK_token_ids, topK_probabilities

In [None]:
def get_top_tokens(logits, token_index, topK):
    """
    From the logits extract the most probable token ids.
    
    Input values:
        logits : the output of the BERT model
        token_index : index of the token of interest
        topK : The number of tokens to be extracted.
        
    Output values:
        topK_token_ids : the token ids of the most probable tokens
        topK_probabilities : the probabilities of these tokens
    """
    logits = logits.squeeze()[token_index]
    values, indices = torch.sort(logits, descending = True)
    topK_token_ids = indices[:topK]
    topK_probabilities = torch.nn.functional.softmax(values, dim=0)[:topK]
    return topK_token_ids, topK_probabilities

token_ids, token_probs = get_top_tokens(outputs['logits'], token_index=6, topK=10)
tokens = tokenizer.convert_ids_to_tokens(token_ids)
for token, prob in zip(tokens, token_probs):
    print(f"{token}    \t -- \t {prob.item():.4f}")

#### 4. Bias in the Model
Predict the masked token in following sentences:`'The man/woman/person works as a ??.'` What differences in the results do you observe for man, woman or person? What might be problematic about that?

In [None]:
#ToDo: predict the masked token for the sample sentences.

In [None]:
s1 = 'The woman works as a ' + tokenizer.mask_token + '.'
s2 = 'The man works as a ' + tokenizer.mask_token + '.'
s3 = 'The person works as a ' + tokenizer.mask_token + '.'

for s in [s1, s2, s3]:
    bert_input = tokenizer(s, return_tensors='pt')
    mask_token_id = bert_input['input_ids'].squeeze().numpy().tolist().index(tokenizer.mask_token_id)
    logits = bert(**bert_input)['logits']
    token_ids, token_probs = get_top_tokens(logits, mask_token_id, 10)
    tokens = tokenizer.convert_ids_to_tokens(token_ids)
    for token, prob in zip(tokens, token_probs):
        print(f"{token}    \t -- \t {prob.item():.4f}")
    print("\n")

### 3. Solving NLP problems with transformers

Now we want to look at more concrete problems than predicting masked words. Language models are usually pre-trained on some language modeling tasks (e.g., masked language modeling, autoregressiv language modeling). In a second step, these models are trained on a downstream tasks such as question answering. As models learn general text understanding from pre-training, the fine-tuning is usually a way shorter process (few hours on single GPUs). Luckily, weights for the pre-trained models (and many fine-tuned ones) are publically available. When fine-tuning, we add some small heads on top of BERT for that specific task (typically 1-2 linear layers + activation function).

When we want to solve a real problem, we have multiple options:
1. Using an already fine-tuned model from the huggingface model hub
2. Fine-tuning a model with an available script 
3. Fine-tune with our own script

#### The Huggingface model hub
As mentioned earlier, on the [model hub](https://huggingface.co/models), you can find a huge number of pre-trained and fine-tuned models ready to use. If you work with text, have a look if some models are already on there - it can safe you a lot work work ;) 

Let us see how we can use some of the models:

##### Sentiment analysis
Sentiment analysis is a task where we want to use a model to predict if a sentence is either positive or negative. This is sentence classification task - we do not want to predict on the token level, but do one prediction on the entire sequence. 

To do so, one linear layer is added on top of the \[CLS] token. In BERT, the \[CLS] is preprended to every input sequence and it is supposed to learn an aggregation of the input. Using this representation of the input sequence, we can classify the sentiment.

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
  
tokenizer = AutoTokenizer.from_pretrained("lvwerra/bert-imdb")
model = AutoModelForSequenceClassification.from_pretrained("lvwerra/bert-imdb")

inputs = tokenizer('This is...meh', return_tensors='pt')
outputs = model(**inputs)

print(outputs['logits'])
# probability for negative and positive class

##### Named entity recognition
Contrary to sentiment analysis, NER is a token classification task: For each token, we want to use a classifier to predict it into one of the following classes.

Abbreviation|Description 
-|- 
O|Outside of a named entity 
B-MIS |Beginning of a miscellaneous entity right after another miscellaneous entity 
I-MIS |Miscellaneous entity 
B-PER |Beginning of a person’s name right after another person’s name 
I-PER |Person’s name 
B-ORG |Beginning of an organisation right after another organisation 
I-ORG |Organisation 
B-LOC |Beginning of a location right after another location 
I-LOC |Location

In [None]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "My name is Sarah and I live in London"

ner_results = nlp(example)
for entity in ner_results:
  print(entity)

##### Summarization


In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
  
tokenizer = AutoTokenizer.from_pretrained("sshleifer/distilbart-cnn-6-6")
model = AutoModelForSeq2SeqLM.from_pretrained("sshleifer/distilbart-cnn-6-6")

seq = """The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres 
(410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 
41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting 
aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest 
free-standing structure in France after the Millau Viaduct."""


inputs = tokenizer(seq, return_tensors='pt')
generated = model.generate(**inputs)

decoded_text = tokenizer.decode(generated[0], clean_up_tokenization_spaces=True, skip_special_tokens=True)

In [None]:
print(decoded_text)

#### Huggingface fine-tuning scripts
If you want to train your own model, the logical first step is looking at the existing fine-tuning [scripts](https://github.com/huggingface/transformers/tree/master/examples/pytorch#examples) in the transformers library. Most of the time, you can find some starting points.

These are easy to run. This example is for training a [multiple choice model](https://github.com/huggingface/transformers/tree/master/examples/pytorch/multiple-choice). You can pull the repository or download the respective files if you want to try it.

In [None]:
# python examples/multiple-choice/run_swag.py \
#   --model_name_or_path roberta-base \
#   --do_train \
#   --do_eval \
#   --learning_rate 5e-5 \
#   --num_train_epochs 3 \
#   --output_dir /tmp/swag_base \
#   --per_gpu_eval_batch_size=16 \
#   --per_device_train_batch_size=16 \
#   --overwrite_output

#### Creating your own finetuning script

We are sticking for now with sentiment analysis. We want, as in the previous exercise, to classify movie reviews into positive and negative. The data reading is done for you, but this time you have to build your own dataloaders.

In [None]:
# Similarly get the data
import random
import re
import torch
from torchtext import data, datasets, vocab
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

import numpy as np
from collections import Counter, OrderedDict
from sklearn.model_selection import train_test_split

random_seed = 0
data_directory = './data'
debugging = True #This can be set to True, if you want to test your implementation on a smaller subset


random.seed(random_seed)
torch.manual_seed(random_seed)
np.random.seed(random_seed)

max_length = 200   # we want the maximum words in each text instance to be 200.

def text_cleaning(entry):
    entry = re.sub('<\w{1,2} />', ' ', entry) #replace <br /> and similar
    entry = re.sub(r'\s+', ' ', entry) #replace multiple spaces by one space
    return entry

# read the dataset, the first call also downloads the dataset. Split the training_data into training and validation
train_set, test_set = datasets.IMDB(root=data_directory)
train_set = list(train_set)
test_set = list(test_set)


if debugging == True: 
    train_labels = [l for l,t in train_set]
    test_labels = [l for l, t in test_set]
    train_set, _ = train_test_split(train_set, train_size=0.2, stratify=train_labels, random_state=random_seed)
    test_set, _ = train_test_split(test_set, train_size=0.2, stratify=test_labels, random_state=random_seed)

train_labels = [l for l,t in train_set]
train_set, val_set = train_test_split(train_set, train_size=0.7, stratify=train_labels, random_state=random_seed)

##### 1. Create a Sentiment Classification Model

You should build your model out of:
- A pretrained BERT model
- a dropout layer with dropout probability of 0.3 applied to the BERT output for the `[CLS]` token.
- a fully connected layer mapping the output of the dropout layer to the prediction output
- a sigmoid activation for the prediction output

*Hint*: to get the right dimensions for the fully connected layer, check the BERT config.

In [None]:
from transformers import AutoModel, AutoConfig, AutoTokenizer
from torch import nn

#ToDo: Fill the __init__() and forward() functions. Add arguments if needed.
class BERTClassifier(nn.Module):
    def __init__(self, ):
        super(BERTClassifier, self).__init__()
        self.bert = AutoModel.from_pretrained('google-bert/bert-base-uncased')
        

    def forward(self, ):

In [None]:
from transformers import AutoModel, AutoConfig, AutoTokenizer
from torch import nn

class BERTClassifier(nn.Module):
    def __init__(self, hidden_size):
        super(BERTClassifier, self).__init__()
        self.bert = AutoModel.from_pretrained('google-bert/bert-base-uncased')
        self.dropout = nn.Dropout(0.3)
        self.classify = nn.Linear(in_features=hidden_size, out_features=1)
        self.sigm = nn.Sigmoid()

    def forward(self, bert_input):
        bert_output = self.bert(**bert_input)
        cls_output = bert_output['last_hidden_state'][:,0,:]
        out = self.dropout(cls_output)
        out = self.classify(out).squeeze()
        out = self.sigm(out)
        return out
    
#test
tokenizer = AutoTokenizer.from_pretrained('google-bert/bert-base-uncased')
config = AutoConfig.from_pretrained('google-bert/bert-base-uncased')
s = 'this is a good movie.'
BC = BERTClassifier(config.hidden_size)
bert_input = tokenizer(s, return_tensors='pt')
print(BC(bert_input))

##### 2. Create the DataLoaders

Now that we have the model, we need to bring our train, valid and test sets into a format, where we can give them to the model. 

Create `torch.utils.data.DataLoader` for them. 

In [None]:
#ToDo: Implement the train, val and test dataloaders.

In [None]:
from torch.utils.data import DataLoader


def collate_fn(batch):
    labels, batched_texts = [], []
    for label, text in batch:
        labels.append(1 if label =='pos' else 0)
        batched_texts += [text]    
        
    labels = torch.tensor(labels)
    tokenized_input = tokenizer(batched_texts,
                                padding=True, truncation=True, max_length=max_length,
                                return_tensors='pt')    
    return labels, tokenized_input

train_dataloader = DataLoader(train_set, batch_size=32, shuffle=True, 
                              collate_fn=collate_fn, drop_last=True)

val_dataloader = DataLoader(val_set, batch_size=32, shuffle=True, 
                              collate_fn=collate_fn, drop_last=True)

test_dataloader = DataLoader(test_set, batch_size=32, shuffle=False, 
                              collate_fn=collate_fn, drop_last=False)


##### 3. Training and Evaluation

Implement training and evaluation for your model. You can reuse your code from the previous exercise. Use binary cross-entropy loss and the adam optimizer, train for a maximum of 20 epochs with early stopping.

Before and after you train your model, evaluate it on the test dataset.

**Important Note:** Finetuning a BERT model takes a long time, if you don't have GPUs. Skip this part in that case.

In [None]:
#ToDo: Evaluate your model on the test dataset, then train it and evaluate it again on the test dataset.

In [None]:
from tqdm import tqdm

def train(num_epochs, model, loss_funtion, optimizer, train_loader, val_loader, break_criterium, model_name):
    best_val_loss = 100000
    no_improve=0
    for epoch in range(num_epochs):
        model.train()
        for labels, indices in tqdm(train_loader, desc='Train Iter', ascii=True):
            output = model(indices)
            loss = loss_function(output, labels.float())
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
        acc, val_loss = evaluate(model, val_loader, loss_function)
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            no_improve = 0
            torch.save(model.state_dict(), f'{model_name}.pt')
        else:
            no_improve += 1
        print(f"Epoch {epoch} \t Loss {val_loss:.5f} \t Accuracy {acc:.5f}")
        if no_improve >= break_criterium:
            model.load_state_dict(torch.load(f'{model_name}.pt'))
            break
                
def evaluate(model, test_loader, loss_function):
    model.eval()
    correct = 0
    total_entries = 0
    cum_loss = 0
    with torch.no_grad():
        for labels, indices in tqdm(test_loader, desc='Test Iter', ascii=True):
            output = model(indices)
            preds = (output>0.5).int()
            correct += (preds == labels).sum()
            total_entries += labels.shape[0]
            cum_loss += loss_function(output, labels.float()).item()
    return correct/total_entries, cum_loss/len(test_loader)
    
BC = BERTClassifier(config.hidden_size)
loss_function = torch.nn.BCELoss()
optimizer = torch.optim.Adam(BC.parameters())
print(evaluate(BC, test_dataloader, loss_function))
train(1, BC, loss_function, optimizer, train_dataloader, val_dataloader, 5, 'BC') 
print(evaluate(BC, test_dataloader, loss_function))  


#### (Almost) training our own QA model

You can also do more complex stuff, adapting dataset and models to your needs. For running the training below, we recommend using a GPU. You also need to install the datasets package, it is not part of the environment.

In [None]:
%%capture
!pip install datasets

In [None]:
# Getting data
from datasets import load_dataset

dataset = load_dataset("squad")
print(dataset)

training_data = dataset['train']
print(training_data[254])


# look at one example:
print(training_data[254]['question'])
print(training_data[254]['context'])
print(training_data[254]['answers'])

# Defining the model (usually BertForQuestionAnswering (https://huggingface.co/transformers/model_doc/bert.html#transformers.BertForQuestionAnswering) but also show how it is implemented)

# Doing a forward pass

In [None]:
from torch.utils.data.dataset import Dataset
from tqdm import tqdm

class SquadDataset(Dataset):
    def __init__(self, tokenizer, validation=False):
        self.tokenizer = tokenizer
        dataset = load_dataset("squad")
        if not validation: 
          data = load_dataset('squad', split='train[:10%]')
        else: 
          data = dataset['validation']
        
        self.qa_pairs = []
        num_discarded = 0

        for qa_pair in tqdm(data):
          question = qa_pair['question']
          context = qa_pair['context']
          input_text = tokenizer.encode(question, context, add_special_tokens=True, return_tensors='pt', truncation=True, padding='max_length')
          answer = qa_pair['answers']['text'][0]

          try:
            answer_start, answer_end = self.find_answer_indices(qa_pair, input_text)
          except:
            num_discarded = num_discarded + 1
            # print('Could not find answer, example is discarded... This is example number: ', num_discarded)
            continue


          self.qa_pairs.append({
              'input_text': input_text[0],
              # 'question': question,
              # 'context': context,
              # 'answer': answer,
              'answer_start': torch.tensor(answer_start),
              'answer_end': torch.tensor(answer_end)
          })
        print('Some instances have been discard because no answer was found in the text: ', num_discarded)

    def find_answer_indices(self, qa_pair, input_text):
        """
        Find the indices of the answer in the (tokenized) input. The input will be "[CLS] <question> [SEP] <context> [SEP]" 
        """
        answer_ids = self.tokenizer.encode(qa_pair['answers']['text'][0], add_special_tokens=False)
        # print('Answer ids: ', answer_ids)
        # print('Input text: ', input_text)
        inputs_text_ids = input_text[0].numpy().tolist()

        # print(self.get_sublist_idx(inputs_text_ids, answer_ids))
        start_index, end_index = self.get_sublist_idx(inputs_text_ids, answer_ids)

        return start_index, end_index

      
    def get_sublist_idx(self, x, y):
        l1, l2 = len(x), len(y)
        for i in range(l1):
            if x[i:i+l2] == y:
                return i, i+l2
        raise Exception('Answer not found')


    def __len__(self):
        return len(self.qa_pairs)

    def __getitem__(self, idx):
        return self.qa_pairs[idx]

In [None]:
from transformers import AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained('twmkn9/bert-base-uncased-squad2')

trainset = SquadDataset(tokenizer)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=2,
                                          shuffle=True)

In [None]:
# Slightly adjusted from the huggingface transformers library
from transformers import BertPreTrainedModel, BertModel
from transformers.modeling_outputs import QuestionAnsweringModelOutput
import torch.nn as nn


class BertForQuestionAnswering(BertPreTrainedModel):

    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels

        self.bert = BertModel(config, add_pooling_layer=False)
        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)

        self.init_weights()

    # @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
    # @add_code_sample_docstrings(
    #     tokenizer_class=_TOKENIZER_FOR_DOC,
    #     checkpoint=_CHECKPOINT_FOR_DOC,
    #     output_type=QuestionAnsweringModelOutput,
    #     config_class=_CONFIG_FOR_DOC,
    # )
    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        start_positions=None,
        end_positions=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None,
    ):
        r"""
        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
            Labels for position (index) of the start of the labelled span for computing the token classification loss.
            Positions are clamped to the length of the sequence (:obj:`sequence_length`). Position outside of the
            sequence are not taken into account for computing the loss.
        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
            Labels for position (index) of the end of the labelled span for computing the token classification loss.
            Positions are clamped to the length of the sequence (:obj:`sequence_length`). Position outside of the
            sequence are not taken into account for computing the loss.
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        sequence_output = outputs[0]

        logits = self.qa_outputs(sequence_output)
        start_logits, end_logits = logits.split(1, dim=-1)
        start_logits = start_logits.squeeze(-1)
        end_logits = end_logits.squeeze(-1)

        total_loss = None
        if start_positions is not None and end_positions is not None:
            # If we are on multi-GPU, split add a dimension
            if len(start_positions.size()) > 1:
                start_positions = start_positions.squeeze(-1)
            if len(end_positions.size()) > 1:
                end_positions = end_positions.squeeze(-1)
            # sometimes the start/end positions are outside our model inputs, we ignore these terms
            ignored_index = start_logits.size(1)
            start_positions.clamp_(0, ignored_index)
            end_positions.clamp_(0, ignored_index)

            loss_fct = nn.CrossEntropyLoss(ignore_index=ignored_index)
            start_loss = loss_fct(start_logits, start_positions)
            end_loss = loss_fct(end_logits, end_positions)
            total_loss = (start_loss + end_loss) / 2

        if not return_dict:
            output = (start_logits, end_logits) + outputs[2:]
            return ((total_loss,) + output) if total_loss is not None else output

        return QuestionAnsweringModelOutput(
            loss=total_loss,
            start_logits=start_logits,
            end_logits=end_logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

In [None]:
# Fine-tuned to QA
# qa_bert = BertForQuestionAnswering.from_pretrained('twmkn9/bert-base-uncased-squad2')

# Pre-trained
qa_bert = BertForQuestionAnswering.from_pretrained('bert-base-uncased')

In [None]:
import torch.optim as optim
from transformers import AdamW

optimizer = adam = AdamW([p for p in qa_bert.parameters(
        ) if p.requires_grad], lr=5e-5, eps=1e-08)

In [None]:
!nvidia-smi

In [None]:
import torch
torch.cuda.is_available()
torch.cuda.empty_cache()


In [None]:
# move model to gpu if we have one 
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# device = torch.device('cpu')
print('We are using the device: ', device)

t = torch.tensor([1,2,3])
print(t)

qa_bert.to(device)
print(t.to(device))


In [None]:
from tqdm import tqdm

for epoch in range(1):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(trainloader,0):
        
        # x = data
        # print(data)
        
        input_text = data['input_text']
        input_text = input_text.to(device)
        # input_text = input_text.squeeze(1)
        # print(input_text.shape)
        # print(input_text.device)

        answer_start = data['answer_start'].to(device)
        # print(answer_start.shape)
        answer_end = data['answer_end'].to(device)

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = qa_bert(input_ids=input_text, start_positions=answer_start, end_positions=answer_end)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

        # print(loss)
        # break

        # print statistics
        running_loss += loss.item()
        if i % 50 == 0:    # print every 2000 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 50))
            running_loss = 0.0

print('Finished Training')

In [None]:
After training (which we skipped here), you can use the model to answer questions

In [None]:
model = BertForQuestionAnswering.from_pretrained('deepset/bert-base-cased-squad2')
tokenizer = AutoTokenizer.from_pretrained('deepset/bert-base-cased-squad2')

question = 'What is my name?'
context = 'My name is Sarah and I live in London.'

inputs = tokenizer(question, context, return_tensors='pt')

print(inputs.input_ids)
print(tokenizer.convert_ids_to_tokens(inputs.input_ids[0]))

model_output = model(**inputs)
print(model_output)
print(torch.argmax(model_output.start_logits))
print(torch.argmax(model_output.end_logits))