# LAB 2 - TRANSFORMERS WITH HUGGINGFACE


In this lab you'll get to try two of the most famous NLP transformers: BERT and GPT-2

To download and use the models we will use huggingface transformers, a large platform for sharing transformer models. Check it out here https://huggingface.co/

In this lab we will:
- Use DistilBERT to 
        - fill out missing words in sentences, 
        - extract features from texts
- Fine tune DistilGPT-2 on a text and then use it to generate a story

Install huggingface transformers by running the cell below

In [2]:
!pip install transformers

Collecting tokenizers<0.11,>=0.10.1
  Using cached tokenizers-0.10.3-cp38-cp38-win_amd64.whl (2.0 MB)
Installing collected packages: tokenizers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.8.1rc2
    Uninstalling tokenizers-0.8.1rc2:
      Successfully uninstalled tokenizers-0.8.1rc2
Successfully installed tokenizers-0.10.3


# 1. BERT


## <ins>BACKGROUND</ins>


BERT is trained on the task of filling in masked words in sentences. We will use a distilled version of BERT made by huggingface called: DistilBERT

A distilled model is simply a condensed version of a model. It performs almost as well, but is lighter and faster than the original model.

We will use BERT for two things:
1. Feature extraction
2. Mask filling


First we will instantiate the model and its tokenizer. In huggingface, all models are accompanied with their specific tokenizer. There are many different sorts of tokenizers, which we won't cover. One reason for this is that different models have been trained with different special characters. BERT (and DistilBERT) are specifically trained using some special tokens in the text which we'll see shortly.

#### Note: The model and tokenizer we use are case-unsensitive, i.e. they don't know the difference between upper case and lower case letters. Thus to the model BERT = bert = Bert = BeRT

## <ins>EXERCISE</ins>

Explore the tokenizer and the model to see how they can be used as is


#### TODO:
- Explore the tokenizer



In [3]:
from transformers import DistilBertTokenizer, DistilBertModel
model = DistilBertModel.from_pretrained('distilbert-base-uncased')
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Tokenizer

The tokenizer is a complex object with many attributes. To get an idea of what's unique about this tokenizer you can use the cell below to see which special tokens exist, their representation in text, and how large the vocabulary of the tokenizer is.

Docs: https://huggingface.co/transformers/main_classes/tokenizer.html

#### TODO: 
- Run the commands in the cells below to get information about the tokenizer 

In [4]:
# The categories of special tokens used in DistilBERT
tokenizer.SPECIAL_TOKENS_ATTRIBUTES

['bos_token',
 'eos_token',
 'unk_token',
 'sep_token',
 'pad_token',
 'cls_token',
 'mask_token',
 'additional_special_tokens']

In [5]:
# List the special tokens - can you guess which match which in the previous list?
tokenizer.all_special_tokens

['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']

In [6]:
# How many unique words have been used when training to tokenizer. 
# Words outside of this vocabulary will be mapped to the unk_token: '[UNK]'
tokenizer.vocab_size

30522

### Tokenize text

Now we'll use the tokenizer on a sample text. 

In [7]:
text = 'This is a sample sentence with ten words in it'
tokens = tokenizer.encode(text)
print('Tokens:\n', tokens)
print('\nNumber of tokens: ', len(tokens))

Tokens:
 [101, 2023, 2003, 1037, 7099, 6251, 2007, 2702, 2616, 1999, 2009, 102]

Number of tokens:  12



We also see that it's 12 tokens long even though our text only contained ten words. Can you figure out what happened?

#### TODO:
- Decode the input_ids using the decode() function in the tokenizer


In [8]:
# Use the decode function to see what happened
tokenizer.decode(tokens)

'[CLS] this is a sample sentence with ten words in it [SEP]'

## Model

### <ins>BACKGROUND</ins>


Just like the tokenizer, the model variable contains a lot of information about the model. Huggingface is built on top of pytorch, and supports everything that the pytorch library enables.

We might be interested to see the architecture of the model and what the inputs to the model should look like. Run the two following cells

In [9]:
print(model)

DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0): TransformerBlock(
        (attention): MultiHeadSelfAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): Linear(i

In [10]:
model.dummy_inputs

{'input_ids': tensor([[7, 6, 0, 0, 1],
         [1, 2, 3, 0, 0],
         [0, 0, 0, 4, 5]])}

## Send a sentence through model

Now let's see what comes out of the model if we send a sentence through it! Run the cell below


In [11]:
text = 'This sentence will go through BERT and come out the other side'
model_input = tokenizer(text, return_tensors='pt')

output = model(**model_input)

print(output.last_hidden_state)
print('\nShape:', output.last_hidden_state.shape)

tensor([[[-0.1551, -0.0323,  0.0877,  ...,  0.0033,  0.2715,  0.2747],
         [-0.1971, -0.0893,  0.1830,  ...,  0.0153,  0.3042,  0.1955],
         [ 0.3238, -0.0349,  0.0759,  ...,  0.0196, -0.1462, -0.2677],
         ...,
         [-0.7359, -0.3485,  0.2496,  ..., -0.2540,  0.5690, -0.4983],
         [-0.3434, -0.2095, -0.2392,  ...,  0.2645,  0.1149,  0.1997],
         [ 0.9707,  0.4362, -0.1645,  ...,  0.0281, -0.4067, -0.4955]]],
       grad_fn=<NativeLayerNormBackward>)

Shape: torch.Size([1, 14, 768])


# 2. Pipelines

### <ins>BACKGROUND</ins>


The output we got in the last step doesn't tell us much. We get a vector for every token in the input, but how can we use it concretely?

To make it easier to use the models we'll make use of huggingfaces pipelines. Pipelines are basically wrappers for models and tokenizers that automate getting an output for a specific task. There are many pipelines you can use, see a full list here: https://huggingface.co/transformers/main_classes/pipelines.html

We will use the "fill-mask" and "feature-extraction" pipelines.

Loading and using a pipeline is super easy. We only have to specify the kind of pipeline and the model we want to use and huggingface takes care of the rest!


In [12]:
from transformers import pipeline
unmasker = pipeline('fill-mask', model='distilbert-base-uncased')
extractor = pipeline("feature-extraction", model='distilbert-base-uncased')

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Fill mask

We know that BERT was originally trained to predict masked tokens in sentences. It should thus be pretty good at predicting missing words in sentences. 



### <ins>EXERCISE</ins>

The pipeline only supports one masked token at a time. It will spit out a list of what it thinks are the most likely words to fill in and their scores. Can you create a function that uses the pipeline to fill in a sentence with several masks with the most likely words?


#### TODO:
- Play around with the unmasker by changing the masked_sentence

In [29]:
# Fill mask 
masked_sentence = "Hello there my friend. [MASK] are you doing today?"
unmasker(masked_sentence)


[{'sequence': 'hello there my friend. what are you doing today?',
  'score': 0.8218488693237305,
  'token': 2054,
  'token_str': 'what'},
 {'sequence': 'hello there my friend. how are you doing today?',
  'score': 0.1665491908788681,
  'token': 2129,
  'token_str': 'how'},
 {'sequence': 'hello there my friend. where are you doing today?',
  'score': 0.006061742547899485,
  'token': 2073,
  'token_str': 'where'},
 {'sequence': 'hello there my friend. why are you doing today?',
  'score': 0.0012343761045485735,
  'token': 2339,
  'token_str': 'why'},
 {'sequence': 'hello there my friend. whatever are you doing today?',
  'score': 0.000964021252002567,
  'token': 3649,
  'token_str': 'whatever'}]

## Choosing the most likely candidate

### <ins>EXERCISE</ins>

Out of the candidates you saw you probably thought some were better than others. The model also scored how likely it thought the different A natural way to choose is to simply pick the most probably. Using the unmasker pipeline's outputs, implement a function that you can use to fill one word in a sentence.

#### TODO:
- Complete the function fill_mask() 

In [14]:
def fill_mask(masked_sentence):
    "Returns the masked sentence filled out with the most likely candidate"
    candidates = unmasker(masked_sentence)
    most_likely_candidate = max(candidates, key= lambda x: x['score'])
    return most_likely_candidate['sequence']

In [17]:
fill_mask('BERT is a sort of [MASK] trained by google')

'bert is a sort of robot trained by google'

## Filling several masks

You should now implement a function that can fill several masked words so we can get around the limitations of the fill-mask pipeline. Here's a list of approaches you can try:


#### Option 1:
1. Replace all but one [MASK] tokens in the sentence with another token
    - Hint: Go back and look at the special tokens DistilBERT uses
2. Fill the [MASK] token
3. Uncover another [MASK] token and fill it
4. Repeat step 3 until all have been filled


#### Option 2:
1. Split the sentence into as many parts as there are masked words
2. Send each piece into the function fill_mask()
3. Stitch results together 


#### Option 3:
If you want a more challenging solution, you can try to build many different candidates and calculate their probabilities conditional probabilities. 

E.g. for a the sentence: "Hi my [MASK] is BERT and I am a [MASK]"
1. Get the most probable fill for each mask. Use the token for an unknown word for the other mask
    - "Hi my [MASK] is BERT and I am a [UNK]" -> model -> word1=name + probability10
    - "Hi my [UNK] is BERT and I am a [MASK]" -> model -> word2=model + probability20
    
2. Fill in each subsentence with the word BERT thought was most likely and send it back through
    - "Hi my name is BERT and I am a [MASK]" -> model -> word11=human + probability11
    - "Hi my [MASK] is BERT and I am a model" -> model -> word22=name + probability22
    
3. Calculate the probabilities of the results as a conditional probability of independent events
    - P("Hi my name is BERT and I am a human") = probability10 * probability11
    - P("Hi my name is BERT and I am a model") = probability20 * probability21
4. Pick the most likely candidate


#### TODO:
- Write a function that uses the unmasker to fill in several masks
    + Hint: to replace tokens you can use string.replace
    + Hint: to split the sentence you can use a regex and the re.split() method

In [42]:
def fill_several_masks(masked_text):
    "Fills several masked tokens by replacing all but one [MASK] with [mask] at a time"
    # Write your code here
    modified_masked_text = masked_text.replace("[MASK]", "[mask]").replace("[mask]", "[MASK]", 1)
    print(modified_masked_text)
    sentence = fill_mask(modified_masked_text)
    print(sentence)
    while 'mask' in sentence:
        sentence = sentence.replace("[ mask ]", "[MASK]", 1)
        print(sentence)
        sentence = fill_mask(sentence)
        print(sentence)

    return sentence
    

### Try your function!

In [43]:
sentence = "Hi my [MASK] is BERT and I am a [MASK] used for [MASK]."

print(fill_several_masks(sentence))

Hi my [MASK] is BERT and I am a [mask] used for [mask].
hi my hat is bert and i am a [ mask ] used for [ mask ].
hi my hat is bert and i am a [MASK] used for [ mask ].
hi my hat is bert and i am a hat used for [ mask ].
hi my hat is bert and i am a hat used for [MASK].
hi my hat is bert and i am a hat used for hats.
hi my hat is bert and i am a hat used for hats.


# GPT-2


## <ins>BACKGROUND</ins>

GPT - Generative pre-training is a series of models released by OpenAI (GPT-3 was released in 2020). In 2019, GPT-2 got a lot of media attention in the heated debates about its potential dangers- and for writing stories about unicorns. 

Just like BERT, the GPT models are trained in an unsupervised fashion, but instead of the MLM task they are trained to do causal language modelling (CLM) - predicting the next word given the context. This arguably makes GPT-2 more fun to play around with than BERT.



### Unicorn story

The original unicorn story generated by the GPT-2 model trained by OpenAI.

#### Written prompt

> In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

#### Generated by GPT-2

> The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science.

> Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved.

> Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow.

> Pérez and the others then ventured further into the valley. “By the time we reached the top of one peak, the water looked blue, with some crystals on top,” said Pérez.

> Pérez and his friends were astonished to see the unicorn herd. These creatures could be seen from the air without having to move too much to see them – they were so close they could touch their horns.

> While examining these bizarre creatures the scientists discovered that the creatures also spoke some fairly regular English. Pérez stated, “We can see, for example, that they have a common ‘language,’ something like a dialect or dialectic.”

> Dr. Pérez believes that the unicorns may have originated in Argentina, where the animals were believed to be descendants of a lost race of people who lived there before the arrival of humans in those parts of South America.

> While their origins are still unclear, some believe that perhaps the creatures were created when a human and a unicorn met each other in a time before human civilization. According to Pérez, “In South America, such incidents seem to be quite common.”

> However, Pérez also pointed out that it is likely that the only way of knowing for sure if unicorns are indeed the descendants of a lost alien race is through DNA. “But they seem to be able to communicate in English quite well, which I believe is a sign of evolution, or at least a change in social organization,” said the scientist.



Just like we used DistilBERT, we will here be using DistilGPT-2 from huggingface, which is a distilled version of GPT-2 small.



## <ins>EXERCISE</ins>

In this exercise we'll use distill-gpt2 to generate text.

We will fine tune distill-gpt2 with Alice in Wonderland to see what happens

#### TODO:
- Use Distill-GPT2 to generate text
- Fine tune the model on alice in wonderland
- Generate text with your fine-tuned model


## Alice in wonderland from project gutenberg

We'll start by downloading alice in wonderland distributed by project gutenberg to use for fine tuning and take a look at it

In [45]:
!wget https://raw.githubusercontent.com/NordAxon/NBI-Handelsakademin-ML-Labs/main/nlp-lab/data/alice.txt --no-check-certificate -O alice.txt
!head -20 alice.txt

--2021-06-14 21:25:18--  https://raw.githubusercontent.com/NordAxon/NBI-Handelsakademin-ML-Labs/main/nlp-lab/data/alice.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 200 OK
Length: 150507 (147K) [text/plain]
Saving to: ‘alice.txt’

     0K .......... .......... .......... .......... .......... 34% 1.37M 0s
    50K .......... .......... .......... .......... .......... 68% 2.44M 0s
   100K .......... .......... .......... .......... ......    100% 2.49M=0.07s

2021-06-14 21:25:18 (1.94 MB/s) - ‘alice.txt’ saved [150507/150507]





CHAPTER I.
Down the Rabbit-Hole


Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into
the book her sister was reading, but it had no pictures or
conversations in it, “and what is the use of a book,” thought Alice
“without pictures or conversations?”

So she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy and stupid), whether the pleasure of
making a daisy-chain would be worth the trouble of getting up and
picking the daisies, when suddenly a White Rabbit with pink eyes ran
close by her.

There was nothing so _very_ remarkable in that; nor did Alice think it
so _very_ much out of the way to hear the Rabbit say to itself, “Oh


## Install huggingface transformers from source

We'll use a script from huggingface in order to fine-tune and use our model so we have to install it from source. We'll also go back to a previous version of the library

In [60]:
!git clone https://github.com/huggingface/transformers.git

Cloning into 'transformers'...


In [62]:
cd transformers

C:\Users\AlexanderHagelborn\code\NBI-Handelsakademin-ML-Labs\transformers


In [63]:
!git checkout v3.3.0

Note: switching to 'v3.3.0'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 0613f0522 Release: v3.3.0


In [64]:
cd ..

C:\Users\AlexanderHagelborn\code\NBI-Handelsakademin-ML-Labs


In [65]:
!pip install -q ./transformers

ERROR: Could not install packages due to an EnvironmentError: [WinError 5] Access is denied: 'C:\\Users\\AlexanderHagelborn\\Miniconda3\\envs\\testenv\\Lib\\site-packages\\~okenizers\\tokenizers.cp38-win_amd64.pyd'
Consider using the `--user` option or check the permissions.



### Generate text with Distil-GPT2

Use the cell below to download and use distilgpt2. 

When you execute the cell you'll get a prompt for the model input that it will generate text from. You can of course change the length and number of outputs you want it to generate! Why not try giving it the same prompt that the unicorn text was generated from?

In [49]:
!python /content/transformers/examples/text-generation/run_generation.py \
--model_type=gpt2 \
--model_name_or_path='distilgpt2' \
--num_return_sequences 3 \
--length 150 \

python: can't open file '/content/transformers/examples/text-generation/run_generation.py': [Errno 2] No such file or directory


### Fine tune Distill-GPT2 on Alice in Wonderland

In [36]:
!python /content/transformers/examples/language-modeling/run_language_modeling.py \
--model_type=gpt2 \
--model_name_or_path=distilgpt2 \
--do_train \
--train_data_file=/content/alice.txt \
--num_train_epochs 20 \
--output_dir model_output \
--overwrite_output_dir \
--save_steps 20000 \
--per_gpu_train_batch_size 2

python: can't open file '/content/transformers/examples/language-modeling/run_language_modeling.py': [Errno 2] No such file or directory


### Generate text using your model!

Now use your fine-tuned model to generate text! You can go back and use the base model and compare the ouputs for the same prompt.

In [None]:
!python /content/transformers/examples/text-generation/run_generation.py \
--model_type=gpt2 \
--model_name_or_path=/content/model_output \
--num_return_sequences 3 \
--length 150 \