# Transfer learning and pretrained models

In [None]:
%%bash

shred -u setup_colab.py

wget https://raw.githubusercontent.com/Alenush/mds_nlp_2021/main/utils/setup_colab.py -O setup_colab.py

In [None]:
import setup_colab

setup_colab.setup()

### Fill in your Coursera token and email
To successfully submit your answers to our grader, please fill in your Coursera submission token and email.

In [None]:
import grading

all_parts = ["tS7BO", "6gZAz", "TdWnl", "0ClQi", "b8JOi", "LtFTq", "X91HC", "AW7xi", "1DoPa", "LapxE", "aKlhZ", "3aCIW", "h2vQy"]
grader = grading.Grader(
    assignment_key="WMPlZI7kTWOGKLZJMQtFUA",
    all_parts=all_parts
)

In [None]:
# token expires every 30 min
COURSERA_TOKEN = "### YOUR TOKEN HERE ###"
COURSERA_EMAIL = "### YOUR EMAIL HERE ###"

In the current week assignment have to prepare the data and tokenize it into subwords, and finally use it as input to some pretrained models, for example BERT. 

<br>

You have to:
1. implement BPE algorithm 
2. use ELMO and compare it with word2vec embeddings
3. explore the usage of BERT
4. train a classifier using BERT embeddings to solve COLA classification task
5. use prepared pipelines

<br>

Good luck!

In [None]:
!pip install wget

In [None]:
import nltk
import wget
import os
import random
nltk.download('punkt')
import pandas as pd
import numpy as np
import torch
from collections import defaultdict
from sklearn.svm import SVC

This week we will work with the [The Corpus of Linguistic Acceptability (CoLA)](https://nyu-mll.github.io/CoLA/) dataset. It is a single sentence classification task with sentences labeled as grammatically correct or incorrect. 

Download the dataset and load the data, preapare train and dev set. In the following tasks use dev set to predict the model score.

In [None]:
print('Downloading dataset...')

# The URL for the dataset zip file.
url = 'https://nyu-mll.github.io/CoLA/cola_public_1.1.zip'

# Download the file (if we haven't already)
if not os.path.exists('./cola_public_1.1.zip'):
    wget.download(url, './cola_public_1.1.zip')

# Unzip the dataset (if we haven't already)
if not os.path.exists('./cola_public/'):
    !unzip cola_public_1.1.zip

In [None]:
# Load the dataset into a pandas dataframe.
df = pd.read_csv("./cola_public/raw/in_domain_train.tsv", delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])
df_dev = pd.read_csv("./cola_public/raw/in_domain_dev.tsv", delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])

# Report the number of sentences.
print('Number of training sentences: {:,}\n'.format(df.shape[0]))

# Display 10 random rows from the data.
df.sample(10)

In [None]:
sentences = df.sentence.values
labels = df.label.values

set(labels), len(labels)

In [None]:
sentences_dev = df_dev.sentence.values
labels_dev = df_dev.label.values

## BPE algorithm

**Task**. Implement the BPE algorithm. 

In the video we have discussed that for pre-trained models it is better to use subword tokenization. BPE is a commonly used tokenizer.


In this task you need to implement the BPE algorithm from scratch.


This approach solves OOV problem by encoding rare or unknown words as sequence of subword units.
For example, the model sees an out of vocabulary word `Transformer`. Instead of directly using a default unknown token, <unk>, the algorithm uses information about words such as `trans`, `form`, which appear in the corpus, to encode the word `Transformer`. Hence, the model might be able to pick up more information compared with the setting when is uses an unknown token. Thus, any word that does not appear in the training corpus can be broken down into subword units, which appear in the corpus.

### Recap the procedure of BPE algo

Procedure:

1) learn "BPE rules", i.e., which pairs of symbols to merge; 

2) apply learned rules to segment a text.

In order to do this we need to perform the following actions:

* Get the word count frequency
* Get the initial token count and frequency (i.e. how many times each character occurs)
* Merge the most common byte pairing
* Add this to the list of tokens and recalculate the frequency count for each token; this will change with each merging step
* Rinse and repeat this procedure until you have reached a predefined token limit or maximum number of iterations (as in the example)

**Task 1. (2 points)** You need to **write three functions**:

1. `get_statistics` computes token frequencies dictionary in the vocab. Note: You need to create a list of bigram frequencies from the vocabulary.

2. `merge_vocab` creates a merge table from statistics. It take  a `pair` (bigrams) and the current `vocab` as an input. The function outputs a new `vocab`, created from the old one by joining together the characters in the pair, if their union is a word from the dictionary.

3. `learn_bpe_rules` takes the current `vocab` word-frequency dictionary and the fraction of the total `vocab_size` to merge characters in the words of the dictionary `num_merges` times. Then for each *merge* operation it `get_stats` the counter for each pair of character sequences. Then, it selects the most frequent pair of symbols and merges this pair of symbols in each word in the `vocab` containing it (this pair).

Apply 2000 merge steps and write the frequnecy for the pair ('com', 'p') in the answer.

In [None]:
# Let's create a symbols frequnecy vocabulary from the data
vocab = defaultdict(int)
for sentence in sentences:
  toks = nltk.word_tokenize(sentence.lower())
  for tok in toks:
    vocab[(" ".join(list(tok))+' </w>')] += 1

# check that the word `transformers` are not in the vocabulary
assert(vocab["t r a n s f o r m e r s </w>"] == 0)
# check that `or` in the vocab
assert(vocab["o r </w>"] == 61)

In [None]:
# EXERCISE 1.1
import re, collections

def get_statistics(vocab):
    # YOUR CODE HERE
    return pairs

def merge_vocab(pair, cur_vocab):
    v_out = {}
    # YOUR CODE HERE
    return v_out

def learn_bpe_rules(vocab, num_merges=10):
    bpe_codes = {}
    # YOUR CODE HERE
    # use previous functions: get_statistics and merge_vocab to learn bpe rules
    return vocab, bpe_codes

In [None]:
pair_stats = get_statistics(vocab)
pair_stats

In [None]:
best_pair = max(pair_stats, key=pair_stats.get)
print(best_pair)

new_vocab = merge_vocab(best_pair, vocab)
new_vocab

In [None]:
# set num_merges to 2.000
vocab, bpe_rules = learn_bpe_rules(vocab, num_merges=2000)
bpe_rules

In [None]:
frequency = ## YOUR FREQUENCY

In [None]:
## GRADED PART, DO NOT CHANGE!
grader.set_answer(all_parts[0], frequency)

**Task 1.2. (1 point)** After we have learned the vocabulary we need to apply the rules in the merge table to new words in the vocabulary. As a final answer write the subwords of the word `transformers` separated by comma. For example: `"tr,an,sf,or,me,rs"`

In [None]:
from operator import itemgetter

def get_pairs(word):
    # YOUR CODE HERE
    return pairs

def create_new_word(word, pair_to_merge):
    # YOUR CODE HERE
    return new_word

def encode(original_word, bpe_rules):
    # YOUR CODE HERE
    return word

In [None]:
original_word = 'transformers'
encode(original_word, bpe_rules)

In [None]:
subwords = '' ## YOUR SUBWORDS

In [None]:
## GRADED PART, DO NOT CHANGE!
grader.set_answer(all_parts[1], subwords)

## ELMO embeddings

In this part we will see how the ELMO model can be loaded and will check its embeddings.

In [None]:
# If you work in Colab, for correct work, please, 
# use the following versions (and reinstall colab after): 

!pip install tensorflow==1.15
!pip install "tensorflow_hub>=0.6.0"
!pip3 install tensorflow_text==1.15

# You can download ELMO models from here: https://allennlp.org/elmo

import tensorflow_hub as hub
import tensorflow as tf

elmo = hub.Module("https://tfhub.dev/google/elmo/2")
elmo

In [None]:
# first, you need to initialise the session to get embeddings:
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)

In [None]:
# You can get ELMO embeddings feeding it text string and use default signature or tokenized text.
embeddings = elmo(["Some input text", 
                   "Test sentence"],
             signature="default",
             as_dict=True)["elmo"]

print("text")
emb_text = sess.run(embeddings[0][2])
emb_text

**Task 2 (2 points)**. You have texts with several sentences where the word `bank` is used in different contexts. Load the word2vec model `word2vec-google-news-300` and elmo model from `https://tfhub.dev/google/elmo/2`. 
Check the embedding of the word `bank` in each sentence. 
Compare word2vec and ELMO embeddings.

Write the cosine similarity distance between the embeddings of the word bank in the first sentence and in the third sentence for both word2vec and ELMO. Round the answer up to two digits after the decimal point.

In [None]:
texts = [
         "The river bank was flooded", # this one
         "The bank vault was robust",
         "He had to bank on her for support", # the third one
         "The bank was out of money",
         "The robber still the money from the bank"
]

tokenized_input = []
tokens_length = []
for sentence in texts:
    toks_sent = nltk.word_tokenize(sentence.lower())
    tokenized_input.append(toks_sent)
    tokens_length.append(len(toks_sent))

print(len(tokenized_input), len(tokens_length))
tokenized_input, tokens_length

In [None]:
# YOUR CODE FOR WORD2VEC HERE

In [None]:
# YOUR CODE FOR ELMO HERE 
# YOU need pad for signature="tokens"

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# YOUR CODE HERE

In [None]:
cosine_similarity_word2vec = ## YOUR COSINE SIMILARITY FOR Word2vec
cosine_similarity_elmo = ## YOUR COSINE SIMILARITY FOR ELMO

In [None]:
## GRADED PART, DO NOT CHANGE!
grader.set_answer(all_parts[2], cosine_similarity_word2vec)
grader.set_answer(all_parts[3], cosine_similarity_elmo)

## BERT embeddings

Next, we will work with BERT models. We will work with **transformers** library. It's the most commonly used library for working with pretrained models.

In [None]:
# If you work in Colab:
# !pip install transformers

You will work with English BERT uncased model. First, you need to load the tokenizer and the pretrained model. The full list of models available you can find [here](https://huggingface.co/models).

In [None]:
import torch
from transformers import BertTokenizer, BertModel, BertForMaskedLM

import logging
logging.basicConfig(level=logging.INFO)

# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()

**Task 3.1. (0.5 point)** Write the function for input preparation for the BERT model. 

For this you need:
1. Split each sentence and tokenize it with the tokenizer initialized above.
2. Add special tokens `[CLS]` and `[SEP]`.
3. Map all tokens to their IDs.
4. Pad or truncate all sentences to the same length.
5. Create attention masks which explicitly differentiate real tokens from `[PAD]` tokens.

In [None]:
# WRITE function for preprocessing input for BERT
# Write the sum of input_ids of the `Attention is all you need.`

def bert_text_preparation(text, tokenizer):
    """Preparing the input for BERT
    Takes a string argument and performs
    pre-processing like adding special tokens,
    tokenization, tokens to ids, and tokens to
    segment ids. All tokens are mapped to seg-
    ment id = 1.
    OUTPUT:  input_ids': tensor([[ 101, 3086, 2003, 2035, 2017, 2342, 1012,  102]]), 
    'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]]), 
    'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}
    (['[CLS]', 'attention', 'is', 'all', 'you', 'need', '.', '[SEP]'],
    """
    # YOUR CODE HERE
    return tokenized_text, input_ids, token_type_ids, attention_mask

# Write in the answer the input ids of the sequence:
text = "Attention is all you need."
tokenized_text, input_ids, token_type_ids, attention_mask = bert_text_preparation(text, tokenizer)
print(tokenized_text)
print(input_ids)

In [None]:
input_ids_sum = ## YOUR INPUT IDs SUM

In [None]:
## GRADED PART, DO NOT CHANGE!
grader.set_answer(all_parts[4], input_ids_sum)

**Task 3.2. (1 point)** Apply `BertForMaskedLM` model. `BertForMaskedLM` is a Language Model head on top of BERT. It maps the hidden state output of BERT model to a specific token in the vocabulary. The loss is calculated based on the scores obtained from a given token with respect to the target token. We can `[MASK]` some tokens in the text and force BERT to predict the masked words. That is exactly what you need to do. 

Write a function that masks an indexed word and predicts it. Write the predicted index and the corresponding token.

**Note:** pass a vector of `1`s as `token_type_ids` to encode the token type (that means that as long as you have only one segment than all tokens should have the same type id, use id `1` for the correct grading).

In [None]:
# EXERCISE 3.2. 
# Write function to predict MASKED token.

def find_masked_word(text, masked_index):
  """ 
  Tokenize input - don't forget to add [CLS] ans [SEP] tokens.
  Mask the word with the `masked_index`
  use `model` BertForMaskedLM above to predict the MASKED token.
  Return: predicted_index
  """
  # YOUR CODE HERE
  return predicted_index

# Write
text = "What is the best pretrained model? [SEP] BERT think that it is!"
predicted_index = find_masked_word(text, masked_index=14)
print(predicted_index)
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
print(predicted_token)

In [None]:
## GRADED PART, DO NOT CHANGE!
grader.set_answer(all_parts[5], predicted_index)
grader.set_answer(all_parts[6], predicted_token)

**Task 3.3. (1 point)**. Get token and sentence embeddings from BERT. Now, load pretrained `BertModel`. We know how to preprocess text for BERT. Next, we need to get embeddings from the model. 

First, write a function that gets  tokens_tensors of a sentence, segment ids, and model as an input. It outputs the `list_token_embeddings` - the embeddings of this sentences.

In [None]:
bert_model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
# Put the model in "evaluation" mode, meaning feed-forward operation.
bert_model = bert_model.eval()

In [None]:
# EXERCISE 3.3. Write function that gets the BERT embedding directly

def get_bert_embeddings(tokens_tensor, segments_tensors, model):
    """Get embeddings from an embedding model
    Args:
        tokens_tensor (obj): Torch tensor size [n_tokens]
            with token ids for each token in text
        segments_tensors (obj): Torch tensor size [n_tokens]
            with segment ids for each token in text
        model (obj): Embedding model to generate embeddings
            from token and segment ids
    
    Returns: [n_tokens, n_embedding_dimensions]
    containing embeddings for each token
    """
    # YOUR CODE HERE
    # Gradient calculation id disabled
    # Model is in inference mode

    return list_token_embeddings

In [None]:
texts = [
         "The river bank was flooded.",
         "The bank vault was robust.",
         "He had to bank on her for support.",
         "The bank was out of money.",
         "The robber still the money from the bank."
]

# Getting embeddings for the target sentences
# Here we put the word `bank` in all given contexts
target_word_embeddings = []

for text in texts:
    tokenized_text, input_ids, token_type_ids, attention_mask = bert_text_preparation(text, tokenizer)
    list_token_embeddings = get_bert_embeddings(input_ids, attention_mask, bert_model)
    # Find the position 'bank' in list of tokens
    word_index = tokenized_text.index('bank')
    # Get the embedding for bank
    word_embedding = list_token_embeddings[word_index]
    target_word_embeddings.append(word_embedding)

assert(len(target_word_embeddings) == 5)

Use the cosine_similarity metrics between embeddings and calculate it for the first and third sentences (sentence indexes 0 and 2), where the meaning of the `bank` words are different; and the sentences 4 and 5, where their senses are similar. Round the answer up to 3 digits after the decimal point.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

## Cosine similarity [different meanings]
cosine_similarity_diff = ## YOUR ANSWER

## Cosine similarity [similar meanings]
cosine_similarity_sim = ## YOUR ANSWER

In [None]:
## GRADED PART, DO NOT CHANGE!
grader.set_answer(all_parts[7], cosine_similarity_diff)
grader.set_answer(all_parts[8], cosine_similarity_sim)

**Task 3.4 (1 point)**. Write a function that recieves a sentence embedding from the BERT model. Compute sentence embeddings for all the sentences and write the cosine similarity between the first and third sentences and between the 4th and the 5th. Round the answer up to 3 digits after the decimal point.

In [None]:
import numpy as np

def get_sentence_embedding(text, bert_model):
    # YOUR CODE HERE
    return sentence_embedding[:, 0, :].cpu().numpy()

sent_embeddings = []
for text in texts:
    sentence_embedding = get_sentence_embedding(text, bert_model)
    sent_embeddings.append(sentence_embedding)

In [None]:
## Cosine similarity  between the 1st and the 3rd sentences
cosine_similarity_1_3 = ## YOUR ANSWER

## Cosine similarity  between the 4th and the 5th sentences
cosine_similarity_4_5 = ## YOUR ANSWER

In [None]:
## GRADED PART, DO NOT CHANGE!
grader.set_answer(all_parts[9], cosine_similarity_1_3)
grader.set_answer(all_parts[10], cosine_similarity_4_5)

**Task 3.5. (1 point)**. Using sentence embeddings apply a simple classifier for the COLA dataset you loaded in the begining of this notebook. Use SVM with hyperparameters `(C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=- 1, decision_function_shape='ovr', break_ties=False, random_state=42)` from `sklearn` library. Do not change random seed and parameters! Write accuracy in percentages for the validation set. For example, 50.

In [None]:
from sklearn import svm

# YOUR CODE HERE
accuracy_in_percentages = ## YOUR ACCURACY HERE

In [None]:
## GRADED PART, DO NOT CHANGE!
grader.set_answer(all_parts[11], accuracy_in_percentages)

**Task 4 (0.5 point)**.  In the hugging face library you can use pipelines that are already prepared for you. Import `pipilene` and choose the task you need to solve, the model from the huggingface list and tokenizer, like in the cell below:

In [None]:
from transformers import pipeline

# Using default model and tokenizer for the task
pipeline("<task-name>")

# # Using a user-specified model
pipeline("<task-name>", model="<model_name>")

# # Using custom model/tokenizer as str
pipeline('<task-name>', model='<model name>', tokenizer='<tokenizer_name>')

Write the pipeline for the Question answering task using the `distilbert-base-uncased-distilled-squad` model. It's distilled BERT model that was fine-tuned on QA dataset `SQUAD`. Write the start id of the answer, using as a context 1000 sentences from COLA dataset.

In [None]:
context = "\n".join(sentences[:1000])
question = "When does Mary get depressed?"
# YOUR CODE HERE

In [None]:
start_id = ## YOUR START ID

In [None]:
## GRADED PART, DO NOT CHANGE!
grader.set_answer(all_parts[12], start_id)

In [None]:
grader.submit(COURSERA_EMAIL, COURSERA_TOKEN)

❗️Remember to **run the last code cell** to submit the solution.