# Recent advances in Natural Language Processing

## 1. Static word embeddings

Introduced in 2013, word2vec has had a huge impact in natural language processing and its applications.

Vector representations of words seem to capture word meaning quite well!

Accessible and easy to use (easy to train, to apply and to share).

Shortcoming: this algorithm creates static embeddings, i.e. it creates one vector per word, no matter how many meanings the word has (e.g. `I like apples` vs `I like Apple macbooks`.)

Import the `gensim` library:

In [None]:
import gensim
import gensim.downloader

Download and load one of the models.

Just for illustration, we'll use `glove-wiki-gigaword-50`, which was trained on text from Wikipedia and Gigaword (newswire). Note that different models may perform differently.

In [None]:
glove_vectors = gensim.downloader.load('glove-wiki-gigaword-50')

Static word embeddings create one vector per word.

_Example 1:_
See top 20 most similar words to word 'mouse'. What do you observe?

In [None]:
glove_vectors.most_similar('mouse', topn=20)

_Example 2:_ See top 20 most similar words to word 'pear'.

What would you expect to see here? Any guesses on the top most similar? And what do you see?

In [None]:
glove_vectors.most_similar('pear', topn=20)

_Example 3:_ See top 20 most similar words to word 'apple'.

What would you expect to see here? Any guesses on the top most similar? And what do you see?

In [None]:
glove_vectors.most_similar('apple', topn=20)

Word2Vec (2013) was one of the most important developments in NLP:
* It captures word meaning quite well!
* Very accessible and easy to use

... but it produces "static embeddings", with one vector representation for each word form.

**But most words have multiple meanings!**

## 2. Contextualized word embeddings

Words mean different things in different contexts.

**Goal:** learn the representation (i.e. the "meaning"!) for each word in its context.

In recent years (since 2018 mostly), lots of progress has been made (from BERT to GPT-3).

Also, lots of progress in making this easily accessible, and easy to use. The company HuggingFace has been greatly responsible for this last point, especially with their `transformers` library and their model hub.

A **transformer** is a deep learning model that uses the **attention** mechanism (a mechanism which is based on cognitive attention, and which focuses on where the key information in a sequence is produces while forgetting less relevant information). Its development has had a huge impact in deep learning, especially in natural language processing and computer vision. It allows a more effective modeling of long term dependencies between the words in a sequence, and more efficient training, not limited by the sequence order of the input sequence.

**BERT** (Bidirectional Encoder Representations from Transformers) is a transformer-based model, hugely successful, that creates contextualized word embeddings, it captures fine-grained contextual properties of words. It learns contextualized information through a masking process (i.e. it hides some words and uses their position to infer them back).

![](images/mask_predictions.png)

## 3. HuggingFace 🤗

HuggingFace is a company specialised in developing NLP technologies. 

Their open source `transformers` library has become one of the most popular libraries for NLP:
* State-of-the-art NLP easier to use.
* Provides APIs to download and use pretrained models, but also allows you to load and fine-tune your own models.
* It is open source! 
* Maintains a **model hub**: central point for people to share and find models. They host more than 50K models, supporting different languages and different tasks, and also more than 7K datasets.

We'll just scratch the surface, but if you are interested in this, we highly recommend the HuggingFace course: https://huggingface.co/course

### The `transformers` pipelines

The `transformers` library provides an easy way of using transformer models for some of the main tasks in natural language processing, such as:

#### Sentiment analysis

The task of determining whether a text is positive, negative, or neutral.

![](images/sentiment_analysis.png)

#### Zero-shot classification

The task of classifying texts into categories of your choice.

![](images/zeroshot.png)

#### Text generation

The task of generating text given a prompt.

![](images/textgen.png)

#### Named entity recognition

The task of recognizing named entities in a text.

![](images/ner.png)

#### Machine translation

The task of converting a source text from one language to another.

![](images/machinetransl.png)

#### Question answering

The task of retrieving the answer to a question from a given text.

![](images/questionans.png)

✏️ **Exercises:**
    
* Explore the model hub: https://huggingface.co/models.
* Have a look at the models: tasks and languages available.
* Discussion:
    * Can you find a model that fits your needs (language, task, ...)?
    * Have you spotted any strange behaviour?

### How to use `transformers` in code

We will first install the `transformers` library (and dependencies):

In [None]:
!pip install torch torchvision torchaudio
!pip install transformers

Import the transformers library

In [None]:
import transformers

###  Using BERT pipelines

Pipelines are a simplified way to apply BERT models (and other transformer models!). A pipeline is a code object that abstracts most of the complex code (it happens in the background), leaving only the bare minimum for the user to interact.

We load the `pipeline` module from the `transformers` library:

In [None]:
from transformers import pipeline

To create a pipeline, you need to know:
* Which task you want to perform (e.g. `'fill-mask'`)
* The model you want to use to make predictions (e.g. `'distilbert-base-uncased'`), which must be trained for the task you want to perform (i.e. `fill-mask`).
* The tokenizer used by the model (i.e. the strategy that BERT uses to split sequences into smaller units. This is often the same name as the model, e.g. `'distilbert-base-uncased'`).
* Which conventions the language model follows: e.g. if your task is `fill-mask`, how is the masked element tagged (usually `[MASK]`, sometimes `<MASK>`, etc.).

If you obtained your model from the HuggingFace model hub (https://huggingface.co/models), you should be able to find all this info in the model card (e.g. https://huggingface.co/bert-base-uncased).

**Note:** You can find more information on pipelines and how to use them in https://huggingface.co/transformers/main_classes/pipelines.html.

### 2.2. The Mask filling pipeline

Masked language modeling is the task of masking tokens in a sequence with a masking token, and prompting the model to fill that mask with an appropriate token (source: https://huggingface.co/transformers/task_summary.html#masked-language-modeling). The `fill-mask` pipeline replaces the mask in a sequence by the most likely prediction according to a BERT model.

We will create a `fill-mask` pipeline using the `distilbert-base-uncased` English model (and its tokenizer), as follows:

In [None]:
unmasker = pipeline('fill-mask',
                    model='distilbert-base-uncased',
                    tokenizer='distilbert-base-uncased')

This pipeline allows us to easily use BERT to predict the masked element in a sentence.

In the previous cell, we are:
* Creating a pipeline for the task of `fill-mask`,
* by using the `distilbert-base-uncased` BERT model and tokenizer,
* and storing the resulting pipeline in a variable (we call it `unmasker`), which we can use and reuse in subsequent code.

**Warning:** You need to make sure the model you use is trained for the `'fill-mask'` task.

To use the pipeline, you just need to pass the sentence containing the masked word as an argument of `unmasker` (i.e. the variable containing your pipeline). You don't need to do any encoding, the pipeline already takes care of converting the text into an input BERT can understand!

We store the output of applying the pipeline to this sentence in the `outputs` variable, as shown below:

In [None]:
outputs = unmasker("The cell is guarded by a [MASK].")

Now, let's inspect the `outputs` variable:

In [None]:
print(outputs)

In [None]:
# Let's print the results in an easier-to-read format:
for one_output in outputs:
    print("Prediction:", one_output['token_str'])
    print("Score:     ", round(one_output['score'],4))
    print()

In [None]:
outputs = unmasker("""When a cell has been produced, we can then trace some of the
                      stages by which new [MASK] are formed. There appear to be four
                      modes in which vegetable cells are multiplied. The new cells
                      may either proceed from a nucleus or they may be formed at
                      once in the protoplasm.""")

# Let's print the results in an easier-to-read format:
for one_output in outputs:
    print("Prediction:", one_output['token_str'])
    print("Score:     ", round(one_output['score'],4))
    print()

In [None]:
outputs = unmasker("""Imprisonment with proper employment, and at least two visits
                      every day from a prison officer. The punishment does not
                      extend over a month. A week must elapse before the same
                      prisoner can be put again into the dark [MASK].""")

# Let's print the results in an easier-to-read format:
for one_output in outputs:
    print("Prediction:", one_output['token_str'])
    print("Score:     ", round(one_output['score'],4))
    print()

✏️ **Exercise:**

In [None]:
# Find a `fill-mask` model from HuggingFace model hub (trained on data in your preferred
# language, if there is one). Create a `fill-mask` pipeline and try to predict the mask
# token in some sentences.
# * Try this with different sentences.
# * What do the scores indicate?
# * Try to see what happens if you want to use BERT to predict something that requires
#   world knowledge, for example:
#     * `Everyone agrees that the princes in the tower were [MASK].`
#     * `It would seem [MASK] III killed the princes in the tower.`
#     * `Barcelona is a city in [MASK].`
#     * `Paris is the capital of [MASK].`
#
# Type your code here:



### 2.3. Load and use your own models

In this tutorial we won't have time to cover how to train or fine-tune your own BERT model, but at the end of this notebook you will find some links on this.

We will now imagine you have your own BERT models you want to use. Instead, we will be using our historical English BERT models, just to show that you can also use the `transformers` library using your own model. You just need to correctly point the right path to the model when loading it.

See how we load our historical English BERT models:

✏️ **To do:**

We have stored our historical English BERT models stored in [Google drive](https://drive.google.com/drive/folders/1Y-ltpJNCfTO0ti7zPnBdRWlyMXh8OjmH?usp=sharing).
These language models are described in [this paper](https://openhumanitiesdata.metajnl.com/articles/10.5334/johd.48/).

!!! Important facts you will **need to know** about these language models:
* They were fine-tuned on the `fill-mask` task based on `bert-base-uncased`
* They use the `bert-base-uncased` tokenizer.

The dataset on which these language models are trained is a 19th-century collection
of books in English. We will download the following two BERT models:
* `bert_1760_1850.zip`: trained on books from 1760 to 1850: [download](https://drive.google.com/file/d/1QJgUFiFgplOq2eBUn5mLwAxcn3KOSPxw/view?usp=sharing)
* `bert_1890_1900.zip`: trained on books from 1890 to 1900: [download](https://drive.google.com/file/d/1nPlcyBBOdGYxRGVmiCrgC6muhgD87lva/view?usp=sharing)

Download the files, unzip them, and store the `bert_1760_1850` and `bert_1890_1900` folders
directly under the `models` folder.

In [None]:
# Now we can create a `fill-mask` pipeline for the 1760-1850 model. To do so, you
# just need to add the path to the `model` argument. It is very important that
# you know (1) which is the tokenizer that was used to train the model and (2)
# on which task the model was fine-tuned, in this case `fill-mask`: this info
# is usually given in the description of the model.
unmasker_1760_1850 = pipeline('fill-mask',
                              model='models/bert_1760_1850',
                              tokenizer='bert-base-uncased')

In [None]:
# We can now use them to predict a mask in a sentence as well:
outputs = unmasker_1760_1850("""The [MASK] is guarded by guards.""")

# Let's print the results in an easier-to-read format:
for one_output in outputs:
    print("Prediction:", one_output['token_str'])
    print("Score:     ", round(one_output['score'],4))
    print()

✏️ **Exercise:**

In [None]:
# Create a pipeline for the 1890-1900 model as well and try different sentences with
# both the 1760-1850 and the 1890-1900 models. Do language models trained on data
# from different periods make different predictions?
# 
# Type your code here:



### 2.4. The other pipelines

These are other pipelines available through HuggingFace:
* `ner` (for named entity recognition)
* `question-answering`
* `sentiment-analysis`
* `summarization`
* `text-generation`
* `translation`
* `zero-shot-classification`

HuggingFace have created well-documented [tutorial](https://huggingface.co/course/chapter1/3?fw=pt#working-with-pipelines).

✏️ **Exercise:**

In [None]:
# Have a look at the other pipelines, get inspration from the tutorial:
# https://huggingface.co/course/chapter1/3?fw=pt#working-with-pipelines
# Play a bit with different pipelines, using different models from the
# model hub: https://huggingface.co/models (make sure that the language
# is trained for the task you would like to try!).
# 
# Type your code here:



### 2.5. Get the vector representation

The meaning of a word, in NLP, is usually represented as vectors, i.e. lists of numbers. In the case of transformer models (such as BERT), the vector of a word changes based on the context in which this word occurs. In the following cells, we'll see how to obtain the vector representation of a token using the `tranformers` library.

#### Tokenization

As we saw yesterday, tokenizing a text is splitting it into meaningful units. Very often, this means splitting a text into words.

BERT uses a **subword** tokenization procedure called WordPiece.

This means that it does not only separate words. It also splits certain words into (ideally) meaningful units
> E.g. it splits the word `tokenizing` into `token` and `##izing`, where `##` indicates that this is a suffix which should be attached to the previous word).

These tokens are then mapped to numbers.

The tokenizer maps every word form (e.g. `token` and `##izing`) with identifiers in the vocabulary (e.g. given a certain model, `19204` is the vocabulary ID of `token` and `6026` is the vocabulary ID of the suffix `##izing`, so the word 'tokenizing' would be tokenized as `[19204, 6026]`).

**!!! Warning:** BERT has certain limits as to the length of the string that is accepted, which depends on the model, but usually 512 tokens.

**!!! VERY important:** different models may have different token-to-id mappings. When we use an existing model, we must use the same tokenizer (and therefore the same vocabulary mapping) that was used when training the model.

#### The inner workings of BERT tokenization

Tokenization steps:

1. The text is split into tokens, which can be:
  * words
  * parts of words
  * punctuation symbols

2. The tokenizer adds special tokens:
  * `[CLS]` indicating that this is the beginning of the input sequence.
  * `[SEP]` indicating that it is the end of the sequence (or a sequence delimiter if we have a pair of sequences as input).

3. The tokenizer maps each token into their vocabulary IDs.

Let's explore this in code.

In [None]:
# Load the tokenizer of a certain BERT model
our_tokenizer = transformers.AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [None]:
# The `encode` function does the three steps in one go (splits the sequence, adds
# special tokens, and converts them into a sequence of IDs):
encoded_seq = our_tokenizer.encode('The cell is relentlessly guarded by guards.')
print(encoded_seq)

And there are also functions that translate the vocabulary IDs to the word forms (given a certain tokenizer):

In [None]:
# The `convert_ids_to_tokens` returns the tokens that correspond to the IDs of
# an encoded sequence:
tokens = our_tokenizer.convert_ids_to_tokens(encoded_seq)
print(tokens)

### 2.6. The feature extraction pipeline

Here we will see how to get vectors for words in context.

Similarly to what we did with word2vec, we may also want to have access to the vector of a certain word. However, unlike with word2vec, the vector of a word will depend on the context in which the word occurs. This means that we can't just ask for the vector of the word "apple", for example: we will need to ask for the vector of the word "apple" given a certain context.

We first import the following two libraries, which will help us work with vectors:

In [None]:
import numpy as np # python library used for working with vectors
from scipy import spatial # package to help compute distance or similarity between vectors

The pipeline task to obtain the vectors for tokens in a sequence is `feature-extraction`. As you can see, creating this pipeline is very similar to creating the `fill-mask` pipeline.

We will store the pipeline in a variable called `nlp_features`:

In [None]:
nlp_features = pipeline("feature-extraction",
                    model='distilbert-base-uncased',
                    tokenizer='distilbert-base-uncased')

Given a sentence, the pipeline tokenizes the input sentence:

In [None]:
sentence = "They were told that the machines stopped working."

output = nlp_features(sentence)
output_vectors = np.squeeze(output) # This removes single-dimensional entries (i.e. for vector readability)

Let's inspect the output. First of all, let's print it:

In [None]:
print(output_vectors)

This is an array (a list of vectors). Let's see its shape:

In [None]:
print(output_vectors.shape) # Print the shape of the vector

This means that we have an arrray (in other words a matrix, a table) that has 11 vectors of length 768 (or, in other words, 11 rows with 768 columns).

**Question:** 11 vectors? Why 11?

Let's see how the sentence is tokenized (we've seen how above):

In [None]:
# Load the **SAME** tokenizer used in the pipeline:
our_tokenizer = transformers.AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Encode the sentence into a sequence of vocabulary IDs
encoded_seq = our_tokenizer.encode(sentence)
print(encoded_seq)

# And get the tokens given the vocabulary IDs
tokens = our_tokenizer.convert_ids_to_tokens(encoded_seq)
print(tokens)

# And print the length of the tokenized sequence:
print(len(tokens))

As you can see, the input sentence has been tokenized into 11 tokens. So what we have in the above array is 11 vectors (each one representing a word in the context of the sentence, **keeping the order of tokens**, i.e. the first vector will correspond to the special token `[CLS]`, the second vector to the token `the`, and so on until the last vector, which corresponds to the special token `[SEP]`).

How do we get the vector of a specific token?

In [None]:
print(tokens[6]) # The 6th element in the tokenized sentence is the token `machine` (we start counting from zero)

In [None]:
print(output_vectors[6]) # Therefore, o the 6th vector in output_vectors is the vector of `machine` in this context.

✏️ **Exercise:**

In [None]:
# Create two `feature-extraction` pipelines, one for the  1760-1850 model, and
# one for the 1890-1900 model. Find whether the cosine similarity between words
# in sequences change depending on which BERT model you use.
# 
# Type your code here:

👀 **If you are interested in knowing more, we recommend:**
* The HuggingFace tutorials: https://huggingface.co/course/chapter1/1
* BERT for Humanists: http://www.bertforhumanists.org/