# 1. Static word embeddings

Introduced in 2013, word2vec has had a huge impact in natural language processing and its applications.

Vector representations of words seem to capture word meaning quite well!

Accessible and easy to use (easy to train, to apply and to share).

Shortcoming: this algorithm creates static embeddings, i.e. it creates one vector per word, no matter how many meanings the word has (e.g. `I like apples` vs `I like Apple macbooks`.)

Import the `gensim` library:

In [1]:
import gensim
import gensim.downloader

Download and load one of the models.

Just for illustration, we'll use `glove-wiki-gigaword-50`, which was trained on text from Wikipedia and Gigaword (newswire). Note that different models may perform differently.

In [2]:
glove_vectors = gensim.downloader.load('glove-wiki-gigaword-50')

Static word embeddings create one vector per word.

_Example 1:_
See top 20 most similar words to word 'mouse'. What do you observe?

In [3]:
glove_vectors.most_similar('mouse', topn=20)

[('monkey', 0.7965501546859741),
 ('bugs', 0.7805658578872681),
 ('cat', 0.7731667160987854),
 ('rabbit', 0.7622702717781067),
 ('worm', 0.7504912614822388),
 ('clone', 0.7307788729667664),
 ('robot', 0.7268993854522705),
 ('spider', 0.7199547290802002),
 ('bug', 0.7104865312576294),
 ('frog', 0.702705979347229),
 ('mice', 0.6978598833084106),
 ('morph', 0.6821691393852234),
 ('rat', 0.6796256303787231),
 ('ape', 0.6776273846626282),
 ('monster', 0.6768894791603088),
 ('click', 0.6624072790145874),
 ('uses', 0.6548232436180115),
 ('squirrel', 0.6545740962028503),
 ('creature', 0.652310848236084),
 ('trackball', 0.6500849723815918)]

_Example 2:_ See top 20 most similar words to word 'pear'.

What would you expect to see here? Any guesses on the top most similar? And what do you see?

In [4]:
glove_vectors.most_similar('pear', topn=20)

[('mango', 0.854493260383606),
 ('avocado', 0.8034905195236206),
 ('pineapple', 0.7993020415306091),
 ('pomegranate', 0.7979761362075806),
 ('apricot', 0.7934627532958984),
 ('plum', 0.7909936308860779),
 ('peach', 0.7899217009544373),
 ('tomato', 0.7809723615646362),
 ('almond', 0.7721766829490662),
 ('guava', 0.7694117426872253),
 ('cherries', 0.7693702578544617),
 ('cucumber', 0.7676603198051453),
 ('cherry', 0.7642223238945007),
 ('melon', 0.758733868598938),
 ('rhubarb', 0.7586071491241455),
 ('cranberry', 0.75290846824646),
 ('papaya', 0.7493483424186707),
 ('compote', 0.7489797472953796),
 ('fruit', 0.7483691573143005),
 ('ripe', 0.7483318448066711)]

_Example 3:_ See top 20 most similar words to word 'apple'.

What would you expect to see here? Any guesses on the top most similar? And what do you see?

In [5]:
glove_vectors.most_similar('apple', topn=20)

[('blackberry', 0.7543067336082458),
 ('chips', 0.7438644170761108),
 ('iphone', 0.7429664134979248),
 ('microsoft', 0.7334205508232117),
 ('ipad', 0.7331036329269409),
 ('pc', 0.7217225432395935),
 ('ipod', 0.7199784517288208),
 ('intel', 0.7192243337631226),
 ('ibm', 0.7146540284156799),
 ('software', 0.7093585133552551),
 ('macintosh', 0.7047760486602783),
 ('android', 0.7046630382537842),
 ('processor', 0.6996651291847229),
 ('product', 0.6925289630889893),
 ('dell', 0.6896463632583618),
 ('cola', 0.6863354444503784),
 ('desktop', 0.6860975027084351),
 ('netscape', 0.6852997541427612),
 ('processors', 0.6781534552574158),
 ('amd', 0.6766293048858643)]

# 2. Contextualized word embeddings

Words mean different things in different contexts.

**Goal:** learn the representation (i.e. the "meaning"!) for each word in its context.

In recent years (since 2018 mostly), lots of progress has been made (from BERT to GPT-3).

Also, lots of progress in making this easily accessible, and easy to use. The company HuggingFace has been greatly responsible for this last point, especially with their `transformers` library and their model hub.

A **transformer** is a deep learning model that uses the **attention** mechanism (a mechanism which is based on cognitive attention, and which focuses on where the key information in a sequence is produces while forgetting less relevant information). Its development has had a huge impact in deep learning, especially in natural language processing and computer vision. It allows a more effective modeling of long term dependencies between the words in a sequence, and more efficient training, not limited by the sequence order of the input sequence.

**BERT** (Bidirectional Encoder Representations from Transformers) is a transformer-based model that creates contextualized word embeddings. It learns contextualized information through a masking process (i.e. it hides some words and uses their position to infer them back).

We will first install the `transformers` library (and dependencies):

In [6]:
!pip install torch torchvision torchaudio
!pip install transformers



Import the transformers library

In [7]:
import transformers

### 2.1. Using BERT pipelines

Pipelines are a simplified way to apply BERT models. A pipeline is a code object that abstracts most of the complex code (it happens in the background), leaving only the bare minimum for the user to interact.

We load the `pipeline` module from the `transformers` library:

In [8]:
from transformers import pipeline

To create a pipeline, you need to know:
* Which task you want to perform (e.g. `'fill-mask'`)
* The model you want to use to make predictions (e.g. `'distilbert-base-uncased'`), which must be trained for the task you want to perform (i.e. `fill-mask`).
* The tokenizer used by the model (i.e. the strategy that BERT uses to split sequences into smaller units. This is often the same name as the model, e.g. `'distilbert-base-uncased'`).
* Which conventions the language model follows: e.g. if your task is `fill-mask`, how is the masked element tagged (usually `[MASK]`, sometimes `<MASK>`, etc.).

If you obtained your model from the HuggingFace model hub (https://huggingface.co/models), you should be able to find all this info in the model card (e.g. https://huggingface.co/bert-base-uncased).

**Note:** You can find more information on pipelines and how to use them in https://huggingface.co/transformers/main_classes/pipelines.html.

### 2.2. The Mask filling pipeline

Masked language modeling is the task of masking tokens in a sequence with a masking token, and prompting the model to fill that mask with an appropriate token (source: https://huggingface.co/transformers/task_summary.html#masked-language-modeling). The `fill-mask` pipeline replaces the mask in a sequence by the most likely prediction according to a BERT model.

We will create a `fill-mask` pipeline using the `distilbert-base-uncased` English model (and its tokenizer), as follows:

In [9]:
unmasker = pipeline('fill-mask',
                    model='distilbert-base-uncased',
                    tokenizer='distilbert-base-uncased')

This pipeline allows us to easily use BERT to predict the masked element in a sentence.

In the previous cell, we are:
* Creating a pipeline for the task of `fill-mask`,
* by using the `distilbert-base-uncased` BERT model and tokenizer,
* and storing the resulting pipeline in a variable (we call it `unmasker`), which we can use and reuse in subsequent code.

**Warning:** You need to make sure the model you use is trained for the `'fill-mask'` task.

To use the pipeline, you just need to pass the sentence containing the masked word as an argument of `unmasker` (i.e. the variable containing your pipeline). You don't need to do any encoding, the pipeline already takes care of converting the text into an input BERT can understand!

We store the output of applying the pipeline to this sentence in the `outputs` variable, as shown below:

In [10]:
outputs = unmasker("The cell is guarded by a [MASK].")

Now, let's inspect the `outputs` variable:

In [11]:
print(outputs)

[{'score': 0.14199909567832947, 'token': 3457, 'token_str': 'guard', 'sequence': 'the cell is guarded by a guard.'}, {'score': 0.09074387699365616, 'token': 8638, 'token_str': 'fence', 'sequence': 'the cell is guarded by a fence.'}, {'score': 0.05313466489315033, 'token': 2813, 'token_str': 'wall', 'sequence': 'the cell is guarded by a wall.'}, {'score': 0.04400479421019554, 'token': 10684, 'token_str': 'keeper', 'sequence': 'the cell is guarded by a keeper.'}, {'score': 0.037186309695243835, 'token': 4796, 'token_str': 'gate', 'sequence': 'the cell is guarded by a gate.'}]


In [12]:
# Let's print the results in an easier-to-read format:
for one_output in outputs:
    print("Prediction:", one_output['token_str'])
    print("Score:     ", round(one_output['score'],4))
    print()

Prediction: guard
Score:      0.142

Prediction: fence
Score:      0.0907

Prediction: wall
Score:      0.0531

Prediction: keeper
Score:      0.044

Prediction: gate
Score:      0.0372



In [13]:
outputs = unmasker("""When a cell has been produced, we can then trace some of the
                      stages by which new [MASK] are formed. There appear to be four
                      modes in which vegetable cells are multiplied. The new cells
                      may either proceed from a nucleus or they may be formed at
                      once in the protoplasm.""")

# Let's print the results in an easier-to-read format:
for one_output in outputs:
    print("Prediction:", one_output['token_str'])
    print("Score:     ", round(one_output['score'],4))
    print()

Prediction: cells
Score:      0.9774

Prediction: nuclei
Score:      0.0077

Prediction: neurons
Score:      0.0026

Prediction: colonies
Score:      0.0017

Prediction: organisms
Score:      0.0012



In [14]:
outputs = unmasker("""Imprisonment with proper employment, and at least two visits
                      every day from a prison officer. The punishment does not
                      extend over a month. A week must elapse before the same
                      prisoner can be put again into the dark [MASK].""")

# Let's print the results in an easier-to-read format:
for one_output in outputs:
    print("Prediction:", one_output['token_str'])
    print("Score:     ", round(one_output['score'],4))
    print()

Prediction: ##room
Score:      0.1958

Prediction: cell
Score:      0.0785

Prediction: tower
Score:      0.0615

Prediction: room
Score:      0.0517

Prediction: chamber
Score:      0.0445



✏️ **Exercise:**

In [None]:
# Find a `fill-mask` model from HuggingFace model hub (trained on data in your preferred
# language, if there is one). Create a `fill-mask` pipeline and try to predict the mask
# token in some sentences.
# * Try this with different sentences.
# * What do the scores indicate?
# * Try to see what happens if you want to use BERT to predict something that requires
#   world knowledge, for example:
#     * `Everyone agrees that the princes in the tower were [MASK].`
#     * `It would seem [MASK] III killed the princes in the tower.`
#     * `Barcelona is a city in [MASK].`
#     * `Paris is the capital of [MASK].`
#
# Type your code here:



### 2.3. Load and use your own models

In this tutorial we won't have time to cover how to train or fine-tune your own BERT model, but at the end of this notebook you will find some links on this.

We will now imagine you have your own BERT models you want to use. Instead, we will be using our historical English BERT models, just to show that you can also use the `transformers` library using your own model. You just need to correctly point the right path to the model when loading it.

See how we load our historical English BERT models:

In [None]:
# We have stored our historical English BERT models stored in Google drive, in
# https://drive.google.com/drive/folders/1Y-ltpJNCfTO0ti7zPnBdRWlyMXh8OjmH?usp=sharing.
# These language models are described in https://arxiv.org/abs/2105.11321.
#
# !!! Important facts you will **need to know** about these language models:
# * They were fine-tuned on the `fill-mask` task based on `bert-base-uncased`
# * They use the `bert-base-uncased` tokenizer.
#
# The dataset on which these language models are trained is a 19th-century collection
# of books in English. We will download the following two BERT models:
# * bert_1760_1850.zip: trained on books from 1760 to 1850: https://drive.google.com/file/d/1QJgUFiFgplOq2eBUn5mLwAxcn3KOSPxw/view?usp=sharing
# * bert_1890_1900.zip: trained on books from 1890 to 1900: https://drive.google.com/file/d/1nPlcyBBOdGYxRGVmiCrgC6muhgD87lva/view?usp=sharing
#
# Download the files, unzip them, and store the `bert_1760_1850` and `bert_1890_1900` folders
# directly under the `models` folder.

In [15]:
# And we create a `fill-mask` pipeline for the 1760-1850 model. To do so, you
# just need to add the path to the `model` argument. It is very important that
# you know (1) which is the tokenizer that was used to train the model and (2)
# on which task the model was fine-tuned, in this case `fill-mask`: this info
# is usually given in the description of the model.
unmasker_1760_1850 = pipeline('fill-mask',
                              model='models/bert_1760_1850',
                              tokenizer='bert-base-uncased')

In [16]:
# We can now use them to predict a mask in a sentence as well:
outputs = unmasker_1760_1850("""The [MASK] is guarded by guards.""")

# Let's print the results in an easier-to-read format:
for one_output in outputs:
    print("Prediction:", one_output['token_str'])
    print("Score:     ", round(one_output['score'],4))
    print()

Prediction: door
Score:      0.1674

Prediction: entrance
Score:      0.0997

Prediction: gate
Score:      0.0459

Prediction: house
Score:      0.0369

Prediction: hall
Score:      0.036



✏️ **Exercise:**

In [17]:
# Create a pipeline for the 1890-1900 model as well and try different sentences with
# both the 1760-1850 and the 1890-1900 models. Do language models trained on data
# from different periods make different predictions?
# 
# Type your code here:



### 2.4. The other pipelines

These are other pipelines available through HuggingFace:
* `ner` (for named entity recognition)
* `question-answering`
* `sentiment-analysis`
* `summarization`
* `text-generation`
* `translation`
* `zero-shot-classification`

HuggingFace have created well-documented [tutorial](https://huggingface.co/course/chapter1/3?fw=pt#working-with-pipelines).

✏️ **Exercise:**

In [None]:
# Have a look at the other pipelines, get inspration from the tutorial:
# https://huggingface.co/course/chapter1/3?fw=pt#working-with-pipelines
# Play a bit with different pipelines, using different models from the
# model hub: https://huggingface.co/models (make sure that the language
# is trained for the task you would like to try!).
# 
# Type your code here:



### 2.5. Get the vector representation

The meaning of a word, in NLP, is usually represented as vectors, i.e. lists of numbers. In the case of transformer models (such as BERT), the vector of a word changes based on the context in which this word occurs. In the following cells, we'll see how to obtain the vector representation of a token using the `tranformers` library.

#### Tokenization

As we saw yesterday, tokenizing a text is splitting it into meaningful units. Very often, this means splitting a text into words.

BERT uses a **subword** tokenization procedure called WordPiece.

This means that it does not only separate words. It also splits certain words into (ideally) meaningful units
> E.g. it splits the word `tokenizing` into `token` and `##izing`, where `##` indicates that this is a suffix which should be attached to the previous word).

These tokens are then mapped to numbers.

The tokenizer maps every word form (e.g. `token` and `##izing`) with identifiers in the vocabulary (e.g. given a certain model, `19204` is the vocabulary ID of `token` and `6026` is the vocabulary ID of the suffix `##izing`, so the word 'tokenizing' would be tokenized as `[19204, 6026]`).

**!!! Warning:** BERT has certain limits as to the length of the string that is accepted, which depends on the model, but usually 512 tokens.

**!!! VERY important:** different models may have different token-to-id mappings. When we use an existing model, we must use the same tokenizer (and therefore the same vocabulary mapping) that was used when training the model.

#### The inner workings of BERT tokenization

Tokenization steps:

1. The text is split into tokens, which can be:
  * words
  * parts of words
  * punctuation symbols

2. The tokenizer adds special tokens:
  * `[CLS]` indicating that this is the beginning of the input sequence.
  * `[SEP]` indicating that it is the end of the sequence (or a sequence delimiter if we have a pair of sequences as input).

3. The tokenizer maps each token into their vocabulary IDs.

Let's explore this in code.

In [19]:
# Load the tokenizer of a certain BERT model
our_tokenizer = transformers.AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [20]:
# The `encode` function does the three steps in one go (splits the sequence, adds
# special tokens, and converts them into a sequence of IDs):
encoded_seq = our_tokenizer.encode('The cell is relentlessly guarded by guards.')
print(encoded_seq)

[101, 1996, 3526, 2003, 21660, 2135, 13802, 2011, 4932, 1012, 102]


And there are also functions that translate the vocabulary IDs to the word forms (given a certain tokenizer):

In [21]:
# The `convert_ids_to_tokens` returns the tokens that correspond to the IDs of
# an encoded sequence:
tokens = our_tokenizer.convert_ids_to_tokens(encoded_seq)
print(tokens)

['[CLS]', 'the', 'cell', 'is', 'relentless', '##ly', 'guarded', 'by', 'guards', '.', '[SEP]']


### 2.6. The feature extraction pipeline

Here we will see how to get vectors for words in context.

Similarly to what we did with word2vec, we may also want to have access to the vector of a certain word. However, unlike with word2vec, the vector of a word will depend on the context in which the word occurs. This means that we can't just ask for the vector of the word "apple", for example: we will need to ask for the vector of the word "apple" given a certain context.

We first import the following two libraries, which will help us work with vectors:

In [22]:
import numpy as np # python library used for working with vectors
from scipy import spatial # package to help compute distance or similarity between vectors

The pipeline task to obtain the vectors for tokens in a sequence is `feature-extraction`. As you can see, creating this pipeline is very similar to creating the `fill-mask` pipeline.

We will store the pipeline in a variable called `nlp_features`:

In [23]:
nlp_features = pipeline("feature-extraction",
                    model='distilbert-base-uncased',
                    tokenizer='distilbert-base-uncased')

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Given a sentence with a masked element, the pipeline tokenizes the input sentence:

In [24]:
sentence = "They were told that the [MASK] stopped working."

output = nlp_features(sentence)
output_vectors = np.squeeze(output) # This removes single-dimensional entries (i.e. for vector readability)

Let's inspect the output. First of all, let's print it:

In [25]:
print(output_vectors)

[[-0.07321426  0.09112739  0.13822553 ... -0.04309447  0.07258517
   0.19995576]
 [ 0.04927437  0.23628353  0.29838353 ... -0.07781299  0.42321664
  -0.35109937]
 [ 0.25163767 -0.08905183  0.13250721 ... -0.04209455  0.21886498
   0.04611908]
 ...
 [ 0.44022593  0.04324771 -0.05111429 ... -0.24669521  0.16832688
  -0.2145575 ]
 [ 0.60124731  0.21446064 -0.34401369 ...  0.0698016  -0.41985661
  -0.43481502]
 [-0.0759867   0.24979676  0.34999335 ...  0.02171966  0.3519392
  -0.05327559]]


This is an array (a list of vectors). Let's see its shape:

In [26]:
print(output_vectors.shape) # Print the shape of the vector

(11, 768)


This means that we have an arrray (in other words a matrix, a table) that has 11 vectors of length 768 (or, in other words, 11 rows with 768 columns).

**Question:** 11 vectors? Why 11?

Let's see how the sentence is tokenized (we've seen how above):

In [27]:
# Load the **SAME** tokenizer used in the pipeline:
our_tokenizer = transformers.AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Encode the sentence into a sequence of vocabulary IDs
encoded_seq = our_tokenizer.encode(sentence)
print(encoded_seq)

# And get the tokens given the vocabulary IDs
tokens = our_tokenizer.convert_ids_to_tokens(encoded_seq)
print(tokens)

# And print the length of the tokenized sequence:
print(len(tokens))

[101, 2027, 2020, 2409, 2008, 1996, 103, 3030, 2551, 1012, 102]
['[CLS]', 'they', 'were', 'told', 'that', 'the', '[MASK]', 'stopped', 'working', '.', '[SEP]']
11


As you can see, the input sentence has been tokenized into 11 tokens. So what we have in the above array is 11 vectors (each one representing a word in the context of the sentence, **keeping the order of tokens**, i.e. the first vector will correspond to the special token `[CLS]`, the second vector to the token `the`, and so on until the last vector, which corresponds to the special token `[SEP]`).

How do we get the vector of the `[MASK]`?

In [28]:
print(tokens[6]) # The [MASK] is the 6th element in the tokenized sentence (we start counting from zero)

[MASK]


In [29]:
print(output_vectors[6]) # So the 6th vector in output_vectors is the vector of the [MASK] in this context.

[ 3.93834978e-01  9.59944427e-02  4.58759703e-02  2.51016229e-01
  5.57362139e-01  2.37901658e-01 -2.06523180e-01  5.89200631e-02
 -1.73569813e-01  1.68626867e-02  8.82697105e-02  3.52528453e-01
 -1.32263631e-01  2.57606536e-01 -1.32825673e-01  1.76140487e-01
  1.14207864e-01 -1.24538951e-02 -1.40964329e-01  4.74948198e-01
 -1.78421646e-01 -9.54175964e-02  1.69281527e-01  8.99420232e-02
 -3.75428885e-01  3.06230724e-01  1.00276358e-02  2.63351649e-01
 -7.12660775e-02  9.17139128e-02  2.20640481e-01 -2.19783895e-02
  2.15243712e-01 -2.72956997e-01 -4.59161177e-02 -2.79989056e-02
 -1.75952762e-02 -1.73659176e-02 -1.86118573e-01 -2.65843421e-01
  1.84799224e-01 -4.14932668e-02 -9.50016379e-02  1.64442118e-02
 -2.49477327e-02  3.27352285e-01 -2.33344987e-01  2.11645011e-02
 -5.38533255e-02 -1.30768895e-01 -4.00706470e-01  1.28413379e-01
  6.17664099e-01 -2.73852825e-01  2.46800318e-01 -3.25068533e-02
 -1.53046504e-01 -6.70900196e-03  2.65286595e-01  1.04872912e-01
  1.05188876e-01 -2.29741

#### Compute the similarity between words in contexts

The following function (`get_embedding`) gets us a vector for a token in a sentence. It needs, as input:
* `sentence`: the sentence where the target token appears.
* `target_token`: the token for which we want to get a vector.
* `nlp_features`: the `feature-extraction` pipeline.
* `tokenizer`: the same tokenizer used by the `feature-extraction` pipeline.

The function prints the list of encoded tokens in the sentence, the list of tokens, and the position of the target token in the sentence. The output of the function is the vector representing the target token in the sentence.

In [41]:
def get_embedding(sentence, target_token, nlp_features, tokenizer):
    """
    Function that returns the contextualized vector of a target token.
    
    Arguments:
        * sentence: the full sentence in which the target token occurs.
        * target_token: the target token (the token whose vector we want).
        * nlp_features: the feature-extraction pipeline.
        * tokenizer: the tokenizer used by the feature-extraction pipeline.
    
    Returns
        The vector representing teh target token in the sentence.
    """
    encoded_seq = tokenizer.encode(sentence) # Tokenize and encode tokens into vocabulary IDs
    tokens = tokenizer.convert_ids_to_tokens(encoded_seq) # Get the tokens corresponding to the IDs.
    target_id_in_sentence = tokens.index(target_token) # Find the position of the target token in the sentence.
    output = nlp_features(sentence) # Use the feature-extraction pipeline to convert the sentence into an array.
    output = np.squeeze(output) # Squeeze the output from feature-extraction for readability.
    print("Encoded tokens in sentence:", encoded_seq)
    print("Tokens in sentence:", tokens)
    print("Position of target id:", target_id_in_sentence)
    return output[target_id_in_sentence] # Return nth vector in the array (where n is the position of the target token in the sentence.)

We can call the function for different sentences:

In [42]:
# Get the embedding for [MASK] in the following sentence:
sentence_1 = "They were told that the [MASK] stopped working."
target_token_1 = "[MASK]"

tok_embedding_1 = get_embedding(sentence_1, target_token_1, nlp_features, our_tokenizer)

Encoded tokens in sentence: [101, 2027, 2020, 2409, 2008, 1996, 103, 3030, 2551, 1012, 102]
Tokens in sentence: ['[CLS]', 'they', 'were', 'told', 'that', 'the', '[MASK]', 'stopped', 'working', '.', '[SEP]']
Position of target id: 6


In [32]:
# Get the embedding for [MASK] in the following sentence:
sentence_2 = "The [MASK] worked in the factory until dawn."
target_token_2 = "[MASK]"

tok_embedding_2 = get_embedding(sentence_2, target_token_2, nlp_features, our_tokenizer)

Encoded tokens in sentence: [101, 1996, 103, 2499, 1999, 1996, 4713, 2127, 6440, 1012, 102]
Tokens in sentence: ['[CLS]', 'the', '[MASK]', 'worked', 'in', 'the', 'factory', 'until', 'dawn', '.', '[SEP]']
Position of target id: 2


We can now get the cosine similarity between any pair of vectors using the following (where `tok_embedding_1` and `tok_embedding_2` are the two vectors we want to compare).

In [33]:
print(1 - spatial.distance.cosine(tok_embedding_1, tok_embedding_2))

0.7452569641310333


Try to find the similarity between words in different sentences.

In [45]:
# Example 1
sentence_1 = "I would like to eat an apple."
target_token_1 = "apple"
tok_embedding_1 = get_embedding(sentence_1, target_token_1, nlp_features, our_tokenizer)

print()

# Example 2
sentence_2 = "I work with an apple macbook."
target_token_2 = "apple"
tok_embedding_2 = get_embedding(sentence_2, target_token_2, nlp_features, our_tokenizer)

Encoded tokens in sentence: [101, 1045, 2052, 2066, 2000, 4521, 2019, 6207, 1012, 102]
Tokens in sentence: ['[CLS]', 'i', 'would', 'like', 'to', 'eat', 'an', 'apple', '.', '[SEP]']
Position of target id: 7

Encoded tokens in sentence: [101, 1045, 2147, 2007, 2019, 6207, 6097, 8654, 1012, 102]
Tokens in sentence: ['[CLS]', 'i', 'work', 'with', 'an', 'apple', 'mac', '##book', '.', '[SEP]']
Position of target id: 5


In [46]:
# Example 3
sentence_3 = "I made apples in the oven."
target_token_3 = "apples"
tok_embedding_3 = get_embedding(sentence_3, target_token_3, nlp_features, our_tokenizer)

print()

# Example 4
sentence_4 = "My apple device crashed and I had to restart it."
target_token_4 = "apple"
tok_embedding_4 = get_embedding(sentence_4, target_token_4, nlp_features, our_tokenizer)

Encoded tokens in sentence: [101, 1045, 2081, 18108, 1999, 1996, 17428, 1012, 102]
Tokens in sentence: ['[CLS]', 'i', 'made', 'apples', 'in', 'the', 'oven', '.', '[SEP]']
Position of target id: 3

Encoded tokens in sentence: [101, 2026, 6207, 5080, 8007, 1998, 1045, 2018, 2000, 23818, 2009, 1012, 102]
Tokens in sentence: ['[CLS]', 'my', 'apple', 'device', 'crashed', 'and', 'i', 'had', 'to', 'restart', 'it', '.', '[SEP]']
Position of target id: 2


In [47]:
# Cosine similarity
print("Similarity sent1 and sent2:", 1 - spatial.distance.cosine(tok_embedding_1, tok_embedding_2))
print("Similarity sent1 and sent3:", 1 - spatial.distance.cosine(tok_embedding_1, tok_embedding_3))
print("Similarity sent1 and sent4:", 1 - spatial.distance.cosine(tok_embedding_1, tok_embedding_4))
print("Similarity sent2 and sent4:", 1 - spatial.distance.cosine(tok_embedding_2, tok_embedding_4))

Similarity sent1 and sent2: 0.7331894389522713
Similarity sent1 and sent3: 0.8244613730693006
Similarity sent1 and sent4: 0.677632658232054
Similarity sent2 and sent4: 0.8349916990418413


✏️ **Exercise:**

In [48]:
# Create two `feature-extraction` pipelines, one for the  1760-1850 model, and
# one for the 1890-1900 model. Find whether the cosine similarity between words
# in sequences change depending on which BERT model you use.
# 
# Type your code here:

👀 **If you are interested in knowing more, we recommend:**
* The HuggingFace tutorials: https://huggingface.co/course/chapter1/1
* BERT for Humanists: http://www.bertforhumanists.org/