In [None]:
pip install transformers

<font color=blue><b>If you may get compaitable issues with numpy, matplotplib after installing the transformer, please upgrade numpy

In [None]:
pip install numpy --upgrade

<font color=blue><b>import pytorch, the pretrained BERT model, and a BERT tokenizer.<br>
BERT model is pretrained by Google which has ran for many hours on Wikipedia and Book Corpus, a dataset containing +10,000 books of different genres. <br>
transformers provides a number of classes for applying BERT to different tasks

In [1]:
import torch
from transformers import BertTokenizer, BertModel
import matplotlib.pyplot as plt


# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# <font color=blue><b> Tokenization

<font color=blue>As BERT is a pretrained model, it expects input data in a specific format: <br>

 [SEP]- to mark the end of a sentence, or the delimiter between two sentences<br>
 [CLS]-, indicates the beginning of our text. <br>
The Token IDs for the tokens, from BERT's tokenizer<br>
Mask IDs to indicate which elements in the sequence are tokens and which are padding elements<br>
Segment IDs used to distinguish different sentences<br>
Positional Embeddings used to show token position within the sequence<br>

Input can be formatted by explicit coding or using the transformer tokenize function. both teh methods are illustrated.<br>

In [44]:
text = "I will book flight  ticket and I will read this book in the flight said David book"
marker = "[CLS] " + text+ " [SEP]"

# Tokenize the sentence with the BERT tokenizer.
tokens= tokenizer.tokenize(marker)

# Map the token strings to their vocabulary indeces.
index = tokenizer.convert_tokens_to_ids(tokens)

# Print  the tokens.
print('-------------------')
print('Tokens generated from BERT')
print('-------------------')
print (tokens)
# Display the words with their indeces.
print('------------------------')
print('TOKEN           INDEX')
print('------------------------')
for tup in zip(tokens, index ):
    print('{:<12} {:>6,}'.format(tup[0], tup[1]))

-------------------
Tokens generated from BERT
-------------------
['[CLS]', 'i', 'will', 'book', 'flight', 'ticket', 'and', 'i', 'will', 'read', 'this', 'book', 'in', 'the', 'flight', 'said', 'david', 'book', '[SEP]']
------------------------
TOKEN           INDEX
------------------------
[CLS]           101
i             1,045
will          2,097
book          2,338
flight        3,462
ticket        7,281
and           1,998
i             1,045
will          2,097
read          3,191
this          2,023
book          2,338
in            1,999
the           1,996
flight        3,462
said          2,056
david         2,585
book          2,338
[SEP]           102


Below code is an alternate method to tokenize using transformer functions <br>
 transformers interface provides all the functionalities coded in the above cell using the tokenizer.encode_plus function.<br>
 user can choose any of the methods to generate ID's for the tokens
 

In [43]:
data = "I will book flight  ticket and I will read this book in the flight said David book"
toks = tokenizer.tokenize(data)
output = tokenizer.encode_plus(data, add_special_tokens=False)
toks_converted = tokenizer.convert_ids_to_tokens(output['input_ids'])
print(output)

{'input_ids': [1045, 2097, 2338, 3462, 7281, 1998, 1045, 2097, 3191, 2023, 2338, 1999, 1996, 3462, 2056, 2585, 2338], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


# Segment ID
<font color=blue>BERT is trained on and expects sentence pairs, using 1s and 0s to distinguish between the two sentences.<br>
That is, for each token in "tokenized_text," we must specify which sentence it belongs to: sentence 0 (a series of 0s) or sentence 1 (a series of 1s).<br>
For our purposes, single-sentence inputs only require a series of 1s, so we will create a vector of 1s for each token in our input sentence.<br>

If you want to process two sentences, assign each word in the first sentence plus the '[SEP]' token a 0, and all tokens of the second sentence a 1.

In [45]:
# Mark each of the 22 tokens as belonging to sentence "1".
segments_ids = [1] * len(tokens)

print (segments_ids)

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


# fit BERT model on input text<br>
To fit BERT model on the input text , data should be converted to torch tensors because <br>
BERT PyTorch interface requires the data  in torch tensors rather than Python lists

In [46]:
# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([index ])
segments_tensors = torch.tensor([segments_ids])
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased',output_hidden_states = True,)

# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

# Interpretation of above output
when < BertModel.from_pretraineding > is called it will fetch the model from the internet which is as shown in above output.<br>
model.eval() puts  model in evaluation mode as opposed to training mode.

evaluate BERT on given text, and fetch the hidden states of the network

In [47]:
# Run the text through BERT, and collect all of the hidden states produced from all 12 layers. 
with torch.no_grad():

    outputs = model(tokens_tensor, segments_tensors)

    hidden_states = outputs[2]

Print the hidden states of the network

In [49]:
print('----------------------------------------')
print('Netwrok hidden states')
print('----------------------------------------')
print ("Number of layers:", len(hidden_states), )
layer_i = 0

print ("Number of batches:", len(hidden_states[layer_i]))
batch_i = 0

print ("Number of tokens:", len(hidden_states[layer_i][batch_i]))
token_i = 0

print ("Number of hidden units:", len(hidden_states[layer_i][batch_i][token_i]))

----------------------------------------
Netwrok hidden states
----------------------------------------
Number of layers: 13
Number of batches: 1
Number of tokens: 19
Number of hidden units: 768


# Interpretation of above output
<font color=blue>the Hidden state of the netwrok gives 4 dimension as listed below: <br>
The layer number (13 layers)  (initial embeddings + 12 BERT layers)<br>
The batch number (1 sentence)<br>
The word / token number (22 tokens in our sentence)<br>
The hidden unit / feature number (768 features)<br>


 let's look at the different instances of the word "book" and see whether context is preseved

In [50]:
for i, token_str in enumerate(tokens):
  print (i, token_str)

0 [CLS]
1 i
2 will
3 book
4 flight
5 ticket
6 and
7 i
8 will
9 read
10 this
11 book
12 in
13 the
14 flight
15 said
16 david
17 book
18 [SEP]


combine the layers to make this one whole big tensor.

In [51]:
# Concatenate the tensors using `stack` to create a new dimension in the tensor.
token_embeddings = torch.stack(hidden_states, dim=0)

token_embeddings.size()

torch.Size([13, 1, 19, 768])

Current dimensions produced by the model:

[# layers, # batches, # tokens, # features]

remove "batches" dimension since we don't need it.

In [52]:
# Remove dimension 1, the "batches".
token_embeddings = torch.squeeze(token_embeddings, dim=1)

token_embeddings.size()

torch.Size([13, 19, 768])

from the above output<br>
Current dimensions :

[# layers,  # tokens, # features]

Use pytorch Permute function to swap token and layers.

In [53]:
# Swap dimensions 0 and 1.
token_embeddings = token_embeddings.permute(1,0,2)

token_embeddings.size()

torch.Size([19, 13, 768])

from the above output<br>
Current dimensions :

[  # tokens,# layers, # features]

Now word or sentence vectors can be generated from hidden states<br>
here we will show only word vector creation  from hidden states<br>
totally there are 16 tokens and for each token we have to get individual vectors/features<br>
13 layers of the model has produced different vectors for each token.<br>
so, totally for each token in the input we have 13 separate vectors each of length 768.<br>
to get the final word embedding, we sum last 3 vectors from last 3 layers

In [54]:
# Stores the token vectors, with shape [19 x 768]
token_vecs_sum = []


# For each token in the sentence...
for token in token_embeddings:

    # Sum the vectors from the last 3 layers.
    sum_vec = torch.sum(token[-3:], dim=0)
    
    # Use `sum_vec` to represent `token`.
    token_vecs_sum.append(sum_vec)

print ('Shape is: %d x %d' % (len(token_vecs_sum), len(token_vecs_sum[0])))

Shape is: 19 x 768


In [55]:
for i, token_str in enumerate(tokens):
  print (i, token_str)

0 [CLS]
1 i
2 will
3 book
4 flight
5 ticket
6 and
7 i
8 will
9 read
10 this
11 book
12 in
13 the
14 flight
15 said
16 david
17 book
18 [SEP]


verification of vectors for the context 

In [56]:
print('First 5 vector values for each instance of "book".')
print('')
print("Read book   ", str(token_vecs_sum[3][:5]))
print("book the ticket  ", str(token_vecs_sum[11][:5]))
print("David book   ", str(token_vecs_sum[17][:5]))

First 5 vector values for each instance of "book".

Read book    tensor([ 0.8234, -1.0166,  3.4270, -1.4349,  2.8383])
book the ticket   tensor([ 0.8677,  1.8590,  1.5261, -1.3450,  3.4283])
David book    tensor([ 0.2270, -0.0197,  1.1182, -0.0579,  1.6186])


the token book apperas in 3 diffrent places in a sentence with different context.<br>
position of book : [3] [11] [17]


# cosine similarity between the vectors to make a more precise comparison.

In [59]:
from scipy.spatial.distance import cosine

# Calculate the cosine similarity between the word book 
# in "Read book" vs "book the ticket" (different meanings).
read_book= 1 - cosine(token_vecs_sum[3], token_vecs_sum[11])
book_ticket = 1 - cosine(token_vecs_sum[3], token_vecs_sum[17])

print('Vector similarity for  *similar*  meanings:  %.2f' % read_book)
print('Vector similarity for *different* meanings:  %.2f' % book_ticket)

Vector similarity for  *similar*  meanings:  0.61
Vector similarity for *different* meanings:  0.51
