## Tokenizer

In [25]:
import pandas as pd
from transformers import BertTokenizer
import tensorflow as tf

In [2]:
tokenizer = BertTokenizer.from_pretrained("bert-large-uncased")

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

In [3]:
result = tokenizer.tokenize('Here is the sentence I want embeddings for.')
print(result)

['here', 'is', 'the', 'sentence', 'i', 'want', 'em', '##bed', '##ding', '##s', 'for', '.']


In [4]:
print(tokenizer.vocab['embeddings'])

KeyError: 'embeddings'

In [5]:
print(tokenizer.vocab['em'])
print(tokenizer.vocab['##bed'])

7861
8270


In [8]:
with open("Models/BERT/vocabulary.txt",'r') as f:
    for token in tokenizer.vocab.keys():
        f.write(token + '\n')

UnsupportedOperation: not writable

In [9]:
df = pd.read_fwf("Models/BERT/vocabulary.txt", header=None)
print("Size of vocabulary set: ",len(df))

Size of vocabulary set:  1067


In [10]:
print(df.loc[0].values[0])
print(df.loc[100].values[0])
print(df.loc[101].values[0])
print(df.loc[102].values[0])
print(df.loc[103].values[0])

[PAD]
[UNK]
[CLS]
[SEP]
[MASK]


In [16]:
result = tokenizer.tokenize("I am studying about Tokai")
print(result)

['i', 'am', 'studying', 'about', 'to', '##kai', 'un', '##iv', '.']


## MLM

In [1]:
from transformers import TFBertForMaskedLM
from transformers import AutoTokenizer

In [3]:
model = TFBertForMaskedLM.from_pretrained("bert-large-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased")

Downloading:   0%|          | 0.00/1.47G [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
All model checkpoint layers were used when initializing TFBertForMaskedLM.

All the layers of TFBertForMaskedLM were initialized from the model checkpoint at bert-large-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.


Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [4]:
inputs = tokenizer("I am [MASK] of Tokai University.", return_tensors='tf')

In [5]:
print(inputs['input_ids'])

tf.Tensor([[  101  1045  2572   103  1997  2000 11151  2118  1012   102]], shape=(1, 10), dtype=int32)


In [6]:
print(inputs['token_type_ids'])

tf.Tensor([[0 0 0 0 0 0 0 0 0 0]], shape=(1, 10), dtype=int32)


In [7]:
print(inputs['attention_mask'])

tf.Tensor([[1 1 1 1 1 1 1 1 1 1]], shape=(1, 10), dtype=int32)


In [8]:
from transformers import FillMaskPipeline

In [10]:
FillMaskPipeline(model=model, tokenizer=tokenizer)("I am [MASK] of Tokai University.")

[{'score': 0.41616421937942505,
  'token': 4619,
  'token_str': 'graduate',
  'sequence': 'i am graduate of tokai university.'},
 {'score': 0.21492940187454224,
  'token': 2343,
  'token_str': 'president',
  'sequence': 'i am president of tokai university.'},
 {'score': 0.1335262656211853,
  'token': 3076,
  'token_str': 'student',
  'sequence': 'i am student of tokai university.'},
 {'score': 0.08211586624383926,
  'token': 2934,
  'token_str': 'professor',
  'sequence': 'i am professor of tokai university.'},
 {'score': 0.03578830510377884,
  'token': 19678,
  'token_str': 'alumnus',
  'sequence': 'i am alumnus of tokai university.'}]

In [14]:
FillMaskPipeline(model=model, tokenizer=tokenizer)("Sushi is [MASK].")

[{'score': 0.19599762558937073,
  'token': 2800,
  'token_str': 'available',
  'sequence': 'sushi is available.'},
 {'score': 0.15111589431762695,
  'token': 2204,
  'token_str': 'good',
  'sequence': 'sushi is good.'},
 {'score': 0.11108103394508362,
  'token': 2366,
  'token_str': 'served',
  'sequence': 'sushi is served.'},
 {'score': 0.05522491782903671,
  'token': 12090,
  'token_str': 'delicious',
  'sequence': 'sushi is delicious.'},
 {'score': 0.05283496528863907,
  'token': 2759,
  'token_str': 'popular',
  'sequence': 'sushi is popular.'}]

## NSP

In [44]:
import tensorflow as tf
from transformers import TFBertForNextSentencePrediction
from transformers import AutoTokenizer

In [45]:
model = TFBertForNextSentencePrediction.from_pretrained('bert-base-uncased')
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/536M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForNextSentencePrediction.

All the layers of TFBertForNextSentencePrediction were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForNextSentencePrediction for predictions without further training.


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [46]:
prompt = "In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced."
next_sentence = "pizza is eaten with the use of a knife and fork. In casual settings, however, it is cut into wedges to be eaten while held in the hand."

In [47]:
encoding = tokenizer(prompt, next_sentence, return_tensors='tf')

In [48]:
print(encoding["input_ids"])

tf.Tensor(
[[  101  1999  3304  1010 10733  2366  1999  5337 10906  1010  2107  2004
   2012  1037  4825  1010  2003  3591  4895 14540  6610  2094  1012   102
  10733  2003  8828  2007  1996  2224  1997  1037  5442  1998  9292  1012
   1999 10017 10906  1010  2174  1010  2009  2003  3013  2046 17632  2015
   2000  2022  8828  2096  2218  1999  1996  2192  1012   102]], shape=(1, 58), dtype=int32)


In [49]:
print(tokenizer.cls_token, ':', tokenizer.cls_token_id)
print(tokenizer.sep_token, ':' , tokenizer.sep_token_id)

[CLS] : 101
[SEP] : 102


In [50]:
print(tokenizer.decode(encoding["input_ids"][0]))

[CLS] in italy, pizza served in formal settings, such as at a restaurant, is presented unsliced. [SEP] pizza is eaten with the use of a knife and fork. in casual settings, however, it is cut into wedges to be eaten while held in the hand. [SEP]


In [51]:
print(encoding["token_type_ids"])

tf.Tensor(
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]], shape=(1, 58), dtype=int32)


In [52]:
print(encoding["attention_mask"])

tf.Tensor(
[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]], shape=(1, 58), dtype=int32)


In [53]:
logits = model(encoding["input_ids"], token_type_ids=encoding["token_type_ids"])[0]
probs = tf.keras.layers.Softmax()(logits)
print(probs)

tf.Tensor([[9.9999714e-01 2.8379097e-06]], shape=(1, 2), dtype=float32)


In [54]:
print("Predicted label: ", tf.math.argmax(probs, axis=-1).numpy())

Predicted label:  [0]


In [55]:
prompt = "In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced."
next_sentence = "The sky is blue due to the shorter wavelength of blue light."
encoding = tokenizer(prompt, next_sentence, return_tensors='tf')

logits = model(encoding['input_ids'], token_type_ids=encoding['token_type_ids'])[0]

softmax = tf.keras.layers.Softmax()
probs = softmax(logits)

print("Predicted label: ", tf.math.argmax(probs, axis=-1).numpy())

Predicted label:  [1]
