# BERT

We shall implement BERT.  For this tutorial, you may want to first look at my Transformers tutorial to get a basic understanding of Transformers. 

For BERT, the main difference is on how we process the datasets, i.e., masking.   Aside from that, the backbone model is still the Transformers.

In [46]:
import math
import re
from   random import *
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F


## 1. Data

For simplicity, we shall use very simple data like this.

In [47]:
import spacy

with open ("wiki_reduced.txt", "r") as f:
    raw_text = f.read()
nlp = spacy.load("en_core_web_sm")
doc = nlp(raw_text)
sentences = list(doc.sents)

text = [x.text.lower() for x in sentences] #lower case
text = [re.sub("[.,!?\\-]", '', x) for x in text] #clean all symbols

text

["in number theory fermat's last theorem (sometimes called fermat's conjecture especially in older texts) states that no three positive integers a b and c satisfy the equation an + bn = cn for any integer value of n greater than 2",
 'the cases n = 1 and n = 2 have been known since antiquity to have infinitely many solutions',
 'the proposition was first stated as a theorem by pierre de fermat around 1637 in the margin of a copy of arithmetica',
 'fermat added that he had a proof that was too large to fit in the margin',
 "although other statements claimed by fermat without proof were subsequently proven by others and credited as theorems of fermat (for example fermat's theorem on sums of two squares) fermat's last theorem resisted proof leading to doubt that fermat ever had a correct proof",
 'consequently the proposition became known as a conjecture rather than a theorem',
 'after 358 years of effort by mathematicians the first successful proof was released in 1994 by andrew wiles an

In [48]:
for sentence in text:
    print(sentence, "_____")
    words = sentence.split()
    print(words)

in number theory fermat's last theorem (sometimes called fermat's conjecture especially in older texts) states that no three positive integers a b and c satisfy the equation an + bn = cn for any integer value of n greater than 2 _____
['in', 'number', 'theory', "fermat's", 'last', 'theorem', '(sometimes', 'called', "fermat's", 'conjecture', 'especially', 'in', 'older', 'texts)', 'states', 'that', 'no', 'three', 'positive', 'integers', 'a', 'b', 'and', 'c', 'satisfy', 'the', 'equation', 'an', '+', 'bn', '=', 'cn', 'for', 'any', 'integer', 'value', 'of', 'n', 'greater', 'than', '2']
the cases n = 1 and n = 2 have been known since antiquity to have infinitely many solutions _____
['the', 'cases', 'n', '=', '1', 'and', 'n', '=', '2', 'have', 'been', 'known', 'since', 'antiquity', 'to', 'have', 'infinitely', 'many', 'solutions']
the proposition was first stated as a theorem by pierre de fermat around 1637 in the margin of a copy of arithmetica _____
['the', 'proposition', 'was', 'first', 's

### Making vocabs

Before making the vocabs, let's remove all question marks and perios, etc, then turn everything to lowercase, and then simply split the text. 

In [49]:
# combine everything into one to make vocabs
word_list = list(set(" ".join(text).split()))
word2id = {'[PAD]': 0, '[CLS]': 1, '[SEP]': 2, '[MASK]': 3}  #special tokens.

#create the word2id
for i, w in enumerate(word_list):
    word2id[w] = i + 4  #because 0-3 are already occupied
    id2word = {i: w for i, w in enumerate(word2id)}
    vocab_size = len(word2id)

#list of all tokens for whole text
token_list = list()
for sentence in sentences:
    arr = [word2id[word] for sentence in text for word in sentence.split()]
    token_list.append(arr)

In [50]:
#take a look at sentences
sentences

[In number theory, Fermat's Last Theorem (sometimes called Fermat's conjecture, especially in older texts) states that no three positive integers a, b, and c satisfy the equation an + bn = cn for any integer value of n greater than 2.,
 The cases n = 1 and n = 2 have been known since antiquity to have infinitely many solutions.,
 The proposition was first stated as a theorem by Pierre de Fermat around 1637 in the margin of a copy of Arithmetica.,
 Fermat added that he had a proof that was too large to fit in the margin.,
 Although other statements claimed by Fermat without proof were subsequently proven by others and credited as theorems of Fermat (for example, Fermat's theorem on sums of two squares), Fermat's Last Theorem resisted proof, leading to doubt that Fermat ever had a correct proof.,
 Consequently, the proposition became known as a conjecture rather than a theorem.,
 After 358 years of effort by mathematicians, the first successful proof was released in 1994 by Andrew Wiles 

In [51]:
#take a look at token_list
token_list

[[40,
  62,
  114,
  144,
  163,
  81,
  68,
  153,
  144,
  96,
  90,
  40,
  4,
  30,
  42,
  77,
  98,
  162,
  44,
  54,
  102,
  118,
  50,
  122,
  91,
  100,
  151,
  10,
  164,
  108,
  37,
  103,
  165,
  45,
  19,
  78,
  152,
  70,
  120,
  94,
  95,
  100,
  148,
  70,
  37,
  83,
  50,
  70,
  37,
  95,
  56,
  130,
  21,
  33,
  34,
  132,
  56,
  73,
  47,
  31,
  100,
  57,
  48,
  71,
  113,
  86,
  102,
  81,
  43,
  23,
  138,
  89,
  53,
  46,
  40,
  100,
  55,
  152,
  102,
  140,
  152,
  157,
  89,
  35,
  77,
  134,
  60,
  102,
  38,
  77,
  48,
  17,
  129,
  132,
  166,
  40,
  100,
  55,
  105,
  135,
  14,
  123,
  43,
  89,
  136,
  38,
  59,
  106,
  143,
  43,
  61,
  50,
  24,
  86,
  63,
  152,
  89,
  115,
  7,
  144,
  81,
  6,
  141,
  152,
  8,
  15,
  144,
  163,
  81,
  107,
  38,
  84,
  132,
  49,
  77,
  89,
  117,
  60,
  102,
  111,
  38,
  12,
  100,
  57,
  76,
  21,
  86,
  102,
  96,
  41,
  94,
  102,
  81,
  154,
  39,
  116,
  152,
 

In [52]:
#testing one sentence
for tokens in token_list[0]:
    print(id2word[tokens])

in
number
theory
fermat's
last
theorem
(sometimes
called
fermat's
conjecture
especially
in
older
texts)
states
that
no
three
positive
integers
a
b
and
c
satisfy
the
equation
an
+
bn
=
cn
for
any
integer
value
of
n
greater
than
2
the
cases
n
=
1
and
n
=
2
have
been
known
since
antiquity
to
have
infinitely
many
solutions
the
proposition
was
first
stated
as
a
theorem
by
pierre
de
fermat
around
1637
in
the
margin
of
a
copy
of
arithmetica
fermat
added
that
he
had
a
proof
that
was
too
large
to
fit
in
the
margin
although
other
statements
claimed
by
fermat
without
proof
were
subsequently
proven
by
others
and
credited
as
theorems
of
fermat
(for
example
fermat's
theorem
on
sums
of
two
squares)
fermat's
last
theorem
resisted
proof
leading
to
doubt
that
fermat
ever
had
a
correct
proof
consequently
the
proposition
became
known
as
a
conjecture
rather
than
a
theorem
after
358
years
of
effort
by
mathematicians
the
first
successful
proof
was
released
in
1994
by
andrew
wiles
and
formally
published
in
19

## 2. Data loader

We gonna make dataloader.  Inside here, we need to make two types of embeddings: **token embedding** and **segment embedding**

1. **Token embedding** - Given “The cat is walking. The dog is barking”, we add [CLS] and [SEP] >> “[CLS] the cat is walking [SEP] the dog is barking”. 

2. **Segment embedding**
A segment embedding separates two sentences, i.e., [0 0 0 0 1 1 1 1 ]

3. **Masking**
As mentioned in the original paper, BERT randomly assigns masks to 15% of the sequence. In this 15%, 80% is replaced with masks, while 10% is replaced with random tokens, and the rest 10% is left as is.  Here we specified `max_pred` 

4. **Padding**
Once we mask, we will add padding. For simplicity, here we padded until some specified `max_len`. 

Note:  `positive` and `negative` are just simply counts to keep track of the batch size.  `positive` refers to two sentences that are really next to one another.

In [53]:
batch_size = 6
max_mask   = 5  # max masked tokens when 15% exceed, it will only be max_pred
max_len    = 1000 # maximum of length to be padded; 

In [54]:
def make_batch():
    batch = []
    positive = negative = 0  #count of batch size;  we want to have half batch that are positive pairs (i.e., next sentence pairs)
    while positive != batch_size/2 or negative != batch_size/2:
        
        #randomly choose two sentence so we can put [SEP]
        tokens_a_index, tokens_b_index= randrange(len(sentences)), randrange(len(sentences))
        #retrieve the two sentences
        tokens_a, tokens_b= token_list[tokens_a_index], token_list[tokens_b_index]

        #1. token embedding - append CLS and SEP
        input_ids = [word2id['[CLS]']] + tokens_a + [word2id['[SEP]']] + tokens_b + [word2id['[SEP]']]

        #2. segment embedding - [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
        segment_ids = [0] * (1 + len(tokens_a) + 1) + [1] * (len(tokens_b) + 1)

        #3. mask language modeling
        #masked 15%, but should be at least 1 but does not exceed max_mask
        n_pred =  min(max_mask, max(1, int(round(len(input_ids) * 0.15))))
        #get the pos that excludes CLS and SEP and shuffle them
        cand_maked_pos = [i for i, token in enumerate(input_ids) if token != word2id['[CLS]'] and token != word2id['[SEP]']]
        shuffle(cand_maked_pos)
        masked_tokens, masked_pos = [], []
        #simply loop and change the input_ids to [MASK]
        for pos in cand_maked_pos[:n_pred]:
            masked_pos.append(pos)  #remember the position
            masked_tokens.append(input_ids[pos]) #remember the tokens
            #80% replace with a [MASK], but 10% will replace with a random token
            if random() < 0.1:  # 10%
                index = randint(0, vocab_size - 1) # random index in vocabulary
                input_ids[pos] = word2id[id2word[index]] # replace
            elif random() < 0.9:  # 80%
                input_ids[pos] = word2id['[MASK]'] # make mask
            else:  #10% do nothing
                pass

        # pad the input_ids and segment ids until the max len
        n_pad = max_len - len(input_ids)
        input_ids.extend([0] * n_pad)
        segment_ids.extend([0] * n_pad)

        # pad the masked_tokens and masked_pos to make sure the lenth is max_mask
        if max_mask > n_pred:
            n_pad = max_mask - n_pred
            masked_tokens.extend([0] * n_pad)
            masked_pos.extend([0] * n_pad)

        #check if first sentence is really comes before the second sentence
        #also make sure positive is exactly half the batch size
        if tokens_a_index + 1 == tokens_b_index and positive < batch_size / 2:
            batch.append([input_ids, segment_ids, masked_tokens, masked_pos, True]) # IsNext
            positive += 1
        elif tokens_a_index + 1 != tokens_b_index and negative < batch_size/2:
            batch.append([input_ids, segment_ids, masked_tokens, masked_pos, False]) # NotNext
            negative += 1
            
    return batch

In [55]:
batch = make_batch()

In [56]:
#len of batch
len(batch)

6

In [57]:
#we can deconstruct using map and zip
input_ids, segment_ids, masked_tokens, masked_pos, isNext = map(torch.LongTensor, zip(*batch))
input_ids.shape, segment_ids.shape, masked_tokens.shape, masked_pos.shape, isNext.shape

(torch.Size([6, 1000]),
 torch.Size([6, 1000]),
 torch.Size([6, 5]),
 torch.Size([6, 5]),
 torch.Size([6]))

## 3. Model

Recall that BERT only uses the encoder.

BERT has the following components:

- Embedding layers
- Attention Mask
- Encoder layer
- Multi-head attention
- Scaled dot product attention
- Position-wise feed-forward network
- BERT (assembling all the components)

## 3.1 Embedding

Here we simply generate the positional embedding, and sum the token embedding, positional embedding, and segment embedding together.

<img src = "figures/BERT_embed.png" width=500>

In [58]:
class Embedding(nn.Module):
    def __init__(self):
        super(Embedding, self).__init__()
        self.tok_embed = nn.Embedding(vocab_size, d_model)  # token embedding
        self.pos_embed = nn.Embedding(max_len, d_model)      # position embedding
        self.seg_embed = nn.Embedding(n_segments, d_model)  # segment(token type) embedding
        self.norm = nn.LayerNorm(d_model)

    def forward(self, x, seg):
        #x, seg: (bs, len)
        seq_len = x.size(1)
        pos = torch.arange(seq_len, dtype=torch.long)
        pos = pos.unsqueeze(0).expand_as(x)  # (len,) -> (bs, len)
        embedding = self.tok_embed(x) + self.pos_embed(pos) + self.seg_embed(seg)
        return self.norm(embedding)

## 3.2 Attention mask

In [59]:
def get_attn_pad_mask(seq_q, seq_k):
    batch_size, len_q = seq_q.size()
    batch_size, len_k = seq_k.size()
    # eq(zero) is PAD token
    pad_attn_mask = seq_k.data.eq(0).unsqueeze(1)  # batch_size x 1 x len_k(=len_q), one is masking
    return pad_attn_mask.expand(batch_size, len_q, len_k)  # batch_size x len_q x len_k

### Testing the attention mask

In [60]:
print(get_attn_pad_mask(input_ids, input_ids).shape)

torch.Size([6, 1000, 1000])


## 3.3 Encoder

The encoder has two main components: 

- Multi-head Attention
- Position-wise feed-forward network

First let's make the wrapper called `EncoderLayer`

In [61]:
class EncoderLayer(nn.Module):
    def __init__(self):
        super(EncoderLayer, self).__init__()
        self.enc_self_attn = MultiHeadAttention()
        self.pos_ffn       = PoswiseFeedForwardNet()

    def forward(self, enc_inputs, enc_self_attn_mask):
        enc_outputs, attn = self.enc_self_attn(enc_inputs, enc_inputs, enc_inputs, enc_self_attn_mask) # enc_inputs to same Q,K,V
        enc_outputs = self.pos_ffn(enc_outputs) # enc_outputs: [batch_size x len_q x d_model]
        return enc_outputs, attn

Let's define the scaled dot attention, to be used inside the multihead attention

In [62]:
class ScaledDotProductAttention(nn.Module):
    def __init__(self):
        super(ScaledDotProductAttention, self).__init__()

    def forward(self, Q, K, V, attn_mask):
        scores = torch.matmul(Q, K.transpose(-1, -2)) / np.sqrt(d_k) # scores : [batch_size x n_heads x len_q(=len_k) x len_k(=len_q)]
        scores.masked_fill_(attn_mask, -1e9) # Fills elements of self tensor with value where mask is one.
        attn = nn.Softmax(dim=-1)(scores)
        context = torch.matmul(attn, V)
        return context, attn 

Let's define the parameters first

In [63]:
n_layers = 6    # number of Encoder of Encoder Layer
n_heads  = 8    # number of heads in Multi-Head Attention
d_model  = 768  # Embedding Size
d_ff = 768 * 4  # 4*d_model, FeedForward dimension
d_k = d_v = 64  # dimension of K(=Q), V
n_segments = 2

Here is the Multiheadattention.

In [64]:
class MultiHeadAttention(nn.Module):
    def __init__(self):
        super(MultiHeadAttention, self).__init__()
        self.W_Q = nn.Linear(d_model, d_k * n_heads)
        self.W_K = nn.Linear(d_model, d_k * n_heads)
        self.W_V = nn.Linear(d_model, d_v * n_heads)
    def forward(self, Q, K, V, attn_mask):
        # q: [batch_size x len_q x d_model], k: [batch_size x len_k x d_model], v: [batch_size x len_k x d_model]
        residual, batch_size = Q, Q.size(0)
        # (B, S, D) -proj-> (B, S, D) -split-> (B, S, H, W) -trans-> (B, H, S, W)
        q_s = self.W_Q(Q).view(batch_size, -1, n_heads, d_k).transpose(1,2)  # q_s: [batch_size x n_heads x len_q x d_k]
        k_s = self.W_K(K).view(batch_size, -1, n_heads, d_k).transpose(1,2)  # k_s: [batch_size x n_heads x len_k x d_k]
        v_s = self.W_V(V).view(batch_size, -1, n_heads, d_v).transpose(1,2)  # v_s: [batch_size x n_heads x len_k x d_v]

        attn_mask = attn_mask.unsqueeze(1).repeat(1, n_heads, 1, 1) # attn_mask : [batch_size x n_heads x len_q x len_k]

        # context: [batch_size x n_heads x len_q x d_v], attn: [batch_size x n_heads x len_q(=len_k) x len_k(=len_q)]
        context, attn = ScaledDotProductAttention()(q_s, k_s, v_s, attn_mask)
        context = context.transpose(1, 2).contiguous().view(batch_size, -1, n_heads * d_v) # context: [batch_size x len_q x n_heads * d_v]
        output = nn.Linear(n_heads * d_v, d_model)(context)
        return nn.LayerNorm(d_model)(output + residual), attn # output: [batch_size x len_q x d_model]


Here is the PoswiseFeedForwardNet.

In [65]:
class PoswiseFeedForwardNet(nn.Module):
    def __init__(self):
        super(PoswiseFeedForwardNet, self).__init__()
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        # (batch_size, len_seq, d_model) -> (batch_size, len_seq, d_ff) -> (batch_size, len_seq, d_model)
        return self.fc2(F.gelu(self.fc1(x)))



## 3.4 Putting them together

In [66]:
class BERT(nn.Module):
    def __init__(self):
        super(BERT, self).__init__()
        self.embedding = Embedding()
        self.layers = nn.ModuleList([EncoderLayer() for _ in range(n_layers)])
        self.fc = nn.Linear(d_model, d_model)
        self.activ = nn.Tanh()
        self.linear = nn.Linear(d_model, d_model)
        self.norm = nn.LayerNorm(d_model)
        self.classifier = nn.Linear(d_model, 2)
        # decoder is shared with embedding layer
        embed_weight = self.embedding.tok_embed.weight
        n_vocab, n_dim = embed_weight.size()
        self.decoder = nn.Linear(n_dim, n_vocab, bias=False)
        self.decoder.weight = embed_weight
        self.decoder_bias = nn.Parameter(torch.zeros(n_vocab))

    def forward(self, input_ids, segment_ids, masked_pos):
        output = self.embedding(input_ids, segment_ids)
        enc_self_attn_mask = get_attn_pad_mask(input_ids, input_ids)
        for layer in self.layers:
            output, enc_self_attn = layer(output, enc_self_attn_mask)
        # output : [batch_size, len, d_model], attn : [batch_size, n_heads, d_mode, d_model]
        
        # 1. predict next sentence
        # it will be decided by first token(CLS)
        h_pooled   = self.activ(self.fc(output[:, 0])) # [batch_size, d_model]
        logits_nsp = self.classifier(h_pooled) # [batch_size, 2]

        # 2. predict the masked token
        masked_pos = masked_pos[:, :, None].expand(-1, -1, output.size(-1)) # [batch_size, max_pred, d_model]
        h_masked = torch.gather(output, 1, masked_pos) # masking position [batch_size, max_pred, d_model]
        h_masked  = self.norm(F.gelu(self.linear(h_masked)))
        logits_lm = self.decoder(h_masked) + self.decoder_bias # [batch_size, max_pred, n_vocab]

        return logits_lm, logits_nsp

## 4. Training

In [67]:
# num_epoch =80
# model = BERT()
# criterion = nn.CrossEntropyLoss()
# optimizer = optim.Adam(model.parameters(), lr=0.001)

# batch = make_batch()
# input_ids, segment_ids, masked_tokens, masked_pos, isNext = map(torch.LongTensor, zip(*batch))

# for epoch in range(num_epoch):
#     optimizer.zero_grad()
#     logits_lm, logits_nsp = model(input_ids, segment_ids, masked_pos)    
#     #logits_lm: (bs, max_mask, vocab_size) ==> (6, 5, 34)
#     #logits_nsp: (bs, yes/no) ==> (6, 2)

#     #1. mlm loss
#     #logits_lm.transpose: (bs, vocab_size, max_mask) vs. masked_tokens: (bs, max_mask)
#     loss_lm = criterion(logits_lm.transpose(1, 2), masked_tokens) # for masked LM
#     loss_lm = (loss_lm.float()).mean()
#     #2. nsp loss
#     #logits_nsp: (bs, 2) vs. isNext: (bs, )
#     loss_nsp = criterion(logits_nsp, isNext) # for sentence classification
    
#     #3. combine loss
#     loss = loss_lm + loss_nsp
#     # if epoch % 100 == 0:
#     print('Epoch:', '%02d' % (epoch), 'loss =', '{:.6f}'.format(loss))
#     loss.backward()
#     optimizer.step()
#     if epoch == num_epoch-1:
#         torch.save(model.state_dict(), 'BERT.pt')
#         torch.save(model.state_dict(), 'BERT.pth')

In [68]:
batch[2]

[[1,
  40,
  62,
  114,
  144,
  163,
  81,
  68,
  153,
  144,
  96,
  90,
  40,
  4,
  30,
  42,
  77,
  98,
  162,
  44,
  54,
  102,
  118,
  50,
  122,
  91,
  100,
  151,
  10,
  164,
  108,
  37,
  103,
  165,
  45,
  19,
  78,
  152,
  70,
  120,
  94,
  95,
  100,
  148,
  70,
  37,
  83,
  50,
  70,
  37,
  95,
  56,
  130,
  21,
  33,
  34,
  132,
  56,
  73,
  47,
  31,
  100,
  57,
  48,
  71,
  113,
  86,
  102,
  81,
  43,
  23,
  138,
  89,
  53,
  46,
  40,
  100,
  55,
  152,
  102,
  140,
  152,
  157,
  89,
  35,
  77,
  134,
  60,
  102,
  38,
  77,
  48,
  17,
  129,
  132,
  166,
  40,
  100,
  55,
  105,
  135,
  14,
  123,
  43,
  89,
  136,
  38,
  59,
  106,
  143,
  43,
  61,
  50,
  24,
  86,
  63,
  152,
  89,
  115,
  7,
  144,
  81,
  6,
  141,
  152,
  8,
  15,
  144,
  163,
  81,
  107,
  38,
  84,
  132,
  49,
  77,
  89,
  117,
  60,
  3,
  111,
  38,
  12,
  100,
  57,
  76,
  21,
  86,
  102,
  96,
  41,
  94,
  102,
  81,
  154,
  39,
  116,
  152

## 5. Inference

Since our dataset is very small, it won't work very well, but just for the sake of demonstration.

In [76]:
trained = torch.load(r'C:\Users\Tairo Kageyama\Documents\GitHub\Python-fo-Natural-Language-Processing-main\lab5\model\BERT.pt')
model = BERT()

model.load_state_dict(trained)
# Predict mask tokens ans isNext
input_ids, segment_ids, masked_tokens, masked_pos, isNext = map(torch.LongTensor, zip(batch[2]))
# print(type(input_ids))
# print(type(segment_ids))
# print(type(masked_pos))
print(input_ids.size())
print(segment_ids.size())
print(masked_pos.size())

# print([id2word[w.item()] for w in input_ids[0] if id2word[w.item()] != '[PAD]'])

logits_lm, logits_nsp = model(input_ids, segment_ids, masked_pos)
#logits_lm:  (1, max_mask, vocab_size) ==> (1, 5, 34)
#logits_nsp: (1, yes/no) ==> (1, 2)

#predict masked tokens
#max the probability along the vocab dim (2), [1] is the indices of the max, and [0] is the first value
logits_lm = logits_lm.data.max(2)[1][0].data.numpy() 
#note that zero is padding we add to the masked_tokens
print('masked tokens (words) : ',[id2word[pos.item()] for pos in masked_tokens[0]])
print('masked tokens list : ',[pos.item() for pos in masked_tokens[0]])
print('masked tokens (words) : ',[id2word[pos.item()] for pos in logits_lm])
print('predict masked tokens list : ', [pos for pos in logits_lm])

#predict nsp
logits_nsp = logits_nsp.data.max(1)[1][0].data.numpy()
print(logits_nsp)
print('isNext : ', True if isNext else False)
print('predict isNext : ',True if logits_nsp else False)

torch.Size([1, 1000])
torch.Size([1, 1000])
torch.Size([1, 5])
masked tokens (words) :  ['in', 'successful', 'a', '20th', 'largest']
masked tokens list :  [40, 93, 102, 112, 29]
masked tokens (words) :  ['citation', 'citation', 'citation', 'citation', 'citation']
predict masked tokens list :  [88, 88, 88, 88, 88]
0
isNext :  False
predict isNext :  False


Trying a bigger dataset should be able to see the difference.

Evaluation

In [85]:
Syntatic = "[CLS] fermat's last [MASK] [SEP]"
Semantic = "[CLS] fermat's last theorem is among the most notable theorems in the history of [MASK] [SEP]"

Syntactic = Syntatic.split()
Semantic = Semantic.split()

Syntactic_id = []
Semantic_id = []

for Sy in Syntactic:
    Syntactic_id.append(word2id[Sy])

for Se in Semantic:
    Semantic_id.append(word2id[Se])

print(Syntactic_id)
print(Semantic_id)

input_syn = torch.tensor([Syntactic_id])
input_sem = torch.tensor([Semantic_id])
seg_syn = torch.tensor([[0, 0, 0, 0, 0]])
seg_sem = torch.tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
mask_syn = torch.tensor([[3]])
mask_sem = torch.tensor([[14]])


[1, 144, 163, 3, 2]
[1, 144, 163, 81, 149, 97, 100, 79, 9, 63, 40, 100, 155, 152, 3, 2]


In [86]:
trained = torch.load(r'C:\Users\Tairo Kageyama\Documents\GitHub\Python-fo-Natural-Language-Processing-main\lab5\model\BERT.pt')
model = BERT()
for n in range(10):
    print(n+1)
    model.load_state_dict(trained)
    logits_lm, logits_nsp = model(input_syn, seg_syn, mask_syn)
    logits_lm = logits_lm.data.max(2)[1][0].data.numpy() 
    print('Syntatic predicted : ',[id2word[pos.item()] for pos in logits_lm])

    logits_lm, logits_nsp = model(input_sem, seg_sem, mask_sem)
    logits_lm = logits_lm.data.max(2)[1][0].data.numpy() 
    print('Semantic predicted : ',[id2word[pos.item()] for pos in logits_lm])
    print('-----------------------------------------')

1
Syntatic predicted :  ['citation']
Semantic predicted :  ['citation']
-----------------------------------------
2
Syntatic predicted :  ['citation']
Semantic predicted :  ['citation']
-----------------------------------------
3
Syntatic predicted :  ['citation']
Semantic predicted :  ['prior']
-----------------------------------------
4
Syntatic predicted :  ['theorem']
Semantic predicted :  ['theorem']
-----------------------------------------
5
Syntatic predicted :  ['citation']
Semantic predicted :  ['theorem']
-----------------------------------------
6
Syntatic predicted :  ['citation']
Semantic predicted :  ['citation']
-----------------------------------------
7
Syntatic predicted :  ['also']
Semantic predicted :  ['theorem']
-----------------------------------------
8
Syntatic predicted :  ['theorem']
Semantic predicted :  ['citation']
-----------------------------------------
9
Syntatic predicted :  ['also']
Semantic predicted :  ['citation']
--------------------------------

In [117]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# GPT-2の事前学習済みモデルとトークナイザーを読み込む
model_name = 'gpt2'
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# テスト文を定義する
text = "[CLS] fermat's last [MASK] [SEP]"

# 文をトークン化する
input_ids = tokenizer.encode(text, return_tensors='pt')
print(input_ids)
model.eval()

# モデルを使ってマスクされたトークンを予測する
with torch.no_grad():
    outputs = model(*input_ids)
    predictions = outputs[0]

# 予測されたトークンを出力する
predicted_token_index = torch.argmax(predictions).item()
predicted_token = tokenizer.decode([predicted_token_index])
print("Predicted token:", predicted_token)



tensor([[   58,  5097,    50,    60, 11354,  6759,   338,   938,   685, 31180,
            42,    60,   685,  5188,    47,    60]])


TypeError: sequence item 0: expected str instance, NoneType found

In [125]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

Syntactic = "Fermat's Last "
Semantic = "Fermat's Last Theorem is among the most notable theorems in the history of "

Syn = tokenizer.encode(Syntactic)
Sem = tokenizer.encode(Semantic)
Synt = torch.tensor([Syn])
Sema = torch.tensor([Sem])
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()

for n in range(10):
    print(n+1)
    with torch.no_grad():
        preSyn = model(Synt)
        preSem = model(Sema)
        preSyn = preSyn[0]
        preSem = preSem[0]
        #print(predictions)

    preSyn = torch.argmax(preSyn[0, -1, :]).item()
    preSem = torch.argmax(preSem[0, -1, :]).item()
    predicted_Syn = tokenizer.decode([preSyn])
    predicted_Sem = tokenizer.decode([preSem])

    print('Syntatic predicted : ',predicted_Syn)
    print('Semantic predicted : ',predicted_Sem)
    print('-----------------------------------------')



1
Syntatic predicted :   
Semantic predicted :  vern
-----------------------------------------
2
Syntatic predicted :   
Semantic predicted :  vern
-----------------------------------------
3
Syntatic predicted :   
Semantic predicted :  vern
-----------------------------------------
4
Syntatic predicted :   
Semantic predicted :  vern
-----------------------------------------
5
Syntatic predicted :   
Semantic predicted :  vern
-----------------------------------------
6
Syntatic predicted :   
Semantic predicted :  vern
-----------------------------------------
7
Syntatic predicted :   
Semantic predicted :  vern
-----------------------------------------
8
Syntatic predicted :   
Semantic predicted :  vern
-----------------------------------------
9
Syntatic predicted :   
Semantic predicted :  vern
-----------------------------------------
10
Syntatic predicted :   
Semantic predicted :  vern
-----------------------------------------


In [3]:
from transformers import T5Config, T5ForConditionalGeneration, T5Tokenizer

model_name = "allenai/t5-small-next-word-generator-qoogle"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

Syntactic = "Fermat's Last "
Semantic = "Fermat's Last Theorem is among the most notable theorems in the history of "

def run_model(input_string, **generator_args):
    input_ids = tokenizer.encode(input_string, return_tensors="pt")
    res = model.generate(input_ids, **generator_args)
    output = tokenizer.batch_decode(res, skip_special_tokens=True)
    return output

for i in range(10):
    print(i+1)
    print('Syntatic predicted : ',run_model(Syntactic))
    print('Semantic predicted : ',run_model(Semantic))
    print('-----------------------------------------')



1
Syntatic predicted :  ['will']
Semantic predicted :  ['the']
-----------------------------------------
2
Syntatic predicted :  ['will']
Semantic predicted :  ['the']
-----------------------------------------
3
Syntatic predicted :  ['will']




Semantic predicted :  ['the']
-----------------------------------------
4
Syntatic predicted :  ['will']
Semantic predicted :  ['the']
-----------------------------------------
5
Syntatic predicted :  ['will']
Semantic predicted :  ['the']
-----------------------------------------
6
Syntatic predicted :  ['will']
Semantic predicted :  ['the']
-----------------------------------------
7
Syntatic predicted :  ['will']
Semantic predicted :  ['the']
-----------------------------------------
8
Syntatic predicted :  ['will']
Semantic predicted :  ['the']
-----------------------------------------
9
Syntatic predicted :  ['will']
Semantic predicted :  ['the']
-----------------------------------------
10
Syntatic predicted :  ['will']
Semantic predicted :  ['the']
-----------------------------------------
