Karim ElDakroury

# ELMo: Embeddings from Language Models 
![](https://get.whotrades.com/u4/photoDE6C/20647654315-0/blogpost.jpeg)

In this assignment you will implement a deep lstm-based model for contextualized word embeddings - ELMo. Your tasks are as following: 

- Preprocessing (20 points)
- Implementation of ELMo model (30 points)
  - 2-layer BiLSTM (15 points)
  - Highway layers (5 points) [link](https://paperswithcode.com/method/highway-layer) [paper](https://arxiv.org/pdf/1507.06228.pdf) [code](https://github.com/allenai/allennlp/blob/9f879b0964e035db711e018e8099863128b4a46f/allennlp/modules/highway.py#L11)
  - CharCNN embeddings (10 points) [paper](https://arxiv.org/pdf/1509.01626.pdf)
- Report metrics and loss using tensorbord/comet or other tool.  (10 points)
- Evaluate on movie review dataset (20 pts)
- Compare the performance with BERT model (10 pts)
- Clean and documented code (10 points)


Remarks: 

*   Use Pytorch
*   Cheating will result in 0 points


ELMo paper: https://arxiv.org/pdf/1802.05365.pdf

Possible datasets:
- [WikiText-103](https://blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/)
- Any monolingual dataset from [WMT](https://statmt.org/wmt22/translation-task.html)

## Data loading and preprocessing
Preprocess the english monolingual data (20 points):
- clean
- split to train and validation
- tokenize
- create vocabulary, convert words to numbers. [vocab](https://pytorch.org/text/stable/vocab.html#id1)
- pad sequences

Use these tutorials [one](https://pytorch.org/tutorials/beginner/torchtext_translation_tutorial.html) and [two](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html) as a reference

![](https://miro.medium.com/max/720/1*UPirqwpBWnNmcwoUjfZZIA.png)

In [None]:
# !pip install allennlp
# !pip install unidecode

In [1]:
import re
import torch
import random
import string
from pathlib import Path
from typing import Dict, List, Tuple
from collections import Counter
from torch import nn, optim
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import vocab as torch_vocab
from torch.utils.data import TensorDataset, DataLoader
import numpy as np
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### Download dataset

Wiki-103 is already split


In [37]:
from torchtext.utils import download_from_url, extract_archive

dataset_url = 'https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip'
dataset_path = download_from_url(dataset_url)
test_path, valid_path, train_path = extract_archive(dataset_path)
print(test_path, valid_path, train_path, sep='\n')

/content/.data/wikitext-2/wiki.test.tokens
/content/.data/wikitext-2/wiki.valid.tokens
/content/.data/wikitext-2/wiki.train.tokens


In [38]:
train_data = Path('/content/.data/wikitext-2/wiki.train.tokens').read_text()

### Prepare tokenizer

In [39]:
def sanitize(text):
    header_token = '( \n = [^=]*[^=] = \n )'
    text = re.split(header_token, text)
    text = [x for x in text[2::2]]
    subheader_token = '( \n = = .* = = \n )'
    punc_token = '(\'s)|(\n)|(?:(?<!unk)([^a-zA-Z0-9\s\.])(?!unk))'
    cleaned_text = []
    for word in text:
      data = re.sub(subheader_token, '', word)    
      data = re.sub(punc_token, '', data)
      data = re.sub(' +', ' ', data.strip())
      data = data.lower()
      cleaned_text.append(data)
    return cleaned_text

### build vocab

In [40]:
unk_tok = '<unk>'
pad_tok = '<pad>'
bos_tok = '<bos>'
eos_tok = '<eos>'

specials = [pad_tok, unk_tok, bos_tok, eos_tok]

def get_sentences(data):
  sentences = []
  tokenizer = get_tokenizer('spacy', language='en')
  for article in data:
    article_sentences = re.split(' +\. +', article)
    for sent in article_sentences:
      sent = re.sub('\.', '', sent).strip()
      sentences.append(sent.split())
  return sentences

def build_vocab(data):        
  counter = Counter()
  for s in data:
    counter.update(s)
  return torch_vocab(counter, specials=['<unk>', '<pad>', '<bos>', '<eos>'])

In [41]:
cleaned_articles = sanitize(train_data)
sentences = get_sentences(cleaned_articles)
# cut the list of sentences
sentences = random.sample(sentences, 1000)
vocab = build_vocab(sentences)
print(sentences[0])
print(vocab(['<unk>', 'book', 'the', '<bos>', '<eos>', '<pad>']))

['the', 'upgrade', 'proposal', 'proved', 'to', 'be', 'very', 'unpopular', 'with', 'north', 'kingstown', 'residents', 'who', 'lived', 'on', 'the', 'affected', 'local', 'roads']
[0, 1401, 4, 2, 3, 1]


### Padding, creating tensors

In [42]:
max_sentence_len = 50
batch_size = 32

def padding(sentences, max_len=max_sentence_len):
  sent_tensor = torch.zeros(len(sentences), max_len, dtype=torch.long, device=device)
  masks = torch.zeros(len(sentences), max_len, dtype=torch.long, device=device)
  return masks, sent_tensor

In [43]:
masks, sent_tensor = padding(sentences)
print(sent_tensor.size(), masks.size())

torch.Size([1000, 50]) torch.Size([1000, 50])


### Convert words to numbers


In [44]:
def converting_words2numbers(setntences, vocab, masks, sent_tensor, max_len=max_sentence_len):
  lengths = []
  for idx, sentence in enumerate(sentences):        
    sent_len = min(max_len - 2, len(sentence))
    
    masks[idx][:sent_len+2] = torch.ones(sent_len+2, device=device)        
    sent_tensor[idx][1:sent_len+1] = torch.tensor(vocab(sentence[:sent_len]), dtype=torch.long, device=device)
    sent_tensor[idx][0] = vocab[bos_tok]
    sent_tensor[idx][sent_len+1] = vocab[eos_tok]
    lengths.append(sent_len+2)
  
  lengths = torch.tensor(lengths, dtype=torch.long, device=device)
  return lengths

lengths = converting_words2numbers(sentences, vocab, masks, sent_tensor)

print(lengths.size())

dataset = TensorDataset(sent_tensor, masks, lengths)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

torch.Size([1000])


## Model - learning embeddings
Read chapter 3 from the [paper](https://arxiv.org/pdf/1802.05365.pdf)

Implement this model with 
- 2 BiLSTM layers,
- CharCNN embeddings,
- Highway layers,
- out-of-vocabulary words handling

Plot the training and validation losses over the epochs (iterations)

Use the [implementation](https://github.com/allenai/allennlp/blob/main/allennlp/modules/elmo.py) as a reference

![](https://miro.medium.com/max/720/1*3_wsDpyNG-TylsRACF48yA.png)

![](https://miro.medium.com/max/720/1*8pG54o28pbD2L0dv5THL-A.png)

References:


https://github.com/gazelle93/charCNN


https://github.com/ankurbanga/Language-Models/tree/master/ELMo


https://towardsdatascience.com/pytorch-elmo-844d2391a0b2#b484

### Character embeddings

In [45]:
max_chars_per_token = 40
emb_dim = 16
filters= [[1, 32], [2, 32], [3, 64], [4, 128], [5, 256], [6, 512], [7, 1024]]
output_dim = 512
num_highway = 2

bow_tok = '<bow>'
eow_tok = '<eow>'

special_chars = [bow_tok, eow_tok]

def char_dict(special_chars=specials+special_chars):
  chars = string.ascii_lowercase + string.digits
  char_dict = {}
  idx = 0
  for char in special_chars:
    char_dict[char] = idx
    idx += 1
  for char in chars:
    char_dict[char] = idx
    idx += 1
  return char_dict

### CharCNN

In [46]:
class CharCNN(nn.Module):
    def __init__(self, num_chars, emb_dim, filters):
        super(CharCNN, self).__init__()
        self.embeddings = nn.Embedding(num_chars, emb_dim)
        self.conv_layers = nn.ModuleList([nn.Conv1d(in_channels=emb_dim,
                                     out_channels=num_f,
                                     kernel_size=width,
                                     bias=True) for width, num_f in filters])
        self.activation = nn.ReLU()

    def forward(self, inputs):
        convs = []
        initial_embeddings = self.embeddings(inputs)

        for conv_layer in self.conv_layers:
            convolved = conv_layer(torch.transpose(initial_embeddings, 1, 2))
            convolved, _ = torch.max(convolved, dim=-1)
            convolved = self.activation(convolved)
            convs.append(convolved)

        return torch.cat(convs, dim=-1)

###Highway layer

In [47]:
class Highway(nn.Module):
    def __init__(self, dim, n_highway):
        super(Highway, self).__init__()
        self.layers = torch.nn.ModuleList(
            [torch.nn.Linear(dim, dim * 2) for _ in range(n_highway)]
        )

        for layer in self.layers:
            layer.bias[dim:].data.fill_(1)

        self.activation = nn.ReLU()

    def forward(self, inputs):
      current_input = inputs
      for layer in self.layers:
        projected_input = layer(current_input)
        linear_part = current_input

        nonlinear_part, gate = projected_input.chunk(2, dim=-1)
        nonlinear_part = self.activation(nonlinear_part)
        gate = torch.sigmoid(gate)
        current_input = gate * linear_part + (1 - gate) * nonlinear_part

      return current_input

### Elmo encoder
preparing ELMO inputs from CharCNN, Highway, and projection layers

In [63]:
class Projection(nn.Module):
  def __init__(self, input_dim, output_dim):
    super(Projection, self).__init__()
    self.layer = torch.nn.Linear(input_dim, output_dim, bias=True)
  
  def forward(self, inputs):
    return self.layer(inputs)

class CharIndexer(nn.Module):
  def __init__(self, vocab, max_char, char):
      super(CharIndexer, self).__init__()
      self.vocab = vocab
      self.char = char
      self.max_char = max_char

  def forward(self, sent):
    embedding = []

    for token in sent:
      word = self.vocab.lookup_token(token)

      embedded_word = [self.char[pad_tok] for _ in range(self.max_char)]
      
      embedded_word[0] = self.char[bow_tok]
      if word != pad_tok:
        if word == bos_tok or word == eos_tok or word == unk_tok:
          embedded_word[1] = self.char[word]
          embedded_word[2] = self.char[eow_tok]
        
        else:
          for idx, char in enumerate(word[:self.max_char - 2]):
            embedded_word[idx+1] = self.char[char]
            last_idx = idx+2
          
          embedded_word[last_idx] = self.char[eow_tok]
      embedding.append(embedded_word)
      return embedding

class Encoder(nn.Module):
    def __init__(self, vocab, output_dim=output_dim, max_char=max_chars_per_token, num_highway=num_highway):
      super(Encoder, self).__init__()
      self.output_dim=output_dim
      self.filters = filters
      self.char = char_dict()
      self.char_embeddings = CharIndexer(vocab, max_char, self.char)
      self.charCNN = CharCNN(len(self.char), emb_dim, self.filters)
      
      dim = sum([x[1] for x in self.filters])
      self.highway = Highway(dim, num_highway)
      self.projection = Projection(dim, output_dim)

    def forward(self, sentences):
      max_sent_len = sentences.size()[0]
      batch_size = sentences.size()[1]
      embeddings = torch.zeros((batch_size, max_sent_len, self.output_dim), dtype=torch.float, device=device)
      
      for i, sent in enumerate(sentences.t()):
        emb = torch.tensor(self.char_embeddings(sent), dtype=torch.long, device=device)
        emb = self.charCNN(emb)
        emb = self.highway(emb)
        emb = self.projection(emb)
        embeddings[i] = emb
      return torch.permute(embeddings, (1, 0, 2))

### 2 biLSTM layers

In [64]:
class biLSTM(nn.Module):
    def __init__(self, hidden_size, embedding, n_layers=1, dropout=0):
        super(biLSTM, self).__init__()
        self.hidden_size = hidden_size
        self.n_layers = n_layers
        self.embedding = embedding
        USE_CUDA = torch.cuda.is_available()
        self.device = torch.device("cuda" if USE_CUDA else "cpu")
        self.drop = nn.Dropout(p=dropout)
        self.forwardLSTM = nn.LSTM(hidden_size, 
                                         hidden_size, 
                                         n_layers, 
                                         dropout=(0 if n_layers == 1 else dropout))
        self.backwardLSTM = nn.LSTM(hidden_size, 
                                         hidden_size, 
                                         n_layers, 
                                         dropout=(0 if n_layers == 1 else dropout))
        
    def forward(self, input_seq, input_lengths, initial_states=None):
        embedded = self.embedding(input_seq)
        embedded = embedded.to(device)
        MAX_LEN = embedded.size()[0]
        batch_size = embedded.size()[1]
        outputs = torch.zeros(MAX_LEN, batch_size, 2, self.hidden_size, device=self.device)
        hidden_states = torch.zeros(self.n_layers * 2, MAX_LEN, batch_size, self.hidden_size, device=self.device)
        
        if not initial_states:
            initial_states = (torch.zeros(self.n_layers, 1, self.hidden_size, device=self.device), torch.zeros(self.n_layers, 1, self.hidden_size, device=self.device))
        
        for batch_n in range(batch_size):
            b_sentence = embedded[:,batch_n, :]
            length = input_lengths[batch_n]
            
            sentence = self.drop(b_sentence[:length,:])
            hidden_forward_state, cell_forward_state = initial_states
            hidden_backward_state, cell_backward_state = initial_states
            
            for t in range(length):
                output, (hidden_forward_state, cell_forward_state) = self.forwardLSTM(sentence[t].view(1, 1, -1), (hidden_forward_state, cell_forward_state))
                outputs[t, batch_n, 0, :] = output[0, 0, :]
                hidden_states[:self.n_layers, t, batch_n, :] = hidden_forward_state[:, 0, :]
                
            for t in range(length):
                output, (hidden_backward_state, cell_backward_state) = self.backwardLSTM(sentence[length - t - 1].view(1, 1, -1), (hidden_backward_state, cell_backward_state))
                outputs[length - t - 1, batch_n, 1, :] = output[0, 0, :]
                hidden_states[self.n_layers:, length - t - 1, batch_n, :] = hidden_backward_state[:, 0, :]
                
        return outputs, hidden_states, embedded

### ELMO model


In [65]:
class ELMo(nn.Module):

    def __init__(self, hidden_size, embedding, vocab_size, n_layers=1, dropout=0):
        super(ELMo, self).__init__()
        USE_CUDA = torch.cuda.is_available()
        self.device = torch.device("cuda" if USE_CUDA else "cpu")
        self.n_layers = n_layers
        self.hidden_size = hidden_size
        self.embedding = embedding
        self.biLSTM = biLSTM(hidden_size, embedding, n_layers, dropout)
        self.W = nn.Parameter(torch.tensor([1/(2*n_layers + 1) for i in range(2*n_layers + 1)], requires_grad=True, device=self.device))
        self.gamma = nn.Parameter(torch.ones(1, requires_grad=True, device=self.device))
        self.dense = nn.Linear(hidden_size, vocab_size)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, input_seq, input_lengths, mask, initial_states=None):
        biLSTM_outputs, hidden_states, embedded = self.biLSTM(input_seq, input_lengths, initial_states)
        concat_hidden_with_embedding = torch.cat((embedded.unsqueeze(0), hidden_states), dim=0)
        scaled_embedding = torch.zeros(*embedded.size(), device=self.device)
        
        for i in range(2*self.n_layers + 1):
            w = self.W[i]
            layer = concat_hidden_with_embedding[i]
            scaled_embedding = scaled_embedding + w * layer
        
        scaled_embedding *= self.gamma
        perm_scaled_embedding = scaled_embedding.permute(1, 0, 2)
        projection_size = (perm_scaled_embedding.size()[0], perm_scaled_embedding.size()[2])
        dense_layer_input = torch.zeros(projection_size, dtype=torch.float, device=device)
        for idx, sent_emb in enumerate(perm_scaled_embedding):
          dense_layer_input[idx] = sent_emb[input_lengths[idx] - 1]
        
        vocab_projection = self.dense(dense_layer_input)
        prop = self.softmax(vocab_projection)
        
        return prop, scaled_embedding, biLSTM_outputs

### Model Training

In [73]:
n_epochs = 20
clip = 50.0

def make_confusion_matrix(true, pred):
    K = len(np.unique(true))
    result = np.zeros((K, K))

    for i in range(len(true)):
        result[true[i]][pred[i]] += 1
    return result

def train_model(model, model_optimizer, dataloader):
    model.to(device)
    train_true = []
    train_pred = []
    train_loss = []
    
    for epoch in tqdm(range(n_epochs)):
        curr_true = []
        curr_pred = []
        model.train()
        
        for batch in dataloader:
            model_optimizer.zero_grad()
            inp_seq, masks, lengths = batch
            prediction, elmo_embedding, biLSTM_outputs = model(inp_seq.t(), lengths, masks.t())
          
            true_pred = []
            for i, length in enumerate(lengths):
              true_pred.append(inp_seq[i][length - 1])

            true_pred = torch.tensor(true_pred, dtype=torch.long, device=device)
            loss_func = nn.CrossEntropyLoss()
            loss = loss_func(prediction, true_pred)
            loss = loss.to(device)
            train_loss.append(loss.item() * lengths.size()[0])
            
            loss.backward(retain_graph=True)
            
            _ = nn.utils.clip_grad_norm_(model.parameters(), clip)
            model_optimizer.step()
        print("\nloss:", loss.item())
    return train_loss

keyboard intruption

In [74]:
dim = 512
n_layers = 2
lr = 2e-3

model = ELMo(dim, Encoder(vocab), len(vocab), n_layers=n_layers, dropout=0.4)
model_optimizer = optim.Adam(model.parameters(), lr = lr)

In [75]:
train_loss = train_model(model, model_optimizer, dataloader)

  5%|▌         | 1/20 [00:57<18:05, 57.12s/it]


loss: 7.673977851867676


  5%|▌         | 1/20 [01:02<19:52, 62.78s/it]


KeyboardInterrupt: ignored

## Evaluate your embeddings model on IMDB movie reviews dataset (sentiment analysis) 
[Dataset](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)

Preprocess data

Disable training for ELMo, it will produce 5 embeddings for each word, add trainable parameters $\gamma^{task}$ and $s^{task}_j$

Don't forget metric plots

In [76]:
from allennlp.modules.elmo import Elmo, batch_to_ids
import unidecode

options_file = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_options.json"
weight_file = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5"

batch_size = 32

def get_accuracy_from_logits(logits, labels):
    probs = torch.sigmoid(logits.unsqueeze(-1))
    soft_probs = (probs > 0.5).long()
    acc = (soft_probs.squeeze() == labels).float().mean()
    return acc

def train_test_split(dataset, test_split):
    train_len = int((1 - test_split) * len(dataset))
    test_len = len(dataset) - train_len
    train, test = torch.utils.data.random_split(dataset, [train_len, test_len], generator=torch.Generator().manual_seed(42))
    return train, test

def load_data(filename) -> Tuple[List[str], List[int]]:
    inputs = []
    targets = []
    with open(filename) as file:
        for line in file:
            line = unidecode.unidecode(line.lower())
            line = re.sub(r"([^a-zA-Z0-9\s])", "", line)
            word_list = [word.strip() for word in line.split() if word.strip() != '']
            inputs.append(word_list[:-1])
            targets.append(int(word_list[-1]))
    return inputs, targets

In [21]:
def get_elmo_data(sentences, targets):
  # pad the sequence
  padded = []
  max_len = 0
  for sent in sentences:
    max_len = max(max_len, len(sent))
  for sent in sentences:
    tmp_sent = [pad_tok for _ in range(max_len)]
    tmp_sent[:len(sent)] = sent
    padded.append(tmp_sent)
  
  # get embeddings and create batches
  embeddings = batch_to_ids(padded)
  labels = torch.tensor(targets, dtype=torch.float, device=device)
  dataset = TensorDataset(embeddings, labels)
  train, test = train_test_split(dataset, 0.25)
  return DataLoader(train, batch_size, shuffle=True), DataLoader(test, batch_size), max_len

class ElmoClassifier(nn.Module):
  def __init__(self, max_sent_len, options_file, weight_file):
    super(ElmoClassifier, self).__init__()
    self.elmo = Elmo(options_file, weight_file, 2, dropout=0)
    self.word_cls = nn.Linear(1024, 1)
    self.sent_cls = nn.Linear(max_sent_len, 1)

  def forward(self, character_ids):
    output = self.elmo(character_ids)
    embeddings = output['elmo_representations'][0]
    word_cls = self.word_cls(embeddings)
    return self.sent_cls(word_cls.squeeze(-1))

def train_elmo(dataloader, model, optimizer, n_epochs=20):
  model = model.to(device)
  model.train()
  criterion = nn.BCEWithLogitsLoss()
  for epoch in tqdm(range(n_epochs)):
    for iteration, (embeddings, labels) in enumerate(dataloader):
      embeddings = embeddings.to(device)
      labels = labels.to(device)
      optimizer.zero_grad()
      logits = model(embeddings)
      loss = criterion(logits.squeeze(-1), labels)
      loss.backward()
      optimizer.step()
      acc = get_accuracy_from_logits(logits, labels)

  print("\nepoch {} complete. Loss : {} Accuracy : {}".format(epoch + 1, loss.item(), acc))

In [22]:
inputs, targets = load_data('data.txt')

train_loader, test_loader, max_len = get_elmo_data(inputs, targets)
model = ElmoClassifier(max_len, options_file, weight_file)
optimizer = optim.Adam(model.parameters(), lr=0.00002)

In [23]:
train_elmo(train_loader, model, optimizer)

100%|██████████| 20/20 [02:12<00:00,  6.62s/it]


epoch 20 complete. Loss : 0.7006427049636841 Accuracy : 0.2857142984867096





## Compare the results with BERT embeddings
you can choose other bert model

In [17]:
from transformers import AutoTokenizer, BertModel, TrainingArguments, Trainer

def get_bert_data(sentences, targets):
  # tokenize dataset and create batches
  bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
  data = [" ".join(words_list) for words_list in sentences]
  tokenized_data = bert_tokenizer(data, padding="max_length")
  embeddings = torch.tensor(tokenized_data['input_ids'])
  attn_masks = torch.tensor(tokenized_data['attention_mask'])
  labels = torch.tensor(targets, dtype=torch.float)
  dataset = TensorDataset(embeddings, attn_masks, labels)
  train, test = train_test_split(dataset, 0.25)
  return DataLoader(train, batch_size, shuffle=True), DataLoader(eval, batch_size)

class BertClassifier(nn.Module):
    def __init__(self, freeze_bert = True):
        super(BertClassifier, self).__init__()
        self.bert_layer = BertModel.from_pretrained('bert-base-uncased')
        
        if freeze_bert:
            for p in self.bert_layer.parameters():
                p.requires_grad = False 
        self.cls_layer = nn.Linear(768, 1)

    def forward(self, seq, attn_masks):
        output = self.bert_layer(seq, attention_mask=attn_masks)
        cont_reps = output.last_hidden_state
        cls_rep = cont_reps[:, 0]
        logits = self.cls_layer(cls_rep)
        return logits

In [18]:
def train_bert(model, optimizer, train_loader, val_loader, n_epochs=20):
    model = model.to(device)
    model.train()
    criterion = nn.BCEWithLogitsLoss()

    for epoch in tqdm(range(n_epochs)):
        for iteration, (seq, attn_masks, labels) in enumerate(train_loader):
            optimizer.zero_grad()
            seq, attn_masks, labels = seq.to(device), attn_masks.to(device), labels.to(device)

            logits = model(seq, attn_masks) # task: model forward pass
            loss = criterion(logits.squeeze(-1), labels.float())
            loss.backward()
            optimizer.step()

        acc = get_accuracy_from_logits(logits, labels)
    print("\nIteration {} of epoch {} complete. Loss : {} Accuracy : {}".format(iteration + 1, epoch + 1, loss.item(), acc))

In [19]:
train_loader, val_loader = get_bert_data(inputs, targets)
model = BertClassifier()
optimizer = optim.Adam(model.parameters(), lr=2e-4)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [20]:
train_bert(model, optimizer, train_loader, val_loader)

100%|██████████| 20/20 [08:48<00:00, 26.43s/it]



Iteration 24 of epoch 20 complete. Loss : 0.6732714176177979 Accuracy : 0.7142857313156128
