# Transformer and BERT (thanks 🤗)

- Vaswani et al., [_Attention is All you Need._](https://papers.nips.cc/paper/7181-attention-is-all-you-need) NIPS 2017: 5998-6008
- Devlin et al., [_BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding._](https://www.aclweb.org/anthology/N19-1423/) NAACL-HLT (1) 2019: 4171-4186

---

- [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/)
- [The Illustrated BERT, ELMo ...](https://jalammar.github.io/illustrated-bert/)
- [The Annotated Transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html)

---

# Model Compression

- Liu et al.: [_RoBERTa: A Robustly Optimized BERT Pretraining Approach_](https://arxiv.org/abs/1907.11692)
- Sam Sucik: [Compressing BERT for faster prediction](https://blog.rasa.com/compressing-bert-for-faster-prediction-2/amp/)

---

# Sub-word tokenistion using Byte Pair Encoding

Tokenise words not based on whitespace, but based on frequency patterns learned from a corpus. The method trains an unsupervised tokeniser and does not, therefore, require any labelled data.

- Sennrich et al.: [Neural Machine Translation of Rare Words with Subword Units.](https://www.aclweb.org/anthology/P16-1162/) ACL (1) 2016
> Neural machine translation (NMT) models typically operate with a fixed vocabulary, but ___translation is an open-vocabulary problem.___ Previous work addresses the translation of out-of-vocabulary words by backing off to a dictionary. In this paper, we introduce a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units. ___This is based on the intuition that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations)___. [Emphasis mine]

- Heinzerling et al.: [BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages.](http://www.lrec-conf.org/proceedings/lrec2018/pdf/1049.pdf) LREC 2018
- [BPEmb: Subword Embeddings in 275 Languages](https://nlp.h-its.org/bpemb/)
- Provlikov et al.: [BPE-Dropout: Simple and Effective Subword Regularization](https://arxiv.org/abs/1910.13267)
> While multiple segmentations are possible even with the same vocabulary, BPE splits words into unique sequences; this may prevent a model from better learning the compositionality of words and being robust to segmentation errors.

---

In [None]:
from transformers import BertModel, BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

In [None]:
token_ids = tokenizer.encode('The BERT tokenizers splits words into sub-word units.')
tokenizer.convert_ids_to_tokens(token_ids)

In [None]:
token_ids = tokenizer.encode('The new Berlin airport (BER) will open soon.')
tokenizer.convert_ids_to_tokens(token_ids)

----

# Document Classification with BERT (thanks 🤗)

In [None]:
bert = BertModel.from_pretrained('bert-base-multilingual-cased')
bert

In [None]:
from sklearn.preprocessing import LabelEncoder
import utils

gnad_train, gnad_test = utils.load_gnad()
label_encoder = LabelEncoder()

# turn all the data into integer indices
y_train = label_encoder.fit_transform(gnad_train.category)

In [None]:
y_train

----

In [None]:
from transformers import BertTokenizer

def doc2bert(doc, tokenizer):
    tokens = tokenizer.tokenize(doc)[:510]  # NOTE: that's 510 SUBword tokens, not 510 tokens from the original document
    token_ids = tokenizer.convert_tokens_to_ids(tokens)
    
    # pad everything to 512
    if len(token_ids) < 510:
        token_ids = token_ids + [0] * (510 - len(token_ids))
        
    return [tokenizer.cls_token_id] + token_ids + [tokenizer.sep_token_id]

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')

token_ids = (doc2bert(x_, tokenizer) for x_ in gnad_train.text)

In [None]:
import torch
from torch.utils.data import TensorDataset, RandomSampler, DataLoader

X_train = torch.LongTensor(list(token_ids))
y_train = torch.LongTensor(y_train)

batch_size = 8
data_train = TensorDataset(X_train, y_train)
sampler = RandomSampler(data_train)
train_dataloader = DataLoader(data_train, sampler=sampler, batch_size=8)

In [None]:
X_train, y_train

In [None]:
from transformers import BertForSequenceClassification

bert = BertForSequenceClassification.from_pretrained('bert-base-multilingual-cased', num_labels=len(torch.unique(y_train)))

Let's have quick look at what the output from looks like

In [None]:
output, *_ = bert(X_train[:4])
output

In [None]:
import torch.nn.functional as F

F.log_softmax(output, dim=1).exp()

## What's going on here?

In [None]:
bert

---

In [None]:
from transformers import AdamW
from transformers.optimization import WarmupLinearSchedule

num_epochs = 5

# pass the parameters of the classifier head ONLY to the optimizer
params = [p for n, p in bert.named_parameters() if 'classifier.' in n]
optimizer = AdamW(params, lr=3e-5, correct_bias=False)

num_total_steps = num_epochs * (len(train_dataloader.sampler) // batch_size)
num_warmup_steps = int(num_total_steps * 0.15)
scheduler = WarmupLinearSchedule(optimizer, warmup_steps=num_warmup_steps, t_total=num_total_steps)

# Learning Rate Schedule
## Warmup Linear Schedule

In [None]:
from matplotlib import pyplot as plt
%matplotlib inline

def plot_learning_rate(num_warmup_steps):
    scheduler = WarmupLinearSchedule(optimizer, warmup_steps=num_warmup_steps, t_total=num_total_steps)
    learning_rates = []
    for i in range(num_total_steps):
        learning_rates.append(scheduler.get_lr())
        scheduler.step()
    plt.plot(learning_rates);
    plt.xlabel('Iteration');
    plt.ylabel('Learning Rate');

plot_learning_rate(int(num_total_steps * 0.05))
plot_learning_rate(int(num_total_steps * 0.10))
plot_learning_rate(int(num_total_steps * 0.15))

In [None]:
import torch
torch.cuda.is_available()

# Document Classification with BERT

In [None]:
import tqdm
from torch.nn.utils import clip_grad_norm_

# Use a GPU if one is available
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
bert.to(DEVICE)

for _ in tqdm.trange(num_epochs, total=num_epochs, desc="Epoch"):
#     steps = tqdm.tqdm(train_dataloader,
#                       total=X_train.size()[0] // train_dataloader.batch_size + 1,
#                       desc='Mini-batch')
    train_loss = 0
    for i_step, batch in enumerate(train_dataloader):
        batch_X, batch_y = (b.to(DEVICE) for b in batch)
        loss, *_ = bert(batch_X, labels=batch_y)
        train_loss += loss.item()
        loss.backward()
        clip_grad_norm_(bert.parameters(), 1.0)
#         steps.set_postfix_str(f'avg. loss {train_loss / (i_step + 1):.4f}')
        
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

In [None]:
bert.to('cpu')
torch.save(bert.to('cpu'), 'bert-GNADs-5epochs-HEAD.pt')

In [None]:
bert = torch.load('bert-GNADs-5epochs-HEAD.pt')

In [None]:
del batch_X, batch_y

In [None]:
torch.cuda.empty_cache()

---

In [None]:
import torch
from torch.utils.data import SequentialSampler

tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')

token_ids = (doc2bert(x_, tokenizer) for x_ in gnad_test.text)
X_test = torch.LongTensor(list(token_ids))
y_test = label_encoder.transform(gnad_test.category)
y_test = torch.LongTensor(y_test)

batch_size = 8
data_test = TensorDataset(X_test, y_test)
sampler = SequentialSampler(data_test)
test_dataloader = DataLoader(data_test, sampler=sampler, batch_size=8)

In [None]:
from torch.utils.data import SequentialSampler
from torch.nn import functional as F


DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
bert.eval()
bert.to(DEVICE)
pred = []
for x, *_batch in test_dataloader:
    x = x.to(DEVICE)
    pred_, *_ = bert(x)
    _, pred_ = F.log_softmax(pred_, dim=1).exp().max(dim=1)
    pred.extend(pred_.cpu().numpy().tolist())

In [None]:
from sklearn import metrics

print(metrics.classification_report(y_test, pred, target_names=list(label_map.keys())))

---
                       BERT classifier head ONLY

                   precision    recall  f1-score   support
 
             Etat       0.00      0.00      0.00        67
           Inland       0.00      0.00      0.00       102
    International       0.29      0.03      0.06       151
           Kultur       0.00      0.00      0.00        54
         Panorama       0.18      0.61      0.28       168
            Sport       0.82      0.07      0.14       120
              Web       0.27      0.67      0.38       168
       Wirtschaft       0.00      0.00      0.00       141
     Wissenschaft       0.00      0.00      0.00        57

         accuracy                           0.22      1028
        macro avg       0.17      0.15      0.09      1028
     weighted avg       0.21      0.22      0.13      1028


---
                 Linear SVM with parameter tuning

                   precision    recall  f1-score   support

             Etat       0.94      0.75      0.83        67
           Inland       0.89      0.84      0.86       102
    International       0.89      0.85      0.87       151
           Kultur       0.89      0.89      0.89        54
         Panorama       0.80      0.88      0.84       168
            Sport       0.99      0.97      0.98       120
              Web       0.92      0.90      0.91       168
       Wirtschaft       0.82      0.88      0.85       141
     Wissenschaft       0.89      0.96      0.92        57

         accuracy                           0.88      1028
        macro avg       0.89      0.88      0.89      1028
     weighted avg       0.89      0.88      0.88      1028
    
---

      BERT with fine-tuning the whole transformer stack

                  precision    recall  f1-score   support

             soc       0.55      0.70      0.61       398
             rec       0.82      0.69      0.75      1590
             alt       0.02      0.08      0.03       319
             sci       0.21      0.09      0.12      1579
            misc       0.81      0.55      0.65       390
            talk       0.63      0.58      0.60      1301
            comp       0.79      0.85      0.82      1955

        accuracy                           0.55      7532
       macro avg       0.54      0.51      0.51      7532
    weighted avg       0.60      0.55      0.57      7532


---

In [None]:
import torch
from torch.nn.utils import clip_grad_norm_

from transformers.optimization import WarmupLinearSchedule
from transformers import BertForSequenceClassification
from transformers import AdamW

from tqdm import tqdm_notebook as tqdmn

num_epochs = 5
bert = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(torch.unique(y_train)))

# include ALL the network parameters in the optimizer
params = [p for n, p in bert.named_parameters()]  # if .classifier in n]
optimizer = AdamW(params, lr=3e-5, correct_bias=False)

num_total_steps = num_epochs * (len(train_dataloader.sampler)
                              // batch_size)
num_warmup_steps = int(num_total_steps * 0.15)
scheduler = WarmupLinearSchedule(optimizer,
                                 warmup_steps=num_warmup_steps,
                                 t_total=num_total_steps)

DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
bert.to(DEVICE)

for _ in tqdmn(range(num_epochs), total=num_epochs, desc="Epoch"):
    steps = tqdmn(train_dataloader,
                  total=X_train.size()[0] // train_dataloader.batch_size + 1,
                  desc='Mini-batch')
    train_loss = 0
    for i_step, batch in enumerate(steps):
        batch_X, batch_y = (b.to(DEVICE) for b in batch)
        loss, *_ = bert(batch_X, labels=batch_y)
        train_loss += loss.item()
        loss.backward()
        clip_grad_norm_(bert.parameters(), 1.0)
        steps.set_postfix_str(f'avg. loss {train_loss / (i_step + 1):.4f}')
        
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

In [None]:
bert.to('cpu')
torch.save(bert.to('cpu'), 'bert-GNAD-5epochs-ALL.pt')

In [None]:
torch.cuda.empty_cache()

In [None]:
from torch.utils.data import SequentialSampler
from torch.nn import functional as F

X_test = torch.LongTensor(X_test)

batch_size = 4
data_test = TensorDataset(X_test)
sampler = SequentialSampler(data_test)
test_dataloader = DataLoader(data_test, sampler=sampler, batch_size=batch_size)

bert.eval()
bert.to(DEVICE)
pred = []
for x, *_batch in test_dataloader:
    x = x.to(DEVICE)
    pred_, *_ = bert(x)
    _, pred_ = F.log_softmax(pred_, dim=1).exp().max(dim=1)
    pred.extend(pred_.cpu().numpy().tolist())

In [None]:
from sklearn import metrics

print(metrics.classification_report(y_test, pred, target_names=list(label_map.keys())))

----

In [None]:
# Author: Olivier Grisel <olivier.grisel@ensta.org>
#         Peter Prettenhofer <peter.prettenhofer@gmail.com>
#         Mathieu Blondel <mathieu@mblondel.org>
# License: BSD 3 clause
from pprint import pprint
from time import time
import logging

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(tol=1e-3)),
])

# uncommenting more parameters will give better exploring power but will
# increase processing time in a combinatorial way
parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    # 'vect__max_features': (None, 5000, 10000, 50000),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    # 'tfidf__use_idf': (True, False),
    # 'tfidf__norm': ('l1', 'l2'),
    'clf__max_iter': (20,),
    'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ('l2', 'elasticnet'),
    # 'clf__max_iter': (10, 50, 80),
}

In [None]:
grid_search = GridSearchCV(pipeline, parameters, cv=5, n_jobs=3, verbose=1)

grid_search.fit(gnad_train.text, y_train)

In [None]:
from sklearn import metrics

In [None]:
print(metrics.classification_report(y_test,
                                    grid_search.best_estimator_.predict(gnad_test.text),
                                    target_names=label_encoder.classes_))

----