# Modern NLP in biomedical
Attention, Transformers, BERT, BioBERT, PubMedBERT, CODER, KeBioLM

Author: Yuan Zheng

# Layers in neural network

Input: 
Batch_size \* k
$$x = (x_1, x_2, ..., x_k) \in \mathbf{R}^k$$


Linear layer


Output:
Batch_size \* n
$$y = (y_1, y_2, ..., y_n) \in \mathbf{R}^n$$

# Layers in neural network

Input: 
Batch_size \* Time_step \* k
$$x_i = (x_{1i}, x_{2i}, ..., x_{ki}) \in \mathbf{R}^k$$


RNN,
GRU,
LSTM,
CNN

Output:
Batch_size \* Time_step \* n
$$y_i = (y_{1i}, y_{2i}, ..., y_{ni}) \in \mathbf{R}^n$$

# Attention Mechanism

## Paper
Attention is all you need.

## Query, Key, Value
![Attention](https://pic2.zhimg.com/80/v2-f0e3e2fa8493252bfd09a586c30b042f_1440w.jpg?source=1940ef5c)

$$score = QK^T$$
$$alpha = Softmax(QK^T)$$
$$y = alpha * V$$

For example:
$$k_0 = (1, 0, 0, 0) = v_0$$
$$k_1 = (0, 1, 1, 1) = v_1$$
$$k_2 = (1, 1, 1, 1) = v_2$$
$$q = (1, 0, 1, 1)$$

$$score_0 = (q, k_0) = 1$$
$$score_1 = (q, k_1) = 2$$
$$score_2 = (q, k_2) = 3$$

$$alpha = (\frac{e}{e+e^2+e^3},\frac{e^2}{e+e^2+e^3},\frac{e^3}{e+e^2+e^3}) = (0.09, 0.25, 0.66)$$

$$y = 0.09 * v_0 + 0.25 * v_1 + 0.66 * v_2$$

# Self-Attention

Input: $$x$$

Parameters: $$W_Q, W_K, W_V$$

$$Q = W_Q * x$$
$$K = W_K * x$$
$$V = W_V * x$$

Output: $$y = softmax(QK^T)V=softmax(W_Qxx^TW_K^T)W_Vx$$

Important variants: Scaled self attention
$$y = softmax(QK^T/\sqrt{d_k})V$$

# What is self attention actually doing?

![self](self.png)

# Multi-head self attention
![multi-head](4.png)

# Transformers
State of the art architecture of NLP and CV!
![trans](transformer.png)

# Encoder layer

- Positional Encoding
![pos](https://pic1.zhimg.com/v2-c9b34779e00ff95c10059df2b432b23b_r.jpg?source=1940ef5c)

- Add & Norm: LayerNorm(x + MultiHeadAttention(x))

- Feed forward: Two-layer linear layers

Transformers = 6 \* Encoder Layer + 6 \* Decoder Layer

# Bert
BERT = 12 * Encoder Layer

GPT-1 = 12 * Decoder Layer

GPT-3 = 96 * Decoder Layer

## Input & Output
Input1: [CLS] There is a [MASK] in my bag. [SEP] It is Xiao ##mi. [SEP]

Input2: [CLS] There is a [MASK] in my bag. [SEP]

Output:

[CLS] -> Next setence. 
[MASK] -> phone

## Architecture

- Embedding Layer
![embed](https://pic4.zhimg.com/80/v2-4f9f62a7776afcdd1e1c99dfa57b965f_1440w.jpg)

- Arch
![Arch](https://pic1.zhimg.com/80/v2-9979c95d66a71a720207a48311702430_1440w.jpg)

## Pretraining Task
Next Sentence Classification: Not important.

Masked Language Modelling: Recover masked tokens in sentence. Require model to understand the sentence meaning.


# Memory issue for BERT

It is a 12-layer transformer model. Really Big.
Max batch-size for 12GB graphic card:

| Sequence length | Batch-size |
| :--:|:--:|
|64|64|
|128|32|
|256|16|
|512|6|

You cannot input a sequence longer than 512 into BERT directly.

# How to train a NLP model on a specific NLP task?
Tasks include:
- Text classification
- Named Entity Recognization
- Question Answering
- etc.

## Before Pretrained Language Model (PLM)
- Train/use a word2vec model as word representation
- Train a complex model (e.g. lots of LSTM layers) for specific NLP task

## After PLM
- Train/use a PLM as word/sentence representation (**Pre-training**)
- Fine-tune a simple model (i.e. a single linear layer) for specific NLP task (**Fine-tuning**)
- You should not freeze the PLM parameter!

Bert performance >> LSTM

![fine-tune](https://pic2.zhimg.com/v2-f576d9d19c9dcac1c6ee6ea28ea7a2d9_r.jpg)

In [3]:
import transformers
print(transformers.__version__)

# How to load a pretrained model?
# Use its name, or download it to your local folder.
# Find model names on https://huggingface.co/models

3.0.2


# PLM for biomedical
Bert, **BioBert**, ClinicalBert, SciBert, BlueBERT, **PubMedBERT**, **KeBioLM**

- Bert: Original Bert
- BioBert: First biomedical bert, trained on PubMed
- ClinicalBert: Trained on PubMed + Mimic3
- SciBert: Trained on PubMed + arxiv
- BlueBERT: Trained on PubMed + Mimic3
- PubMedBERT: New vocabulary, trained on PubMed
- KeBioLM: Integrate entity knowledge, trained on PubMed https://arxiv.org/abs/2104.10344

![result](./1.png)

In [3]:
from transformers import AutoModel, AutoTokenizer
# How to load them?

# name of other models:
# Original bert: bert-base-cased
# BioBERT: dmis-lab/biobert-v1.1
# ClinicalBERT: emilyalsentzer/Bio_ClinicalBERT
# SciBERT: allenai/scibert_scivocab_uncased
# BlueBERT: bionlp/bluebert_pubmed_uncased_L-12_H-768_A-12
# KeBioLM: https://github.com/GanjinZero/KeBioLM
# More details for KeBioLM please ask me

model = AutoModel.from_pretrained('microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract')
tokenizer = AutoTokenizer.from_pretrained('microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440472042.0, style=ProgressStyle(descri…




In [30]:
# BERT Tokenizer
sentence = ['bm ii', 'PubMed Central® (PMC) is a free full-text archive of biomedical and life sciences']
tokenize = tokenizer(sentence)
print(tokenize)
word_idx = tokenize['input_ids'][1]
print("---")
print([tokenizer.convert_ids_to_tokens(word_idx)])

{'input_ids': [[2, 3732, 2517, 3], [2, 9919, 3602, 1063, 11, 4356, 1015, 12, 1744, 42, 2964, 4591, 16, 8236, 5493, 2014, 1685, 10719, 1690, 2978, 13222, 3]], 'token_type_ids': [[0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
---
[['[CLS]', 'pubmed', 'central', '##®', '(', 'pm', '##c', ')', 'is', 'a', 'free', 'full', '-', 'text', 'arch', '##ive', 'of', 'biomedical', 'and', 'life', 'sciences', '[SEP]']]


In [21]:
# Feature Extraction
# Bert is not fit for sentence representation directly!
tokenize = tokenizer(sentence, padding=True, return_tensors="pt")
print(tokenize)
output = model(**tokenize)
print("---")
print(output)

{'input_ids': tensor([[    2,  3732,  2517,     3,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0],
        [    2,  9919,  3602,  1063,    11,  4356,  1015,    12,  1744,    42,
          2964,  4591,    16,  8236,  5493,  2014,  1685, 10719,  1690,  2978,
         13222,     3]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
---
(tensor([[[-8.1338e-01,  5.5416e-01,  4.1603e-01,  ...,  2.8464e-01,
           5.4926e-02,  1.9964e-01],
         [-1.6601e-01,  5.0937e-01,  1.0445e+00,  ..., -2.6240e-01,
           7.8138e-01, -3.0793e-01],
         [-1.8830e+00,  3.4614e-01,  2.2237e-01,  ..., -1.505

In [25]:
print(output[0].shape) # Word Representation
print(output[1].shape) # Sentence Representation

h = model(**tokenize)[1]
print(h.shape)

torch.Size([2, 22, 768])
torch.Size([2, 768])
torch.Size([2, 768])


# A general pipeline of training PyTorch Model

- Create Dataset, Dataloader

- Create Model, Optimizer

    **In this tutorial, we only care about model and optimizer related to BERT.**

- For each training step:
    - fetch a batch of data
    - feedforward
    - backward
    
- Evaluation, Save model, Prediction

In [32]:
# Design a new model based on BERT
from torch import nn

class BertSentenceClassifier(nn.Module):
    def __init__(self, init_model, class_count=10):
        super(BertSentenceClassifier, self).__init__()
        self.bert = AutoModel.from_pretrained(init_model)
        self.classifier = nn.Linear(768, class_count)
        self.loss_fn = nn.CrossEntropyLoss()
        
    def forward(self, input_ids, label=None):
        # input_ids: batch_size * sentence_length
        h = self.bert(input_ids)[1] # batch_size * 768
        predict_y = self.classifier(h) # batch_size * class_count
        if label is not None:
            loss = self.loss_fn(predict_y, label)
            return predict_y, loss
        return predict_y, 0.0
    
classifier = BertSentenceClassifier("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract", class_count=3)
input_ids = tokenizer(sentence, padding=True, return_tensors="pt")['input_ids']
predict_y, _ = classifier(input_ids)
print(predict_y)

tensor([[ 0.6538, -0.0370, -0.2952],
        [ 0.6350,  0.0014, -0.2340]], grad_fn=<AddmmBackward>)


# Optimizer and Scheduler for fine-tuning BERT

- Using AdamW as the optimizer for default
- Small learning rate among 1e-5 ~ 5e-5
- Warmup & Linear Decay
    - 0 - 10% steps linear warmup
    - 10% - 100% steps linear decay

    ![decay](https://img-blog.csdnimg.cn/20200721131948457.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3F4cXN1bnNoaW5l,size_16,color_FFFFFF,t_70)

In [1]:
from transformers import AdamW, get_linear_schedule_with_warmup
learning_rate = 5e-5
weight_decay = 0.01
adam_epsilon = 1e-8
t_total = 1000

no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
    {
        "params": [p for n, p in classifier.named_parameters() if not any(nd in n for nd in no_decay)],
        "weight_decay": weight_decay,
    },
    {"params": [p for n, p in classifier.named_parameters() if any(
        nd in n for nd in no_decay)], "weight_decay": 0.0},
]

optimizer = AdamW(optimizer_grouped_parameters,
                  lr=learning_rate,
                  eps=adam_epsilon)

scheduler = get_linear_schedule_with_warmup(
    optimizer, num_warmup_steps=int(0.1 * t_total), num_training_steps=t_total
)

NameError: name 'model' is not defined

# CODER
Cross-Lingual Medical Term Representation via Knowledge Graph Contrastive Learning
https://arxiv.org/abs/2011.02947
![coder](2.png)

CODER is useful for **term** representation!

![coder3](3.png)

Features for CODER:
- Multilingual
- Pretrained for term normalization with synonym and relation knowledge
- Dual Contrastive Learning

In [1]:
# How to load CODER?
# English version: GanjinZero/UMLSBert_ENG
# Multilingual version: GanjinZero/UMLSBert_ALL
import torch
from transformers import AutoModel, AutoTokenizer
coder = AutoModel.from_pretrained('GanjinZero/UMLSBert_ALL')
coder_tok = AutoTokenizer.from_pretrained('GanjinZero/UMLSBert_ALL')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=995526.0, style=ProgressStyle(descripti…




In [22]:
# Feature Extraction & Term Similarity
sen = ['背痛', 'backache', 'dorsalgia', 'heart attack', 'Rückenschmerzen']
tokenized = coder_tok(sen, padding=True, return_tensors="pt")
h = coder(**tokenized)[1]
h_norm = h / torch.norm(h, 2, dim=1).unsqueeze(-1)
print(torch.mm(h_norm, h_norm.t()))

tensor([[1.0000, 0.7842, 0.7475, 0.2809, 0.7848],
        [0.7842, 1.0000, 0.6829, 0.2937, 0.7063],
        [0.7475, 0.6829, 1.0000, 0.3585, 0.7391],
        [0.2809, 0.2937, 0.3585, 1.0000, 0.3581],
        [0.7848, 0.7063, 0.7391, 0.3581, 1.0000]], grad_fn=<MmBackward>)


# HW1
Train a text classifier based on BERT model and thucnews

# HW2
Define a function using CODER
```python
def retrieve_most_similar(word, dictitionary):
    # sort dictionarty based on similarity between word and dictionary
    # you should consider the count of words in dictionarty is larger than general batch-size
    return sorted_dictionarty
```