This notebook is runned on Google Colab to utilize GPU resource. You may need to execute below two cells to install necessary packages and upload required data to be able to run the notebook.

In [None]:
!pip install transformers sentencepiece

In [None]:
!unzip processed_data.zip

In [None]:
import numpy as np
import pandas as pd
import torch
from transformers import AutoTokenizer, LongformerModel

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
device

device(type='cuda')

In [None]:
PROCESSDED_DATA_DIR = './processed_data/'
PROCESSED_CLINICAL_NOTES_FILE = PROCESSDED_DATA_DIR + "ClinNotes.csv"
CLINICAL_BERT_VECTOR = PROCESSDED_DATA_DIR + 'Clinical_Bert_vector.npy'

In [None]:
df_clinical_processed = pd.read_csv(PROCESSED_CLINICAL_NOTES_FILE)

# Deep Neural Network based Vectorization
DNN has been proved as an effective way to vectorize clinical documents and transformers familiy has the state-of-the-art performance. In this notebook, we will utilize [Clinical-Longformer](https://arxiv.org/abs/2301.11847) to vectorize our clinical notes. This is a large language model per-trained on medical corpus and it uses Longformer as the architecture to be able to support sequence length up to 4096, which is sufficient for our dataset. We will use the pooler output as the vector of each clinical note.

##### EDIT
I want to acknowledge that my vectorization with transformer langugage model is incorrect here. After review the original Clinical-Longformer paper, I found out the model follows the RoBERTa traning objective and it does NOT include the NSP loss. So without fine-tuning, the pooler output is NOT a good representation of the whole document. The correct way is to take out the final hidden state of each token and use it as the contextualized word vector. The word vector can be aggregated to obtain the document vector. But I will just continue with this pooler output for following two reasons:
1. The time is limited and transformer language model needs GPU resource to run.
2. The word vector aggregation method is already conducted. (But I do promise the word vector obtained from language model is contextualized and will have a better performance)

In [None]:
model_name = 'yikuan8/Clinical-Longformer'

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading (…)okenizer_config.json:   0%|          | 0.00/347 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

In [None]:
model = LongformerModel.from_pretrained(model_name)

Downloading (…)lve/main/config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/595M [00:00<?, ?B/s]

Some weights of the model checkpoint at yikuan8/Clinical-Longformer were not used when initializing LongformerModel: ['lm_head.decoder.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.bias']
- This IS expected if you are initializing LongformerModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LongformerModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of LongformerModel were not initialized from the model checkpoint at yikuan8/Clinical-Longformer and are newly initialized: ['longformer.pooler.dense.weight', 'longformer.pooler.dense.bias']
You should probably TRAIN this model on a dow

In [None]:
model.eval()

LongformerModel(
  (embeddings): LongformerEmbeddings(
    (word_embeddings): Embedding(50265, 768, padding_idx=1)
    (position_embeddings): Embedding(4098, 768, padding_idx=1)
    (token_type_embeddings): Embedding(1, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): LongformerEncoder(
    (layer): ModuleList(
      (0-11): 12 x LongformerLayer(
        (attention): LongformerAttention(
          (self): LongformerSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (query_global): Linear(in_features=768, out_features=768, bias=True)
            (key_global): Linear(in_features=768, out_features=768, bias=True)
            (value_global): Linear(in_features=768, out_features=768, bias=True)
          )
    

In [None]:
model = model.to(device)

In [None]:
vectors = None

For Longformer architecture, we need to explicity define which token has the global attention. Here we only use the pooler output as the vector so we will define the claasification token has the global attention.

In [None]:
with torch.no_grad():
    for num in range(0, len(df_clinical_processed), 50):
        start = num, end = num + 50
        inputs = tokenizer(df_clinical_processed['notes'].to_list()[start:end], return_tensors='pt', return_token_type_ids=True, padding=True)
        
        input_ids = inputs.input_ids.to(device)
        attention_mask = inputs.attention_mask.to(device)
        token_type_ids = inputs.token_type_ids.to(device)
        
        global_attention_mask = torch.zeros(input_ids.shape, dtype=torch.long, device=input_ids.device)
        global_attention_mask[:, 0] = 1
        
        outputs = model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, global_attention_mask=global_attention_mask)
        
        if not vectors: vectors = outputs.pooler_output.cpu().detach().numpy()
        else: vectors = np.concatenate((vectors, outputs.pooler_output.cpu().detach().numpy()))
        
        # explicity evict the cache to resolve GPU out of memory issue on Google Colab
        outputs = None
        torch.cuda.empty_cache()

In [None]:
np.save(CLINICAL_BERT_VECTOR, vectors)