<a href="https://colab.research.google.com/github/Tdas-christ/LLM/blob/main/Cardiac_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Installing necessary libraries

In [1]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0


In [2]:
!pip install datasets
!pip install -U pip setuptools wheel
!pip install -U spacy[cuda110]
!pip install -U scikit-learn
!pip install matplotlib
!pip install wikipedia
!python -m spacy download en_core_web_sm

Collecting spacy[cuda110]
  Using cached spacy-3.7.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Collecting cupy-cuda110<13.0.0,>=5.0.0b4 (from spacy[cuda110])
  Using cached cupy_cuda110-12.3.0-cp310-cp310-manylinux2014_x86_64.whl.metadata (2.7 kB)
Using cached cupy_cuda110-12.3.0-cp310-cp310-manylinux2014_x86_64.whl (79.5 MB)
Using cached spacy-3.7.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.0 MB)
Installing collected packages: cupy-cuda110, spacy
  Attempting uninstall: spacy
    Found existing installation: spacy 3.7.5
    Uninstalling spacy-3.7.5:
      Successfully uninstalled spacy-3.7.5
Successfully installed cupy-cuda110-12.3.0 spacy-3.7.6
Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py

# Download a BERT model and its WordPiece tokenizer

In [3]:
import warnings
warnings.filterwarnings("ignore")

In [4]:
from transformers import AutoModelForMaskedLM, AutoTokenizer
model_name = "bert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForMaskedLM.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Tokenize a phrase about Heart Disease

In [5]:
text = "Myxomatous degeneration of the mitral valve."

In [6]:
# tokenization of the text
tokens = tokenizer.tokenize(text)
print(tokens)

['my', '##x', '##oma', '##tou', '##s', 'de', '##gen', '##eration', 'of', 'the', 'mit', '##ral', 'valve', '.']


In [7]:
# back to text
tokenizer.decode(tokenizer.encode(text), skip_special_tokens=True)

'myxomatous degeneration of the mitral valve.'

In [9]:
print(tokenizer.tokenize('myxomatous'))
print(tokenizer.tokenize('degeneration'))
print(tokenizer.tokenize('mitral'))

['my', '##x', '##oma', '##tou', '##s']
['de', '##gen', '##eration']
['mit', '##ral']


**We can notice that the BERT WordPiece tokenizer (from the bert-base-cased model) tokenize the words myxomatous, degeneration and mitral with subwords because they do not exist as words in the tokenizer vocabulary.**

In [10]:
# Verify that the words myxomatous, degenration and mitral do not belong to the tokenizer vocabulary
vocab = [tok for tok,index in tokenizer.get_vocab().items()]
"myxomatous" in vocab, "degeneration" in vocab, "mitral" in vocab

(False, False, False)

# Test 1: We add 3 new tokens (whole words) into the tokenizer vocabulary

In [11]:
new_tokens = ['myxomatous', 'degeneration', 'mitral']

In [12]:
print("[ BEFORE ] tokenizer vocab size:", len(tokenizer))
added_tokens = tokenizer.add_tokens(new_tokens)

print("[ AFTER ] tokenizer vocab size:", len(tokenizer))
print()
print('added_tokens:',added_tokens)
print()

# resize the embeddings matrix of the model
model.resize_token_embeddings(len(tokenizer))

[ BEFORE ] tokenizer vocab size: 30522
[ AFTER ] tokenizer vocab size: 30525

added_tokens: 3



Embedding(30525, 768)

In [13]:
# Verify that the new words got added and belong to the tokenizer vocabulary
vocab = [tok for tok,index in tokenizer.get_vocab().items()]
"myxomatous" in vocab, "degeneration" in vocab, "mitral" in vocab

(True, True, True)

In [14]:
tokenizer_xBERT = tokenizer

In [15]:
# tokenization of the input text
tokens = tokenizer_xBERT.tokenize(text)
print(tokens)

['myxomatous', 'degeneration', 'of', 'the', 'mitral', 'valve', '.']


In [17]:
# back to text
tokenizer_xBERT.decode(tokenizer_xBERT.encode(text), skip_special_tokens=True)

'myxomatous degeneration of the mitral valve.'

**The tokenizer with the 3 new tokens succeeded in tokenizing the words myxomatous, degeneration and mitral without subwords as they belong now to the vocabulary tokenizer.**

In [18]:
# tokenization of the words COVID and hospitalization
print(tokenizer_xBERT.tokenize('myxomatous'))
print(tokenizer_xBERT.tokenize('degeneration'))
print(tokenizer_xBERT.tokenize('mitral'))

['myxomatous']
['degeneration']
['mitral']


# Test 2: We add more new tokens (subwords and words) into the tokenizer vocab

What if we want to detect the whole vocabulary of a specialized corpus in order to add it to an existing corpus?

Let us try using a Wordpiece tokenizer for this.

1) Import pages about Mitral valve prolapse from Wikipedia

In [23]:
!pip install wikipedia



In [24]:
import wikipedia

# let us choose 2 Wikipedia pages for our demonstration
pages = ["Mitral valve prolapse","Myxoma"]

documents = list()
for p in pages:
  page = wikipedia.page(p)
  documents.append(page.content)
  print(page.title,page.url)

Mitral valve prolapse https://en.wikipedia.org/wiki/Mitral_valve_prolapse
Myxoma https://en.wikipedia.org/wiki/Myxoma


2) Train a WordPiece tokenizer on the imported Wikipedia pages

In [25]:
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers import normalizers
from tokenizers.normalizers import Lowercase, NFD, StripAccents
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.processors import TemplateProcessing
from tokenizers.trainers import WordPieceTrainer

# Initialize the tokenizer with the WordPiece model
bert_tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))

# Apply normalizers
bert_tokenizer.normalizer = normalizers.Sequence([NFD(), Lowercase(), StripAccents()])

# Apply pre-tokenizer
bert_tokenizer.pre_tokenizer = Whitespace()

# Define the post-processing template
bert_tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B [SEP]",  # Correct template for pair
    special_tokens=[
        ("[CLS]", 1),
        ("[SEP]", 2),
    ],
)

# Instantiate a trainer for the WordPiece model
trainer = WordPieceTrainer(
    vocab_size=30522,
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)

# Assuming 'documents' is your iterator over the training data
files = documents
bert_tokenizer.train_from_iterator(files, trainer)

3) Get the vocabulary that is not in the original BERT tokenizer

This step is not necessary, as the `tokenizer.add_tokens()` method will add new tokens only if they do not belong to the existing tokenizer vocabulary. However, it helps us to see what these new tokens are.

In [26]:
old_vocab = [k for k,v in tokenizer.get_vocab().items()]
new_vocab = [k for k,v in bert_tokenizer.get_vocab().items()]
idx_old_vocab_list = list()
same_tokens_list = list()
different_tokens_list = list()

for idx_new,w in enumerate(new_vocab):
  try:
    idx_old = old_vocab.index(w)
  except:
    idx_old = -1
  if idx_old>=0:
      idx_old_vocab_list.append(idx_old)
      same_tokens_list.append((w,idx_new))
  else:
      different_tokens_list.append((w,idx_new))

In [27]:
len(same_tokens_list),len(different_tokens_list),len(same_tokens_list)+len(different_tokens_list)

(1602, 1085, 2687)

We found 1085 tokens (subwords or words) that are not in the vocabulary of the original tokenizer.

In [28]:
# get list of new tokens
new_tokens = [k for k,v in different_tokens_list]
len(new_tokens), new_tokens[:10]

(1085,
 ['##ences',
  'includ',
  'abn',
  'eith',
  '##ength',
  'fastidi',
  'prolongation',
  '##tolog',
  'dur',
  '##oiding'])

4) Add the new tokens (subwords and words) in the vocabulary of the original BERT tokenizer

In [29]:
from transformers import AutoModelForMaskedLM, AutoTokenizer
model_name = "bert-base-cased"

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForMaskedLM.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [30]:
print("[ BEFORE ] tokenizer vocab size:", len(tokenizer))
added_tokens = tokenizer.add_tokens(new_tokens)

print("[ AFTER ] tokenizer vocab size:", len(tokenizer))
print()
print('added_tokens:',added_tokens)
print()

# resize the embeddings matrix of the model
model.resize_token_embeddings(len(tokenizer))

[ BEFORE ] tokenizer vocab size: 28996
[ AFTER ] tokenizer vocab size: 30042

added_tokens: 1085



Embedding(30042, 768)

In [35]:
# The words myxomatous, degeneration and mitral should not belong to the original tokenizer vocabulary
vocab = [tok for tok,index in tokenizer.get_vocab().items()]
"myxomatous" in vocab, "degeneration" in vocab, "mitral" in vocab

(False, False, False)

Let's call tokenizer_exBERT our tokenizer with the new tokens.

In [36]:
tokenizer_xBERT1 = tokenizer

In [37]:
# tokenization of the text
tokens = tokenizer_xBERT1.tokenize(text)
print(tokens)

['My', '##x', '##oma', '##to', '##us', 'degen', 'era', '##tion', 'of', 'the', 'mitr', 'al', 'valv', 'e', '.']


In [38]:
# back to text
tokenizer_xBERT1.decode(tokenizer_xBERT1.encode(text), skip_special_tokens=True)

'Myxomatous degen eration of the mitr al valv e.'

**As the words myxomatous, degeneration and mitral do not belong to the tokenizer vocabulary, they continue to be tokenized with subwords.**

**We can see that most of the words in the sentence are not well tokenized.**

In [40]:
# tokenization of the test words
print(tokenizer_xBERT1.tokenize('myxomatous'))
print(tokenizer_xBERT1.tokenize('degeneration'))
print(tokenizer_xBERT1.tokenize('mitral'))

['myxomat', 'ou', '##s']
['degen', 'era', '##tion']
['mitr', 'al']


5) Add only the new tokens that do not start with ## in the vocabulary of the original BERT tokenizer.

We know that a subword is not just a token that starts with ##, but let's see what happens if we remove all those subsowrds from the list of the new tokens.

In [41]:
# get list of new tokens as whole words
new_tokens = [tok for tok in new_tokens if tok.startswith("#") == False]
len(new_tokens), new_tokens[:10]

(695,
 ['includ',
  'abn',
  'eith',
  'fastidi',
  'prolongation',
  'dur',
  '")',
  'displa',
  'isch',
  'pedunculated'])

In [42]:
from transformers import AutoModelForMaskedLM, AutoTokenizer
model_name = "bert-base-cased"

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForMaskedLM.from_pretrained(model_name)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [43]:
print("[ BEFORE ] tokenizer vocab size:", len(tokenizer))
added_tokens = tokenizer.add_tokens(new_tokens)

print("[ AFTER ] tokenizer vocab size:", len(tokenizer))
print()
print('added_tokens:',added_tokens)
print()

# resize the embeddings matrix of the model
model.resize_token_embeddings(len(tokenizer))

[ BEFORE ] tokenizer vocab size: 28996
[ AFTER ] tokenizer vocab size: 29691

added_tokens: 695



Embedding(29691, 768)

Let's call tokenizer_exBERT our tokenizer with the new tokens.

In [44]:
tokenizer_exBERT = tokenizer

In [45]:
# tokenization of the text
tokens = tokenizer_exBERT.tokenize(text)
print(tokens)

['My', '##x', '##oma', '##to', '##us', 'degen', 'era', '##tion', 'of', 'the', 'mitr', 'al', 'valv', 'e', '.']


In [46]:
# back to text
tokenizer_exBERT.decode(tokenizer_exBERT.encode(text), skip_special_tokens=True)

'Myxomatous degen eration of the mitr al valv e.'

**The tokenizer continues to fail!**

**It means that we must improve the new tokens list by taking our as well the subwords that begin a word (i.e., they don't start by ##)**

In [47]:
# tokenization of the test words
print(tokenizer_xBERT1.tokenize('myxomatous'))
print(tokenizer_xBERT1.tokenize('degeneration'))
print(tokenizer_xBERT1.tokenize('mitral'))

['myxomat', 'ou', '##s']
['degen', 'era', '##tion']
['mitr', 'al']


# Test 3: Add new tokens (only words, not subwords) into the tokenizer vocab

Let us add only the new tokens that are words, not subwords (that do not start with ## or do not are followed by a subword with ##) in the vocabulary of the original BERT tokenizer.

1) Let us use the word tokenizer (spacy) to find the most frequent words of our corpus by using scikit-learn

We use a word tokenizer like spacy to find the most frequent words of our corpus instead of a WordPiece tokenizer which generates subwords as well.

**Observation**: here, the expression `most frequent words` means the tokens present in most of the documents.

In [48]:
import spacy
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

import matplotlib.pyplot as plt

In [49]:
# initialize our tokenizer with the English spaCY one
nlp = spacy.load("en_core_web_sm", exclude=['morphologizer', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])

In [50]:
def spacy_tokenizer(document, nlp=nlp):
    # tokenize the document with spaCY
    doc = nlp(document)
    # Remove stop words and punctuation symbols
    tokens = [
        token.text for token in doc if (
        token.is_stop == False and \
        token.is_punct == False and \
        token.text.strip() != '' and \
        token.text.find("\n") == -1)]
    return tokens

def dfreq(idf, N):
    return (1+N) / np.exp(idf - 1) - 1

In [51]:
%%time
# https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting
tfidf_vectorizer = TfidfVectorizer(lowercase=False, tokenizer=spacy_tokenizer,
                                   norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)
# parse matrix of tfidf
docs = documents
length = len(docs)
result = tfidf_vectorizer.fit_transform(docs)
# print(result.shape)

# idf
idf = tfidf_vectorizer.idf_

# sorted idf, tokens and docs frequencies
idf_sorted_indexes = sorted(range(len(idf)), key=lambda k: idf[k])
idf_sorted = idf[idf_sorted_indexes]
tokens_by_df = np.array(tfidf_vectorizer.get_feature_names_out())[idf_sorted_indexes]
dfreqs_sorted = dfreq(idf_sorted, length).astype(np.int32)
tokens_dfreqs = {tok:dfreq for tok, dfreq in zip(tokens_by_df,dfreqs_sorted)}
tokens_pct_list = [int(round(dfreq/length*100,2)) for token,dfreq in tokens_dfreqs.items()]

CPU times: user 271 ms, sys: 0 ns, total: 271 ms
Wall time: 293 ms


In [52]:
# we have only 2 documents (that's why we range the intervale [1,101] with a step of 50)
number_tokens_with_DF_above_pct = list()
for pct in range(1,101,50):
    index_max = len(np.array(tokens_pct_list)[np.array(tokens_pct_list)>=pct])
    number_tokens_with_DF_above_pct.append(index_max)

In [53]:
# DF = Document Frequency

# df_docfreqs = pd.DataFrame(number_tokens_with_DF_above_pct, columns=['number of tokens with DF above x%'])
# df_docfreqs.index += 1
# df_docfreqs.transpose()

# plt.plot(number_tokens_with_DF_above_pct)
# plt.title(f'Document Frequency above of {pct}%')
# plt.show()

df_docfreqs = pd.DataFrame({'pct':list(range(1,101,50)),'number of tokens with DF above pct%':number_tokens_with_DF_above_pct})
df_docfreqs.transpose()

Unnamed: 0,0,1
pct,1,51
number of tokens with DF above pct%,1038,61


**There are 1038 words which appear in one or two documents from our 2 documents list, and 61 which are in the documents.**

**Let's consider that the 1038 words are all important and relevant to our Mitral valve prolapse corpus.**

**Observation**: Within a corpus with more documents, we could have used another rule as for example: keeping only words which are at least in 10% of the documents list.

**Get the vocabulary that is not in the original BERT tokenizer**

This step is not necessary, as the `tokenizer.add_tokens()` method will add new tokens only if they do not belong to the existing tokenizer vocabulary. However, it helps us to see what these tokens are.

In [54]:
# list of new tokens
pct = 1
index_max = len(np.array(tokens_pct_list)[np.array(tokens_pct_list)>=pct])
new_tokens = tokens_by_df[:index_max]
# print(len(new_tokens))

old_vocab = [k for k,v in tokenizer.get_vocab().items()]
new_vocab = [token for token in new_tokens]
idx_old_vocab_list = list()
same_tokens_list = list()
different_tokens_list = list()

for idx_new,w in enumerate(new_vocab):
  try:
    idx_old = old_vocab.index(w)
  except:
    idx_old = -1
  if idx_old>=0:
      idx_old_vocab_list.append(idx_old)
      same_tokens_list.append((w,idx_new))
  else:
      different_tokens_list.append((w,idx_new))

In [55]:
len(same_tokens_list),len(different_tokens_list),len(same_tokens_list)+len(different_tokens_list)

(875, 163, 1038)

**We found 163 tokens (whole words) that are not in the vocabulary of the original tokenizer and the words myxomatous, degeneration and mitral belong to the new tokens list.**

In [56]:
# get list of new tokens
new_tokens = [k for k,v in different_tokens_list]
print(len(new_tokens), new_tokens[:20])

163 ['Cardiac', 'Diagnosis', 'Epidemiology', 'MRI', 'Myxomatous', 'References', 'Sudden', 'atrium', 'benign', 'degeneration', 'i.e.', '0.2', '0.4', '1.^', '11p15.4', '13.q31.3', '16p12.1', '2.4', '2.^', '2–3']


In [57]:
"Myxomatous" in new_tokens, "degeneration" in new_tokens, "mitral" in new_tokens

(True, True, True)

Add new tokens (only whole words, not subwords) in the vocabulary of the original BERT tokenizer

In [58]:
# import model and tokenizer
from transformers import AutoModelForMaskedLM, AutoTokenizer
model_name = "bert-base-cased"

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForMaskedLM.from_pretrained(model_name)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [59]:
print("[ BEFORE ] tokenizer vocab size:", len(tokenizer))
added_tokens = tokenizer.add_tokens(new_tokens)

print("[ AFTER ] tokenizer vocab size:", len(tokenizer))
print()
print('added_tokens:',added_tokens)
print()

# resize the embeddings matrix of the model
model.resize_token_embeddings(len(tokenizer))

[ BEFORE ] tokenizer vocab size: 28996
[ AFTER ] tokenizer vocab size: 29159

added_tokens: 163



Embedding(29159, 768)

Let's call tokenizer_exBERT our tokenizer with the new tokens.

In [60]:
tokenizer_exBERT2 = tokenizer

In [61]:
# tokenization of the text
tokens = tokenizer_exBERT2.tokenize(text)
print(tokens)

['Myxomatous', 'degeneration', 'of', 'the', 'mitral', 'valve', '.']


In [62]:
# back to text
tokenizer_exBERT.decode(tokenizer_exBERT.encode(text), skip_special_tokens=True)

'Myxomatous degen eration of the mitr al valv e.'

**The tokenizer with the new tokens (only whole words) did succeed in tokenizing the all the words including myxomatous, degeneration and mitral!)**

**It means that is fundamental to add new tokens that are only whole words to an existing subword tokenizer like WordPiece and not subwords!**

In [63]:
# tokenization of the test words
print(tokenizer_xBERT1.tokenize('myxomatous'))
print(tokenizer_xBERT1.tokenize('degeneration'))
print(tokenizer_xBERT1.tokenize('mitral'))

['myxomat', 'ou', '##s']
['degen', 'era', '##tion']
['mitr', 'al']


# Let's check the impact of our enriched tokenizer

Let's use a text about Myxomatous degeneration taken from a National Centre for Biotechnology Information

In [73]:
# source: https://heart.bmj.com/content/88/suppl_4/iv20
text = 'Degenerative mitral valve disease is responsible for the syndromes of billowing mitral leaflet, mitral valve prolapse (MVP), floppy mitral valve, and flail leaflet. The pathology of these is mainly caused by myxomatous infiltration and fibroelastic deficiency.\
In the 1960s, Reid7 and Barlow and colleagues1 proposed that mid to late systolic clicks and apical late systolic murmurs were of mitral valvar origin.\
 This origin was further documented by intracardiac phonocardiography.\
 Criley and colleagues used “mitral valve prolapse” to describe posterior mitral leaflet motion in systole.9 Since then, MVP has remained a diagnosis of sustained interest and controversy."'

In [74]:
from transformers import AutoModelForMaskedLM, AutoTokenizer
model_name = "bert-base-cased"

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForMaskedLM.from_pretrained(model_name)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [75]:
tokens = tokenizer.tokenize(text)
print('number of tokens by the original BERT tokenizer:', len(tokens))

tokens = tokenizer_exBERT.tokenize(text)
print('number of tokens by the enriched tokenizer:', len(tokens))

number of tokens by the original BERT tokenizer: 168
number of tokens by the enriched tokenizer: 178


**Even though the enriched tokenizer needs more tokens than original tokenizer, that is because the text is taken from a biomedical journal which has a lot of words that were not present in the Wikipedia pages. However, the difference is quite less to account for.**