This notebook contains the code for creating a compressed version of mT5-base, containing only embeddings for the most used Danish and English vocabulary.

Code adapted from https://gist.github.com/avidale/44cd35bfcdaf8bedf51d97c468cc8001.

In [None]:
# installing modules
!pip install torch transformers sentencepiece
!sudo apt install git-lfs
!git lfs install

In [None]:
# importing modules
from transformers import T5Tokenizer, AutoModelForSeq2SeqLM
import torch

# Removing unused vocabulary

In [None]:
tokenizer = T5Tokenizer.from_pretrained("google/mt5-base")

In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained('google/mt5-base')

In [None]:
print(tokenizer.vocab_size)

The mT5 tokeniser contains embeddings for 250K tokens.

In [None]:
def msize(m):
    return sum(p.numel() for p in m.parameters())

original_size = msize(model)
print(msize(model))

The model has 582M parameters. 

In [None]:
print(msize(model.shared) / msize(model))
print(msize(model.lm_head) / msize(model))

Input and output embeddings constitute 66% of the model.

# Determine the new tokens

Both English and Danish corpora are downloaded from the Leipzig Corpora Collection (https://wortschatz.uni-leipzig.de/en/download). The corpora contain 1 million sentences each.

In [None]:
!tar -xsvf dan-dk_web-public_2019_1M.tar.gz

In [None]:
!tar -xsvf eng-com_web-public_2018_1M.tar.gz

In [None]:
import pandas as pd
pd.options.display.max_colwidth = 300
import csv
fname = 'dan-dk_web-public_2019_1M/dan-dk_web-public_2019_1M-sentences.txt'
df_da = pd.read_csv(fname, sep='\t', header=None, quoting=csv.QUOTE_NONE)
df_da.columns = ['idx', 'text']
df_da.sample(5)

In [None]:
fname = 'eng-com_web-public_2018_1M/eng-com_web-public_2018_1M-sentences.txt'
df_en = pd.read_csv(fname, sep='\t', header=None, quoting=csv.QUOTE_NONE)
df_en.columns = ['idx', 'text']
df_en.sample(5)

In [None]:
from collections import Counter
from tqdm.auto import tqdm, trange

cnt_da = Counter()
for text in tqdm(df_da.text):
    cnt_da.update(tokenizer.encode(text))

cnt_en = Counter()
for text in tqdm(df_en.text):
    cnt_en.update(tokenizer.encode(text))

In [None]:
print('Tokenised Danish words:', len(cnt_da))
print('Tokenised English words:', len(cnt_en))
common = len(set(cnt_da.keys()).intersection(set(cnt_en.keys())))
print('Common word between corpora:', common)
print('Amount of Danish words that are also in English corpora:', common / len(cnt_da)*100)

diff_en = len(set(cnt_en.keys()).difference(set(cnt_da.keys())))
print('Words that are only in the English corpus:', diff_en)
diff_da = len(set(cnt_da.keys()).difference(set(cnt_en.keys())))
print('Words that are only in the Danish corpus:', diff_da)

total = common + diff_en + diff_da
print('Total number of tokenised words across corpora:', total)

print('Percentage of total model vocabulary:', total/tokenizer.vocab_size*100)

In [None]:
print('Danish top tokens')
for top in 10_000, 20_000, 30_000:
    print(top, sum(v for k, v in cnt_da.most_common(top)) / sum(cnt_da.values()))
print('English top tokens')
for top in 10_000, 20_000, 30_000:
    print(top, sum(v for k, v in cnt_en.most_common(top)) / sum(cnt_en.values()))

For both English and Danish, the top 10K tokens cover about 95% of the tokenised vocabulary, while the top 20K tokens encompass about 98%.

In [None]:
old_voc = tokenizer.get_vocab()
old_inv_voc = {v: k for k, v in old_voc.items()}

The 30 most used tokens in both languages are mostly service words or prefixes.

In [None]:
print('Danish:', tokenizer.convert_ids_to_tokens([k for k, v in cnt_da.most_common(30)]))
print('English:', tokenizer.convert_ids_to_tokens([k for k, v in cnt_en.most_common(30)]))

Composition of the new vocabulary:
* Top 10K of the English vocabulary
* Top 30K of the Danish vocabulary (or almost, to make the total number of tokens 30K)
* 100 special tokens that T5 uses


In [None]:
new_tokens = set(range(1000))
for i, (k, v) in enumerate(cnt_en.most_common(10_000)):
    if k not in new_tokens:
        new_tokens.add(k)
for i, (k, v) in enumerate(cnt_da.most_common(25_000)):
    if len(new_tokens) == 29_900:
        print(i, 'Danish tokens are included')
        break
    if k not in new_tokens:
        new_tokens.add(k)

for t in range(tokenizer.vocab_size - 100, tokenizer.vocab_size):
    new_tokens.add(t)

print(len(new_tokens))
kept_ids = sorted(new_tokens)

In [None]:
len(kept_ids) / tokenizer.vocab_size

The new vocabulary is only 12% of the original one. 

### Update the embeddings

In [None]:
new_size = len(kept_ids)
new_emb = torch.nn.Embedding(new_size, model.shared.embedding_dim)
new_head = torch.nn.Linear(in_features=model.lm_head.in_features, out_features=new_size, bias=False)

In [None]:
for new_id, old_id in enumerate(kept_ids):
    new_emb.weight.data[new_id] = model.shared.weight.data[old_id]
    new_head.weight.data[new_id] = model.lm_head.weight.data[old_id]

In [None]:
model.shared.weight = new_emb.weight
model.lm_head.weight = new_head.weight

In [None]:
print(msize(model), msize(model) / original_size)

The new model has 244M parameters - 42% of the original size. 

### Update the tokenizer

T5 uses Sentencepiece tokenizer, which is implemented in C and is opaque to Python. 

We can download the model and deploy it into Python using its Protobuf representation. 

In [None]:
!wget https://raw.githubusercontent.com/google/sentencepiece/master/src/sentencepiece_model.proto

We compile the protobuf description of the sentencepiece model in order to be able to modify it. 

In [None]:
!sudo apt install protobuf-compiler

In [None]:
!protoc --python_out=. sentencepiece_model.proto

Now we can serialize the model used by the current tokenizer and open it as a protobuf class. 

In [None]:
import sentencepiece_model_pb2 as spmp
smp = tokenizer.sp_model.serialized_model_proto()
m = spmp.ModelProto()
m.ParseFromString(smp)

print('the loaded model has pieces:', len(m.pieces))
new_pieces = [m.pieces[idx] for idx in kept_ids]
print('the new pieces:', len(new_pieces))

# replace the content of the first 30K pieces
for i, p in enumerate(new_pieces):
    m.pieces[i].piece = p.piece
    m.pieces[i].score = p.score
    m.pieces[i].type = p.type

# drop the remaining pieces
n = len(new_pieces)
for i in trange(len(m.pieces) - n):
    m.pieces.pop(len(m.pieces) - 1)

print(len(m.pieces))
with open('new_sp.model', 'wb') as f:
    f.write(m.SerializeToString())

In [None]:
new_tokenizer = T5Tokenizer('new_sp.model', extra_ids=0)

### Save the model

In [None]:
model.config.__dict__['vocab_size'] = new_size
model.config.__dict__['_name_or_path'] = 'cointegrated/daT5-base'
model.config

In [None]:
new_tokenizer.save_pretrained('daT5-base')
model.save_pretrained('daT5-base')

In [None]:
!ls daT5-base -alsh

The updated model and tokeniser can be loaded using the Huggingface API with the following commands:

In [None]:
T5Tokenizer.from_pretrained('sarakolding/daT5-base')
AutoModelForSeq2SeqLM.from_pretrained('sarakolding/daT5-base')