## mT5 model ukrainization

The aim is to compress the mT5-base model to retain Ukrainian embeddings and tokens used for it. We'll still save 10K most popular tokens for English language and 1K most popular tokens overall.

An idea and most of the code were taken from [this](https://medium.com/towards-data-science/how-to-adapt-a-multilingual-t5-model-for-a-single-language-b9f94f3d9c90) medium article.

Results: 
- 582M params -> 211M params
- 250K tokens -> 8900 tokens
- 2.2GB size model -> 0.8GB size model

Still, we won't lose much performance if use only the Ukrainian language for our task. This model will be useful for possible training on generated synthetic data and fine-tuned for the GEC task.

### Things we need

In [1]:
!pip install transformers sentencepiece

Collecting transformers
  Downloading transformers-4.19.4-py3-none-any.whl (4.2 MB)
     ---------------------------------------- 4.2/4.2 MB 4.4 MB/s eta 0:00:00
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-win_amd64.whl (1.1 MB)
     ---------------------------------------- 1.1/1.1 MB 5.7 MB/s eta 0:00:00
Collecting numpy>=1.17
  Downloading numpy-1.21.6-cp37-cp37m-win_amd64.whl (14.0 MB)
     ---------------------------------------- 14.0/14.0 MB 5.5 MB/s eta 0:00:00
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-win_amd64.whl (3.3 MB)
     ---------------------------------------- 3.3/3.3 MB 6.7 MB/s eta 0:00:00
Collecting tqdm>=4.27
  Downloading tqdm-4.64.0-py2.py3-none-any.whl (78 kB)
     ---------------------------------------- 78.4/78.4 kB 4.3 MB/s eta 0:00:00
Collecting filelock
  Downloading filelock-3.7.1-py3-none-any.whl (10 kB)
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-no



In [2]:
from transformers import MT5ForConditionalGeneration, MT5Tokenizer
import torch

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


ModuleNotFoundError: No module named 'torch'

In [3]:
tokenizer = MT5Tokenizer.from_pretrained("google/mt5-base")
model = MT5ForConditionalGeneration.from_pretrained('google/mt5-base')

Original tokenizer contains 250K tokens and the model has 582M params.

In [4]:
def msize(m):
    return sum(p.numel() for p in m.parameters())

original_size = msize(model)

print(original_size)
print(tokenizer.vocab_size)

582401280
250100


#### Ukranian corpus for building our new vocabulary

Further we'll use the [Ukrainian 2019 corpus](https://wortschatz.uni-leipzig.de/en/download/Ukrainian) (scrapped randomly from web pages) of 1M sentences, taken from University of Leipzig.

In [5]:
import tarfile

file = tarfile.open('data/ukr-ua_web_2019_1M.tar.gz')

fname = 'ukr-ua_web_2019_1M/ukr-ua_web_2019_1M-sentences.txt'

file.extract(fname, 'data/')

In [6]:
import pandas as pd
import csv

df_ua = pd.read_csv('data/' + fname, sep='\t', quoting=csv.QUOTE_NONE)
df_ua.columns = ['idx', 'text']
df_ua.sample(5)

Unnamed: 0,idx,text
764133,764135,Сьогодні вранці в с. Пядики Коломийського райо...
66617,66619,Бо це по суті є найлогічніше рішення.
95362,95364,"Ви і раніше контролювали, недовіряли їй?"
578349,578351,"Перевіримо це, розглянувши головні мислимі мож..."
6057,6059,Smart Solutions у стислі терміни сформує або н...


#### English corpus 

We'll also use web [corpus](https://wortschatz.uni-leipzig.de/en/download/English) from the same place as we did for Ukrainian.

In [7]:
file = tarfile.open('data/eng-com_web-public_2018_1M.tar.gz')

fname = 'eng-com_web-public_2018_1M/eng-com_web-public_2018_1M-sentences.txt'

file.extract(fname, 'data/')

In [8]:
df_en = pd.read_csv('data/' + fname, sep='\t', quoting=csv.QUOTE_NONE)
df_en.columns = ['idx', 'text']
df_en.sample(5)

Unnamed: 0,idx,text
771355,771357,"The place is a little small but it works, we h..."
393406,393408,"“In the end, our sensibilities are on the same..."
283109,283111,"Hosted by 451 Research, the Hosting & Cloud Tr..."
371022,371024,"In addition, the ceiling in the VIP lounge is ..."
876609,876611,"Transderma M Moisturizing Serum, - Truth In Ag..."


### Determine new vocabulary

We tokenize our corpus, count the frequences of different tokens and remain only tokens that were used frequently enough.

Count the tokens that the current model uses for representing the sentences.

In [9]:
from collections import Counter
from tqdm.auto import tqdm, trange

cnt_ua = Counter()
for text in tqdm(df_ua.text):
    cnt_ua.update(tokenizer.encode(text))

  0%|          | 0/999999 [00:00<?, ?it/s]

In [10]:
cnt_en = Counter()
for text in tqdm(df_en.text):
    cnt_en.update(tokenizer.encode(text))

  0%|          | 0/999999 [00:00<?, ?it/s]

The number of used tokens for our ua corpus is 23% from all mT5 tokenizer vocab size, for en corpus its 27%.

There is also 55% overlap between the ua and en vocabularies. The original article assumes that in Russian (our case Ukrainian) text there are occasionaly Emglish words or latin representations.

In [11]:
print(len(cnt_ua), len(cnt_ua)/tokenizer.vocab_size)
print(len(cnt_en), len(cnt_en)/tokenizer.vocab_size)
common = len(set(cnt_ua.keys()).intersection(set(cnt_en.keys())))
print(common, common / len(cnt_ua))

58168 0.23257896841263495
67920 0.2715713714514194
31702 0.5450075642965204


For both languages 10K tokens covers about 95% of the vocabulary, and 20K - about 99%.

In [12]:
print('ua')
for top in 10_000, 20_000, 30_000:
    print(top, sum(v for k, v in cnt_ua.most_common(top)) / sum(cnt_ua.values()))
print('en')
for top in 10_000, 20_000, 30_000:
    print(top, sum(v for k, v in cnt_en.most_common(top)) / sum(cnt_en.values()))

ua
10000 0.9807354043937903
20000 0.996521760465981
30000 0.9986511122211118
en
10000 0.9531899579723471
20000 0.984080976549739
30000 0.9937869235026024


Most common tokens. They are mostly prefixes, punctuation or "little words" (і, у, й):

In [13]:
print(tokenizer.convert_ids_to_tokens([k for k, v in cnt_ua.most_common(30)]))
print(tokenizer.convert_ids_to_tokens([k for k, v in cnt_en.most_common(30)]))

['▁', ',', '</s>', '.', 'і', '▁в', 'у', 'и', '▁на', '▁з', 'а', 'ів', '▁у', '▁за', 'ї', '▁та', '-', '▁до', '▁не', '▁що', 'ого', '▁по', '▁від', 'я', '▁як', 'о', 'их', 'е', 'й', '▁«']
['▁', '</s>', '.', '▁the', ',', 's', '▁to', '▁and', 'a', '▁of', '▁in', '▁is', '▁I', '’', '▁that', 'ed', '▁for', '-', 'ing', "'", '▁you', '▁it', '▁with', '▁on', 'ly', 'y', '▁be', '▁The', '▁as', '▁are']


We will do the next composition of vocabulary:
- 1K of top tokens of the original tokenizer
- Top 10K of the English vocab
- Top 20K of the Ukrainian vocab
- 100 special tokens that T5 uses

In [14]:
print(tokenizer.convert_ids_to_tokens([0,1,2,3,4,5]))

['<pad>', '</s>', '<unk>', '<0x00>', '<0x01>', '<0x02>']


In [15]:
new_tokens = set(range(1000))
for i, (k, v) in enumerate(cnt_en.most_common(10_000)):
    if k not in new_tokens:
        new_tokens.add(k)
for i, (k, v) in enumerate(cnt_ua.most_common(25_000)):
    if len(new_tokens) == 29_900:
        print(i, 'Ukrainian tokens are included')
        break
    if k not in new_tokens:
        new_tokens.add(k)

for t in range(tokenizer.vocab_size - 100, tokenizer.vocab_size):
    new_tokens.add(t)

print(len(new_tokens))
kept_ids = sorted(new_tokens)

20919 Ukrainian tokens are included
30000


The new vocabulary is only 12% percent of the original one.

In [16]:
len(kept_ids) / tokenizer.vocab_size

0.11995201919232307

### Update the embeddings

In [17]:
import numpy as np

In [18]:
new_size = len(kept_ids)
new_emb = torch.nn.Embedding(new_size, model.shared.embedding_dim)
new_head = torch.nn.Linear(in_features=model.lm_head.in_features, out_features=new_size, bias=False)

In [19]:
for new_id, old_id in enumerate(kept_ids):
    new_emb.weight.data[new_id] = model.shared.weight.data[old_id]
    new_head.weight.data[new_id] = model.lm_head.weight.data[old_id]

In [20]:
model.shared.weight = new_emb.weight
model.lm_head.weight = new_head.weight

The new model has 244M parameters - 42% of the original size.

In [21]:
print(msize(model), msize(model) / original_size)

244309248 0.4194861110195362


### Update the tokenizer

From original notebook:
> T5 uses Sentencepiece tokenizer, which is implemented in C and is opaque to Python. Fortunately, we can download its model and deploy it into Python using its Protobuf representation.

https://github.com/google/sentencepiece/issues/121



In [22]:
!wget https://raw.githubusercontent.com/google/sentencepiece/master/src/sentencepiece_model.proto

--2022-05-19 18:56:59--  https://raw.githubusercontent.com/google/sentencepiece/master/src/sentencepiece_model.proto
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12872 (13K) [text/plain]
Saving to: 'sentencepiece_model.proto.1'

     0K .......... ..                                         100% 2.00M=0.006s

2022-05-19 18:56:59 (2.00 MB/s) - 'sentencepiece_model.proto.1' saved [12872/12872]



Compile the protobuf description of the sentencepiece model in order to be able to modify it.

In [23]:
! protoc --python_out=. sentencepiece_model.proto

Serialize the model used by the current tokenizer and open it as a protobuf class.

In [24]:
import sentencepiece_model_pb2 as spmp
smp = tokenizer.sp_model.serialized_model_proto()
m = spmp.ModelProto()
m.ParseFromString(smp)

print('the loaded model has pieces:', len(m.pieces))
new_pieces = [m.pieces[idx] for idx in kept_ids]
print('the new pieces:', len(new_pieces))

# replace the content of the first 30K pieces
for i, p in enumerate(new_pieces):
    m.pieces[i].piece = p.piece
    m.pieces[i].score = p.score
    m.pieces[i].type = p.type

# drop the remaining pieces
n = len(new_pieces)
for i in trange(len(m.pieces) - n):
    m.pieces.pop(len(m.pieces) - 1)

print(len(m.pieces))
with open('new_sp.model', 'wb') as f:
    f.write(m.SerializeToString())

the loaded model has pieces: 250100
the new pieces: 30000


  0%|          | 0/220100 [00:00<?, ?it/s]

30000


In [25]:
new_tokenizer = MT5Tokenizer('new_sp.model', extra_ids=0)

### Save the model

In [26]:
model.config.__dict__['vocab_size'] = new_size
model.config.__dict__['_name_or_path'] = 'kravchenko/uk-t5-base'
model.config

MT5Config {
  "_name_or_path": "kravchenko/uk-t5-base",
  "architectures": [
    "MT5ForConditionalGeneration"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "mt5",
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "tokenizer_class": "T5Tokenizer",
  "transformers_version": "4.17.0",
  "use_cache": true,
  "vocab_size": 30000
}

In [31]:
model.config.__dict__["use_cache"] = False

In [32]:
model.config

MT5Config {
  "_name_or_path": "kravchenko/uk-t5-base",
  "architectures": [
    "MT5ForConditionalGeneration"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "mt5",
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "tokenizer_class": "T5Tokenizer",
  "torch_dtype": "float32",
  "transformers_version": "4.17.0",
  "use_cache": false,
  "vocab_size": 30000
}

In [27]:
new_tokenizer.save_pretrained('uk-t5-base_local')
model.save_pretrained('uk-t5-base_local')

### Load & test new model

In [28]:
model1 = MT5ForConditionalGeneration.from_pretrained('uk-t5-base_local')
tokenizer1 = MT5Tokenizer.from_pretrained('uk-t5-base_local')

One task our model can "somehow" solve is fill the gaps. However, we'll need to finetune this model in the future.

In [29]:
inputs = tokenizer1('Порівнюючи <extra_id_0> відповідним періодом минулого <extra_id_1> покращилася інвестиційна привабливість промислового комплексу району.', return_tensors='pt')
with torch.no_grad():
    hypotheses = model1.generate(
        **inputs, 
        do_sample=True, top_p=0.95, 
        num_return_sequences=3, 
        repetition_penalty=2.5,
        max_length=32,
    )
for h in hypotheses:
    print(tokenizer1.decode(h))

<pad> <extra_id_0> з <extra_id_1> року значно <extra_id_2> сезону, <extra_id_3> до 2020</s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>
<pad> <extra_id_0> з <extra_id_1> року, в районі <extra_id_2> доби <extra_id_3>, <extra_id_4> року <extra_id_5> пропонується подавати...</s>
<pad> <extra_id_0> з <extra_id_1> року, <extra_id_2> р-ну <extra_id_3> та минулого</s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>
