<a href="https://colab.research.google.com/gist/avidale/7bc6350f26196918bf339c01261f5c60/rubert-tiny.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Задача - сделать свой маленький англо-русский BERT с информативными сжатыми представлениями предложений:
* Инициализируем с нуля маленький BERT (3 слоя, 45 мб весов, 12M параметров, из которых 9M - эмбеддинги) с англо-русским подмножеством словаря bert multilingual (30К токенов)
* Эмбеддинги частично инициализируем из bert multilingual тоже
* При обучении используем несколько лоссов:
    * Дистиллируем распределение выходных токенов обычного мультиязычного берта
    * Минимизируем whole-word MLM лосс
    * Минимизируем translation ranking loss, как в LaBSE
    * Минимизируем перплексию декодера T5, обусловленного на CLS токен и воспроизводящего текст. 
    * Дистиллируем CLS токены, приближая разные их проекции к эмбеддингам RuBERT, LaBSE и Laser. 
* Обучается это всё на 2.5 млн параллельных англо-русских предложений собранных из разных корпусов. 

Вдохновлено:
* TinyBERT от Хуавея https://huggingface.co/huawei-noah/TinyBERT_General_4L_312D
* LaBSE от Google https://arxiv.org/pdf/2007.01852.pdf

### Dependencies

Гугл диск я подключаю, т.к. на нём находится часть обучающих данных, и туда же я регулярно сохраняю веса модели. 

In [None]:
from google.colab import drive
drive.mount('/gd')

Mounted at /gd


In [None]:
!pip install transformers sentencepiece datasets natasha laserembeddings

In [None]:
!pip install tensorflow_text>=2.0.0

In [None]:
!python -m laserembeddings download-models

Downloading models into /usr/local/lib/python3.7/dist-packages/laserembeddings/data

✅   Downloaded https://dl.fbaipublicfiles.com/laser/models/93langs.fcodes    
✅   Downloaded https://dl.fbaipublicfiles.com/laser/models/93langs.fvocab    
✅   Downloaded https://dl.fbaipublicfiles.com/laser/models/bilstm.93langs.2018-12-26.pt    

✨ You're all set!


### Prepare

In [None]:
from transformers import BertForPreTraining, BertTokenizerFast, BertConfig

In [None]:
base_model = 'bert-base-multilingual-cased'

In [None]:
tok = BertTokenizerFast.from_pretrained(base_model)

Корпус взят из https://translate.yandex.ru/corpus

In [None]:
corpus_path = 'C:/Users/david/Google Диск/datasets/nlp/1mcorpus/'

In [None]:
import pandas as pd
import csv
df_en = pd.read_csv(corpus_path + 'corpus.en_ru.1m.en', sep='\t', header=None, quoting=csv.QUOTE_NONE)
df_ru = pd.read_csv(corpus_path + 'corpus.en_ru.1m.ru', sep='\t', header=None, quoting=csv.QUOTE_NONE)
df_en.columns = ['text']
df_ru.columns = ['text']

print(df_ru.shape)
print(df_en.shape)

In [None]:
pd.Series(len(tt) for tt in tok(df_ru.sample(10000).text.tolist())['input_ids']).describe()

In [None]:
pd.Series(len(tt) for tt in tok(df_en.sample(10000).text.tolist())['input_ids']).describe()

### The tokenizer: initialize

In [None]:
from collections import Counter
from tqdm.auto import tqdm, trange

cnt_ru = Counter()
for text in tqdm(df_ru.text):
    cnt_ru.update(tok(text)['input_ids'])
    
cnt_en = Counter()
for text in tqdm(df_en.text):
    cnt_en.update(tok(text)['input_ids'])

In [None]:
print(len(cnt_ru), len(cnt_en))

In [None]:
print(len(sorted(k for k, v in cnt_ru.items() if v >= 5)))
print(len(sorted(k for k, v in cnt_en.items() if v >= 5)))

In [None]:
print(len(sorted(k for k, v in cnt_ru.items() if v >= 10)))
print(len(sorted(k for k, v in cnt_en.items() if v >= 10)))

In [None]:
print(len(sorted(k for k, v in cnt_ru.items() if v >= 100)))
print(len(sorted(k for k, v in cnt_en.items() if v >= 100)))

In [None]:
resulting_vocab = {
    tok.vocab[k] for k in tok.special_tokens_map.values()
}
for k, v in cnt_ru.items():
    if v >= 5 or k <= 3_000:
        resulting_vocab.add(k)
for k, v in cnt_en.items():
    if v >= 100 or k <= 3_000:
        resulting_vocab.add(k)

resulting_vocab = sorted(resulting_vocab)
print(len(resulting_vocab))

29564


In [None]:
NEW_MODEL_NAME = 'tinybert-ru'

In [None]:
tok.save_pretrained(NEW_MODEL_NAME)

('tinybert-ru\\tokenizer_config.json',
 'tinybert-ru\\special_tokens_map.json',
 'tinybert-ru\\vocab.txt',
 'tinybert-ru\\added_tokens.json')

In [None]:
inv_voc = {idx: word for word, idx in tok.vocab.items()}

In [None]:
with open(NEW_MODEL_NAME + '/vocab.txt', 'w', encoding='utf-8') as f:
    for idx in resulting_vocab:
        f.write(inv_voc[idx] + '\n')

### The model: initialize

In [None]:
new_tokenizer = BertTokenizerFast.from_pretrained(NEW_MODEL_NAME)

In [None]:
small_config = BertConfig(
    emb_size=312,
    hidden_size=312,
    intermediate_size=600,
    max_position_embeddings=512,
    num_attention_heads=12,
    num_hidden_layers=3,
    vocab_size=new_tokenizer.vocab_size,
)

In [None]:
small_model = BertForPreTraining(small_config)

In [None]:
small_model.save_pretrained(NEW_MODEL_NAME)

Выкачиваем веса из большой модели для инициализации

In [None]:
big_model = BertForPreTraining.from_pretrained(base_model)

Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# copy input embeddings
small_model.bert.embeddings.word_embeddings.weight.data = big_model.bert.embeddings.word_embeddings.weight.data[resulting_vocab, :312].clone()
small_model.bert.embeddings.position_embeddings.weight.data = big_model.bert.embeddings.position_embeddings.weight.data[:, :312].clone()
# copy output embeddings
small_model.cls.predictions.decoder.weight.data = big_model.cls.predictions.decoder.weight.data[resulting_vocab, :312].clone()

In [None]:
small_model.save_pretrained(NEW_MODEL_NAME)

### Fine tune the model (multitask and distillation)

In [None]:
NEW_MODEL_NAME = 'tinybert-ru'
NEW_MODEL_NAME = '/gd/MyDrive/models/tinybert-ru'
base_model = 'bert-base-multilingual-cased'
#corpus_path = 'C:/Users/david/Google Диск/datasets/nlp/1mcorpus/'
corpus_path = '/gd/MyDrive/datasets/nlp/1mcorpus/'

In [None]:
from transformers import BertForPreTraining, BertTokenizerFast, BertConfig

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [None]:
from tqdm.auto import tqdm, trange

#### Prepare data

In [None]:
import pandas as pd
import csv
df_en = pd.read_csv(corpus_path + 'corpus.en_ru.1m.en', sep='\t', header=None, quoting=csv.QUOTE_NONE)
df_ru = pd.read_csv(corpus_path + 'corpus.en_ru.1m.ru', sep='\t', header=None, quoting=csv.QUOTE_NONE)
df_en.columns = ['text']
df_ru.columns = ['text']

print(df_ru.shape)
print(df_en.shape)

(1000000, 1)
(1000000, 1)


Добавляем датасеты opus100 и tatoeba, чтобы разнообразить примеры

https://huggingface.co/datasets/opus100

https://huggingface.co/datasets/tatoeba

In [None]:
from datasets import load_dataset
tatoeba = load_dataset("tatoeba", lang1="en", lang2="ru")

tat_ru = []
tat_en = []
for pair in tqdm(tatoeba['train']):
    tat_ru.append(pair['translation']['ru'])
    tat_en.append(pair['translation']['en'])

df_en = pd.concat([df_en, pd.Series(tat_en, name='text').to_frame()])
df_ru = pd.concat([df_ru, pd.Series(tat_ru, name='text').to_frame()])

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1896.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1438.0, style=ProgressStyle(description…

Using custom data configuration en-ru-lang1=en,lang2=ru



Downloading and preparing dataset tatoeba/en-ru (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/tatoeba/en-ru-lang1=en,lang2=ru/0.0.0/54423b66d13968ea583b6ac5828448a54b1a69944cabd3368ccd364fdb4f3216...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=14047102.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset tatoeba downloaded and prepared to /root/.cache/huggingface/datasets/tatoeba/en-ru-lang1=en,lang2=ru/0.0.0/54423b66d13968ea583b6ac5828448a54b1a69944cabd3368ccd364fdb4f3216. Subsequent calls will reuse this data.


HBox(children=(FloatProgress(value=0.0, max=514195.0), HTML(value='')))




In [None]:
from datasets import load_dataset
opus100 = load_dataset('opus100', 'en-ru')

tat_ru = []
tat_en = []
for pair in tqdm(opus100['train']):
    tat_ru.append(pair['translation']['ru'])
    tat_en.append(pair['translation']['en'])

df_en = pd.concat([df_en, pd.Series(tat_en, name='text').to_frame()])
df_ru = pd.concat([df_ru, pd.Series(tat_ru, name='text').to_frame()])

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2216.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=13292.0, style=ProgressStyle(descriptio…


Downloading and preparing dataset opus100/en-ru (download: 65.33 MiB, generated: 187.00 MiB, post-processed: Unknown size, total: 252.33 MiB) to /root/.cache/huggingface/datasets/opus100/en-ru/0.0.0/a87abd612d82947c7a2c3991f71095a98f55141af7ad37516dfb31bfa3511ddc...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=68501634.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset opus100 downloaded and prepared to /root/.cache/huggingface/datasets/opus100/en-ru/0.0.0/a87abd612d82947c7a2c3991f71095a98f55141af7ad37516dfb31bfa3511ddc. Subsequent calls will reuse this data.


HBox(children=(FloatProgress(value=0.0, max=1000000.0), HTML(value='')))




In [None]:
df_ru.reset_index(drop=True, inplace=True);
df_en.reset_index(drop=True, inplace=True);

In [None]:
df_en.shape, df_ru.shape

((2514195, 1), (2514195, 1))

Сложные негативные примеры (в итоге забил на них)

In [None]:
import math

from collections import Counter, defaultdict
from functools import lru_cache
from typing import List, Dict
from tqdm.auto import tqdm, trange
import re

TOKEN = re.compile(r'([^\W\d]+|\d+|[^\w\s])')


def re_tokenize(text):
    chunks = TOKEN.findall(text)
    return find_substrings(chunks, text)


def find_substrings(chunks, text):
    offset = 0
    for chunk in chunks:
        start = text.find(chunk, offset)
        stop = start + len(chunk)
        yield chunk
        offset = stop


class SimpleSearcher:
    def __init__(self, k=1.5, b=0.75, max_freq=None, df=False):
        self.k = k
        self.b = b
        self.max_freq = max_freq
        self.df = df

    def tokenize(self, text, stem=None):
        return list(re_tokenize(text.lower()))

    def setup(self, texts, owners):
        """ texts: list of texts, owners: list of ids """
        self.texts = texts
        self.owners = owners
        paragraphs = {i: text for i, text in enumerate(texts)}
        self.fit(paragraphs=paragraphs)
        return self

    def fit(self, paragraphs):
        """" paragraphs: dict with ids as keys and texts as values """
        inverse_index = defaultdict(set)
        text_frequencies = Counter()
        text_lengths = Counter()
        wf = Counter()
        for p_id, p in tqdm(paragraphs.items(), total=len(paragraphs)):
            tokens = self.tokenize(p)
            text_lengths[p_id] = len(tokens)
            for w in tokens:
                wf[w] += 1
                if self.max_freq and wf[w] >= self.max_freq:
                    inverse_index[w] = set()
                else:
                    inverse_index[w].add(p_id)
                
        self.inverse_index = inverse_index
        self.wf = wf
        self.text_lengths = text_lengths
        self.avg_len = sum(text_lengths.values()) / len(text_lengths)
        self.n_docs = len(paragraphs)
        
    def trim(self, n):
        # remove "stopwords" - words with too many indices
        stopwords = {k for k, v in self.inverse_index.items() if len(v) > n}
        for k in stopwords:
            self.inverse_index[k] = set()

    def get_okapi_idf(self, w):
        n = self.wf[w]
        return math.log(max(1, self.n_docs - n + 0.5) / (n + 0.5))

    def get_okapi_tf(self, w, p_id):
        f = self.text_frequencies[(p_id, w)] if self.df else 1
        return f * (self.k + 1) / (f + self.k * (1 - self.b + self.b * self.text_lengths[p_id] / self.avg_len))

    def get_tf_idfs(self, query):
        words = self.tokenize(query)
        matches = [(w, d) for w in words for d in self.inverse_index[w]]

        tfidfs = Counter()
        for w, d in matches:
            tfidfs[d] += self.text_frequencies[(d, w)] / len(self.inverse_index[w])

        return tfidfs

    def get_okapis(self, query, normalize=False):
        words = self.tokenize(query)
        matches = [(w, d) for w in words for d in self.inverse_index[w]]

        tfidfs = Counter()
        for w, d in matches:
            tfidfs[d] += self.get_okapi_idf(w) * self.get_okapi_tf(w, d)

        return tfidfs

In [None]:
ss = SimpleSearcher(max_freq=10_000)
ss.fit(df_ru.text.sample(100).to_dict())
#ss.fit(df_ru.text.to_dict())

HBox(children=(FloatProgress(value=0.0), HTML(value='')))




In [None]:
def hard_batch(n=16):
    text = df_ru.text.sample(1).iloc[0]
    indices = [k for k, v in ss.get_okapis(text).most_common(n * 4)]
    indices = df_ru.text[indices].drop_duplicates().index.tolist()[:n]
    if len(indices) < n:
        indices.extend(df_ru.text.sample(n - len(indices)).index)
    return indices

In [None]:
%%time
for _ in df_ru.text[hard_batch(16)]:
    print(_)

Том сел на камень.
Ответ на повторяющиеся мероприятия
– Написать письмо претендуете на должность;
Том в костюме.
Том был голый.
Том делает это гораздо лучше, чем я.
ссылаясь также на свою резолюцию 56/260 от 31 января 2002 года, в которой Ассамблея установила мандат Специального комитета по разработке конвенции против коррупции,
Кроме того, Комитет рекомендует государству-участнику в свете статьи 17 Конвенции принимать все необходимые меры законодательного и иного характера, включая, в частности, проведение просветительских кампаний, направленных на родителей, попечителей и учителей, а также налаживать сотрудничество с провайдерами услуг Интернета в целях защиты детей от пагубного воздействия передаваемой через средства массовой коммуникации и через Интернет информации, в том числе материалы, содержащие сцены насилия и порнографии.
Мы твердо уверены в том, что необходимо безотлагательно создать влиятельную группу видных деятелей, которая займется этим вопросом.
В связи с окончательной 

In [None]:
import gc
gc.collect()

73

#### Setup the model

In [None]:
model = BertForPreTraining.from_pretrained(NEW_MODEL_NAME)
tokenizer = BertTokenizerFast.from_pretrained(NEW_MODEL_NAME)

In [None]:
import gc

def cleanup():
    gc.collect()
    torch.cuda.empty_cache()
    # tf.keras.backend.clear_session()

cleanup()

model.cuda();

In [None]:
from transformers import DataCollatorForWholeWordMask

data_collator = DataCollatorForWholeWordMask(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

In [None]:
from tqdm.auto import tqdm, trange

In [None]:
def get_sentence_loss(out_ru, out_en, margin=0.3, mult=1.0):
    """ Calculate translation ranking loss using CLS tokens """
    emb_ru = F.normalize(out_ru) #(out_ru.hidden_states[-1][:, 0])
    emb_en = F.normalize(out_en) #(out_en.hidden_states[-1][:, 0])
    batch_size = emb_ru.shape[0]
    sims = torch.matmul(emb_ru, emb_en.T) - torch.eye(batch_size).cuda() * margin
    loss_fn = torch.nn.CrossEntropyLoss()
    loss = (
        loss_fn(torch.log_softmax(sims, -1) * mult, torch.arange(batch_size).cuda())
        + loss_fn(torch.log_softmax(sims.T, -1) * mult, torch.arange(batch_size).cuda())
    )
    return loss

```
# demontrate that the task is difficult
emb_ru = F.normalize(pool_ru)
emb_en = F.normalize(pool_en)
batch_size = emb_ru.shape[0]
sims = torch.matmul(emb_ru, emb_en.T) - torch.eye(batch_size).cuda() * margin 
print(torch.softmax(sims, dim=1).diag().mean())  # about 0.09

from matplotlib import pyplot as plt
plt.imshow(torch.softmax(sims, dim=1).detach().cpu().numpy())

emb_ru, emb_en, sims = None, None, None
```

In [None]:
def get_mask_labels(input_ids):
    mask_labels = []
    for e in input_ids:
        ref_tokens = []
        for idx in e:
            token = tokenizer._convert_id_to_token(idx)
            ref_tokens.append(token)
        mask_labels.append(data_collator._whole_word_mask(ref_tokens))
    ml = torch.tensor(mask_labels)
    inputs, labels = data_collator.mask_tokens(input_ids, ml)
    return inputs, labels

In [None]:
def preprocess_inputs(inputs):
    inputs['input_ids'], inputs['labels'] = get_mask_labels(inputs['input_ids'])
    return {k: v.to(model.device) for k, v in inputs.items()}

In [None]:
def get_mlm_loss(inputs, outputs):
    return nn.CrossEntropyLoss()(
        outputs.prediction_logits.view(-1, model.config.vocab_size),
        inputs['labels'].view(-1)
    )

In [None]:
def pool(model, outputs):
    return model.bert.pooler(outputs.hidden_states[-1])

#### MLM distillation loss

In [None]:
big_model = BertForPreTraining.from_pretrained(base_model)
big_tokenizer = BertTokenizerFast.from_pretrained(base_model)

bv = big_tokenizer.vocab
vocab_mapping = sorted(bv[w] for w in tokenizer.vocab)

big_model.cuda();

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=625.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=714314041.0, style=ProgressStyle(descri…




Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=995526.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1961828.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=29.0, style=ProgressStyle(description_w…




In [None]:
def distill(inputs, outputs, temperature=1.0):
    new_inputs = torch.tensor(
        [[vocab_mapping[i] for i in row] for row in inputs['input_ids']]
    ).to(inputs['input_ids'].device)
    with torch.no_grad():
        big_out = big_model(
            input_ids=new_inputs, 
            token_type_ids=inputs['token_type_ids'],
            attention_mask=inputs['attention_mask']
        )
    # the whole batch, all tokens after the [cls], the whole dimension
    kd_loss = torch.nn.KLDivLoss(reduction='batchmean')(
        F.log_softmax(outputs.prediction_logits[:, 1:, :] / temperature, dim=1), 
        F.softmax(big_out.prediction_logits[:, 1:, vocab_mapping] / temperature, dim=1)
    ) / outputs.prediction_logits.shape[-1]
    return kd_loss

#### Sentence distillation loss (LaBSE, RuBERT, Laser, USE)

In [None]:
from transformers import AutoModel, AutoTokenizer

In [None]:
# тут я пытался сделать адаптеры нелинейными, но забил

class Adapter(nn.Module):
    def __init__(self, n_in, n_out):
        super().__init__()
        self.proj0 = nn.Linear(n_in, n_in)
        self.proj1 = nn.Linear(n_in, n_in)
        self.proj2 = nn.Linear(n_in, n_in)
        self.proj3 = nn.Linear(n_in, n_out)
        self.nonlin = nn.CELU()
    
    def forward(self, x):
        x = nn.functional.normalize(self.proj0(x))
        x = x + self.nonlin(self.proj1(x))
        x = x + self.nonlin(self.proj2(x))
        x = nn.functional.normalize(self.proj3(x))
        return x

# undo
Adapter = torch.nn.Linear

In [None]:
labse_name = '/gd/MyDrive/models/labse-ru'
labse = AutoModel.from_pretrained(labse_name)
labse_tokenizer = AutoTokenizer.from_pretrained(labse_name)

labse.eval()
labse.cuda();

Some weights of the model checkpoint at /gd/MyDrive/models/labse-ru were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
import os
ladapter_path = '/gd/MyDrive/models/tinybert-ru-labse-adapter.pt'
if os.path.exists(ladapter_path):
    labse_adapter = torch.load(ladapter_path)
    print('loading')
else:
    labse_adapter = Adapter(312, 768)
    print('creating from scratch')
labse_adapter.cuda();

loading


In [None]:
def get_labse_loss(outputs_list, texts):
    inp = {k: v.to(labse.device) for k, v in labse_tokenizer(texts, return_tensors='pt', padding=True, max_length=512, truncation=True).items()}
    with torch.no_grad():
        labse_out = labse(**inp)
    emb = torch.nn.functional.normalize(labse_out.pooler_output)
    lfun = torch.nn.MSELoss()
    loss = sum([
        lfun(torch.nn.functional.normalize(labse_adapter(out)), emb) 
        for out in outputs_list
    ])
    return loss

The clever plan is that we will pull our embeddings to LaBSE embeddings of the English sentences and to rubert embeddings of the Russian sentences at the same time. 

In [None]:
rubert_name = 'DeepPavlov/rubert-base-cased-sentence'
rubert = AutoModel.from_pretrained(rubert_name)
rubert_tokenizer = AutoTokenizer.from_pretrained(rubert_name)

rubert.eval()
rubert.cuda();


import os
dpadapter_path = '/gd/MyDrive/models/tinybert-ru-rubert-adapter.pt'
if os.path.exists(dpadapter_path):
    dp_adapter = torch.load(dpadapter_path)
    print('loading')
else:
    dp_adapter = Adapter(312, 768)
    print('creating from scratch')
dp_adapter.cuda();

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=642.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=711456784.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1649718.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=24.0, style=ProgressStyle(description_w…


loading


In [None]:
def get_rubert_loss(outputs_list, texts):
    inp = {k: v.to(rubert.device) for k, v in rubert_tokenizer(texts, return_tensors='pt', padding=True, max_length=512, truncation=True).items()}
    with torch.no_grad():
        dp_out = rubert(**inp)
    emb = torch.nn.functional.normalize(dp_out.last_hidden_state[:, 0, :])  # pooler_output is worse for rubert
    lfun = torch.nn.MSELoss()
    loss = sum([
        lfun(torch.nn.functional.normalize(dp_adapter(out)), emb) 
        for out in outputs_list
    ])
    return loss

In [None]:
from laserembeddings import Laser
laser = Laser()

In [None]:
import os
laser_adapter_path = '/gd/MyDrive/models/tinybert-ru-laser-adapter.pt'
if os.path.exists(laser_adapter_path):
    laser_adapter = torch.load(laser_adapter_path)
    print('loading')
else:
    laser_adapter = Adapter(312, 1024)
    print('creating from scratch')
laser_adapter.cuda();

loading


In [None]:
def get_laser_loss(outputs_list, texts, language='en'):
    with torch.no_grad():
        embeddings = laser.embed_sentences(texts, lang=language)
    emb = torch.nn.functional.normalize(torch.tensor(embeddings).to(outputs_list[0].device))
    lfun = torch.nn.MSELoss()
    loss = sum([
        lfun(torch.nn.functional.normalize(laser_adapter(out)), emb) 
        for out in outputs_list
    ])
    return loss

Use: see https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3

У него сильно течёт память, поэтому, возможно, стоит отменить эту затею. Но я таки пытаюсь. 

In [None]:
import tensorflow_hub
import tensorflow_text
import tensorflow as tf

use = tensorflow_hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3")

INFO:absl:Using /tmp/tfhub_modules to cache modules.
INFO:absl:Downloading TF-Hub Module 'https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3'.
INFO:absl:Downloaded https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3, Total size: 334.32MB
INFO:absl:Downloaded TF-Hub Module 'https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3'.


In [None]:
import os
use_adapter_path = '/gd/MyDrive/models/tinybert-ru-use-adapter.pt'
if os.path.exists(use_adapter_path):
    use_adapter = torch.load(use_adapter_path)
    print('loading')
else:
    use_adapter = Adapter(312, 512)
    print('creating from scratch')
use_adapter.cuda();

loading


In [None]:
def get_use_loss(outputs_list, texts):
    emb = use(texts)
    emb = torch.nn.functional.normalize(torch.tensor(emb.numpy()).to(outputs_list[0].device))
    lfun = torch.nn.MSELoss()
    loss = sum([
        lfun(torch.nn.functional.normalize(use_adapter(out)), emb) 
        for out in outputs_list
    ])
    return loss

#### Reconstruction loss (T5)

In [None]:
from transformers import T5ForConditionalGeneration, T5Tokenizer

t5_name = 'cointegrated/rut5-small'
t5_name = '/gd/MyDrive/models/tinybert-ru-t5-decoder'

In [None]:
t5 = T5ForConditionalGeneration.from_pretrained(t5_name)
t5_tokenizer = T5Tokenizer.from_pretrained(t5_name)

t5.train();
t5.cuda();

In [None]:
import os
t5_adapter_path = '/gd/MyDrive/models/tinybert-ru-t5-adapter.pt'
t5_states = 10
if os.path.exists(t5_adapter_path):
    t5_adapter = torch.load(t5_adapter_path)
    print('loading')
else:
    t5_adapter = Adapter(312, 512 * t5_states)
    print('creating from scratch')
t5_adapter.cuda();

loading


In [None]:
def get_t5_loss(outputs, texts):
    t5_repr = [t5_adapter(outputs).reshape([outputs.shape[0], t5_states, 512])]
    targets = t5_tokenizer(texts, return_tensors='pt', padding=True, truncation=True, max_length=512)
    out = t5(
        encoder_outputs=t5_repr, 
        labels=targets['input_ids'].to(t5.device), 
        decoder_attention_mask=targets['attention_mask'].to(t5.device),
    )
    return out.loss

```
labse_adapter = Adapter(312, 768)
dp_adapter = Adapter(312, 768)
laser_adapter = Adapter(312, 1024)
t5_adapter = Adapter(312, 512 * t5_states)

labse_adapter.cuda(), dp_adapter.cuda(), laser_adapter.cuda(), t5_adapter.cuda();
```

#### Training loop

In [None]:
def cleanup():
    gc.collect()
    torch.cuda.empty_cache()
    tf.keras.backend.clear_session()

In [None]:
from itertools import chain
optimizer = torch.optim.Adam(
    params=[p for p in chain(
        model.parameters(), 
        t5.decoder.parameters(), t5.lm_head.parameters(),
        t5_adapter.parameters(),
        labse_adapter.parameters(),
        dp_adapter.parameters(),
        laser_adapter.parameters(),
        use_adapter.parameters(),
        ) if p.requires_grad], 
    lr=1e-5  # larger learning rate is detrimental
)
len(optimizer.param_groups)

1

In [None]:
optimizer.param_groups[0]['lr'] = 1e-5

In [None]:
scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer=optimizer, T_0=1765)

In [None]:
from tensorflow.errors import ResourceExhaustedError

In [None]:
batch_size = 16  # the size of 4 seems to be the limit on my local device, while on colab 32 is OK
# with gpt on colab, 8 is maximum, or 16, with t5
# when we do not distill any other models, batch size of 64 seems to be just fine (and 3 epochs promise to pass in less than 24 hours!)
margin = 0.3
temp = 3.0
hard_freq = 0
accumulation_steps = 4  # эта штука реально помогает, когда обучение подзастряло. А ещё ускоряет!

epochs = 3
save_steps = int(8192 / batch_size)
window = int(1024 / batch_size * 4)
print('window steps', window, 'save steps', save_steps)
ewms = [0] * 20

tq = trange(int(df_ru.shape[0] * epochs / batch_size))
cleanup()

model.train()
t5.train()
labse.train()
#big_model.train()
rubert.train()

for i in tq:
    if hard_freq and i % hard_freq == 0:
        bb = df_ru.text.loc[hard_batch(batch_size)]
    else:
        bb = df_ru.text.sample(batch_size)
    eb = df_en.iloc[bb.index].text

    try:
        inputs_ru = preprocess_inputs(tokenizer(bb.tolist(), return_tensors='pt', padding=True, truncation=True))
        inputs_en = preprocess_inputs(tokenizer(eb.tolist(), return_tensors='pt', padding=True, truncation=True))
        outputs_ru = model(**inputs_ru, output_hidden_states=True)
        outputs_en = model(**inputs_en, output_hidden_states=True)
        pool_ru = pool(model, outputs_ru)
        pool_en = pool(model, outputs_en)
        
        losses = [
            sum([
                get_labse_loss([pool_ru, pool_en], eb.tolist()) * 768, 
                get_rubert_loss([pool_ru, pool_en], bb.tolist()) * 768,
                get_laser_loss([pool_ru, pool_en], eb.tolist()) * 1024,
                get_use_loss([pool_ru, pool_en], eb.tolist()) * 512, 
                get_use_loss([pool_ru, pool_en], bb.tolist()) * 512,
            ]),
            (get_t5_loss(pool_ru, bb.tolist()) + get_t5_loss(pool_en, bb.tolist())) * 1,
            #(distill(inputs_ru, outputs_ru, temperature=temp) + distill(inputs_en, outputs_en, temperature=temp)) * 25,
            get_mlm_loss(inputs_ru, outputs_ru) + get_mlm_loss(inputs_en, outputs_en),
            get_sentence_loss(pool_ru, pool_en, margin=margin),
        ]
        loss = sum(losses)
        loss.backward()

    except (RuntimeError, ResourceExhaustedError) as e:
        print('runtime error on batch', i, e)
        inputs_ru, inputs_en = None, None
        outputs_ru, outputs_en = None, None
        pool_ru, pool_en = None, None
        losses = None
        loss = None
        cleanup()
        tf.keras.backend.clear_session()
        continue

    w = 1 / min(i+1, window)
    ewms = [ewm * (1-w) + loss.item() * w for ewm, loss in zip(ewms, [loss] + losses)]
    desc = 'loss: ' + ' '.join(['{:2.2f}'.format(l) for l in ewms]) + '|{:2.1e}'.format(optimizer.param_groups[0]['lr'])
    tq.set_description(desc)

    if i % accumulation_steps == 0:
        optimizer.step()
        scheduler.step()
        
        optimizer.zero_grad()
        tf.keras.backend.clear_session()
        cleanup()
    
    if i % window == 0 and i > 0:
        print(desc)
        # cleanup()

    if i % save_steps == 0 and i > 0:
        model.save_pretrained(NEW_MODEL_NAME)
        tokenizer.save_pretrained(NEW_MODEL_NAME)
        t5_tokenizer.save_pretrained(t5_name)
        t5.save_pretrained(t5_name)
        torch.save(labse_adapter, ladapter_path)
        torch.save(dp_adapter, dpadapter_path)
        torch.save(t5_adapter, t5_adapter_path)
        torch.save(laser_adapter, laser_adapter_path)
        torch.save(use_adapter, use_adapter_path)
        print('saving...', i, optimizer.param_groups[0]['lr'])

window steps 256 save steps 512


HBox(children=(FloatProgress(value=0.0, max=471411.0), HTML(value='')))

runtime error on batch 121 CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 15.90 GiB total capacity; 13.54 GiB already allocated; 31.75 MiB free; 13.84 GiB reserved in total by PyTorch)
loss: 21.22 6.77 2.09 7.74 4.62|9.8e-06
loss: 21.16 6.81 2.06 7.67 4.62|9.6e-06
saving... 512 9.572540863471041e-06
loss: 21.15 6.80 2.06 7.66 4.63|9.3e-06
loss: 21.15 6.79 2.06 7.67 4.62|9.0e-06
saving... 1024 8.997487975927483e-06
runtime error on batch 1108 CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 15.90 GiB total capacity; 13.45 GiB already allocated; 27.75 MiB free; 13.84 GiB reserved in total by PyTorch)
loss: 21.18 6.80 2.05 7.70 4.63|8.6e-06
runtime error on batch 1408 CUDA out of memory. Tried to allocate 406.00 MiB (GPU 0; 15.90 GiB total capacity; 12.87 GiB already allocated; 415.75 MiB free; 13.46 GiB reserved in total by PyTorch)
runtime error on batch 1428 CUDA out of memory. Tried to allocate 96.00 MiB (GPU 0; 15.90 GiB total capacity; 13.46 GiB already allocated; 11

KeyboardInterrupt: ignored

Семантика предложений в текстах, которые восстановил T5, угадывается очень-очень приблизительно. 

Но угадывается именно семантика, что приятно. 

А вообще я, конечно, дурак. CLS-токены надо было сразу из LABSE дистиллировать. К чему я в итоге и пришёл, и представительная сила CLS заметно возросла (и даже спустя время стала выше, чем у усреднения эмбеддингов всех токенов). 

~~Долбаный T5 я отключу, он ничему так и не научился, а GPU выжирает сильно. Так я хоть батчи смогу побольше сделать. ~~ Включил назад, пусть будет. Поменял адаптер только. И да, с более толстым адаптером (и с боле простыми примерами из татоебы) декодер учится заметно лучше. Когда лосс T5 около 2.2-2.3, он уже неплохо декодирует некоторые предложения (короткие иногда даже полностью). 

Пробовал отключить "учителей" от обучения, но качество CLS токена практически моментально упало с 0.58 до 0.56 на детекции парафраз (а потом до 0.54), и я подумал, что нафиг такие эксперименты. 

А вот отключение bert-multilingual для дистилляции токенов сильно помогло - начал уменьшаться лосс как для CLS, так и для MLM. Видимо, плохой учитель в какой-то момент начинает вредить. 

In [None]:
from transformers.modeling_outputs import BaseModelOutput
model.eval()
t5.eval()

def decipher(text, **kwargs):
    bert_in = {k: v.to(model.device) for k, v in tokenizer(text, return_tensors='pt').items()}
    
    with torch.no_grad():
        bert_out = model(**bert_in, output_hidden_states=True)
        pooled = pool(model, bert_out)
        eo = BaseModelOutput(last_hidden_state=t5_adapter(pooled).reshape([pooled.shape[0], t5_states, 512]))

    out = t5.generate(encoder_outputs=eo, **kwargs)
    return t5_tokenizer.decode(out[0])

for text in df_en.text.sample(5):
    print(text)
    print(decipher(text, max_length=64, repetition_penalty=3.0, num_beams=3))
    print()

If you j...
<pad> Если ты...</s>

My delegation encourages the CTC in its efforts to intensify its cooperation with international, regional and subregional organizations.
<pad> Япония приветствует усилия, направленные на содействие сотрудничеству между ЮНЕСКО и другими региональными организациями в области развития.</s>

Don't look so dismal, Arthur.
<pad> Не смотри, не беспокойся.</s>

And Utanka took them back.
<pad> И они пошли в Бостон.</s>

I was in love with Tom once.
<pad> Я был влюблён в Тома.</s>



In [None]:
for text in df_ru.text.sample(5):
    print(text)
    print(decipher(text, max_length=64, repetition_penalty=3.0, num_beams=3))
    print()

Я знаком с теми женщинами.
<pad> Я знаком с ней с мужчинами.</s>

Знаете, я никогда не участвовала в соревнованиях для инвалидов, я всегда выступала против здоровых спортсменов.
<pad> Я никогда не играл в спортивных клубах, но всегда я люблю участвовать в спорте.</s>

3.4 Сервис обязуется использовать только те букмекерские конторы, в которых был размещен по его рекомендации игровой банк Подписчика.
<pad> 2. Для этого пользователь должен использовать в качестве приложения, который используется для использования на основе списка клиентов.</s>

Для того чтобы понять, насколько оправданно введение моратория и на чьей стороне правда: то ли на стороне экологов, выступающих против отлова, то ли сотрудников дельфинариев, утверждающих, что запрет 2008 года никоим образом не способствует спасению популяций дельфинов, мы обратились за разъяснениями к преподавателю кафедры зоологии Таврического национального университета имени Вернадского Павлу Гольдину.
<pad> Поскольку мы не считаем, что этот во

Ниже - история значений лоссов за ход обучения. 

```
loss: 21.03 6.78 2.02 7.62 4.62
loss: 21.11 6.81 2.03 7.64 4.63
loss: 21.13 6.86 2.03 7.61 4.63
loss: 21.15 6.88 2.03 7.62 4.62
loss: 21.17 6.87 2.02 7.65 4.63
loss: 21.25 6.89 2.01 7.71 4.63
loss: 21.28 6.92 2.08 7.67 4.62
loss: 21.30 6.93 2.08 7.66 4.63
loss: 21.33 6.93 2.10 7.67 4.63
loss: 21.36 6.99 2.05 7.69 4.64
loss: 21.38 7.00 2.10 7.65 4.62
loss: 21.44 7.04 2.12 7.65 4.64
loss: 21.52 7.10 2.06 7.71 4.64
# назад 16, ибо участились overflow
loss: 22.15 7.09 1.91 7.73 5.43
loss: 22.19 7.12 1.88 7.77 5.42
loss: 22.24 7.14 1.91 7.77 5.43
loss: 22.31 7.22 1.88 7.77 5.43
loss: 22.35 7.24 1.89 7.79 5.43
loss: 22.40 7.24 1.90 7.83 5.43
loss: 22.64 7.32 1.95 7.93 5.43
# сделал батч 24 вместо 16
loss: 21.81 7.25 2.09 7.83 4.64
# отключил в порядке эксперимента потокенную дистилляцию
loss: 28.31 7.32 2.10 6.44 7.80 4.65
loss: 28.37 7.33 2.07 6.47 7.85 4.65
loss: 28.48 7.37 2.15 6.53 7.78 4.64
loss: 28.53 7.40 2.12 6.51 7.85 4.64
loss: 28.53 7.45 2.11 6.51 7.81 4.65
loss: 28.57 7.52 2.11 6.51 7.79 4.65
loss: 28.63 7.56 2.14 6.50 7.79 4.64  # или ломается T5?
loss: 28.63 7.60 2.09 6.46 7.83 4.65
# Добавил USE-multilingual-large в учителя, вроде ничего не ломается
loss: 24.19 3.20 2.10 6.48 7.77 4.64
loss: 24.24 3.20 2.12 6.46 7.80 4.65
loss: 24.26 3.19 2.12 6.50 7.80 4.65
loss: 24.29 3.21 2.16 6.54 7.74 4.64
loss: 24.32 3.23 2.14 6.46 7.84 4.65
loss: 24.36 3.24 2.13 6.49 7.84 4.66
loss: 24.43 3.31 2.15 6.46 7.83 4.68
# включил truncation в учителях, и сразу обучение как-то бодрее пошло. 
```

```
loss: 24.50 3.31 2.14 6.49 7.88 4.67
loss: 24.53 3.31 2.17 6.50 7.87 4.68
loss: 24.55 3.33 2.14 6.49 7.91 4.68
loss: 24.64 3.35 2.16 6.55 7.89 4.69
loss: 24.68 3.35 2.18 6.55 7.90 4.69
loss: 24.71 3.37 2.19 6.51 7.94 4.70
loss: 24.74 3.41 2.17 6.51 7.94 4.72
loss: 24.85 3.42 2.22 6.52 7.96 4.72
# Отключил "сложные" батчи, т.к. обучение стало довольно нестабильным, особенно по части T5, который, похоже, охуел от таких массовых градиентов.  
loss: 24.83 3.46 2.36 6.41 7.68 4.92
loss: 24.86 3.48 2.38 6.35 7.71 4.94
loss: 24.91 3.50 2.37 6.38 7.76 4.91
loss: 25.07 3.56 2.39 6.37 7.81 4.95
loss: 25.12 3.55 2.38 6.39 7.83 4.97
loss: 25.17 3.61 2.43 6.42 7.75 4.97
loss: 25.30 3.63 2.47 6.37 7.87 4.97
# сделал каждый 4-й батч однородным ("сложным")
loss: 25.24 3.79 2.16 6.47 7.95 4.86
loss: 25.55 3.89 2.16 6.56 7.99 4.94
# поправил индексацию обучающих данных, и внезапно оказались, что всё обстоит лучше с англо-русскими лоссами!
loss: 26.82 3.70 2.27 7.26 8.03 5.56
loss: 26.88 3.72 2.25 7.22 8.10 5.58
loss: 26.91 3.71 2.23 7.23 8.15 5.58
loss: 26.96 3.73 2.26 7.30 8.11 5.56
loss: 27.01 3.73 2.27 7.30 8.13 5.57
loss: 27.05 3.73 2.27 7.32 8.14 5.58
loss: 27.19 3.80 2.35 7.31 8.14 5.58
loss: 27.27 3.81 2.38 7.34 8.17 5.57
loss: 27.38 3.82 2.39 7.37 8.21 5.58
loss: 27.43 3.84 2.44 7.37 8.21 5.58
loss: 27.46 3.84 2.43 7.37 8.23 5.59
loss: 27.71 3.95 2.55 7.39 8.20 5.61
loss: 27.75 3.95 2.55 7.40 8.22 5.62
loss: 27.83 4.01 2.51 7.38 8.25 5.67
loss: 27.99 4.05 2.53 7.42 8.29 5.69
loss: 28.24 4.26 2.55 7.39 8.24 5.80
loss: 28.81 4.72 2.58 7.45 8.29 5.77
loss: 30.14 6.33 2.55 7.45 8.21 5.60
# добавил в учителя ещё Laser, чтобы ещё лучше смысл предложения угадывался
loss: 26.67 2.89 2.54 7.41 8.24 5.58
loss: 26.93 2.92 2.77 7.39 8.26 5.59
loss: 27.22 2.92 2.97 7.43 8.29 5.62  # when t5 loss gets lower than 2.5, I'll be satisfied
loss: 27.83 2.93 3.47 7.54 8.27 5.61
loss: 28.32 2.99 3.98 7.38 8.37 5.60
# не понравилось, вернул T5 на место в новом формате
loss: 25.31 2.91 7.20 8.23 6.96
loss: 25.50 2.96 7.17 8.37 7.00
# отключаю T5 нафиг, увеличиваю батч до 32
loss: 48.62 2.95 24.28 7.46 8.32 5.59
loss: 49.55 3.01 25.13 7.52 8.32 5.57
# Удесятерил лосс от T5, ибо задолбало, что он совсем не учится
loss: 26.91 3.03 2.51 7.46 8.32 5.60
loss: 27.05 2.87 2.67 7.82 8.49 5.20
## добавил tatoeba и opus100 корпус к яндексовскому, чтоб не учиться по кругу на одном и том же
loss: 27.29 2.56 3.05 8.60 8.42 4.65
loss: 27.34 2.58 3.06 8.64 8.41 4.66
loss: 27.36 2.59 3.11 8.62 8.38 4.66
loss: 27.43 2.62 3.07 8.64 8.43 4.67
loss: 27.47 2.64 3.11 8.65 8.39 4.67
loss: 27.57 2.66 3.10 8.69 8.44 4.68
loss: 27.61 2.68 3.11 8.67 8.48 4.68
loss: 27.65 2.76 3.07 8.65 8.45 4.72
loss: 27.74 2.77 3.13 8.67 8.45 4.72
loss: 27.83 2.81 3.16 8.67 8.43 4.75
loss: 27.87 2.85 3.13 8.69 8.42 4.77
loss: 28.38 3.11 3.13 8.73 8.50 4.91
loss: 29.10 3.76 3.16 8.72 8.45 5.02
# добавил rubert в дистилляцию (лосс сумируется с лоссом labse)
loss: 26.81 1.86 3.12 8.74 8.44 4.65
loss: 27.00 1.86 3.16 8.79 8.52 4.66
# включил дропаут в моделях labse и big_model, шоб не скучно было!
loss: 26.51 1.88 3.10 8.45 8.41 4.67
loss: 26.55 1.88 3.12 8.48 8.43 4.65
loss: 26.61 1.89 3.19 8.39 8.48 4.66
loss: 27.37 2.20 3.18 8.76 8.49 4.76
loss: 27.57 2.34 3.19 8.77 8.50 4.77
loss: 27.74 2.47 3.12 8.83 8.54 4.78
loss: 27.94 2.62 3.20 8.82 8.54 4.77
loss: 28.11 2.81 3.17 8.89 8.48 4.76
loss: 28.17 2.81 3.22 8.89 8.49 4.75
loss: 28.32 2.81 3.22 8.98 8.55 4.76
# поднял температуру дистилляции MLM до 3, а вес - до 35, ибо могу!
loss: 21.25 2.82 3.18 2.08 8.41 4.76
loss: 21.28 2.83 3.16 2.06 8.47 4.75
loss: 21.31 2.84 3.22 2.08 8.42 4.75
loss: 21.35 2.86 3.25 2.07 8.41 4.76
loss: 21.38 2.86 3.21 2.08 8.48 4.76
loss: 21.42 2.87 3.26 2.07 8.45 4.76
loss: 21.49 2.89 3.27 2.09 8.48 4.75
loss: 21.55 2.96 3.22 2.08 8.54 4.76
loss: 22.57 3.38 3.42 2.13 8.66 4.99
loss: 22.94 3.63 3.40 2.12 8.66 5.14
loss: 23.12 3.84 3.33 2.12 8.65 5.18
loss: 23.42 4.09 3.47 2.10 8.52 5.23
loss: 23.55 4.11 3.42 2.11 8.68 5.22
loss: 23.69 4.25 3.36 2.13 8.72 5.24
loss: 23.82 4.28 3.36 2.13 8.81 5.24
loss: 23.93 4.42 3.40 2.11 8.77 5.24
loss: 28.08 8.75 3.52 2.11 8.79 4.91
# (above: added LABSE loss to the front, en and ru T5 losses added together, everything disrupted).
loss: 18.77 1.66 1.66 2.10 8.68 4.67
loss: 18.82 1.69 1.70 2.09 8.68 4.66
loss: 18.92 1.69 1.70 2.12 8.73 4.68
loss: 19.00 1.79 1.80 2.12 8.61 4.69
loss: 19.05 1.75 1.75 2.11 8.75 4.70
loss: 19.19 1.78 1.78 2.13 8.81 4.70
#
loss: 19.28 1.87 1.87 2.12 8.73 4.70
loss: 19.35 1.86 1.87 2.13 8.78 4.70
loss: 19.36 1.91 1.91 2.14 8.69 4.71
loss: 19.47 1.91 1.91 2.15 8.80 4.71
loss: 19.95 2.17 2.17 2.11 8.81 4.69
loss: 17.1293, 2.2217, 8.8566, 6.0510
loss: 22.68 2.04 2.04 1.99 1.98 2.15 9.06 3.42
loss: 22.14 1.94 1.95 1.91 1.90 2.13 8.91 3.40

loss: 22.20 1.97 1.98 1.92 1.89 2.14 8.91 3.39
loss: 22.52 2.00 2.01 1.95 1.92 2.15 9.08 3.42
```

## Publish the model to huggingface hub


In [None]:
from transformers import BertForPreTraining, BertTokenizer
NEW_MODEL_NAME = '/gd/MyDrive/models/tinybert-ru'

In [None]:
#model = BertForPreTraining.from_pretrained(NEW_MODEL_NAME)
#tokenizer = BertTokenizerFast.from_pretrained(NEW_MODEL_NAME)

In [None]:
model.config.name_or_path = 'cointegrated/rubert-tiny'
tokenizer.name_or_path = 'cointegrated/rubert-tiny'
tokenizer.init_kwargs['name_or_path'] = 'cointegrated/rubert-tiny'

In [None]:
!curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
!sudo apt-get install git-lfs
!git lfs install

Detected operating system as Ubuntu/bionic.
Checking for curl...
Detected curl...
Checking for gpg...
Detected gpg...
Running apt-get update... done.
Installing apt-transport-https... done.
Installing /etc/apt/sources.list.d/github_git-lfs.list...done.
Importing packagecloud gpg key... done.
Running apt-get update... done.

The repository is setup! You can now install packages.
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  git-lfs
0 upgraded, 1 newly installed, 0 to remove and 48 not upgraded.
Need to get 6,229 kB of archives.
After this operation, 14.5 MB of additional disk space will be used.
Get:1 https://packagecloud.io/github/git-lfs/ubuntu bionic/main amd64 git-lfs amd64 2.13.3 [6,229 kB]
Fetched 6,229 kB in 1s (7,087 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/pe

In [None]:
!pip install huggingface_hub



In [None]:
!huggingface-cli login

In [None]:
#  !huggingface-cli repo create rubert-tiny

In [None]:
! rm -rf rubert-tiny

В ячейке ниже может понадобиться ввести ваши логин и пароль, если будут проблемы с пушем

In [None]:
!git clone https://huggingface.co/cointegrated/rubert-tiny

Cloning into 'rubert-tiny'...
remote: Enumerating objects: 64, done.[K
remote: Counting objects: 100% (64/64), done.[K
remote: Compressing objects: 100% (63/63), done.[K
remote: Total 64 (delta 27), reused 0 (delta 0)[K
Unpacking objects: 100% (64/64), done.


In [None]:
! cd rubert-tiny &&  git lfs install & git config --global user.email "dale.david@mail.ru"

Updated git hooks.
Git LFS initialized.


In [None]:
model.save_pretrained('rubert-tiny')
#tokenizer.save_pretrained('rubert-tiny')

In [None]:
!ls -alsh rubert-tiny

total 47M
4.0K drwxr-xr-x 3 root root 4.0K Jun  9 19:43 .
4.0K drwxr-xr-x 1 root root 4.0K Jun  9 19:43 ..
4.0K -rw-r--r-- 1 root root  632 Jun  9 19:43 config.json
4.0K drwxr-xr-x 9 root root 4.0K Jun  9 19:43 .git
4.0K -rw-r--r-- 1 root root  690 Jun  9 19:43 .gitattributes
 46M -rw-r--r-- 1 root root  46M Jun  9 19:43 pytorch_model.bin
4.0K -rw-r--r-- 1 root root 1.1K Jun  9 19:43 README.md
4.0K -rw-r--r-- 1 root root  112 Jun  9 19:43 special_tokens_map.json
4.0K -rw-r--r-- 1 root root  341 Jun  9 19:43 tokenizer_config.json
460K -rw-r--r-- 1 root root 458K Jun  9 19:43 tokenizer.json
236K -rw-r--r-- 1 root root 236K Jun  9 19:43 vocab.txt


In [None]:
! cd rubert-tiny && git status

On branch main
Your branch is up to date with 'origin/main'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	[31mmodified:   pytorch_model.bin[m

no changes added to commit (use "git add" and/or "git commit -a")


In [None]:
! cd rubert-tiny && git add . && git status

On branch main
Your branch is up to date with 'origin/main'.

Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

	[32mmodified:   pytorch_model.bin[m



In [None]:
! cd rubert-tiny && git diff --cached 

[1mdiff --git a/pytorch_model.bin b/pytorch_model.bin[m
[1mindex 9da29c2..eacb3f3 100644[m
[1m--- a/pytorch_model.bin[m
[1m+++ b/pytorch_model.bin[m
[36m@@ -1,3 +1,3 @@[m
 version https://git-lfs.github.com/spec/v1[m
[31m-oid sha256:bab658a6372a592efa5a8f6fde8149edf07f69e889d6c821a1e3fbf21cc82099[m
[32m+[m[32moid sha256:7b46f70960011906bf9be3c46ad7490bada3845ddb5c2d7a8830c9517ee66071[m
 size 47679974[m


In [None]:
! cd rubert-tiny && git add . && git commit -m "Train for a couple of days more" && git push

[main 0dd911d] Train for a couple of days more
 1 file changed, 1 insertion(+), 1 deletion(-)
Uploading LFS objects: 100% (1/1), 48 MB | 6.0 MB/s, done.
Counting objects: 3, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 374 bytes | 374.00 KiB/s, done.
Total 3 (delta 1), reused 0 (delta 0)
To https://huggingface.co/cointegrated/rubert-tiny
   e032bcd..0dd911d  main -> main
