<a href="https://colab.research.google.com/github/Hesdi/KyrgyzNER/blob/main/dt_iii_bert_baseline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DATATHON-III: научи искусственный интеллект понимать кыргызский язык
### Пример решения на основе BERT

Сначала установим необходимые библиотеки, где многое сделано за нас

In [None]:
! pip install -q deeppavlov==1.2.0 transformers==4.31.0 pytorch-crf==0.7.2

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.3/468.3 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m97.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.8/55.8 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m222.4/222.4 kB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.5/26.5 MB[0m [31m33.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.7/33.7 MB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.5/78.5 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.4/57.4 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━

Служебные функции:
- `get_sentences`: чтение из файла заданного (на хакатоне) формата
- `kyrgyz_ner_2_deeppavlov_ner`: преобразование в формат, с которым работает deeppavlov
- `write_predictions`: запись предсказаний в заданный файл в нужном формате, который примет система

In [None]:
from itertools import groupby
from collections import namedtuple
from typing import List

# неизменяемый класс; в Sentence можно удобно "оборачивать" данные
Sentence = namedtuple("Sentence", "id tokens tags")


def get_sentences(filename: str) -> List[Sentence]:
    """
      filename - путь до файлов с данными
      Возвращает список предложений в структуре Sentence(id, tokens, tags).
    """
    with open(filename, "r", encoding="utf-8") as inp:
        lines = inp.readlines()

    sentences = [list(sentence) for k, sentence in groupby(lines, key=lambda x: x.strip() != "") if k]
    sentences = [Sentence(s[0].strip(),
                          [t.split("\t")[0].strip() for t in s[1:]],
                          [t.split("\t")[-1].strip() for t in s[1:]]) for s in sentences]

    return sentences


def kyrgyz_ner_2_deeppavlov_ner(source: str, destination: str) -> None:
    """
      source - путь до train-файла, который нужно сконвертировать в формат,
      пригодный для deeppavlov NER, чтобы можно было обучаться

      destination - путь, куда нужно сохранить сконвертированный файл

      Преобразовывает исходные данные в формат, который понимает система
      deeppavlov NER.
    """
    sentences = get_sentences(source)

    with open(destination, "w", encoding="utf-8") as out:
        for s in sentences:
            for tok, tag in zip(s.tokens, s.tags):
                print(tok, tag, file=out)
            print("", file=out)

def write_predictions(sentences: List[Sentence], predictions, destination: str):
    """
      sentences - список предложений, полученный из `get_sentences`
      predictions - предсказания, полученные из обученной NER модели
      destination - путь, по которому будет записан файл с предсказаниями
    """
    print(type(predictions), predictions)
    with open(destination, "w", encoding="utf-8") as out:
        for sent, pred in zip(sentences, predictions):
            print(sent.id, file=out)
            for tok, p in zip(sent.tokens, pred):
                print(f"{tok}\t-\t-\t{p}", file=out)
            print("", file=out)


In [None]:
import os

# Создаем директорию, куда положим данные в исправленном формате
data_dir = "train_data"
os.makedirs(data_dir, exist_ok=True)

# Здесь следует указать пути до файлов
kyrgyz_ner_2_deeppavlov_ner("gold.conll2003-formatted.30-texts.txt", f"{data_dir}/train.txt")
kyrgyz_ner_2_deeppavlov_ner("gold.conll2003-formatted.30-texts.txt", f"{data_dir}/valid.txt")
kyrgyz_ner_2_deeppavlov_ner("gold.conll2003-formatted.30-texts.txt", f"{data_dir}/test.txt")

Подготовив всё необходимое, возьмём готовую модель (из библиотеки DeepPavlov), которую можно дообучать.

[Статья с описанием](http://www.ijmlc.org/vol9/758-ML0025.pdf) подхода для LSTM, а для BERT сделано по аналогии.

Вот [ссылка на конфигурацию](
https://github.com/deeppavlov/DeepPavlov/blob/master/deeppavlov/configs/ner/ner_ontonotes_bert_mult.json), может, можно поменять что-то ещё?

In [None]:
from deeppavlov.core.commands.utils import parse_config

# Загружаем конфигурацию и заменяем путь до обучающих данных на свой
model_config = parse_config("ner_ontonotes_bert_mult")
model_config["dataset_reader"]["data_path"] = data_dir
model_config["train"]["epochs"] = 5
model_config["train"]["batch_size"] = 16

model_config["train"]

{'epochs': 5,
 'batch_size': 16,
 'metrics': [{'name': 'ner_f1', 'inputs': ['y', 'y_pred']},
  {'name': 'ner_token_f1', 'inputs': ['y', 'y_pred']}],
 'validation_patience': 100,
 'val_every_n_batches': 20,
 'log_every_n_batches': 20,
 'show_examples': False,
 'pytest_max_batches': 2,
 'pytest_batch_size': 8,
 'evaluation_targets': ['valid', 'test'],
 'class_name': 'torch_trainer'}

### Обучение модели

In [None]:
from deeppavlov import train_model

ner_model = train_model(model_config)



Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

2023-07-26 09:55:04.54 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 104: [saving vocabulary to /root/.deeppavlov/models/ner_ontonotes_torch_bert_mult_crf/tag.dict]
INFO:deeppavlov.core.data.simple_vocab:[saving vocabulary to /root/.deeppavlov/models/ner_ontonotes_torch_bert_mult_crf/tag.dict]


Downloading model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  score = torch.where(mask[i].unsqueeze(1), next_score, score)
14it [00:04,  3.07it/s]
2023-07-26 09:55:22.297 INFO in 'deeppavlov.core.trainers.nn_trainer'['nn_trainer'] at line 198: Initial best ner_f1 of 0.7836
INFO:deeppavlov.core.trainers.nn_trainer:Initial best ner_f1 of 0.7836


{"valid": {"eval_examples_count": 209, "metrics": {"ner_f1": 0.7836, "ner_token_f1": 4.733}, "time_spent": "0:00:05", "epochs_done": 0, "batches_seen": 0, "train_examples_seen": 0, "impatience": 0, "patience_limit": 100}}


INFO:train_report:{"valid": {"eval_examples_count": 209, "metrics": {"ner_f1": 0.7836, "ner_token_f1": 4.733}, "time_spent": "0:00:05", "epochs_done": 0, "batches_seen": 0, "train_examples_seen": 0, "impatience": 0, "patience_limit": 100}}
14it [00:03,  3.61it/s]
5it [00:01,  3.89it/s]
1it [00:00, 11.65it/s]

{"train": {"eval_examples_count": 16, "metrics": {"ner_f1": 45.1613, "ner_token_f1": 59.7938}, "time_spent": "0:00:11", "epochs_done": 1, "batches_seen": 20, "train_examples_seen": 305, "loss": 0.9659969881176949}}



INFO:train_report:{"train": {"eval_examples_count": 16, "metrics": {"ner_f1": 45.1613, "ner_token_f1": 59.7938}, "time_spent": "0:00:11", "epochs_done": 1, "batches_seen": 20, "train_examples_seen": 305, "loss": 0.9659969881176949}}

0it [00:00, ?it/s][A
1it [00:00,  3.38it/s][A
2it [00:00,  5.21it/s][A
3it [00:00,  6.63it/s][A
5it [00:00,  9.68it/s][A
7it [00:00, 11.12it/s][A
9it [00:00, 11.81it/s][A
11it [00:01, 12.32it/s][A
14it [00:01, 11.23it/s]
2023-07-26 09:55:29.218 INFO in 'deeppavlov.core.trainers.nn_trainer'['nn_trainer'] at line 206: Improved best ner_f1 from 0.7836 to 30.2083
INFO:deeppavlov.core.trainers.nn_trainer:Improved best ner_f1 from 0.7836 to 30.2083
2023-07-26 09:55:29.221 INFO in 'deeppavlov.core.trainers.nn_trainer'['nn_trainer'] at line 208: Saving model
INFO:deeppavlov.core.trainers.nn_trainer:Saving model
2023-07-26 09:55:29.224 INFO in 'deeppavlov.core.models.torch_model'['torch_model'] at line 207: Saving model to /root/.deeppavlov/models/ner_onto

{"valid": {"eval_examples_count": 209, "metrics": {"ner_f1": 30.2083, "ner_token_f1": 64.7699}, "time_spent": "0:00:12", "epochs_done": 1, "batches_seen": 20, "train_examples_seen": 305, "impatience": 0, "patience_limit": 100}}


INFO:train_report:{"valid": {"eval_examples_count": 209, "metrics": {"ner_f1": 30.2083, "ner_token_f1": 64.7699}, "time_spent": "0:00:12", "epochs_done": 1, "batches_seen": 20, "train_examples_seen": 305, "impatience": 0, "patience_limit": 100}}
14it [00:22,  1.59s/it]
11it [00:03,  3.49it/s]
1it [00:00, 13.90it/s]

{"train": {"eval_examples_count": 16, "metrics": {"ner_f1": 44.4444, "ner_token_f1": 74.4186}, "time_spent": "0:00:35", "epochs_done": 2, "batches_seen": 40, "train_examples_seen": 610, "loss": 0.3777087040245533}}



INFO:train_report:{"train": {"eval_examples_count": 16, "metrics": {"ner_f1": 44.4444, "ner_token_f1": 74.4186}, "time_spent": "0:00:35", "epochs_done": 2, "batches_seen": 40, "train_examples_seen": 610, "loss": 0.3777087040245533}}

0it [00:00, ?it/s][A
1it [00:00,  3.76it/s][A
2it [00:00,  5.83it/s][A
4it [00:00,  8.83it/s][A
6it [00:00, 11.10it/s][A
8it [00:00, 11.14it/s][A
10it [00:01, 11.36it/s][A
14it [00:01, 11.07it/s]
2023-07-26 09:55:53.359 INFO in 'deeppavlov.core.trainers.nn_trainer'['nn_trainer'] at line 206: Improved best ner_f1 from 30.2083 to 54.7893
INFO:deeppavlov.core.trainers.nn_trainer:Improved best ner_f1 from 30.2083 to 54.7893
2023-07-26 09:55:53.363 INFO in 'deeppavlov.core.trainers.nn_trainer'['nn_trainer'] at line 208: Saving model
INFO:deeppavlov.core.trainers.nn_trainer:Saving model
2023-07-26 09:55:53.366 INFO in 'deeppavlov.core.models.torch_model'['torch_model'] at line 207: Saving model to /root/.deeppavlov/models/ner_ontonotes_torch_bert_mult_cr

{"valid": {"eval_examples_count": 209, "metrics": {"ner_f1": 54.7893, "ner_token_f1": 66.5185}, "time_spent": "0:00:36", "epochs_done": 2, "batches_seen": 40, "train_examples_seen": 610, "impatience": 0, "patience_limit": 100}}


INFO:train_report:{"valid": {"eval_examples_count": 209, "metrics": {"ner_f1": 54.7893, "ner_token_f1": 66.5185}, "time_spent": "0:00:36", "epochs_done": 2, "batches_seen": 40, "train_examples_seen": 610, "impatience": 0, "patience_limit": 100}}
14it [00:15,  1.11s/it]
14it [00:03,  3.63it/s]
3it [00:00,  3.70it/s]
1it [00:00, 11.88it/s]

{"train": {"eval_examples_count": 16, "metrics": {"ner_f1": 79.0698, "ner_token_f1": 75.5906}, "time_spent": "0:00:52", "epochs_done": 4, "batches_seen": 60, "train_examples_seen": 900, "loss": 0.21979666273109616}}



INFO:train_report:{"train": {"eval_examples_count": 16, "metrics": {"ner_f1": 79.0698, "ner_token_f1": 75.5906}, "time_spent": "0:00:52", "epochs_done": 4, "batches_seen": 60, "train_examples_seen": 900, "loss": 0.21979666273109616}}

0it [00:00, ?it/s][A
1it [00:00,  3.46it/s][A
2it [00:00,  5.40it/s][A
3it [00:00,  6.57it/s][A
5it [00:00,  9.14it/s][A
7it [00:00, 10.14it/s][A
9it [00:01, 10.57it/s][A
11it [00:01, 10.82it/s][A
14it [00:01, 10.13it/s]
2023-07-26 09:56:10.408 INFO in 'deeppavlov.core.trainers.nn_trainer'['nn_trainer'] at line 206: Improved best ner_f1 from 54.7893 to 72.7273
INFO:deeppavlov.core.trainers.nn_trainer:Improved best ner_f1 from 54.7893 to 72.7273
2023-07-26 09:56:10.412 INFO in 'deeppavlov.core.trainers.nn_trainer'['nn_trainer'] at line 208: Saving model
INFO:deeppavlov.core.trainers.nn_trainer:Saving model
2023-07-26 09:56:10.416 INFO in 'deeppavlov.core.models.torch_model'['torch_model'] at line 207: Saving model to /root/.deeppavlov/models/ner_o

{"valid": {"eval_examples_count": 209, "metrics": {"ner_f1": 72.7273, "ner_token_f1": 80.3725}, "time_spent": "0:00:53", "epochs_done": 4, "batches_seen": 60, "train_examples_seen": 900, "impatience": 0, "patience_limit": 100}}


INFO:train_report:{"valid": {"eval_examples_count": 209, "metrics": {"ner_f1": 72.7273, "ner_token_f1": 80.3725}, "time_spent": "0:00:53", "epochs_done": 4, "batches_seen": 60, "train_examples_seen": 900, "impatience": 0, "patience_limit": 100}}
14it [00:19,  1.37s/it]
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
14it [00:01, 11.34it/s]


{"valid": {"eval_examples_count": 209, "metrics": {"ner_f1": 72.7273, "ner_token_f1": 80.3725}, "time_spent": "0:00:02"}}


INFO:train_report:{"valid": {"eval_examples_count": 209, "metrics": {"ner_f1": 72.7273, "ner_token_f1": 80.3725}, "time_spent": "0:00:02"}}
14it [00:01, 11.83it/s]


{"test": {"eval_examples_count": 209, "metrics": {"ner_f1": 72.7273, "ner_token_f1": 80.3725}, "time_spent": "0:00:02"}}


INFO:train_report:{"test": {"eval_examples_count": 209, "metrics": {"ner_f1": 72.7273, "ner_token_f1": 80.3725}, "time_spent": "0:00:02"}}
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Предсказания для тестовой выборки

In [None]:
sentences = get_sentences("gold.conll2003-formatted.txt")
_, predictions = ner_model([s.tokens for s in sentences])

Записываем предсказанные результаты в выходной файле в формате хакатона

In [None]:
write_predictions(sentences, predictions, "predictions.txt")

<class 'list'> [['I-PERSON', 'I-PERSON', 'I-PERSON', 'I-PERSON', 'I-PERSON', 'B-TITLE', 'O', 'O', 'O'], ['O', 'B-INSTITUTION', 'I-INSTITUTION', 'I-INSTITUTION', 'I-INSTITUTION', 'I-INSTITUTION', 'O', 'O', 'O', 'O', 'O', 'O'], ['B-PERSON', 'I-PERSON', 'B-LOCATION', 'B-INSTITUTION', 'I-INSTITUTION', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'B-TITLE', 'I-TITLE', 'I-TITLE', 'I-TITLE', 'O', 'O', 'B-TITLE', 'I-TITLE', 'I-TITLE', 'O', 'O'], ['O', 'O', 'O', 'O', 'B-TITLE', 'I-TITLE', 'I-TITLE', 'I-TITLE', 'O', 'O', 'B-INSTITUTION', 'I-INSTITUTION', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'B-PERSON', 'I-PERSON', 'O', 'B-PERSON', 'I-PERSON', 'O', 'B-PERSON', 'I-PERSON', 'O', 'B-PERSON', 'I-PERSON', 'O', 'B-PERSON', 'I-PERSON', 'O', 'B-PERSON', 'I-PERSON', 'O', 'B-PERSON', 'I-PERSON', 'O', 'B-PERSON', 'I-PERSON', 'I-PERSON', 'O', 'B-PERSON', 'I-PERSON', 'O', 'B-PERSON', 'I-PERSON', 'O', 'B-PERSON', 'I-PERS