В данной задаче была использована библиотека LightAutoML. AutoML говорит о том, что большая часть была сделана автоматически с помощью встроенного класса TabularNLPAutoML. С моей стороны задача была поставлена так: на вход подается слово и текст, на выходе необходимо получать номер гласной в слове, на которую падает ударение. В качестве NLP-модели для fine-tuning была взята rubert-base-cased-conversational от DeepPavlov. В качестве типа задачи была выбрана классификация.

In [1]:
import pandas as pd

train =  pd.read_json('train.json')
test = pd.read_json('test.json')
test.head()

Unnamed: 0,source
0,"до АВГУСТА , я думаю , должно что проясниться ."
1,заседание перенесли на шестнадцатое АВГУСТА .
2,АВГУСТА - женское имя латинского происхождения .
3,бабушка невилла АВГУСТА поддерживала переписку...
4,АДОНИС


In [4]:
vowels = set('аоуыэеёиюя')
def get_stress_num(row):
    text = row['target'].split()
    cntr = 0
    for word in text:
        if sum(map(str.isupper,word)) == 1:
            for c in word:
                if c.isupper():
                    return cntr
                if c.lower() in vowels:
                    cntr += 1

def get_word(row):
    text = row['source']
    text = text.split()
    for word in text:
        if word.isupper():
            return word


In [5]:
train['word'] = train.apply(get_word,axis=1)
train['vowel_num'] = train.apply(get_stress_num,axis=1)

In [6]:
train = train.drop('target',axis=1)

In [10]:
train.vowel_num.value_counts()

0    60932
1    57992
2    18707
3     2938
4      298
5       87
Name: vowel_num, dtype: int64

In [11]:
from lightautoml.automl.presets.text_presets import TabularNLPAutoML
from lightautoml.tasks import Task

roles = {
    'text': ['source','word'],
    'target': 'vowel_num'
}

task = Task('multiclass',metric='crossentropy')

automl = TabularNLPAutoML(
    task=task,
    timeout=30000,
    cpu_limit=8,
    general_params={
        'nested_cv': False,
        'use_algos': [['nn']],
    },
    nested_cv_params = {
        'cv': 3
    },
    text_params={
        'lang': 'ru',
        'bert_model': 'DeepPavlov/rubert-base-cased-conversational'
    },
    nn_params={
        'opt_params': {'lr': 1e-5},
        'max_length': 150,
        'bs': 32,
        'n_epochs': 4,
    },
)

In [12]:
oof_pred = automl.fit_predict(train, roles=roles)

Some weights of the model checkpoint at DeepPavlov/rubert-base-cased-conversational were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
train (loss=0.703459): 100%|██████████| 2937/2937 [15:07<00:00,  3.23it/s]
val: 100%|██████████| 

In [13]:
test['word'] = test.apply(get_word,axis=1)

In [14]:
res = automl.predict(test)

Some weights of the model checkpoint at DeepPavlov/rubert-base-cased-conversational were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
test: 100%|██████████| 534/534 [01:42<00:00,  5.20it/s]
Some weights of the model checkpoint at D

In [16]:
import numpy as np
np.unique(res.to_numpy().data.argmax(axis=1))

array([0, 1, 2, 3], dtype=int64)

In [18]:
test['pred'] = res.to_numpy().data.argmax(axis=1)

In [19]:
def prediction(row):
    word = row['word']
    vowel_num = row['pred']
    cntr = 0
    for indx,c in enumerate(word):
        if c.lower() in vowels:
            if cntr == vowel_num:
                new_word = word.lower()
                new_word = new_word[:indx] + new_word[indx].upper() + new_word[indx+1:]
                return new_word
            cntr += 1

In [22]:
test['predicted_word']  = test.apply(prediction,axis=1)

In [24]:
test['predicted_word'].to_csv('word.csv',index=None,header=None)