# Домашнее задание 3. 

## Предсказание пользовательской оценки отеля по тексту отзыва.

В данном домашнем задании вам будет обучиться на данных с кэггла и заслать в [соревнование](https://www.kaggle.com/t/325e82797935464aa07c254b3cc3d8ad) предикт. Чтобы контест отображался, откройте и примите условия участия в контесте через ссылку-приглашение в телеграм-канале. По той же ссылке можете скачать данные.

Мы собрали для вас отзывы по 1500 отелям из совершенно разных уголков мира. Что это за отели - секрет. Вам дан текст отзыва и пользовательская оценка отеля. Ваша задача - научиться предсказывать оценку отеля по отзыву.

Главная метрика - Mean Absolute Error (MAE). Во всех частях домашней работы вам нужно получить значение MAE не превышающее 0.92 на публичном лидерборде. В противном случае мы будем вынуждены не засчитать задание :( 

#### Про данные:
Каждое ревью состоит из двух текстов: positive и negative - плюсы и минусы отеля. В столбце score находится оценка пользователя - вещественное число 0 до 10. Вам нужно извлечь признаки из этих текстов и предсказать по ним оценку.

Для локального тестирования используйте предоставленное разбиение на трейн и тест.

Good luck & have fun! 💪

#### Использовать любые данные для обучения кроме предоставленных организаторами строго запрещено. В последней части можно использовать предобученные модели из библиотеки `transformers`.

In [None]:
PATH_TO_TRAIN_DATA = 'data/train.csv'

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv(PATH_TO_TRAIN_DATA)
df.head()

Unnamed: 0,review_id,negative,positive,score
0,00003c6036f30f590c0ac435efb8739b,There were issues with the wifi connection,No Positive,7.1
1,00004d18f186bf2489590dc415876f73,TV not working,No Positive,7.5
2,0000cf900cbb8667fad33a717e9b1cf4,More pillows,Beautiful room Great location Lovely staff,10.0
3,0000df16edf19e7ad9dd8c5cd6f6925e,Very business,Location,5.4
4,00025e1aa3ac32edb496db49e76bbd00,Rooms could do with a bit of a refurbishment ...,Nice breakfast handy for Victoria train stati...,6.7


Предобработка текста может сказываться на качестве вашей модели.
Сделаем небольшой препроцессинг текстов: удалим знаки препинания, приведем все слова к нижнему регистру. 
Однако можно не ограничиваться этим набором преобразований. Подумайте, что еще можно сделать с текстами, чтобы помочь будущим моделям? Добавьте преобразования, которые могли бы помочь по вашему мнению.

Также мы добавили разбиение текстов на токены. Теперь каждая строка-ревью стала массивом токенов.

In [None]:
import string
import re

import nltk
nltk.download('punkt')

from nltk.tokenize import word_tokenize

def process_text(text):
    return [word for word in word_tokenize(text.lower()) if word not in string.punctuation] 

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
df['negative'] = df['negative'].apply(process_text)
df['positive'] = df['positive'].apply(process_text)

df['negative'] = df['negative'].apply(lambda x: [l + '-' for l in x])
df['positive'] = df['positive'].apply(lambda x: [l + '+' for l in x])

In [None]:
from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df, random_state=1412) # <- для локального тестирования

### Часть 1. 1 балл

Обучите логистическую регрессию на TF-IDF векторах текстов.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression

In [None]:
def join_tokens(tokens):
    return re.sub(r'\d+', '', ' '.join(tokens))

def prepare_data(df_train, df_test):
    X_train = df_train['negative'].apply(join_tokens) + ' ' + df_train['positive'].apply(join_tokens)
    X_test = df_test['negative'].apply(join_tokens) + ' ' + df_test['positive'].apply(join_tokens)
    
    y_train = df_train['score']
    y_test = df_test['score']

    return X_train, y_train, X_test, y_test

In [None]:
X_train, y_train, X_test, y_test = prepare_data(df_train, df_test)

In [None]:
vectorizer = TfidfVectorizer(lowercase=True, analyzer='char',
                        stop_words= 'english',ngram_range=(1,2))

# Word ngram vector
tr_vect = vectorizer.fit_transform(X_train)
ts_vect = vectorizer.transform(X_test)

In [None]:
from sklearn.metrics import mean_absolute_error

regressor = LinearRegression(n_jobs=4)
regressor.fit(tr_vect, y_train)
y_predicted = regressor.predict(ts_vect)

mean_absolute_error(y_test, y_predicted)

0.9189446614427927

In [None]:
df_kaggle = pd.read_csv("data/test.csv")
df_kaggle.head()

Unnamed: 0,review_id,negative,positive
0,00026f564b258ad5159aab07c357c4ca,Other than the location everything else was h...,Just the location
1,000278c73da08f4fcb857fcfe4ac6417,No UK TV but this was a minor point as we wer...,Great location very comfortable clean breakfa...
2,000404f843e756fe3b2a477dbefa5bd4,A tiny noisy room VERY deceptively photographed,The breakfast booked the preceding night but ...
3,000a66d32bcf305148d789ac156dd512,Noisy various electrical devices kicking in r...,Great location Nice bathroom
4,000bf1d8c5110701f459ffbedbf0d546,No Negative,Great location and friendly staff


In [None]:
df_kaggle['negative'] = df_kaggle['negative'].apply(process_text)
df_kaggle['positive'] = df_kaggle['positive'].apply(process_text)

In [None]:
test_kaggle = df_kaggle['negative'].apply(join_tokens) + ' ' + df_kaggle['positive'].apply(join_tokens)
test_kaggle_vect = vectorizer.transform(test_kaggle)
y_kaggle_predicted = regressor.predict(test_kaggle_vect)

In [None]:
df_answer_kaggle = df_kaggle["review_id"].to_frame()
df_answer_kaggle["score"] = y_kaggle_predicted.round(1).tolist()
df_answer_kaggle.to_csv("data/submission.csv", index=False)

df_answer_kaggle

Unnamed: 0,review_id,score
0,00026f564b258ad5159aab07c357c4ca,3.5
1,000278c73da08f4fcb857fcfe4ac6417,5.2
2,000404f843e756fe3b2a477dbefa5bd4,4.0
3,000a66d32bcf305148d789ac156dd512,4.3
4,000bf1d8c5110701f459ffbedbf0d546,5.3
...,...,...
19995,ffe8a7190aee6e3a53ee2e0145a91555,3.9
19996,ffea0e2b84788c9df755efe8e2bedb23,5.0
19997,fff3997a85a1eed7ae7a937bc945fcf0,5.1
19998,fff673fe95ab8f3a0910f112549862e2,3.8


Предскажите этой моделью тестовые данные из [соревнования](https://www.kaggle.com/t/325e82797935464aa07c254b3cc3d8ad) и сделайте сабмит. Какой у вас получился скор? Прикрепите скриншот из кэггла.

![](https://i.imgur.com/v7e1HMQ.png)

### Часть 2. 2 балла

Обучите логистическую регрессию на усредненных Word2Vec векторах. 

In [None]:
df = pd.read_csv(PATH_TO_TRAIN_DATA)
df.head()

Unnamed: 0,review_id,negative,positive,score
0,00003c6036f30f590c0ac435efb8739b,There were issues with the wifi connection,No Positive,7.1
1,00004d18f186bf2489590dc415876f73,TV not working,No Positive,7.5
2,0000cf900cbb8667fad33a717e9b1cf4,More pillows,Beautiful room Great location Lovely staff,10.0
3,0000df16edf19e7ad9dd8c5cd6f6925e,Very business,Location,5.4
4,00025e1aa3ac32edb496db49e76bbd00,Rooms could do with a bit of a refurbishment ...,Nice breakfast handy for Victoria train stati...,6.7


In [None]:
df['opinion'] = df['positive'].str.cat(df['negative'], sep =" ")

In [None]:
df['opinion'] = df['opinion'].apply(process_text)

In [None]:
df_train, df_test = train_test_split(df, random_state=1412) # <- для локального тестирования

In [None]:
from gensim.models import Word2Vec
from gensim.models.phrases import Phrases, Phraser
import logging
logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s", datefmt= '%H:%M:%S', level=logging.INFO)

In [None]:
EMB_SIZE = 500

w2v_model = Word2Vec(min_count=1,
                     window=2,
                     vector_size=EMB_SIZE,
                     sample=6e-5, 
                     alpha=0.03, 
                     min_alpha=0.0007, 
                     negative=20,
                     workers=4)

INFO - 23:52:51: Word2Vec lifecycle event {'params': 'Word2Vec(vocab=0, vector_size=500, alpha=0.03)', 'datetime': '2021-12-19T23:52:51.641997', 'gensim': '4.1.2', 'python': '3.7.4 (default, Aug 13 2019, 15:17:50) \n[Clang 4.0.1 (tags/RELEASE_401/final)]', 'platform': 'Darwin-21.1.0-x86_64-i386-64bit', 'event': 'created'}


In [None]:
w2v_model.build_vocab(df['opinion'], progress_per=50000)
w2v_model.train(df['opinion'], total_examples=w2v_model.corpus_count, epochs=70, report_delay=1)
w2v_model.init_sims(replace=True)

INFO - 23:52:51: collecting all words and their counts
INFO - 23:52:51: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO - 23:52:52: PROGRESS: at sentence #50000, processed 1726357 words, keeping 24704 word types
INFO - 23:52:52: collected 34728 word types from a corpus of 3440696 raw words and 100000 sentences
INFO - 23:52:52: Creating a fresh vocabulary
INFO - 23:52:52: Word2Vec lifecycle event {'msg': 'effective_min_count=1 retains 34728 unique words (100.0%% of original 34728, drops 0)', 'datetime': '2021-12-19T23:52:52.504780', 'gensim': '4.1.2', 'python': '3.7.4 (default, Aug 13 2019, 15:17:50) \n[Clang 4.0.1 (tags/RELEASE_401/final)]', 'platform': 'Darwin-21.1.0-x86_64-i386-64bit', 'event': 'prepare_vocab'}
INFO - 23:52:52: Word2Vec lifecycle event {'msg': 'effective_min_count=1 leaves 3440696 word corpus (100.0%% of original 3440696, drops 0)', 'datetime': '2021-12-19T23:52:52.505421', 'gensim': '4.1.2', 'python': '3.7.4 (default, Aug 13 2019, 15:17:50) \n

INFO - 23:53:19: worker thread finished; awaiting finish of 3 more threads
INFO - 23:53:19: worker thread finished; awaiting finish of 2 more threads
INFO - 23:53:19: worker thread finished; awaiting finish of 1 more threads
INFO - 23:53:19: worker thread finished; awaiting finish of 0 more threads
INFO - 23:53:19: EPOCH - 10 : training on 3440696 raw words (1154421 effective words) took 2.6s, 439610 effective words/s
INFO - 23:53:20: EPOCH 11 - PROGRESS: at 43.29% examples, 495514 words/s, in_qsize 8, out_qsize 0
INFO - 23:53:21: EPOCH 11 - PROGRESS: at 81.12% examples, 466175 words/s, in_qsize 8, out_qsize 0
INFO - 23:53:21: worker thread finished; awaiting finish of 3 more threads
INFO - 23:53:21: worker thread finished; awaiting finish of 2 more threads
INFO - 23:53:21: worker thread finished; awaiting finish of 1 more threads
INFO - 23:53:21: worker thread finished; awaiting finish of 0 more threads
INFO - 23:53:21: EPOCH - 11 : training on 3440696 raw words (1153860 effective wor

INFO - 23:53:52: worker thread finished; awaiting finish of 0 more threads
INFO - 23:53:52: EPOCH - 23 : training on 3440696 raw words (1154075 effective words) took 2.8s, 406441 effective words/s
INFO - 23:53:54: EPOCH 24 - PROGRESS: at 32.85% examples, 376047 words/s, in_qsize 7, out_qsize 0
INFO - 23:53:55: EPOCH 24 - PROGRESS: at 68.86% examples, 392233 words/s, in_qsize 8, out_qsize 0
INFO - 23:53:55: worker thread finished; awaiting finish of 3 more threads
INFO - 23:53:55: worker thread finished; awaiting finish of 2 more threads
INFO - 23:53:55: worker thread finished; awaiting finish of 1 more threads
INFO - 23:53:55: worker thread finished; awaiting finish of 0 more threads
INFO - 23:53:55: EPOCH - 24 : training on 3440696 raw words (1154308 effective words) took 2.8s, 414831 effective words/s
INFO - 23:53:56: EPOCH 25 - PROGRESS: at 35.97% examples, 414450 words/s, in_qsize 7, out_qsize 0
INFO - 23:53:57: EPOCH 25 - PROGRESS: at 70.03% examples, 402252 words/s, in_qsize 8, o

INFO - 23:54:30: EPOCH - 36 : training on 3440696 raw words (1153027 effective words) took 2.9s, 396367 effective words/s
INFO - 23:54:31: EPOCH 37 - PROGRESS: at 34.22% examples, 396447 words/s, in_qsize 7, out_qsize 0
INFO - 23:54:32: EPOCH 37 - PROGRESS: at 74.58% examples, 428964 words/s, in_qsize 8, out_qsize 0
INFO - 23:54:33: worker thread finished; awaiting finish of 3 more threads
INFO - 23:54:33: worker thread finished; awaiting finish of 2 more threads
INFO - 23:54:33: worker thread finished; awaiting finish of 1 more threads
INFO - 23:54:33: worker thread finished; awaiting finish of 0 more threads
INFO - 23:54:33: EPOCH - 37 : training on 3440696 raw words (1154816 effective words) took 2.7s, 434335 effective words/s
INFO - 23:54:34: EPOCH 38 - PROGRESS: at 39.51% examples, 455967 words/s, in_qsize 8, out_qsize 0
INFO - 23:54:35: EPOCH 38 - PROGRESS: at 79.94% examples, 460790 words/s, in_qsize 8, out_qsize 0
INFO - 23:54:35: worker thread finished; awaiting finish of 3 mo

KeyboardInterrupt: 

In [None]:
w2v_model.wv.most_similar(positive=["good"])

[('excellent', 0.6185586452484131),
 ('great', 0.5960034132003784),
 ('nice', 0.5376293659210205),
 ('very', 0.42820680141448975),
 ('comfortable', 0.42485761642456055),
 ('clean', 0.41823825240135193),
 ('location', 0.4068338871002197),
 ('friendly', 0.405752032995224),
 ('poor', 0.398668110370636),
 ('lovely', 0.3927823305130005)]

Усредняя w2v вектора, мы предполагаем, что каждое слово имеет равноценный вклад в смысл предложения, однако это может быть не совсем так. Теперь попробуйте воспользоваться другой концепцией и перевзвесить слова при получении итогового эмбеддинга текста. В качестве весов используйте IDF (Inverse document frequency)

In [None]:
def calc_idf(texts):
    pass

Проведите эксперименты с размерностью эмбеддинга. Для каждого из двух методов постройте график зависимости качества модели от размерности эмбеддинга. 

#### Сделайте выводы:

Теперь попробуйте обучить логистическую регрессию на любых других эмбеддингах размерности 300 и сравните качество с Word2Vec.
#### Выводы:
`<ВАШ ТЕКСТ ЗДЕСЬ>`

Предскажите вашей лучшей моделью из этого задания тестовые данные из [соревнования](https://www.kaggle.com/t/325e82797935464aa07c254b3cc3d8ad) и сделайте сабмит. Какой у вас получился скор? Прикрепите скриншот из кэггла.

### Часть 3. 4 балла

Теперь давайте воспользуемся более продвинутыми методами обработки текстовых данных, которые мы проходили в нашем курсе. Обучите RNN/Transformer для предсказания пользовательской оценки. Получите ошибку меньше, чем во всех вышеперечисленных методах.

Если будете обучать RNN, попробуйте ограничить максимальную длину предложения. Некоторые отзывы могут быть слишком длинные относительно остальных.

Чтобы пользоваться DataLoader, все его элементы должны быть одинаковой размерности. Для этого вы можете добавить нулевой паддинг ко всем предложениям (см пример pad_sequence)

In [None]:
!pip install pytorch_transformers

Collecting pytorch_transformers
  Downloading pytorch_transformers-1.2.0-py3-none-any.whl (176 kB)
[K     |████████████████████████████████| 176 kB 5.3 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 38.3 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 30.3 MB/s 
[?25hCollecting boto3
  Downloading boto3-1.20.24-py3-none-any.whl (131 kB)
[K     |████████████████████████████████| 131 kB 47.6 MB/s 
Collecting s3transfer<0.6.0,>=0.5.0
  Downloading s3transfer-0.5.0-py3-none-any.whl (79 kB)
[K     |████████████████████████████████| 79 kB 8.3 MB/s 
[?25hCollecting botocore<1.24.0,>=1.23.24
  Downloading botocore-1.23.24-py3-none-any.whl (8.4 MB)
[K     |████████████████████████████████| 8.4 MB 32.6 MB/s 
[?25hCollecting jmespath<1.0.0,>=0.7.1
  Downloading jmespath

In [None]:
import torch
from torch import nn
from torch.nn import functional as F

from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from torch.nn.utils.rnn import pad_sequence
from pytorch_transformers import RobertaTokenizer, RobertaForSequenceClassification
from pytorch_transformers import RobertaConfig
from tqdm import tqdm

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

In [None]:
df = pd.read_csv(PATH_TO_TRAIN_DATA)
df['opinion'] = df['positive'].str.cat(df['negative'], sep =" ")

In [None]:
set_score = list(set(df.score.tolist()))
dict_score = {set_score[item]:item for item in range(len(set_score))}
def get_class(score):
    return(dict_score[score['score']])
  
df['reviewClass'] = df.apply(get_class, axis=1)

df_train, df_test = train_test_split(df)

In [None]:
sentences_train = [i for i in df_train['opinion']]
labels_train = [i for i in df_train['reviewClass'].tolist()]

sentences_test = [i for i in df_test['opinion']]
labels_test = [i for i in df_test['reviewClass'].tolist()]

In [None]:
tokenizer = RobertaTokenizer.from_pretrained('roberta-base',add_special_tokens=True)

100%|██████████| 898823/898823 [00:00<00:00, 3257651.64B/s]
100%|██████████| 456318/456318 [00:00<00:00, 2102723.10B/s]


In [None]:
MAX_LEN = 150
train_input = [tokenizer.encode(x,add_special_tokens=True) for x in sentences_train]

train_pos_pad = pad_sequence([torch.as_tensor(seq[:MAX_LEN]) for seq in train_input], 
                           batch_first=True)

train_masks = [[float(i>0) for i in seq] for seq in train_pos_pad]

Token indices sequence length is longer than the specified maximum sequence length for this model (515 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (524 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (679 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (658 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (523 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for thi

In [None]:
val_input = [tokenizer.encode(x, add_special_tokens=True) for x in sentences_test]
val_pos_pad = pad_sequence([torch.as_tensor(seq[:MAX_LEN]) for seq in val_input], 
                           batch_first=True)

val_masks = [[float(i>0) for i in seq] for seq in val_pos_pad]

Token indices sequence length is longer than the specified maximum sequence length for this model (521 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (562 > 512). Running this sequence through the model will result in indexing errors


In [None]:
train_inputs = train_pos_pad.clone().detach()
train_labels = torch.tensor(labels_train)
train_masks = torch.tensor(train_masks)

val_inputs = val_pos_pad.clone().detach()
val_labels = torch.tensor(labels_test)
val_masks = torch.tensor(val_masks)

In [None]:
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_dataloader = DataLoader(
    train_data, shuffle=True,
    batch_size=32
)


val_data = TensorDataset(val_inputs, val_masks, val_labels)
val_dataloader = DataLoader(
    val_data,
    batch_size=32
)

In [None]:
config = RobertaConfig.from_pretrained("roberta-base", 
                                       output_hidden_states=True, 
                                       num_labels=len(df_train.score.unique()))
model = RobertaForSequenceClassification.from_pretrained("roberta-base", 
                                                         config=config)

100%|██████████| 481/481 [00:00<00:00, 262690.13B/s]
100%|██████████| 501200538/501200538 [00:19<00:00, 25909053.67B/s]


In [None]:
dict_score_class = {item:set_score[item] for item in range(len(set_score))}

def mae(predicted, actual):
    predicted =  [dict_score_class[item] for item in predicted]
    actual = [dict_score_class[item] for item in actual]
    return mean_absolute_error(predicted, actual)

In [None]:
df_kaggle = pd.read_csv('data/test.csv')
df_kaggle['opinion'] = df_kaggle['positive'].str.cat(df_kaggle['negative'], sep =" ")
df_kaggle.head()

Unnamed: 0,review_id,negative,positive,opinion
0,00026f564b258ad5159aab07c357c4ca,Other than the location everything else was h...,Just the location,Just the location Other than the location e...
1,000278c73da08f4fcb857fcfe4ac6417,No UK TV but this was a minor point as we wer...,Great location very comfortable clean breakfa...,Great location very comfortable clean breakfa...
2,000404f843e756fe3b2a477dbefa5bd4,A tiny noisy room VERY deceptively photographed,The breakfast booked the preceding night but ...,The breakfast booked the preceding night but ...
3,000a66d32bcf305148d789ac156dd512,Noisy various electrical devices kicking in r...,Great location Nice bathroom,Great location Nice bathroom Noisy various e...
4,000bf1d8c5110701f459ffbedbf0d546,No Negative,Great location and friendly staff,Great location and friendly staff No Negative


In [None]:
sentences_test = [sentence for sentence in df_kaggle['opinion']]

In [None]:
MAX_LEN = 150
test_input = [tokenizer.encode(x,add_special_tokens=True) for x in sentences_test]

test_pos_pad = pad_sequence([torch.as_tensor(seq[:MAX_LEN]) for seq in test_input], 
                           batch_first=True)

test_masks = [[float(i>0) for i in seq] for seq in test_pos_pad]

Token indices sequence length is longer than the specified maximum sequence length for this model (758 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (615 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (684 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (558 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (633 > 512). Running this sequence through the model will result in indexing errors


In [None]:
test_inputs = torch.tensor(test_pos_pad)
test_masks = torch.tensor(test_masks)

  """Entry point for launching an IPython kernel.


In [None]:
test_data = TensorDataset(test_inputs, test_masks)
test_dataloader = DataLoader(
    test_data,
    batch_size=32
)

In [None]:
def train_one_epoch(model, train_dataloader, 
                    criterion, optimizer, device="cuda:0"):
    model.to(device).train()
    with tqdm(total=len(train_dataloader)) as pbar:
        for batch in train_dataloader:
            ids, mask, labels = batch
            ids = ids.to(device)
            mask = mask.to(device)
            labels = labels.to(device)

            optimizer.zero_grad()
            output = model.forward(ids, token_type_ids=None, 
                                   attention_mask=mask)[0]
            _, predicted = torch.max(output, 1)
            
            loss = criterion(output, labels)
            loss.backward()
            optimizer.step()
            _, predicted = torch.max(output.detach(), 1)
            accuracy_mae = mae(predicted.cpu().detach().numpy(), 
                               labels.cpu().detach().numpy())
            pbar.set_description(
                'CrossEntropyLoss: {:.4f}; MAE: {:.4f}'.format(
                    loss.detach().item(), accuracy_mae))    
            pbar.update(1)
            
def predict(model, val_dataloader, criterion, device="cuda:0"):
    model.to(device).eval()
    losses = []
    predicted_classes = []
    true_classes = []
    with tqdm(total=len(val_dataloader)) as pbar:
        with torch.no_grad():
            for batch in val_dataloader:
                ids, mask, labels = batch
                ids = ids.to(device)
                mask = mask.to(device)
                labels = labels.to(device)
                
                
                output = model.forward(ids, token_type_ids=None, 
                                       attention_mask=mask)[0]
                _, predicted = torch.max(output, 1)
            
                loss = criterion(output, labels)
                losses.append(loss.item())
                _, predicted = torch.max(output.detach(), 1)
                predicted_classes.append(predicted)
                true_classes.append(labels)
                
                
                accuracy_mae = mae(predicted.cpu().detach().numpy(), 
                                   labels.cpu().detach().numpy())
                pbar.set_description(
                    'CrossEntropyLoss: {:.4f}; MAE: {:.4f}'.format(
                        loss.detach().item(), accuracy_mae))    
                pbar.update(1)
                
    predicted_classes = torch.cat(predicted_classes).detach().to('cpu').numpy()
    true_classes = torch.cat(true_classes).detach().to('cpu').numpy()
    return losses, predicted_classes, true_classes

def predict_without_labels(model, test_dataloader, device="cuda:0"):
    model.to(device).eval()
    predicted_classes = []
    step = 0
    with tqdm(total=len(test_dataloader)) as pbar:
        with torch.no_grad():
            for batch in test_dataloader:
                ids, mask = batch
                ids = ids.to(device)
                mask = mask.to(device)
                
                
                output = model(ids, token_type_ids=None, 
                              attention_mask=mask)
                predicted = output[0].detach().cpu().numpy()
                batch_predicted = np.argmax(predicted, axis=1)
                predicted_classes.extend(batch_predicted)
                
                pbar.set_description(
                    'Step: {:.4f}'.format(step))    
                pbar.update(1)
                step += 1
                
    return predicted_classes

def train(model, train_dataloader, val_dataloader, test_dataloader, criterion, 
          optimizer, device="cuda:0", n_epochs=10, scheduler=None):
    model.to(device)
    lrs = []
    for epoch in range(n_epochs):
        print('Epoc №', epoch)
        print('Train')
        train_one_epoch(model, train_dataloader, criterion, optimizer)
        torch.save(model.state_dict(), 'model')
        print('Model state saved')
        print('Validation')
        losses, predicted_classes, true_classes = predict(model, 
                                                          val_dataloader, 
                                                          criterion)
        print('MAE: ', mae(true_classes, predicted_classes))
        print('Test')
        predicted_classes = predict_without_labels(model, 
                                                   test_dataloader, 
                                                   device)
        df_answer_kaggle = df_kaggle["review_id"].to_frame()
        df_answer_kaggle["score"] = [dict_score_class[item] for item in predicted_classes]
        df_answer_kaggle.to_csv("data/submission.csv", index=False)
        print('Submission saved')
 
        lrs.append(optimizer.param_groups[0]['lr'])
        scheduler.step()
    

In [None]:
device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")

learning_rate = 1e-05
n_epochs = 3

optimizer = torch.optim.Adam(params=model.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=2, gamma=0.1)

In [None]:
torch.cuda.empty_cache()

In [None]:
train(model, train_dataloader, val_dataloader, test_dataloader, 
      criterion, optimizer, device, n_epochs, scheduler)

Epoc № 0
Train


CrossEntropyLoss: 2.1431; MAE: 0.8458: 100%|██████████| 2344/2344 [1:00:49<00:00,  1.56s/it]


Model state saved
Validation


CrossEntropyLoss: 1.8889; MAE: 0.9375: 100%|██████████| 782/782 [07:33<00:00,  1.72it/s]


MAE:  0.806612
Test


Step: 624.0000: 100%|██████████| 625/625 [06:02<00:00,  1.73it/s]


Submission saved
Epoc № 1
Train


CrossEntropyLoss: 1.8722; MAE: 0.7125: 100%|██████████| 2344/2344 [1:00:48<00:00,  1.56s/it]


Model state saved
Validation


CrossEntropyLoss: 1.7562; MAE: 0.6750: 100%|██████████| 782/782 [07:33<00:00,  1.72it/s]


MAE:  0.7771239999999999
Test


Step: 624.0000: 100%|██████████| 625/625 [06:02<00:00,  1.72it/s]


Submission saved
Epoc № 2
Train


CrossEntropyLoss: 2.3880; MAE: 0.9375: 100%|██████████| 2344/2344 [1:00:57<00:00,  1.56s/it]


Model state saved
Validation


CrossEntropyLoss: 1.7811; MAE: 0.8375: 100%|██████████| 782/782 [07:33<00:00,  1.72it/s]


MAE:  0.7456320000000001
Test


Step: 624.0000: 100%|██████████| 625/625 [06:02<00:00,  1.72it/s]

Submission saved





## Выгрузка репозитория с готовой моделью, можно протыкать, если выше объявить нужную модель, скачать ее по ссылке, а потом загрузить стейт командой ниже
### Не удалось залить на гит, так как весит очень много :(
Ссылка на обученную модельку: https://drive.google.com/file/d/1qR2UPEAs6OFdcUUamZLExR8qtSYbd7Tl/view?usp=sharing

In [None]:
model.load_state_dict(torch.load('model'))

### Контест (до 3 баллов)

По итогам всех ваших экспериментов выберите модель, которую считаете лучшей. Сделайте сабмит в контест. В зависимости от вашего скора на публичном лидерборде, мы начислим вам баллы:

 - <0.77 - 3 балла
 - [0.77; 0.78) - 2 балла
 - [0.78; 0.8) - 1 балл

![](https://i.imgur.com/gCxNpWC.png)