# **Обучение модели классификации отзывов на базе сета данных IMDB**

In [261]:
!pip install catboost #Устанавливаем Катбуст



Импортируем бибилиотеки

In [262]:
import pandas as pd
import numpy as np
import sys
import html
import sklearn
import os.path
from fastai.text import *
import re
from catboost import CatBoostClassifier
from sklearn.metrics import confusion_matrix, classification_report,roc_auc_score, mean_absolute_error, mean_squared_error


Скачиваем датасет с сайта. 

In [263]:
!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

--2020-12-13 23:54:45--  https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz.1’


2020-12-13 23:54:46 (59.8 MB/s) - ‘aclImdb_v1.tar.gz.1’ saved [84125825/84125825]



Данные находятся в архиве, разархивируем  их.

In [264]:
!tar -xf aclImdb_v1.tar.gz

In [265]:
!ls /content/aclImdb

imdbEr.txt  imdb.vocab	README	test  train


Создадим переменную *PATH*, куда запишем путь и переменную *CLASSES* с классами

In [266]:
CLASSES = ['neg', 'pos']
PATH=Path('/content/aclImdb//')

Напишем функцию, которая пробежится по папкам и соберет все отзывы в один файл, заберет окончание названия файлов. Окончание в файлах - оценка пользователя.

In [267]:
def get_texts(path):
    """
Проходим по указаной папке, собираем данные в один файл
    """
    
    # создаем пустые списки для файлов
    texts,labels,score = [],[],[]

    
    # пройдемся по всем папкам
    for idx,label in enumerate(CLASSES):
       
        # пройдемся по всем файлам 
        for fname in (path/label).glob('*.*'):
      
            # откроем файл и склеим
            
            text_clear = re.sub(r'[<br />]', ' ', fname.open('r').read())
            texts.append(text_clear)

            score.append(str(fname).split('_')[1:][0].split('.txt')[0])
    
            # open 
            labels.append(idx)
    return np.array(texts),np.array(labels), np.array(score)


trn_texts,trn_labels,trn_score = get_texts(PATH/'train')
val_texts,val_labels,val_score = get_texts(PATH/'test')

Посмотрим, что получилось.

In [268]:
trn_texts[17001],trn_labels[17001],trn_score[17001]
                                          

('This was a movie that I had hea d a out all my life g owing up,  ut had neve  seen it until a few yea s ago. It\'s  eputation t uly p oceeded it. I knew of Michael Mye s, had seen the mask, saw comme cials fo  all of the c ummy sequels that followed. But I was g owing up du ing the decade whe e Jason and F eddy had a deadly g ip on the ho  o  game, and neve  thought much of the Halloween f anchise. Boy, how I was  eing cheated with cheap knock offs.            Halloween is a genuinely te  ifying movie. Now,  y today\'s standa ds, it isn\'t as g aphic and visce al,  ut this film delive s on all the othe  levels most ho  o  movies fail to achieve today. The atmosphe e that John Ca pente  c eates is so c eepy, and the fact that it is set in a quaint, mid-west town is a testament to his a ility. The lighting effects a e down  ight ho  ifying, with "The Shape" seemingly appea ing and disappea ing into the shadows at will. The simple yet   utally effective music sco e only adds to the susp

Создадим словарик и обернем его в дата фрейм.  
**mood** = 0/1 (негативный/позитивный отзыв.)  
**star** оценка, поставленная пользователями.

In [269]:
d_train = {'review': trn_texts, 'mood': trn_labels, 'stars':trn_score }
data_val_test ={'review': val_texts, 'mood': val_labels, 'stars':val_score }

In [270]:
data_train = pd.DataFrame(d_train)
data_val_test = pd.DataFrame(data_val_test)

In [271]:
data_train

Unnamed: 0,review,mood,stars
0,What on ea th has ecome of ou dea Ramu? Is ...,0,1
1,"So, Steve I win. You have to admi e a man who ...",0,4
2,This was the wo st acted movie I've eve seen ...,0,1
3,Disappointing musical ve sion of Ma ga et Land...,0,4
4,"Jim Belushi is having a mid life c isis, nothi...",0,3
...,...,...,...
24995,I've ead up a little it on Che efo e watchi...,1,7
24996,"A st ong pilot, this two-hou episode does an ...",1,7
24997,"I will admit, I thought this movie wasn't goin...",1,7
24998,Fi st I was caught totally off gua d y the fi...,1,10


Поменяем тип данных в столбце stars и перемешаем датасеты

In [272]:
data_train['stars'] = data_train['stars'].astype('int32')
data_val_test['stars'] = data_val_test['stars'].astype('int32')

In [273]:
data_train = sklearn.utils.shuffle(data_train, random_state =42)
data_train.reset_index(inplace=True, drop=True)
data_train.head()

Unnamed: 0,review,mood,stars
0,Amy Poehle and Rachel D atch a e among the fu...,0,4
1,"Yea s ago, I found a "" a gain in"" copy of thi...",1,7
2,This is the fi st time I eve saw a movie with...,0,4
3,This has long een one of my favou ite adaptat...,1,9
4,If you've ead Mothe Night and enjoyed it so ...,1,7


In [274]:
data_val_test = sklearn.utils.shuffle(data_val, random_state =42)
data_val_test.reset_index(inplace=True, drop=True)
data_val_test.head()

Unnamed: 0,review,mood,stars
0,I have to say that this movie was not what i e...,0,1
1,"I saw this on DVD with subtitles, which made i...",1,10
2,When I watched this movie it was an afternoon ...,0,4
3,"i just wanted to say i liked this movie a lot,...",1,7
4,Aside from the horrendous acting and the ridic...,0,3


Разобьем тестовые данные на валидационную и тестовую выборку.

In [275]:
data_test = data_val_test.sample(frac=0.5,random_state=42).copy()
data_valid = data_val_test[~data_val_test.index.isin(data_test.index)].copy()

Определим и обучим модель Catboost, 

 **Целевой признак:** Настроение коментария.

In [276]:
model = CatBoostClassifier(verbose=100,
                           learning_rate=0.2,
                           early_stopping_rounds=200,
                           eval_metric='F1'
                           )

In [277]:
model.fit(data_train[['review']],data_train[['mood']],
          eval_set=(data_valid[['review']],data_valid[['mood']]),
          text_features=['review'])

0:	learn: 0.8372111	test: 0.7782912	best: 0.7782912 (0)	total: 202ms	remaining: 3m 21s
100:	learn: 0.8995588	test: 0.8292093	best: 0.8292093 (100)	total: 20.6s	remaining: 3m 3s
200:	learn: 0.9278646	test: 0.8319731	best: 0.8330935 (157)	total: 40.7s	remaining: 2m 41s
300:	learn: 0.9475991	test: 0.8358256	best: 0.8358516 (299)	total: 1m 1s	remaining: 2m 21s
400:	learn: 0.9621867	test: 0.8383337	best: 0.8388417 (347)	total: 1m 21s	remaining: 2m 1s
500:	learn: 0.9737819	test: 0.8377590	best: 0.8389177 (409)	total: 1m 41s	remaining: 1m 40s
600:	learn: 0.9833380	test: 0.8381612	best: 0.8389177 (409)	total: 2m 1s	remaining: 1m 20s
Stopped by overfitting detector  (200 iterations wait)

bestTest = 0.8389177285
bestIteration = 409

Shrink model to first 410 iterations.


<catboost.core.CatBoostClassifier at 0x7fecc03b0b70>

Рассмотрим метрики модели на тестовых данных:

In [278]:
roc_auc_score(data_test['mood'], model.predict_proba(data_test[['review']])[:,1])

0.9132599629820211

In [279]:
print(classification_report(data_test['mood'],model.predict(data_test[['review']])))

              precision    recall  f1-score   support

           0       0.85      0.82      0.83      6270
           1       0.83      0.85      0.84      6230

    accuracy                           0.84     12500
   macro avg       0.84      0.84      0.84     12500
weighted avg       0.84      0.84      0.84     12500



Определим и обучим модель Catboost, 


 **Целевой признак:** оценка пользователя.

In [280]:
model_star = CatBoostClassifier(verbose=100,early_stopping_rounds=200)

In [281]:
model_star.fit(data_train[['review']],data_train[['stars']],
          eval_set=(data_valid[['review']],data_valid[['stars']]),
          text_features=['review'])

Learning rate set to 0.115156
0:	learn: 1.9775586	test: 1.9918904	best: 1.9918904 (0)	total: 2.49s	remaining: 41m 26s
100:	learn: 1.4150070	test: 1.6435956	best: 1.6432961 (99)	total: 3m 2s	remaining: 27m 4s
200:	learn: 1.3072172	test: 1.6143785	best: 1.6143785 (200)	total: 6m 2s	remaining: 24m
300:	learn: 1.2295019	test: 1.6049523	best: 1.6049523 (300)	total: 9m 1s	remaining: 20m 58s
400:	learn: 1.1652606	test: 1.6000225	best: 1.5999428 (379)	total: 12m 1s	remaining: 17m 57s
500:	learn: 1.1092948	test: 1.5973684	best: 1.5969896 (493)	total: 14m 59s	remaining: 14m 56s
600:	learn: 1.0541228	test: 1.5948338	best: 1.5948338 (600)	total: 17m 59s	remaining: 11m 56s
700:	learn: 1.0051106	test: 1.5935970	best: 1.5932770 (688)	total: 20m 58s	remaining: 8m 56s
800:	learn: 0.9585214	test: 1.5935471	best: 1.5932770 (688)	total: 23m 59s	remaining: 5m 57s
900:	learn: 0.9143291	test: 1.5944567	best: 1.5928719 (829)	total: 27m 1s	remaining: 2m 58s
999:	learn: 0.8757688	test: 1.5982603	best: 1.5928719

<catboost.core.CatBoostClassifier at 0x7fecc03b0fd0>

In [282]:
print(classification_report(data_test['stars'],model_star.predict(data_test[['review']])))

              precision    recall  f1-score   support

           1       0.43      0.84      0.57      2555
           2       0.24      0.04      0.07      1113
           3       0.29      0.10      0.15      1283
           4       0.36      0.22      0.27      1319
           7       0.32      0.23      0.27      1155
           8       0.32      0.11      0.16      1436
           9       0.46      0.04      0.07      1160
          10       0.42      0.78      0.55      2479

    accuracy                           0.40     12500
   macro avg       0.36      0.30      0.26     12500
weighted avg       0.37      0.40      0.33     12500



In [287]:
print(mean_squared_error(data_test['stars'],model_star.predict(data_test[['review']]), squared=False))

3.0537714387294934


In [296]:
print(mean_absolute_error(data_test['stars'],model_star.predict(data_test[['review']])))

1.84344


Напишем функцию для проверки, с "живым" коментарием.

In [285]:
def comment_analyzer(comment):
  
    data =  pd.DataFrame(comment, columns=['review'])
    rating = model_star.predict(data) # 1 - 10
    comment_type = model.predict(data) # neg / pos
    # 
 
    return rating, comment_type

Сохраним модели, для дальнейшей работы.

In [290]:
model.save_model('mood')
model.save_model('Star')