# Sentiment analysis of show reviews

The goal of this analysis is to ...
* Get hands-on experience with packages and tools for analysing Russian language (natasha, nltk, spacy, rnnmorph, pymorphy2)
* Investigate available pre-trained models for Russian language (wor2vec, fasttext, navec, models from sber, deeppavlov and others)
* Learn how to finetune BERT-like models

## Imports

In [1]:
import gc
import os
import re
import sys
import warnings
from typing import List, Tuple

import dateparser
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn import svm
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tqdm import tqdm

tqdm.pandas()

%matplotlib inline
%config InlineBackend.figure_format='retina'

In [2]:
SEED = 42

## Data

### Loading data

In [3]:
relative_path = "../../../../"

In [4]:
%%time

reviews_df = pd.read_parquet(os.path.join(relative_path, "data/mt_reviews.parquet"))
reviews_df.shape

CPU times: total: 3.7 s
Wall time: 3.87 s


(206737, 9)

### Dataset overview

In [5]:
reviews_df.sample(n=10, random_state=SEED)

Unnamed: 0,show_id,user_id,type,datetime,sentiment,subtitle,review,review_usefulness,score
196236,257386,28525,series,2010-08-27 11:23:00,good,Вечный город.,К истории Древнего мира у меня отношение особо...,40,
128582,688832,44953,movie,2015-02-20 23:19:00,bad,50 оттенков разочарования,"Говорю сразу, книги читала все, да и по нескол...",5,3.0
159673,349,33910,movie,2018-04-15 21:04:00,good,"Господи, спасибо, что не пронесло мимо","Есть два типа фильмов, мой друг. Одни ты прост...",4,
109244,686898,44065,movie,2019-11-09 11:58:00,neutral,Что же стало с клоуном?,"Итак, в первую очередь хотелось бы отметить то...",2,6.5
92610,61455,66782,movie,2017-11-27 18:52:00,good,Они отказываются подчиняться,"Автора этого замечательного фильма, Джосса Уэд...",5,
43727,491724,44563,movie,2012-01-28 23:18:00,good,Жестокая правда,"Финчер снова нас поразил, он всегда нас поража...",7,10.0
50195,102130,67145,movie,2009-08-17 11:56:00,good,"Преодолеть 2 года жизни, что встретиться","Слышал о фильме много, и в основном положитель...",11,10.0
43651,491724,66283,movie,2012-02-19 15:33:00,bad,"Мужчины, которые ненавидели женщин.",Я попробовала рассматривать этот фильм с двух ...,3,6.0
48775,7226,7905,movie,2014-02-20 03:44:00,good,,"«Догвилль» - это один из тех редких фильмов, п...",3,10.0
75262,458,33255,movie,2013-06-22 21:14:00,good,Тайна закрытой двери,Я имела счастье смотреть этот мультфильм в кин...,6,10.0


### Looking at reviews

In [6]:
for review in reviews_df.sample(n=10, random_state=SEED)["review"].values[:3]:
    print(review.replace("<p>", "\n"))
    print("\n")

К истории Древнего мира у меня отношение особое, как и к самой истории в целом. Меня манит великолепие дворцов разных эпох, сильные характеры и яркие личности, грандиозные и эпические битвы. Я обожаю историю Древнего Рима, Греции, Вавилонии, Египта, обожаю историю Средневековья. Есть в прошлом что-то такое притягательное и таинственное, то, что манит. Иногда мне кажется, что я человек той эпохи, а не этой. Поэтому, в силу таких моих личных взглядов исторические картины представляют для меня наибольший интерес. И именно поэтому совершенно неудивительно, что наткнувшись на сериал 'Рим' я им серьёзно увлеклась.
 Римская империя - одна из самых известных цивилизаций, которая положила начало многим основам нашей современной жизни. Римом правили по истине великие люди. Именно Рим дал нам понятие демократии, сената, республики и многих других политических вещей. До сих пор многие римские законы действуют в нашем обществе, конечно, видоизменённые под наш век. Но факт остаётся фактом, Римская и

In the previous step I've removed the scores from the reviews so it is now safe to continue with baseline model creation.

### Selecting needed columns

For baseline model we're interested only in `sentiment` and `review_body` columns

In [7]:
df = reviews_df[["sentiment", "review"]]

In [8]:
del reviews_df
gc.collect()

4

### Splitting the data

In [9]:
train_df, test_df = train_test_split(
    df, test_size=0.1, random_state=SEED, stratify=df["sentiment"]
)
train_df.shape, test_df.shape

((186063, 2), (20674, 2))

In [10]:
train_df["sentiment"].value_counts(normalize=True)

good       0.720331
neutral    0.149971
bad        0.129698
Name: sentiment, dtype: float64

In [11]:
test_df["sentiment"].value_counts(normalize=True)

good       0.720325
neutral    0.149995
bad        0.129680
Name: sentiment, dtype: float64

## Modelling

### Text encoding

For baseline model, I've decided to start with TF-IDF and Logistic Regression

#### Hyperparameter Investigation

##### `lowercase`

In [16]:
%%time

vectorizer = CountVectorizer(lowercase=False)
vectors_wo_lowercase = vectorizer.fit_transform(train_df["review"])

print(
    f"The size of the train dataset is {vectors_wo_lowercase.shape} with lowercase turned off"
)

The size of the train dataset is (186063, 785942) with lowercase turned off
CPU times: total: 40.1 s
Wall time: 40.1 s


In [17]:
%%time

vectorizer = CountVectorizer()
vectors_w_lowercase = vectorizer.fit_transform(train_df["review"])

print(
    f"The size of the train dataset is {vectors_w_lowercase.shape} with lowercase turned on"
)

The size of the train dataset is (186063, 669383) with lowercase turned on
CPU times: total: 41.8 s
Wall time: 41.8 s


In [18]:
vectors_wo_lowercase.shape[1] - vectors_w_lowercase.shape[1]

116559

The difference in vocabulary size without making all characters lowercase and with lowercase is more than 100 000, so we better stick to lowercase 

##### `max_df` and `min_df`

`min_df` is used for removing terms that appear **too infrequently**. For example:

 - `min_df = 0.01` means "ignore terms that appear in **less than 1% of the documents**".
 - `min_df = 5` means "ignore terms that appear in **less than 5 documents**".  
 
The default `min_df` is `1`, which means "ignore terms that appear in **less than 1 document**".  
Thus, the default setting does not ignore any terms.

`max_df` is used for removing terms that appear **too frequently**, also known as "corpus-specific stop words". For example:

 - `max_df = 0.50` means "ignore terms that appear in **more than 50% of the documents**".
 - `max_df = 25` means "ignore terms that appear in **more than 25 documents**".  
 
The default `max_df` is `1.0`, which means "ignore terms that appear in **more than 100% of the documents**".  
Thus, the default setting does not ignore any terms.

In [19]:
vectorizer.get_feature_names_out()[:50]

array(['00', '000', '0000', '00000', '000000',
       '000000000000000000попкорн000000000000', '000000000000001',
       '000000000000на', '00000000000во', '00000000000данной',
       '00000000000есть000000000000000',
       '00000000000есть000000000000000000', '0000000000жевать',
       '0000000000ненавижу00000000', '00000000016', '000000000надо',
       '000000000разговаривать0000000000', '00000000визуальная',
       '00000001', '000001', '00000громко', '00000точек', '00001',
       '00007', '0001', '0002', '000доктора', '000какой',
       '000косметические', '000р', '000теряются', '001', '002', '003',
       '00381', '006', '007', '00в', '00вых', '00е', '00м', '00по', '00с',
       '00седьмого', '00х', '00ые', '00ых', '01', '011', '013'],
      dtype=object)

We can see that if we do not limit the vocabulary, we will have very infrequent words, so we better do it.  
For that we have to choose the `min_df` and `max_df` thresholds.

In [21]:
%%time

vectorizer = CountVectorizer(min_df=0.8)
vectors = vectorizer.fit_transform(train_df["review"])
vectors.shape

CPU times: total: 39.3 s
Wall time: 39.3 s


(186063, 7)

In [22]:
vectorizer.get_feature_names_out()

array(['как', 'на', 'не', 'но', 'то', 'что', 'это'], dtype=object)

These words are in the 80% of all reviews and it is understandable.  

In [24]:
%%time

MIN_DF = 0.01
vectorizer = CountVectorizer(min_df=MIN_DF)
vectors = vectorizer.fit_transform(train_df["review"])

print(
    f"The size of the train dataset is {vectors.shape} with lowercase turned on and min_df={MIN_DF}"
)

The size of the train dataset is (186063, 3284) with lowercase turned on and min_df=0.01
CPU times: total: 39.4 s
Wall time: 39.4 s


In [25]:
vectorizer.get_feature_names_out()[:50]

array(['10', '100', '11', '12', '13', '15', '16', '18', '20', '2012',
       '21', '30', '3d', '40', '50', '60', '70', '80', '90', 'dc',
       'marvel', 'of', 'the', 'абсолютно', 'аватар', 'автор', 'автора',
       'авторов', 'авторы', 'аж', 'актер', 'актера', 'актерам',
       'актерами', 'актерах', 'актеров', 'актером', 'актерская',
       'актерский', 'актерского', 'актерской', 'актерскую', 'актеры',
       'актриса', 'актрисы', 'актёр', 'актёра', 'актёров', 'актёрская',
       'актёрский'], dtype=object)

In [27]:
%%time

MIN_DF = 0.01
MAX_DF = 0.9
vectorizer = CountVectorizer(min_df=MIN_DF, max_df=MAX_DF)
vectors = vectorizer.fit_transform(train_df["review"])

print(
    f"The size of the train dataset is {vectors.shape} with lowercase turned on and min_df={MIN_DF} and max_df={MAX_DF}"
)

The size of the train dataset is (186063, 3281) with lowercase turned on and min_df=0.01 and max_df=0.9
CPU times: total: 39.2 s
Wall time: 39.2 s


##### `ngram_range`

The lower and upper boundary of the range of n-values for different n-grams to be extracted.  
All values of n such that min_n <= n <= max_n will be used.   

For example an `ngram_range` of `(1, 1)` means only `unigrams`, `(1, 2)` means `unigrams` and `bigrams`, and `(2, 2)` means only `bigrams`.

In [29]:
%%time


NGRAM_RANGE = (1, 3)
vectorizer = CountVectorizer(ngram_range=NGRAM_RANGE, min_df=MIN_DF)
train_vectors = vectorizer.fit_transform(train_df["review"])

print(
    f"The size of the train dataset is {vectors.shape} with lowercase turned on and min_df={MIN_DF} and ngram_range={NGRAM_RANGE}"
)

The size of the train dataset is (186063, 3281) with lowercase turned on and min_df=0.01 and ngram_range=(1, 3)
CPU times: total: 5min 35s
Wall time: 7min 7s


In [30]:
vectorizer.get_feature_names_out()[:50]

array(['10', '10 лет', '100', '11', '12', '13', '15', '16', '18', '20',
       '2012', '21', '30', '3d', '40', '50', '60', '70', '80', '90', 'dc',
       'marvel', 'of', 'the', 'абсолютно', 'абсолютно все',
       'абсолютно не', 'аватар', 'автор', 'автора', 'авторов', 'авторы',
       'аж', 'актер', 'актера', 'актерам', 'актерами', 'актерах',
       'актеров', 'актером', 'актерская', 'актерская игра', 'актерский',
       'актерский состав', 'актерского', 'актерской', 'актерской игры',
       'актерскую', 'актерскую игру', 'актеры'], dtype=object)

#### Vectorizing reviews with TF-IDF

In [12]:
vectorizer_params = {
    "min_df": 0.01,
    "ngram_range": (1, 2),
    "max_features": 10000,
    "tokenizer": lambda s: s.split(),
}

vectorizer_article = TfidfVectorizer(**vectorizer_params)

In [13]:
%%time

X_train_review = vectorizer_article.fit_transform(train_df["review"])

CPU times: total: 3min 19s
Wall time: 3min 19s


In [32]:
X_train_review

<186063x4609 sparse matrix of type '<class 'numpy.float64'>'
	with 32904460 stored elements in Compressed Sparse Row format>

In [14]:
%%time

X_test_review = vectorizer_article.transform(test_df["review"])

CPU times: total: 7.52 s
Wall time: 7.52 s


### Label Encoding

In [15]:
le = LabelEncoder()

In [16]:
train_labels = le.fit_transform(train_df["sentiment"])
test_labels = le.transform(test_df["sentiment"])

In [17]:
train_labels.shape, test_labels.shape

((186063,), (20674,))

### LogReg

#### Training

In [33]:
log_reg = LogisticRegression(
    random_state=SEED, n_jobs=-1, solver="sag", max_iter=100_000
)

In [34]:
%%time

log_reg.fit(X_train_review, train_labels)

CPU times: total: 11.7 s
Wall time: 11.7 s


LogisticRegression(max_iter=100000, n_jobs=-1, random_state=42, solver='sag')

#### Estimation

In [29]:
averaging = "micro"

In [25]:
pred_labels = log_reg.predict(X_test_review)

In [27]:
f1 = f1_score(test_labels, pred_labels, average=averaging)

In [31]:
print(f"F1 score with {averaging}-averaging is {f1.round(3)}")

F1 score with micro-averaging is 0.793


## TODO

Add:
 - Saving train/test split
 - sparse tf-idf matrices