# Sentiment analysis of product reviews

## TF-IDF measure


This is a statistical measure of the importance of a word for a document in the collection, adjusted for the fact that some words are more frequently in general. The weight of a word is proportional to the frequency of use of that word in a document and inversely proportional to the frequency of use of the word in all documents in the collection.

The TF-IDF is the product of two statistics: term frequency and inverse document frequency: $$ tf\text{-}idf(t,d,D)= tf(t,d) \times idf(t,D) $$.

A high weight in TF-IDF is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents.

### TF

TF (term frequency) — the ratio of the number of occurrences of a word to the total number of words in the document. It evaluates the importance of word $t_i$ within an individual document.

$ \Large tf(t,d) = \frac{n_t}{\sum_k n_k} $,

where $n_t$ is the number of times that term t occurs in document $d$, and the denominator is simply the total number of terms in document $d$.

### IDF
IDF (inverse document frequency) is the inversion of the frequency with which a word occurs in the documents in the collection. IDF accounting reduces the weight of frequently used words.     

 $ \Large idf(t, D) = log \frac{|D|} { | \{ d_i \in D \mid t \in d_i \} | } $,
 
with
* $|D|$: total number of documents in the collection;
* $| \{ d_i \in D \mid t \in d_i \} |$: number of documents in collection $D$, where the term $t$ appears (when $n_t \not= 0 $).

## Dataset

In [1]:
from nltk.tokenize import word_tokenize
import eli5
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
import re
import pandas as pd
import pymorphy2

from tqdm import tqdm
tqdm.pandas()


In [2]:
df = pd.read_csv('women-clothing-accessories.3-class.balanced.csv', encoding='utf8', sep='\t')
df


Unnamed: 0,review,sentiment
0,качество плохое пошив ужасный (горловина напер...,negative
1,"Товар отдали другому человеку, я не получила п...",negative
2,"Ужасная синтетика! Тонкая, ничего общего с пре...",negative
3,"товар не пришел, продавец продлил защиту без м...",negative
4,"Кофточка голая синтетика, носить не возможно.",negative
...,...,...
89995,сделано достаточно хорошо. на ткани сделан рис...,positive
89996,Накидка шикарная. Спасибо большое провдо линяе...,positive
89997,спасибо большое ) продовца рекомендую.. заказа...,positive
89998,Очень довольна заказом! Меньше месяца в РБ. К...,positive


In [3]:
df['sentiment'].value_counts()


sentiment
negative    30000
neautral    30000
positive    30000
Name: count, dtype: int64

## Prepocessing

In [4]:
# use only 2 classes: positive and negative
df = df[df['sentiment'] != 'neautral']


In [5]:
df.iloc[0]['review']


'качество плохое пошив ужасный (горловина наперекос) Фото не соответствует Ткань ужасная рисунок блеклый маленький рукав не такой УЖАС!!!!! не стоит за такие деньги г.......'

In [6]:
# punctuation removal
df['review_processed'] = df['review'].apply (lambda x: re.sub(r'[^\w\s]', '', x)).values
df


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review_processed'] = df['review'].apply (lambda x: re.sub(r'[^\w\s]', '', x)).values


Unnamed: 0,review,sentiment,review_processed
0,качество плохое пошив ужасный (горловина напер...,negative,качество плохое пошив ужасный горловина напере...
1,"Товар отдали другому человеку, я не получила п...",negative,Товар отдали другому человеку я не получила по...
2,"Ужасная синтетика! Тонкая, ничего общего с пре...",negative,Ужасная синтетика Тонкая ничего общего с предс...
3,"товар не пришел, продавец продлил защиту без м...",negative,товар не пришел продавец продлил защиту без мо...
4,"Кофточка голая синтетика, носить не возможно.",negative,Кофточка голая синтетика носить не возможно
...,...,...,...
89995,сделано достаточно хорошо. на ткани сделан рис...,positive,сделано достаточно хорошо на ткани сделан рису...
89996,Накидка шикарная. Спасибо большое провдо линяе...,positive,Накидка шикарная Спасибо большое провдо линяет...
89997,спасибо большое ) продовца рекомендую.. заказа...,positive,спасибо большое продовца рекомендую заказала ...
89998,Очень довольна заказом! Меньше месяца в РБ. К...,positive,Очень довольна заказом Меньше месяца в РБ Кур...


In [7]:
# to lowercase
df['review_processed'] = df['review_processed'].apply(lambda x: x.lower()).values


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review_processed'] = df['review_processed'].apply(lambda x: x.lower()).values


In [8]:
# tokenization
df['review_processed'] = df['review_processed'].progress_apply(lambda x: word_tokenize(x))


100%|██████████| 60000/60000 [00:14<00:00, 4008.25it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review_processed'] = df['review_processed'].progress_apply(lambda x: word_tokenize(x))


In [9]:
df['review_processed'].iloc[1]


['товар',
 'отдали',
 'другому',
 'человеку',
 'я',
 'не',
 'получила',
 'посылку',
 'ладно',
 'хоть',
 'деньги',
 'вернули']

### Lemmatization

In [10]:
morph = pymorphy2.MorphAnalyzer()


In [11]:
df['review_lemmatized'] = df['review_processed'].progress_apply(lambda text: [morph.parse(word)[0].normal_form for word in text]).values


100%|██████████| 60000/60000 [03:09<00:00, 316.38it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review_lemmatized'] = df['review_processed'].progress_apply(lambda text: [morph.parse(word)[0].normal_form for word in text]).values


## Feature extraction using TF-IDF

In [12]:
vectorizer = TfidfVectorizer(ngram_range=(1, 2)) #collocations


In [13]:
X = vectorizer.fit_transform(df['review_lemmatized'].apply(lambda x: ' '.join(x)))


In [14]:
# (samples, features)
X.shape


(60000, 396100)

## Training

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, df['sentiment'], test_size=0.2, random_state=26)


In [16]:
model = LogisticRegression()


In [None]:
model.fit(X_train, y_train)


## Testing

In [None]:
predicts = model.predict_proba(X_test)[:, 1]
metrics = roc_auc_score(y_test, predicts)


In [19]:
print(f"ROC AUC score: {metrics:.3%}")

ROC AUC score: 97.661%


In [20]:
# weight visualization
eli5.show_weights(estimator=model, feature_names=list(vectorizer.get_feature_names_out()), top=(20,20))

Weight?,Feature
+10.731,отличный
+10.107,хороший
+8.287,супер
+7.749,спасибо
+7.487,немного
+7.130,хорошо
+6.800,классный
+6.799,отлично
+6.477,приятный
+6.399,довольный


## Hyperparameter optimization

### Gridsearch

In [34]:
import time

In [35]:
parameters = {'C' : [0.25, 0.5, 0.75, 1, 5, 10], 'max_iter' : [50, 100, 150]}
gs_clf = GridSearchCV(model, parameters)

In [None]:
start_time = time.time()
gs_clf.fit(X_train, y_train)
end_time = time.time()
gs_time = end_time - start_time

In [37]:
gs_C, gs_max_iter = gs_clf.best_params_.values()
gs_best_model = LogisticRegression(C=gs_C,max_iter=gs_max_iter)


In [38]:
gs_C, gs_max_iter

(10, 150)

In [None]:
gs_best_model.fit(X_train, y_train)

In [40]:
predicts = gs_best_model.predict_proba(X_test)[:, 1]
metrics = roc_auc_score(y_test, predicts)
print(f"ROC AUC score: {metrics:.3%}")

ROC AUC score: 98.044%


  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


### Randomsearch

In [62]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform

distributions = {'C' : uniform(0.25,10), 'max_iter' : [50, 100, 150]}
rs_clf = RandomizedSearchCV(model, distributions)

In [None]:
start_time = time.time()
rs_clf.fit(X_train,y_train)
end_time = time.time()

rs_time = end_time-start_time

In [64]:
rs_C, rs_max_iter = rs_clf.best_params_.values()
rs_best_model = LogisticRegression(C=rs_C, max_iter=rs_max_iter)

In [None]:
rs_best_model.fit(X_train, y_train)

In [66]:
predicts = rs_best_model.predict_proba(X_test)[:, 1]
metrics = roc_auc_score(y_test, predicts)
print(f"ROC AUC score: {metrics:.3%}")

ROC AUC score: 98.037%


  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


### Comparing

In [67]:
print(f"Gridsearch runtime: {gs_time/60:.3} sec. Randomsearch runtime: {rs_time/60:.3} sec.\nUsing randomsearch produces basically the same metric value, but {gs_time/rs_time:.3} times faster.")

Gridsearch runtime: 8.57 sec. Randomsearch runtime: 5.23 sec.
Using randomsearch produces basically the same metric value, but 1.64 times faster.
