In this notebook, I will use NLP (pretrained RoBERTa model) to estimate the missing ratings of about ~2500 observations.

In [1]:
import pandas as pd
import numpy as np
from simpletransformers.classification import ClassificationModel
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

Since this is not the main focus of this project, I am using the library called simple transformers, which allows me to get decent results fast.

The library accepts a dataframe with two columns ('text' and 'label'), preprocesses it and fits to base RoBERTA (transfer learning). Our problem is essentialy a multiclass classification problem with 5 classes. One major problem in our data is the class imbalance, as was seen earlier. To solve this issue, I use weights - I downweight the 5 star reviews quite heavily, while I upweight some of the less common classes - this allows me to improve the model results significantly. I did most of the coding on the Kaggle kernels, since my PC does not support CUDA, so the code here is just a general idea for reference.

In [134]:
df = pd.read_csv(r'E:\parsing project\merged_data.csv')

In [3]:
df[['review_text', 'num_rating', 'for_nlp']]

Unnamed: 0,review_text,num_rating,for_nlp
0,one of my favorites!!! i don't care when i wea...,5.0,0
1,i love it! i pondered on it for awhile. it’s e...,5.0,0
2,love this and the travel spray bottle is so co...,5.0,0
3,i love the fragrance which is light and not ov...,5.0,0
4,pretty fragrance i get a lot of compliments on...,5.0,0
...,...,...,...
39363,"i wear this fragrance, especially in the sprin...",,1
39364,i love love love love love this perfume. when ...,,1
39365,burbbery weekend is a wonderful strong scent w...,,1
39366,this fragrance smell so good it's such a matur...,,1


In [136]:
nlp_index = df[df['for_nlp'] == 1].index
rest_index =df[df['for_nlp'] == 0].index

In [108]:
eval = df.loc[nlp_index][['review_text', 'num_rating']]

In [109]:
data = df.loc[rest_index][['review_text', 'num_rating']].drop_duplicates()

In [110]:
# convert the ratings so they are between 0 and 4 - necessary for model to work
data['num_rating'] = data['num_rating'].astype(int) - 1

In [9]:
data.head(1)

Unnamed: 0,review_text,num_rating
424,absolutely worth the hype - pineapple and soft...,4


In [10]:
data = data.rename(columns = {'review_text': 'text', 'num_rating': 'labels'})

In [None]:
X = data['text']
y = data['labels']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle = True, stratify = y)
#stratify and unite them again

In [None]:
train = pd.DataFrame(X_train).merge(pd.DataFrame(y_train), left_index = True, right_index = True)
test = pd.DataFrame(X_test).merge(pd.DataFrame(y_test), left_index = True, right_index = True)

#missing rating stars
predict_this = df.loc[nlp_index]['review_text']#.apply(lambda x: [str(x)])
predict_this = predict_this.reset_index(drop = True)

It will take a long time to train if you are using cpu. Since my graphics card did not support CUDA, I had to use kaggle kernels to train this model. The code below is greyed out intentionally.

In [13]:
#model = ClassificationModel('roberta', 'roberta-base', num_labels=5, weight = [1, 1.8, 1.15, 0.5, 0.03], args = {'overwrite_output_dir': True, 'ngpu': 1, 'num_train_epochs':4, 'use_early_stopping': True, 'reprocess_input_data': True, 'early_stopping_delta' : 0.01})

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.dense.weight', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'lm_head.layer_norm.weight', 'roberta.pooler.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifie

In [None]:
#model.train_model(train) -> retrain on the whole dataset after calibration

In [None]:
from sklearn.metrics import f1_score, accuracy_score


def f1_multiclass(labels, preds):
    return f1_score(labels, preds, average='micro')
    
result, model_outputs, wrong_predictions = model.eval_model(test, f1=f1_multiclass, acc=accuracy_score)

#0.87 f1 multiclass score with given weights - 0.79 without weights

I then use this model to predict (model.predict) the missing data. I obtain the following predictions:

In [24]:
predictions = pd.read_csv(r'E:\parsing project\preds.csv')

In [25]:
#we have one erroneous row - happened by accident
predictions = predictions.dropna()

In [26]:
predictions['preds'] = (predictions['preds'] + 1).astype(int)

In [30]:
predictions['preds'].value_counts().sort_index() 

1      76
2       9
3      51
4     125
5    2241
Name: preds, dtype: int64

We can notice the importance of weights here - if we don't use weights, our model learns to predict class 5 only and ignore rest of the classes.

Overall, the predictions are quite decent, although they are not perfect (for example, the sentiment can be positive, but the predicted rating is low - happens mostly to complex sentences). It is worth mentioning, that the data has some noisy reviews as well - although it is possible to fix these cases manually is small datasets, it wouldnt be possible in big productions with millions of reviews. I did not fix these cases manually, thus our final data has some noise. However, most of the reviews in the data should display the true valuation of the product.

Match them back:

In [137]:
df.loc[nlp_index, 'num_rating'] = predictions['preds'].values

In [138]:
df['num_rating'].isna().sum() # no missing ratings

0

Save:

In [142]:
df.to_csv('merged_after_nlp.csv', index = False)