Using RoBERTa model for the Sentiment Classification

In [1]:
import numpy as np 
import pandas as pd 
import random as rn
import re
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
import seaborn as sns
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

In this model, we only use the user_review variable to predict the sentiment, and see whether the sentiment could accurately predict the user's suggestion. 

In [2]:
df_reviews = pd.read_csv("game_train.csv")
df_test = pd.read_csv("game_test.csv")

# convert review text to string
df_reviews["user_review"] = df_reviews["user_review"].astype(str)
df_reviews.user_review = df_reviews.user_review.apply(lambda s: s.strip())

df_test["user_review"] = df_test["user_review"].astype(str)
df_test.user_review = df_test.user_review.apply(lambda s: s.strip())

Check the balance 

In [3]:
df_reviews["user_suggestion"].value_counts()

1    5986
0    4508
Name: user_suggestion, dtype: int64

Simply data cleaning: we remove the early access review comments and remove duplicated rows. We also figure out the foul language in the review was replaced by ♥ emoji. So to increase the accurancy of the sentiment prediction, we replaced ♥ with **, as the model would consider ** as foul language.

In [5]:
#Remove the "Early Access Review" comments

df_reviews_2 = df_reviews[df_reviews.user_review != "Early Access Review"]
df_reviews_2 = df_reviews[~df_reviews.user_review.isin(['nan'])]
print(df_reviews_2.shape)

# Drop duplicates 
df_reviews_2.drop_duplicates(['user_review', 'user_suggestion'], inplace = True)
print(df_reviews_2.shape)



(10494, 5)
(10494, 5)


In [14]:
# replace ♥
def replace_hearts_with_PAD(text):
    return re.sub(r"[♥]+", ' **** ' ,text)

df_reviews_2['user_review_clean'] = df_reviews_2.user_review.apply(replace_hearts_with_PAD)

df_reviews_3 = df_reviews_2[['user_review_clean', 'user_suggestion']]
df_reviews_3 = df_reviews_3.rename({"user_review_clean": "text", "user_suggestion": "labels"});
df_reviews_3.head()



df_test['user_review_clean'] = df_test.user_review.apply(replace_hearts_with_PAD)
df_test_1 = df_test['user_review_clean']
df_test_1 = df_test_1.rename({"user_review_clean":"text"})
df_test_1.head()

0    I'm scared and hearing creepy voices.  So I'll...
1    Best game, more better than Sam Pepper's YouTu...
2    A littly iffy on the controls, but once you kn...
3    Great game, fun and colorful and all that.A si...
4    Early Access ReviewIt's pretty cute at first, ...
Name: user_review_clean, dtype: object

Split the df_training into train (60%), test(20%) and holdout sets(20%).

In [15]:
train_df, eval_df = train_test_split(df_reviews_3, test_size = 0.4, random_state = 42)
test_df , holdout_df = train_test_split(eval_df, test_size = 0.5, random_state = 42)

print(train_df.shape)
print(test_df.shape)
print(holdout_df.shape)

(6296, 2)
(2099, 2)
(2099, 2)


Roberta model: Roberta model means robustly Optimized BERT Pre-training Approach, we use simpletransformers package to create the model. And it is recommended to use num_train_epochs=1 and for loop to repeat the training, as using for loop could get the same result saw on the epoch, but if set num_train_epochs more than 1, we could not get the same result. 

In [16]:
from simpletransformers.classification import ClassificationModel
import logging

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)


# Create a ClassificationModel
roberta_model = ClassificationModel(
                          'roberta', 'roberta-base', use_cuda=False,
                          args={'num_train_epochs' : 1,
                                 "train_batch_size": 16,
                                 "eval_batch_size": 16,
                                 "fp16": False,
                                 "optimizer": "AdamW",
                                 "adam_epsilon": 1e-8,
                                 "learning_rate": 1e-5,
                                 "weight_decay": 0.7,
                                 'overwrite_output_dir': True,
                                 "save_eval_checkpoints": False,
                                 "save_model_every_epoch": False,
                                 "no_cache": True,
                                 "manual_seed": 12345})

for i in range(2):
     # Train the model
    roberta_model.train_model(train_df)

# Evaluate the model on the test data 
    result, model_outputs, wrong_predictions = roberta_model.eval_model(test_df)
    print("Accuracy= " ,(result['tp'] + result['tn']) / (result['tp'] + result['tn'] + \
                                                         result['fp'] + result['fn']))
    print("Recall = ",(result['tn']) / (result['tn'] + result['fn'])) # simpletransformers mistakenly reports fn and fp. have to flip them
    print(result)
    print(classification_report(np.argmax(model_outputs, axis = 1), test_df.user_review_clean.values))


Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.weight', 'roberta.pooler.dense.weight', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classi

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  0%|          | 13/6296 [00:26<3:33:20,  2.04s/it]
Epochs 0/1. Running Loss:    0.2142: 100%|██████████| 394/394 [7:02:36<00:00, 64.36s/it]
Epoch 1 of 1: 100%|██████████| 1/1 [7:02:36<00:00, 25356.80s/it]
INFO:simpletransformers.classification.classification_model: Training of roberta model complete. Saved to outputs/.
INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.
  0%|          | 0/2099 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  0%|          | 5/2099 [00:15<1:44:54,  3.01s/it]
Running Evaluation: 100%|██████████| 132/132 [12:30<00:00,  5.69s/it]
INFO:simpletransformers.classification.classification_model:{'mcc': 0.7799084084321967, 'tp': 1092, 'tn': 781, 'fp': 120, 'fn': 106, 'auroc': 0.9555733844235398, 'auprc': 0.9661060437059306, 'eval_loss': 0.2844462984551986}


Accuracy=  0.8923296808003811
Recall =  0.8804960541149943
{'mcc': 0.7799084084321967, 'tp': 1092, 'tn': 781, 'fp': 120, 'fn': 106, 'auroc': 0.9555733844235398, 'auprc': 0.9661060437059306, 'eval_loss': 0.2844462984551986}


AttributeError: 'DataFrame' object has no attribute 'user_review'

INFO:simpletransformers.classification.classification_model:{'mcc': 0.7799084084321967, 'tp': 1092, 'tn': 781, 'fp': 120, 'fn': 106, 'auroc': 0.9555733844235398, 'auprc': 0.9661060437059306, 'eval_loss': 0.2844462984551986}
Accuracy=  0.8923296808003811
Recall =  0.8804960541149943
f1 score = 0.90622
{'mcc': 0.7799084084321967, 'tp': 1092, 'tn': 781, 'fp': 120, 'fn': 106, 'auroc': 0.9555733844235398, 'auprc': 0.9661060437059306, 'eval_loss': 0.2844462984551986}

Upload to Kaggle

In [32]:
y_test_id = df_test['review_id']
y_test = roberta_model.predict(df_test_1)

y_test = pd.DataFrame(y_test)
y_test = y_test.assign(review_id = y_test_id)
y_test.columns = ['review_id','user_suggestion']
pd.DataFrame(y_test).to_csv('predictions.kaggle.csv', index=False)


INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.
  0%|          | 0/6996 [00:14<?, ?it/s]
