Using RoBERTa model for the Sentiment Classification. Have tried use Bert in another python file, but Roberta could generate a better result compare with Bert as Bert use static masking while RoBERTa uses dynamic masking. So will try the simple transformers from the hugging face in this file.

In [13]:
import pandas as pd 
import random as rn
import re
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
import seaborn as sns
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

In this model, we only use the user_review variable to predict the sentiment, and see whether the sentiment could accurately predict the user's suggestion. 

In [2]:
df_reviews = pd.read_csv("game_train.csv")
df_test = pd.read_csv("game_test.csv")

# convert review text to string
df_reviews["user_review"] = df_reviews["user_review"].astype(str)
df_reviews.user_review = df_reviews.user_review.apply(lambda s: s.strip())

df_test["user_review"] = df_test["user_review"].astype(str)
df_test.user_review = df_test.user_review.apply(lambda s: s.strip())

The data is quite balance.

In [3]:
df_reviews["user_suggestion"].value_counts()

1    5986
0    4508
Name: user_suggestion, dtype: int64

Data cleaning: we remove the early access review comments and remove duplicated rows. We also figure out the foul language in the review was replaced by ♥ emoji. So to increase the accurancy of the sentiment prediction, we replaced ♥ with **, as the model would consider ** as foul language.

In [4]:
#Remove the "Early Access Review" comments

df_reviews_2 = df_reviews[df_reviews.user_review != "Early Access Review"]
df_reviews_2 = df_reviews[~df_reviews.user_review.isin(['nan'])]
print(df_reviews_2.shape)

# Drop duplicates 
df_reviews_2.drop_duplicates(['user_review', 'user_suggestion'], inplace = True)
print(df_reviews_2.shape)

(10494, 5)
(10494, 5)


In [5]:
# replace ♥
def replace_hearts_with_PAD(text):
    return re.sub(r"[♥]+", ' **** ' ,text)

df_reviews_2['user_review_clean'] = df_reviews_2.user_review.apply(replace_hearts_with_PAD)

df_reviews_3 = df_reviews_2[['user_review_clean', 'user_suggestion']]
df_reviews_3 = df_reviews_3.rename({"user_review_clean": "text", "user_suggestion": "labels"});
df_reviews_3.head()

df_test_1 = df_test['user_review']

Split the df_training into train (60%), test(20%) and holdout sets(20%).

In [6]:
train_df, eval_df = train_test_split(df_reviews_3, test_size = 0.4, random_state = 42)
test_df , holdout_df = train_test_split(eval_df, test_size = 0.5, random_state = 42)

print(train_df.shape)
print(test_df.shape)
print(holdout_df.shape)

(6296, 2)
(2099, 2)
(2099, 2)


Roberta model: Roberta model means robustly Optimized BERT Pre-training Approach, we use simpletransformers package to create the model. And it is recommended to use num_train_epochs=1 and for loop to repeat the training, as using for loop could get the same result saw on the epoch, but if set num_train_epochs more than 1, we could not get the same result. 

In [17]:
conda install pytorch>=1.6 cudatoolkit=11.0 -c pytorch


Note: you may need to restart the kernel to use updated packages.


In [18]:
from simpletransformers.classification import ClassificationModel
import logging

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)


# Create a ClassificationModel
roberta_model = ClassificationModel(
                          'roberta', 'roberta-base', use_cuda=False,
                          args={'num_train_epochs' : 1,
                                 "train_batch_size": 16,
                                 "eval_batch_size": 16,
                                 "fp16": False,
                                 "optimizer": "AdamW",
                                 "adam_epsilon": 1e-8,
                                 "learning_rate": 1e-5,
                                 "weight_decay": 0.7,
                                 'overwrite_output_dir': True,
                                 "save_eval_checkpoints": False,
                                 "save_model_every_epoch": False,
                                 "no_cache": True,
                                 "manual_seed": 12345})

for i in range(2):
     # Train the model
    roberta_model.train_model(train_df)

# Evaluate the model on the test data 
    result, model_outputs, wrong_predictions = roberta_model.eval_model(test_df)
    print("Accuracy= " ,(result['tp'] + result['tn']) / (result['tp'] + result['tn'] + \
                                                         result['fp'] + result['fn']))
    print("Recall = ",(result['tn']) / (result['tn'] + result['fn'])) # simpletransformers mistakenly reports fn and fp. have to flip them
    print(result)
    print(classification_report(np.argmax(model_outputs, axis = 1), test_df.user_review_clean.values))


Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/478M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias', 'lm_head.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/6296 [00:00<?, ?it/s]



Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/394 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_model: Training of roberta model complete. Saved to outputs/.
INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/2099 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/132 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_model:{'mcc': 0.7886958362445423, 'tp': 1096, 'tn': 786, 'fp': 115, 'fn': 102, 'auroc': 0.9568741094573086, 'auprc': 0.9669629646652218, 'eval_loss': 0.2816459955224259}


Accuracy=  0.8966174368747022
Recall =  0.8851351351351351
{'mcc': 0.7886958362445423, 'tp': 1096, 'tn': 786, 'fp': 115, 'fn': 102, 'auroc': 0.9568741094573086, 'auprc': 0.9669629646652218, 'eval_loss': 0.2816459955224259}


AttributeError: 'DataFrame' object has no attribute 'user_review'

Upload to Kaggle

In [48]:
y_test_id = df_test['review_id']
test_dataframe_from_series = df_test_1.tolist()
#print(test_dataframe_from_series.head(10))
#print(test_dataframe_from_series.dtypes)
y_test = roberta_model.predict(test_dataframe_from_series) 

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/6996 [00:00<?, ?it/s]

  0%|          | 0/438 [00:00<?, ?it/s]

In [67]:
y_test = pd.DataFrame(y_test)
y_test.head(10)
y_test_kaggle = y_test.transpose().drop(1, axis = 1)
y_test_kaggle = y_test_kaggle.assign(review_id = y_test_id)
y_test_kaggle.columns = ['user_suggestion','review_id']
y_test_kaggle.head(5)
pd.DataFrame(y_test_kaggle).to_csv('predictions.kaggle.roberta.csv', index=False)

In [25]:
train_df.head(10)

Unnamed: 0,user_review_clean,user_suggestion
10006,The gameplay is amazing. I really dig it. Walk...,0
5554,JOIN THE OZFORTRESS JOIN THE OZFORTRESS JOIN T...,1
5458,"Product received for freeIt's pretty good, but...",1
8367,A really Fun MMO FPS with good gameplay and a ...,1
9686,"""Formation StrategyThe only clicker/idle game ...",0
425,"Ok lets get this straight, I love this game. T...",1
6390,Early Access Review10/10 refund simulator 2018...,0
5214,Played it on kongregate some time ago when the...,1
228,"ignore my playtime, i have 1000+ hours played....",1
5749,"Early Access Reviewokay, review time:pros: rpg...",1


In [28]:
df_test_1.head(10)

0    I'm scared and hearing creepy voices.  So I'll...
1    Best game, more better than Sam Pepper's YouTu...
2    A littly iffy on the controls, but once you kn...
3    Great game, fun and colorful and all that.A si...
4    Early Access ReviewIt's pretty cute at first, ...
5    This game with its cute little out of the wall...
6    Early Access ReviewRooms 1-20 were cute and ad...
7    Just played this game for about an hour and it...
8    Those puppets, man! They freak me out! Can't g...
9    Early Access ReviewThis Game is a bit simple b...
Name: user_review, dtype: object

In [35]:
type(train_df)

pandas.core.frame.DataFrame

In [33]:
df_test_1.dtypes

dtype('O')

In [31]:
train_df.dtypes

user_review_clean    object
user_suggestion       int64
dtype: object