The goal of our deep learning model, a Transformer, is to analyze game reviews and classify the sentiments associated with these reviews.  
To train our model, we will use the "Steam Reviews" dataset, available at: https://www.kaggle.com/datasets/andrewmvd/steam-reviews.  
This dataset includes, among other features, 6.4 million English reviews from the Steam platform.  
Each review in the dataset is labeled with its sentiment: 1 for positive and -1 for negative.  
The objective of our model is to take a game review as input and predict whether the sentiment is positive or negative.  
To implement the model, we will use the Keras API provided by TensorFlow.

# 1/ Data preprocessing

Before implementing the Transformer and training it with our data, it is necessary to :  
- Load the dataset (store in .csv format in the "datasets" folder)
- Seperate this dataset in train data, validation data and test data
- Transform these initial text data into token sequences
- Pad these sequences to ensure they have the same length

## Importing libraries

In [1]:
import pandas as pd
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import random
from sklearn.model_selection import train_test_split

## Dataset importing and processing

In [2]:
# Loading the dataset
dataset = pd.read_csv("datasets/dataset.csv")

In [3]:
# Data overview
print(dataset.head())

   app_id        app_name                                        review_text  \
0      10  Counter-Strike                                    Ruined my life.   
1      10  Counter-Strike  This will be more of a ''my experience with th...   
2      10  Counter-Strike                      This game saved my virginity.   
3      10  Counter-Strike  • Do you like original games? • Do you like ga...   
4      10  Counter-Strike           Easy to learn, hard to master.             

   review_score  review_votes  
0             1             0  
1             1             1  
2             1             0  
3             1             0  
4             1             1  


In [4]:
print("Number of reviews :", len(dataset))

Number of reviews : 6417106


As this dataset includes 6.4 million reviews, this number is too high, so it is necessary to keep only a portion of these reviews by selecting them randomly.

In [5]:
# Choice of the sample size
sample_size = 100000

# Randomly select review indices
index_reviews_kept = random.sample(range(len(dataset)), 50000)

Then, we will keep the data that interest us: review content ('review_text') and their associated sentiment ('review_score').

In [6]:
# Extracting data by converting it into a numpy array and setting the review format to string
X = np.array(dataset.iloc[index_reviews_kept].review_text, dtype = "str")
y = np.array(dataset.iloc[index_reviews_kept].review_score)

In [7]:
for i in range(3) :
    print(y[i], X[i])

1 Full of intense motion!!!! for those who like zombies, adventure and a little bit puzzle, this game is for you... manage to end this game less then 6 hours... its fun :)
-1 Looks nothing like the store page, hard to play. I wouldn't recommend.
1 This is a great game but I can't access the mapeditor so can anyone help me? any way great game you should get it.


In [8]:
# Set the sentiment value for negative sentiment to 0
y[y == -1] = 0

Moreover, we will split the data into training, validation, and test sets to train the transformer model.  
The training set will contain 60% of the initial data, while the validation and test sets will each contain 20% of the initial data.

In [9]:
# Split the data using the train_test_split function from scikit-learn
X_train, X_test_full, y_train, y_test_full = train_test_split(X, y, train_size = 0.6, random_state = 42)

In [10]:
X_val, y_val = X_test_full[:int(len(X_test_full)/2)], y_test_full[:int(len(X_test_full)/2)]
X_test, y_test = X_test_full[int(len(X_test_full)/2):], y_test_full[int(len(X_test_full)/2):]

In [11]:
print(X_train.shape, X_val.shape, X_test.shape)
print(y_train.shape, y_val.shape, y_test.shape)

(30000,) (10000,) (10000,)
(30000,) (10000,) (10000,)
