The goal of our deep learning model, a Transformer, is to analyze game reviews and classify the sentiments associated with these reviews.  
To train our model, we will use the "Steam Reviews" dataset, available at: https://www.kaggle.com/datasets/andrewmvd/steam-reviews.  
This dataset includes, among other features, 6.4 million English reviews from the Steam platform.  
Each review in the dataset is labeled with its sentiment: 1 for positive and -1 for negative.  
The objective of our model is to take a game review as input and predict whether the sentiment is positive or negative.  
To implement the model, we will use the Keras API provided by TensorFlow.

# 1/ Data preprocessing

Before implementing the Transformer and training it with our data, it is necessary to :  
- Load the dataset (store in .csv format in the "datasets" folder)
- Seperate this dataset in train data, validation data and test data
- Transform these initial text data into token sequences
- Pad these sequences to ensure they have the same length

## Importing libraries

In [1]:
import pandas as pd
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import random
from sklearn.model_selection import train_test_split

## Dataset importing and processing

In [2]:
# Loading the dataset
dataset = pd.read_csv("datasets/dataset.csv")

In [3]:
# Data overview
print(dataset.head())

   app_id        app_name                                        review_text  \
0      10  Counter-Strike                                    Ruined my life.   
1      10  Counter-Strike  This will be more of a ''my experience with th...   
2      10  Counter-Strike                      This game saved my virginity.   
3      10  Counter-Strike  • Do you like original games? • Do you like ga...   
4      10  Counter-Strike           Easy to learn, hard to master.             

   review_score  review_votes  
0             1             0  
1             1             1  
2             1             0  
3             1             0  
4             1             1  


In [4]:
print("Number of reviews :", len(dataset))

Number of reviews : 6417106


As this dataset includes 6.4 million reviews, this number is too high, so it is necessary to keep only a portion of these reviews by selecting them randomly.

In [5]:
# Choice of the sample size
sample_size = 100000

# Randomly select review indices
index_reviews_kept = random.sample(range(len(dataset)), sample_size)

Then, we will keep the data that interest us: review content ('review_text') and their associated sentiment ('review_score').

In [6]:
# Extracting data by converting it into a numpy array and setting the review format to string
X = np.array(dataset.iloc[index_reviews_kept].review_text, dtype = "str")
y = np.array(dataset.iloc[index_reviews_kept].review_score)

In [7]:
for i in range(3) :
    print(y[i], X[i])

1 I recommend this with a proviso. It was a great game when it came out Lo These Many Years Ago, and it's still a great game and a wonderful piece of design. There also isn't really a modern equivalent which does the same things. However, on Win8 (and possibly other moderns OSs) it's quite glitchy and suffers from strange slowdowns which make it, if not unplayable, rather less fun than it could be. So, modern buyers beware.
1 I AM SO GLAD I GOT THIS GAME. The reviews made it sound like a mediocre experience. SO WRONG. Absolutely loving this game, story is fun and interesting and the combat kicks frken ♥♥♥. Game is challenging, but rewarding and the fact you gain new abilites and weapons nearly up until the end makes it bloody hard to put down. Tonnes of replayability (including a one hit kill/death mode) and tonnes of actual content without replaying. Love the platforming that rewards skill and some of the bosses, wow, most intense and insane boss fights I've ever seen. I trully have n

In [8]:
# Set the sentiment value for negative sentiment to 0
y[y == -1] = 0

Moreover, we will split the data into training, validation, and test sets to train the transformer model.  
The training set will contain 60% of the initial data, while the validation and test sets will each contain 20% of the initial data.

In [9]:
# Split the data using the train_test_split function from scikit-learn
X_train, X_test_full, y_train, y_test_full = train_test_split(X, y, train_size = 0.6, random_state = 42)

In [10]:
X_val, y_val = X_test_full[:int(len(X_test_full)/2)], y_test_full[:int(len(X_test_full)/2)]
X_test, y_test = X_test_full[int(len(X_test_full)/2):], y_test_full[int(len(X_test_full)/2):]

In [11]:
print(X_train.shape, X_val.shape, X_test.shape)
print(y_train.shape, y_val.shape, y_test.shape)

(60000,) (20000,) (20000,)
(60000,) (20000,) (20000,)


## Tokenisation of text data

This step consists of transforming the review's words into tokens in order to make this data readable by the AI model.

In [12]:
# Maximum number of words for the token dictionary
num_words = 4000

# Instantiate the Tokeniser and fit it on the training data
tokenizer = Tokenizer(num_words = num_words, oov_token = "UNK")
tokenizer.fit_on_texts(X_train)

In [13]:
print("Number of different words in the data : %d" % len(tokenizer.word_docs))

Number of different words in the data : 66397


In [100]:
max_occurences_oov_words = tokenizer.word_counts[tokenizer.index_word[num_words]]
print("Maximum number of occurrences for OOV words :" , max_occurences_oov_words)

Maximum number of occurrences for OOV words : 49


Convert sentences composed of words into sequences of tokens.

In [101]:
X_train_sequences = tokenizer.texts_to_sequences(X_train)
X_val_sequences = tokenizer.texts_to_sequences(X_val)
X_test_sequences = tokenizer.texts_to_sequences(X_test)

In [104]:
print(X_train_sequences[:3])

[[76, 1076, 27, 174], [12, 6, 1, 203, 4, 136, 8, 1], [910, 246, 139, 37, 311, 39, 8, 2, 3740, 859, 1988, 7, 20, 112, 75, 12, 6, 9, 2013, 17, 544, 14, 119, 1213, 26, 196, 152, 57, 8]]


In [None]:
Next, it is necessary to pad these sequences : étant donné que they don't have all the same lengths (the sentences don't have all the same number of words), they can't be use by the model.  
We will pad these sequences by cutting these which have a length superior than a setted maximum number (= "maxlen"), and fill the sentences which have a length inferior than maxlen with 0

In [17]:
X_train_padded = pad_sequences(X_train_sequences, maxlen = 400)
X_val_padded = pad_sequences(X_val_sequences, maxlen = 400)
X_test_padded = pad_sequences(X_test_sequences, maxlen = 400)

In [18]:
X_train_padded

array([[   0,    0,    0, ..., 1076,   27,  174],
       [   0,    0,    0, ...,  136,    8,    1],
       [   0,    0,    0, ...,  152,   57,    8],
       ...,
       [   0,    0,    0, ...,    1,  156,    6],
       [   0,    0,    0, ..., 3516,    1,  130],
       [   0,    0,    0, ...,   31,    2,  698]], dtype=int32)