# **Transformers for Sentiment Classification:**

Welcome folks!, in the current project I will show you in detail how to implement four types of well-known transformer models making use of the transformers HuggingFace library and Keras API.

The main task corresponds to a multi-class text classification on Movie Reviews Competition and the dataset contains 156.060 instances for training, whereas the testing set contains 66.292 from which we have to classify among 5 classes. The sentiment labels are:

0 → Negative
1 → Somewhat negative
2 → Neutral
3 → Somewhat positive
4 → Positive
At the end of the project we will summarize and compare their performance according to our requirements and metrics.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [2]:
import os 
!pip install -U -q segmentation-models --user
os.environ["SM_FRAMEWORK"] = "tf.keras"
import segmentation_models as sm

In [3]:
!pip install Keras

In [4]:
import pandas as pd
import numpy as np
import os
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import seaborn as sns
sns.set(style='whitegrid')

from wordcloud import WordCloud

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.metrics import classification_report,confusion_matrix

from collections import defaultdict
from collections import Counter

import re
import gensim
import string

from tqdm import tqdm
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, LSTM,Dense, SpatialDropout1D, Dropout
from keras.initializers import Constant

import tensorflow as tf
import warnings
warnings.simplefilter('ignore')

In [5]:

#df=pd.read_csv("../input/amazon-reviews-on-sentiment-analysis/Amazon Reviews on Sentiment Analysis/train.csv")
#encoding='latin1', lineterminator='\n'
dframe = pd.read_csv("../input/tripadvisor-hotel-reviews-20k-dataset/tripadvisor_hotel_reviews.csv")
#df, validate, df_test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])

Just to confirm the number of instances and features in each file:

In [6]:
df = dframe.sample(frac=0.8, random_state=25)
df_test = dframe.drop(df.index)

print(f"No. of training examples: {df.shape[0]}")
print(f"No. of testing examples: {df_test.shape[0]}")

In [7]:
#df_test=pd.read_csv("../input/amazon-reviews-on-sentiment-analysis/Amazon Reviews on Sentiment Analysis/test.csv")
#encoding='latin1', lineterminator='\n'

In [8]:
df.shape, df_test.shape

In [9]:
df

Above we can see that Review and Rating columns are all we need from the file in order to train the models later, therefore we will use these as feature (X) and label (Y) when fitting the transformer.

In [10]:
df_test

In case there is a null or empty value in any column we should have to get rid of it, in order to find it out we will use info() as follows:

In [11]:
df.info()

In [12]:
df.isnull().sum()

The dataset looks good and we need to know how are distributed the 5 classes in the label so as to know it's balanced or not.

In [13]:
df.Rating.value_counts()

In [14]:
df2=df.copy(deep=True)
pie1=pd.DataFrame(df2['Rating'].replace(1,'Negative').replace(2,'Somewhat negative').replace(3,'Neutral').replace(4,'Somewhat positive').replace(5,'Positive').value_counts())
pie1.reset_index(inplace=True)
pie1.plot(kind='pie', title='Pie chart of Rating Class',y = 'Rating', 
          autopct='%1.1f%%', shadow=False, labels=pie1['index'], legend = False, fontsize=14, figsize=(12,12))

Time now to find out the number of words in reviews, in order to understand a bit better we will plot histograms for each class

In [15]:
f, (ax1, ax2, ax3, ax4, ax5) = plt.subplots(1,5,figsize=(25,8))

ax1.hist(df[df['Rating'] == 1]['Review'].str.split().map(lambda x: len(x)), bins=50, color='b')
ax1.set_title('Negative Reviews')

ax2.hist(df[df['Rating'] == 2]['Review'].str.split().map(lambda x: len(x)), bins=50, color='r')
ax2.set_title('Somewhat Negative Reviews')

ax3.hist(df[df['Rating'] == 3]['Review'].str.split().map(lambda x: len(x)), bins=50, color='g')
ax3.set_title('Neutral Reviews')

ax4.hist(df[df['Rating'] == 4]['Review'].str.split().map(lambda x: len(x)), bins=50, color='y')
ax4.set_title('Somewhat Positive Reviews')

ax5.hist(df[df['Rating'] == 5]['Review'].str.split().map(lambda x: len(x)), bins=50, color='k')
ax5.set_title('Positive Reviews')

f.suptitle('Histogram number of words in reviews')

In the 5 histograms we can see the distribution behaves like a negative exponential function decreasing significatively as the x-axis increases. It seems like the longest sentence in Review column corresponds to a class 'Negative Reviews' and is around 52 words, now let's obtain the longest one by using the max() function:

In [16]:
df['Review'].str.split().map(lambda x: len(x)).max()

In [17]:
dfff=pd.DataFrame(df['Review'].str.split().map(lambda x: len(x))>=20)
print('Number of sentences which contain more than 20 words: ', dfff.loc[dfff['Review']==True].shape[0])
print(' ')
dfff=pd.DataFrame(df['Review'].str.split().map(lambda x: len(x))>=30)
print('Number of sentences which contain more than 30 words: ', dfff.loc[dfff['Review']==True].shape[0])
print(' ')
dfff=pd.DataFrame(df['Review'].str.split().map(lambda x: len(x))>=40)
print('Number of sentences which contain more than 40 words: ', dfff.loc[dfff['Review']==True].shape[0])
print(' ')
dfff=pd.DataFrame(df['Review'].str.split().map(lambda x: len(x))>=50)
print('Number of sentences which contain more than 50 words: ', dfff.loc[dfff['Review']==True].shape[0])
print(' ')
dfff=pd.DataFrame(df['Review'].str.split().map(lambda x: len(x))==52)
print('Number of sentences which contain 52 words: ', dfff.loc[dfff['Review']==True].shape[0])
print(' ')

Modeling
In this step we will build, train and compare the following algorithms:

BERT (Bidirectional Encoder Representation from Transformers)

XLNet (Generalized Auto-Regressive model)

RoBERTa (Robustly Optimized BERT Pre-training Approach)

DistilBERT (Distilled BERT)

Each one of the mentioned have its pros and cons, the most preferred and widely used model is the BERT for being the middle term in performance, whereas RoBERTa and .. are known for their better error metrics and DistilBERT for its faster training. We will consider all of these chracteristics and choose the best one for our dataset.

Firstly, we have to install the transformers library offered by HuggingFace so as enable all useful functions when building the four models.

In [18]:
!pip install transformers

Then what we need from tensorflow.keras:

In [19]:
from tensorflow.keras.layers import Input, Dropout, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.initializers import TruncatedNormal
from tensorflow.keras.losses import CategoricalCrossentropy
from tensorflow.keras.metrics import CategoricalAccuracy
from tensorflow.keras.utils import to_categorical

import pandas as pd
from sklearn.model_selection import train_test_split

Now we have to gather from the dataset only the two columns useful for training (Review and Rating):

In [20]:
data = df[['Review', 'Rating']]

# Set your model output as categorical and save in new label col
data['Rating_label'] = pd.Categorical(data['Rating'])

# Transform your output to numeric
data['Rating'] = data['Rating_label'].cat.codes

In [21]:
data_train, data_val = train_test_split(data, test_size = 0.1)

# **BERT:**



As first step we have to import the Model, Config and Tokenizer corresponding to Bert in order to build properly the model.

In [22]:
from transformers import TFBertModel,  BertConfig, BertTokenizerFast

In [23]:
# Name of the BERT model to use
model_name = 'bert-base-uncased'

# Max length of tokens
max_length = 100

# Load transformers config and set output_hidden_states to False
config = BertConfig.from_pretrained(model_name)
config.output_hidden_states = False

# Load BERT tokenizer
tokenizer = BertTokenizerFast.from_pretrained(pretrained_model_name_or_path = model_name, config = config)

# Load the Transformers BERT model
transformer_bert_model = TFBertModel.from_pretrained(model_name, config = config)

Now that our model has been loaded we can start the processes of building and tuning according to our dataset and task using the functional API of keras.

In [24]:
### ------- Build the model ------- ###

# Load the MainLayer
bert = transformer_bert_model.layers[0]

# Build your model input
input_ids = Input(shape=(max_length,), name='input_ids', dtype='int32')
inputs = {'input_ids': input_ids}

# Load the Transformers BERT model as a layer in a Keras model
bert_model = bert(inputs)[1]
dropout = Dropout(config.hidden_dropout_prob, name='pooled_output')
pooled_output = dropout(bert_model, training=False)

# Then build your model output
Ratings = Dense(units=len(data_train.Rating_label.value_counts()), kernel_initializer=TruncatedNormal(stddev=config.initializer_range), name='Rating')(pooled_output)
outputs = {'Rating': Ratings}

# And combine it all in a model object
model = Model(inputs=inputs, outputs=outputs, name='BERT_MultiClass')

# Take a look at the model
model.summary()

The next cell considers the tokenization of training and testing sentences, setting of label as categorical and finally model training.

In [25]:
### ------- Train the model ------- ###

# Set an optimizer
optimizer = Adam(learning_rate=5e-05,epsilon=1e-08,decay=0.01,clipnorm=1.0)

# Set loss and metrics
loss = {'Rating': CategoricalCrossentropy(from_logits = True)}

# Compile the model
model.compile(optimizer = optimizer, loss = loss, metrics = ['accuracy'])

# Ready output data for the model
y_train = to_categorical(data_train['Rating'])

# Tokenize the input (takes some time)
x_train = tokenizer(
          text=data_train['Review'].to_list(),
          add_special_tokens=True,
          max_length=max_length,
          truncation=True,
          padding=True, 
          return_tensors='tf',
          return_token_type_ids = False,
          return_attention_mask = True,
          verbose = True)
# Fit the model
history = model.fit(
    x={'input_ids': x_train['input_ids']},
    y={'Rating': y_train},
    
    batch_size=64,
    epochs=10,
    verbose=1)

The model took 31 minutes and 16 seconds to train for 2 epochs.

# **Evaluate on Train+Test set(BERT):**

We will compute the error metrics on the validation set in order to have an idea of the model performance.

In [26]:
model_eval = model.evaluate(
    x={'input_ids': x_train['input_ids']},
    y={'Rating': y_train}
)

In [27]:
y_train_predicted = model.predict(
    x={'input_ids': x_train['input_ids']},
)

y_train_predicted corresponds to a numpy array representing the instances and the prediction as one-hot encoded, the actual label is formatted in the same manner, let's them see in detail:

In [28]:
y_train_predicted['Rating']

In [29]:
y_train_predicted['Rating']

In order to compute the classification report and confusion matrix we will convert the matrices to one column representing the argmax for each row:

In [30]:
y_train_pred_max=[np.argmax(i) for i in y_train_predicted['Rating']]

In [31]:
y_train_actual_max=[np.argmax(i) for i in y_train]

In [32]:
from sklearn.metrics import classification_report

report = classification_report(y_train_pred_max, y_train_actual_max)

print(report)

The fact that our dataset is unbalanced in classes makes our prediction absolutely sidetracked towards the most frequent class, in this case (2: 'Neutral'), because of this the performance of the model is poor when predicting classes 0 or 4, making our model almost unuseful for this task. Below we can see for these 2 classes the number of misclassifications is huge.

In [33]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

disp = ConfusionMatrixDisplay(confusion_matrix=confusion_matrix(y_train_pred_max, y_train_actual_max), display_labels=np.unique(y_train_actual_max))
disp.plot(cmap='Blues') 
plt.grid(False)

# **Inference(BERT):**

In this step we will predict the classes corresponding to the test set (out-of-bag) instances, because of the huge dataset we can expect to have almost same performance.

In [34]:
x_test = tokenizer(
          text=df_test['Review'].to_list(),
          add_special_tokens=True,
          max_length=max_length,
          truncation=True,
          padding=True, 
          return_tensors='tf',
          return_token_type_ids = False,
          return_attention_mask = False,
          verbose = True)

In [35]:
label_predicted = model.predict(
    x={'input_ids': x_test['input_ids']},
)

In [36]:
label_predicted['Rating']

In [37]:
label_pred_max=[np.argmax(i) for i in label_predicted['Rating']]

In [38]:
label_pred_max[:10]

We will build the next 3 models the same way as the previous one, notice there are some lines which includes extra functions proper for the model:

In [39]:
#Cheers!!!

# **XLNet:**

The tokenizer corresponding to XLNet requires an extra library called sentencepiece which we have to install and import as follows:

In [40]:
!pip install sentencepiece 

In [41]:
from transformers import XLNetTokenizer, TFXLNetModel, XLNetConfig
import sentencepiece

In [42]:
### --------- Setup XLNet ---------- ###

model_name = 'xlnet-base-cased'

# Max length of tokens
max_length = 100

# Load transformers config and set output_hidden_states to False
config = XLNetConfig.from_pretrained(model_name)
config.output_hidden_states = False

# Load XLNet tokenizer
tokenizer = XLNetTokenizer.from_pretrained(pretrained_model_name_or_path = model_name, config = config)

# Load the XLNet model
transformer_xlnet_model = TFXLNetModel.from_pretrained(model_name, config = config)

In [43]:
from keras.layers import Input
import tensorflow as tf

In [44]:
### ------- Build the model ------- ###

# Load the MainLayer
xlnet = transformer_xlnet_model.layers[0]

# Build your model input
input_ids = Input(shape=(max_length,), name='input_ids', dtype='int32')
inputs = {'input_ids': input_ids}

# Load the Transformers XLNet model as a layer in a Keras model
xlnet_model = xlnet(inputs)[0]
xlnet_model = tf.squeeze(xlnet_model[:, -1:, :], axis=1)
dropout = Dropout(0.1, name='pooled_output')
pooled_output = dropout(xlnet_model, training=False)

# Then build your model output
Ratings = Dense(units=len(data_train.Rating_label.value_counts()), kernel_initializer=TruncatedNormal(stddev=config.initializer_range), name='Rating')(pooled_output)
outputs = {'Rating': Ratings}

# And combine it all in a model object
model4 = Model(inputs=inputs, outputs=outputs, name='XLNet_MultiClass')

# Take a look at the model
model4.summary()

In [45]:
### ------- Train the model ------- ###

# Set an optimizer
optimizer = Adam(learning_rate=5e-05,epsilon=1e-08,decay=0.01,clipnorm=1.0)

# Set loss and metrics
loss = {'Rating': CategoricalCrossentropy(from_logits = True)}

# Compile the model
model4.compile(optimizer = optimizer, loss = loss, metrics = ['accuracy'])

# Ready output data for the model
y_train = to_categorical(data_train['Rating'])

# Tokenize the input (takes some time)
x_train = tokenizer(
          text=data_train['Review'].to_list(),
          add_special_tokens=True,
          max_length=max_length,
          truncation=True,
          padding=True, 
          return_tensors='tf',
          return_token_type_ids = False,
          return_attention_mask = False,
          verbose = True)
# Fit the model
history = model4.fit(
    x={'input_ids': x_train['input_ids']},
    y={'Rating': y_train},
    
    batch_size=64,
    epochs=10,
    verbose=1)

The model took 31 minutes and 16 seconds to train for 2 epochs.

# **Evaluate on Train+Test set(XLNET):**

In [46]:
model_eval = model4.evaluate(
    x={'input_ids': x_train['input_ids']},
    y={'Rating': y_train}
)

In [47]:
y_train_predicted = model4.predict(
    x={'input_ids': x_train['input_ids']},
)

In [48]:
y_train_predicted['Rating']

In [49]:
y_train

In [50]:
y_train_pred_max=[np.argmax(i) for i in y_train_predicted['Rating']]

In [51]:
y_train_actual_max=[np.argmax(i) for i in y_train]

In [52]:
from sklearn.metrics import classification_report

report = classification_report(y_train_pred_max, y_train_actual_max)

print(report)

In [53]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

disp = ConfusionMatrixDisplay(confusion_matrix=confusion_matrix(y_train_pred_max, y_train_actual_max), display_labels=np.unique(y_train_actual_max))
disp.plot(cmap='Blues') 
plt.grid(False)

# **Inference(XLNET):**

In [54]:
x_test = tokenizer(
          text=df_test['Review'].to_list(),
          add_special_tokens=True,
          max_length=max_length,
          truncation=True,
          padding=True, 
          return_tensors='tf',
          return_token_type_ids = False,
          return_attention_mask = False,
          verbose = True)

In [55]:
label_predicted = model4.predict(
    x={'input_ids': x_test['input_ids']},
)

In [56]:
label_predicted['Rating']

In [57]:
label_pred_max=[np.argmax(i) for i in label_predicted['Rating']]

In [58]:
label_pred_max[:10]

We will build the next 2 models the same way as the previous one, notice there are some lines which includes extra functions proper for the model:

In [59]:
#Cheers!!!

# **RoBERTa:**

In [60]:
from transformers import RobertaTokenizer, TFRobertaModel, RobertaConfig  

In [61]:
### --------- Setup Roberta ---------- ###

model_name = 'roberta-base'

# Max length of tokens
max_length = 100

# Load transformers config and set output_hidden_states to False
config = RobertaConfig.from_pretrained(model_name)
config.output_hidden_states = False

# Load Roberta tokenizer
tokenizer = RobertaTokenizer.from_pretrained(pretrained_model_name_or_path = model_name, config = config)

# Load the Roberta model
transformer_roberta_model = TFRobertaModel.from_pretrained(model_name, config = config)

In [62]:
### ------- Build the model ------- ###

# Load the MainLayer
roberta = transformer_roberta_model.layers[0]

# Build your model input
input_ids = Input(shape=(max_length,), name='input_ids', dtype='int32')
inputs = {'input_ids': input_ids}

# Load the Transformers RoBERTa model as a layer in a Keras model
roberta_model = roberta(inputs)[1]
dropout = Dropout(config.hidden_dropout_prob, name='pooled_output')
pooled_output = dropout(roberta_model, training=False)

# Then build your model output
Ratings = Dense(units=len(data_train.Rating_label.value_counts()), kernel_initializer=TruncatedNormal(stddev=config.initializer_range), name='Rating')(pooled_output)
outputs = {'Rating': Ratings}

# And combine it all in a model object
model2 = Model(inputs=inputs, outputs=outputs, name='RoBERTa_MultiClass')

# Take a look at the model
model2.summary()

In [63]:
### ------- Train the model ------- ###

# Set an optimizer
optimizer = Adam(learning_rate=5e-05,epsilon=1e-08,decay=0.01,clipnorm=1.0)

# Set loss and metrics
loss = {'Rating': CategoricalCrossentropy(from_logits = True)}

# Compile the model
model2.compile(optimizer = optimizer, loss = loss, metrics = ['accuracy'])

# Ready output data for the model
y_train = to_categorical(data_train['Rating'])

# Tokenize the input (takes some time)
x_train = tokenizer(
          text=data_train['Review'].to_list(),
          add_special_tokens=True,
          max_length=max_length,
          truncation=True,
          padding=True, 
          return_tensors='tf',
          return_token_type_ids = False,
          return_attention_mask = True,
          verbose = True)
# Fit the model
history = model2.fit(
    x={'input_ids': x_train['input_ids']},
    y={'Rating': y_train},
  
    batch_size=64,
    epochs=10,
    verbose=1)

The model took 26 minutes to train for 2 epochs.

# **Evaluate on validation set(RoBERTa):**

In [64]:
model_eval = model2.evaluate(
    x={'input_ids': x_train['input_ids']},
    y={'Rating': y_train}
)

In [65]:
y_train_predicted = model2.predict(
    x={'input_ids': x_train['input_ids']},
)

In [66]:
y_train_predicted['Rating']

In [67]:
y_train

In [68]:
y_train_pred_max=[np.argmax(i) for i in y_train_predicted['Rating']]

In [69]:
y_train_actual_max=[np.argmax(i) for i in y_train]

In [70]:
from sklearn.metrics import classification_report

report = classification_report(y_train_pred_max, y_train_actual_max)

print(report)

In [71]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

disp = ConfusionMatrixDisplay(confusion_matrix=confusion_matrix(y_train_pred_max, y_train_actual_max), display_labels=np.unique(y_train_actual_max))
disp.plot(cmap='Blues') 
plt.grid(False)

# **Inference(RoBERTa):**

In [72]:
x_test = tokenizer(
          text=df_test['Review'].to_list(),
          add_special_tokens=True,
          max_length=max_length,
          truncation=True,
          padding=True, 
          return_tensors='tf',
          return_token_type_ids = False,
          return_attention_mask = False,
          verbose = True)

In [73]:
label_predicted = model2.predict(
    x={'input_ids': x_test['input_ids']},
)

In [74]:
label_predicted['Rating']

In [75]:
label_pred_max=[np.argmax(i) for i in label_predicted['Rating']]

In [76]:
label_pred_max[:10]

In [77]:
#Cheers!!!

# **DistilBERT:**

In [78]:
from transformers import DistilBertTokenizer, TFDistilBertModel, DistilBertConfig 

In [79]:
### --------- Setup DistilBERT ---------- ###

model_name = 'distilbert-base-uncased'

# Max length of tokens
max_length = 100

# Load transformers config and set output_hidden_states to False
config = DistilBertConfig.from_pretrained(model_name)
config.output_hidden_states = False

# Load Distilbert tokenizer
tokenizer = DistilBertTokenizer.from_pretrained(pretrained_model_name_or_path = model_name, config = config)

# Load the Distilbert model
transformer_distilbert_model = TFDistilBertModel.from_pretrained(model_name, config = config)

In [80]:
### ------- Build the model ------- ###

# Load the MainLayer
distilbert = transformer_distilbert_model.layers[0]

# Build your model input
input_ids = Input(shape=(max_length,), name='input_ids', dtype='int32')
inputs = {'input_ids': input_ids}

# Load the Transformers DistilBERT model as a layer in a Keras model
distilbert_model = distilbert(inputs)[0][:,0,:]
dropout = Dropout(0.1, name='pooled_output')
pooled_output = dropout(distilbert_model, training=False)

# Then build your model output
Ratings = Dense(units=len(data_train.Rating_label.value_counts()), kernel_initializer=TruncatedNormal(stddev=config.initializer_range), name='Rating')(pooled_output)
outputs = {'Rating': Ratings}

# And combine it all in a model object
model3 = Model(inputs=inputs, outputs=outputs, name='DistilBERT_MultiClass')

# Take a look at the model
model3.summary()

In [81]:
### ------- Train the model ------- ###

# Set an optimizer
optimizer = Adam(learning_rate=5e-05,epsilon=1e-08,decay=0.01,clipnorm=1.0)

# Set loss and metrics
loss = {'Rating': CategoricalCrossentropy(from_logits = True)}

# Compile the model
model3.compile(optimizer = optimizer, loss = loss, metrics = ['accuracy'])

# Ready output data for the model
y_train = to_categorical(data_train['Rating'])

# Tokenize the input (takes some time)
x_train = tokenizer(
          text=data_train['Review'].to_list(),
          add_special_tokens=True,
          max_length=max_length,
          truncation=True,
          padding=True, 
          return_tensors='tf',
          return_token_type_ids = False,
          return_attention_mask = True,
          verbose = True)
# Fit the model
history = model3.fit(
    x={'input_ids': x_train['input_ids']},
    y={'Rating': y_train},
    
    batch_size=64,
    epochs=10,
    verbose=1)

The model took 14 minutes to train for 2 epochs.

# **Evaluate on Train+Test set(DistilBERT):**

In [82]:
model_eval = model3.evaluate(
    x={'input_ids': x_train['input_ids']},
    y={'Rating': y_train}
)

In [83]:
y_train_predicted = model3.predict(
    x={'input_ids': x_train['input_ids']},
)

In [84]:
y_train_predicted['Rating']

In [85]:
y_train

In [86]:
y_train_pred_max=[np.argmax(i) for i in y_train_predicted['Rating']]

In [87]:
y_train_actual_max=[np.argmax(i) for i in y_train]

In [88]:
from sklearn.metrics import classification_report

report = classification_report(y_train_pred_max, y_train_actual_max)

print(report)

In [89]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

disp = ConfusionMatrixDisplay(confusion_matrix=confusion_matrix(y_train_pred_max, y_train_actual_max), display_labels=np.unique(y_train_actual_max))
disp.plot(cmap='Blues') 
plt.grid(False)

# **Inference(DistilBERT):**

In [90]:
x_test = tokenizer(
          text=df_test['Review'].to_list(),
          add_special_tokens=True,
          max_length=max_length,
          truncation=True,
          padding=True, 
          return_tensors='tf',
          return_token_type_ids = False,
          return_attention_mask = False,
          verbose = True)

In [91]:
label_predicted = model3.predict(
    x={'input_ids': x_test['input_ids']},
)

In [92]:
label_predicted['Rating']

In [93]:
label_pred_max=[np.argmax(i) for i in label_predicted['Rating']]

In [94]:
label_pred_max[:10]

In [95]:
#Cheers!!!

# **Discussion:**

In general the performance of the four models was similar, supporting the idea that BERT is the middle term of trade-off between accuracy and training time, whereas DistilBERT was the fastest by far, but having a lower accuracy than the previous as is explained by HuggingFace it achieves 95% accuracy of BERT, finally RoBERTa and XLNet were the models with highest accuracy and at the same time the slowest.

I have submitted the predition of the testing set for all models and the best one was RoBERTa reaching 68.62% of accuracy and the lowest was DistilBERT reaching 67.90%. We can say there is a slight difference but in terms of number of misclassifications the gap is huge, however the big challenge of the current task is how to deal with an unbalanced dataset, this is the main and perhaps the unique reason why we have such a poor performance even in the best one, despite the fact that increasing the max_length of sequences can increase a little bit the accuracy too, but not significatively. The method I would apply to solve this problem is undersampling in which we reduce the number of instances to the less frequent class which corresponds to 7.072 (Negative) as such number of instances is not too small and having 5 classes the dataset should finally have 35.360 sentences to compute, but obviously we are getting rid randomly of a big portion of the data.

Another possible solution could be to get rid of those reviews which are "vague" such as those with only one or two words classified as neutral, those really does not add too much to the training, but in contrast are others which have just a couple of words and are useful. This process would take a long time to do because it have to be done one by one, but it surely solves the problem.

Also I have to inform that I have trained for more than 2 epochs each model but the accuracy didn't increase or even decreased after the 3rd or 4th epoch, this is why in order to avoid more complex functions or early stopping I set to 2 epochs.

I would like to know any feedback in order to increase the performance of the models or tell me if you found a different one even better!

If you liked this notebook I would appreciate so much your upvote if you want to see more projects/tutorials like this one. I encourage you to see my projects portfolio, am sure you will love it.

Thank you!