# NLP Disaster Tweets Classifier with Transformers

Welcome folks!, in the current project I will implement on this dataset what I have done previously in another project called "Best Sentiment Classifier Transformers" in which I showed you in detail how to implement four types of well-known transformer models making use of the transformers HuggingFace library and Keras API.

The main task corresponds to a binary text classification on Disaster Tweets Competition and the dataset contains 7.613 instances for training, whereas the testing set contains 3263 from which we have to classify as "Disaster" or "non-Disaster".

By the way I really encourage you to see my project "Best NLP Disaster Tweets Classifier!" in which I perform an exhaustive and spotless explanation of corpus processing and NLU, this is because in the current project we will mainly focus on implementing state of the art transformer models and comparing their performance.


In [None]:
import pandas as pd
import numpy as np
import os
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import seaborn as sns
sns.set(style='whitegrid')

from wordcloud import WordCloud

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.metrics import classification_report,confusion_matrix

from collections import defaultdict
from collections import Counter

import re
import gensim
import string

from tqdm import tqdm
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, LSTM,Dense, SpatialDropout1D, Dropout
from keras.initializers import Constant

import tensorflow as tf

import warnings
warnings.simplefilter('ignore')

Let's start by reading the 3 csv files containing training, testing and sample submission, then run a couple of functions to know a little bit more about our data. 

In [None]:
df=pd.read_csv('../input/nlp-getting-started/train.csv')
df_test=pd.read_csv('../input/nlp-getting-started/test.csv')
sample_submission=pd.read_csv('../input/nlp-getting-started/sample_submission.csv')

In [None]:
df.shape, df_test.shape

In [None]:
df

In [None]:
df.loc[:,['text','target']]

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
df.target.value_counts()

We can see above the distribution of the classes is slightly unbalanced, this is why we should expect to have sidetrack in the prediction towards class 0.

In [None]:
df2=df.copy(deep=True)
pie1=pd.DataFrame(df2['target'].replace(1,'disaster').replace(0,'non-disaster').value_counts())
pie1.reset_index(inplace=True)
pie1.plot(kind='pie', title='Pie chart of Disaster/Non-disaster tweets',y = 'target', 
          autopct='%1.1f%%', shadow=False, labels=pie1['index'], legend = False, fontsize=14, figsize=(12,12))

Time now to find out the number of words in headline tweets, in order to understand a bit better we will plot histograms for both classes:


In [None]:
f, (ax1, ax2,) = plt.subplots(1,2,figsize=(25,8))

ax1.hist(df[df['target'] == 0]['text'].str.split().map(lambda x: len(x)), bins=29, color='b')
ax1.set_title('Non-disaster tweets')

ax2.hist(df[df['target'] == 1]['text'].str.split().map(lambda x: len(x)), bins=29, color='r')
ax2.set_title('Disaster tweets')

f.suptitle('Histogram number of words in tweets')

In both plots we can see the distributions are Gaussian-like shapes with similar frequencies, it seems that the longest tweet in the entire dataset corresponds to a Non-disaster and is around 31 words, now let's obtain the longest one by using the max() function:

In [None]:
df['text'].str.split().map(lambda x: len(x)).max()

Effectively was 31 words, this means if we would Tokenize by word the max_length should be 31, however as transformers consider sub-words tokenization such number could be increased depending on the words being used which can increase such length to 40 or even more, thus we have to take that into account when modeling as it could cause our model to take significatively a long time to train, therefore we have to find a trade-off between training time and performance.

In [None]:
dfff=pd.DataFrame(df['text'].str.split().map(lambda x: len(x))>=10)
print('Number of sentences which contain more than 10 words: ', dfff.loc[dfff['text']==True].shape[0])
print(' ')
dfff=pd.DataFrame(df['text'].str.split().map(lambda x: len(x))>=15)
print('Number of sentences which contain more than 15 words: ', dfff.loc[dfff['text']==True].shape[0])
print(' ')
dfff=pd.DataFrame(df['text'].str.split().map(lambda x: len(x))>=20)
print('Number of sentences which contain more than 20 words: ', dfff.loc[dfff['text']==True].shape[0])
print(' ')
dfff=pd.DataFrame(df['text'].str.split().map(lambda x: len(x))>=25)
print('Number of sentences which contain more than 25 words: ', dfff.loc[dfff['text']==True].shape[0])
print(' ')
dfff=pd.DataFrame(df['text'].str.split().map(lambda x: len(x))==31)
print('Number of sentences which contain 31 words: ', dfff.loc[dfff['text']==True].shape[0])
print(' ')

Above we see 350 tweets contain more than 25 words and only 3 tweets are of 31 words, now we have to consider that our dataset is "small sized" (7.613 instances) and the following cleaning process will get rid of a big portion useless part of the sentences we can be sure after tokenization process the sequences created will not be much longer than 31 words. Below we can see the three headline tweets containing 31 words, observe there are misspelled words, emojis, acronyms and some of them can be decomposed into sub-words:

In [None]:
print(df.loc[954,'text'])
print(' ')
print(df.loc[4432,'text'])
print(' ')
print(df.loc[5005,'text'])

## Cleaning:

The tweets contained in the dataset are almost raw, this means we have to get rid of all 'impurities' such as tags, symbols, punctuations, emojis, etc. These does not add significant information to the prediction moreover makes our sentences more subjective. This process comprehend 6 key steps which will make our sentences partially-suit to be used in training of the model.

### Removing URLs: 
Some tweets either disaster or non-disaster include links 'URLs' which correspond to videos or other webpages containing key information about the subject they are trying to communicate, as we want to clean the sentences we must get rid of them. The function which applies such step will be caled remove_URL:

In [None]:
example="New competition launched :https://www.kaggle.com/c/nlp-getting-started"

In [None]:
def remove_URL(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'',text)

remove_URL(example)

In [None]:
df['text']=df['text'].apply(lambda x : remove_URL(x))

In [None]:
df_test['text']=df_test['text'].apply(lambda x : remove_URL(x))

### Removing HTML tags:

We have to consider that some tweets were obtained using web scrapping, using this method the components of a publication are companied by special tags identifying them. As such tags are unuseful we must get rid of them to gather only the text. The function which applies such step will be called remove_html:

In [None]:
example = """<div>
<h1>Real or Fake</h1>
<p>Kaggle </p>
<a href="https://www.kaggle.com/c/nlp-getting-started">getting started</a>
</div>"""

In [None]:
def remove_html(text):
    html=re.compile(r'<.*?>')
    return html.sub(r'',text)
    
print(remove_html(example))

In [None]:
df['text']=df['text'].apply(lambda x : remove_html(x))

In [None]:
df_test['text']=df_test['text'].apply(lambda x : remove_html(x))

### Removing Emojis:

Emojis are an efficient way to show the feeling of the publishers in the message, we could translate the meaning of them to words and help to improve the scope of the message. These could be useful or confuse the algorithm when finding the same feeling for disaster and non-disaster tweets, because of this we prefer to get rid of them, the function which applies such step will be called remove_emoji:

In [None]:
# Reference : https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

remove_emoji("Omg another Earthquake 😔😔")

In [None]:
df['text']=df['text'].apply(lambda x: remove_emoji(x))

In [None]:
df_test['text']=df_test['text'].apply(lambda x: remove_emoji(x))

### Contractions and acronyms:
People world-wide make use of acronyms to speed-up the publishing of a tweet, some of them can be miswritten and others can be decomposed creating words that make sense, this process is exhaustive and requires investing a long time searching the meaning of each one, the function which replaces the contractions and acronyms by the words they stand for will be called cleaner: 


In [None]:
def cleaner(tweet):
  # Acronyms and miswritten words
  tweet = re.sub(r"Typhoon-Devastated", "typhoon devastated", tweet)
  tweet = re.sub(r"TyphoonDevastated", "typhoon devastated", tweet)
  tweet = re.sub(r"typhoondevastated", "typhoon devastated", tweet)
  tweet = re.sub(r"MH370", "Malaysia Airlines Flight", tweet)
  tweet = re.sub(r"MH", "Malaysia Airlines Flight", tweet)
  tweet = re.sub(r"mh370", "Malaysia Airlines Flight", tweet)
  tweet = re.sub(r"year-old", "years old", tweet)
  tweet = re.sub(r"yearold", "years old", tweet)
  tweet = re.sub(r"yr old", "years old", tweet)
  tweet = re.sub(r"PKK", "Kurdistan Workers Party", tweet)
  tweet = re.sub(r"MP", "madhya pradesh", tweet)
  tweet = re.sub(r"rly", "railway", tweet)
  tweet = re.sub(r"CDT", "Central Daylight Time", tweet)
  tweet = re.sub(r"sensorsenso", "sensor senso", tweet)
  tweet = re.sub(r"pm", "", tweet)
  tweet = re.sub(r"PM", "", tweet)
  tweet = re.sub(r"nan", " ", tweet)
  tweet = re.sub(r"terrorismturn", "terrorism turn", tweet)
  tweet = re.sub(r"epicente", "epicenter", tweet)
  tweet = re.sub(r"epicenterr", "epicenter", tweet)
  tweet = re.sub(r"WAwildfire", "Washington Wildfire", tweet)
  tweet = re.sub(r"prebreak", "pre break", tweet)
  tweet = re.sub(r"nowplaying", "now playing", tweet)
  tweet = re.sub(r"RT", "retweet", tweet)
  tweet = re.sub(r"EbolaOutbreak", "Ebola Outbreak", tweet)
  tweet = re.sub(r"LondonFire", "London Fire", tweet)
  tweet = re.sub(r"IDFire", "Idaho Fire", tweet)
  tweet = re.sub(r"withBioterrorism&use", "with Bioterrorism & use", tweet)
  tweet = re.sub(r"NASAHurricane", "NASA Hurricane", tweet)
  tweet = re.sub(r"withweapons", "with weapons", tweet)
  tweet = re.sub(r"NuclearPower", "Nuclear Power", tweet)
  tweet = re.sub(r"WhiteTerrorism", "White Terrorism", tweet)
  tweet = re.sub(r"MyanmarFlood", "Myanmar Flood", tweet)
  tweet = re.sub(r"ExtremeWeather", "Extreme Weather", tweet)

  # Special characters
  tweet = re.sub(r"%20", " ", tweet)
  tweet = re.sub(r"%", " ", tweet)
  tweet = re.sub(r"@", " ", tweet)
  tweet = re.sub(r"#", " ", tweet)
  tweet = re.sub(r"'", " ", tweet)
  tweet = re.sub(r"\x89û_", " ", tweet)
  tweet = re.sub(r"\x89ûò", " ", tweet)
  tweet = re.sub(r"16yr", "16 year", tweet)
  tweet = re.sub(r"re\x89û_", " ", tweet)
  tweet = re.sub(r"\x89û", " ", tweet)
  tweet = re.sub(r"\x89Û", " ", tweet)
  tweet = re.sub(r"re\x89Û", "re ", tweet)
  tweet = re.sub(r"re\x89û", "re ", tweet)
  tweet = re.sub(r"\x89ûª", "'", tweet)
  tweet = re.sub(r"\x89û", " ", tweet)
  tweet = re.sub(r"\x89ûò", " ", tweet)
  tweet = re.sub(r"\x89Û_", "", tweet)
  tweet = re.sub(r"\x89ÛÒ", "", tweet)
  tweet = re.sub(r"\x89ÛÓ", "", tweet)
  tweet = re.sub(r"\x89ÛÏWhen", "When", tweet)
  tweet = re.sub(r"\x89ÛÏ", "", tweet)
  tweet = re.sub(r"China\x89Ûªs", "China's", tweet)
  tweet = re.sub(r"let\x89Ûªs", "let's", tweet)
  tweet = re.sub(r"\x89Û÷", "", tweet)
  tweet = re.sub(r"\x89Ûª", "", tweet)
  tweet = re.sub(r"\x89Û\x9d", "", tweet)
  tweet = re.sub(r"å_", "", tweet)
  tweet = re.sub(r"\x89Û¢", "", tweet)
  tweet = re.sub(r"\x89Û¢åÊ", "", tweet)
  tweet = re.sub(r"fromåÊwounds", "from wounds", tweet)
  tweet = re.sub(r"åÊ", "", tweet)
  tweet = re.sub(r"åÈ", "", tweet)
  tweet = re.sub(r"JapÌ_n", "Japan", tweet)    
  tweet = re.sub(r"Ì©", "e", tweet)
  tweet = re.sub(r"å¨", "", tweet)
  tweet = re.sub(r"SuruÌ¤", "Suruc", tweet)
  tweet = re.sub(r"åÇ", "", tweet)
  tweet = re.sub(r"å£3million", "3 million", tweet)
  tweet = re.sub(r"åÀ", "", tweet)

  # Contractions
  tweet = re.sub(r"he's", "he is", tweet)
  tweet = re.sub(r"there's", "there is", tweet)
  tweet = re.sub(r"We're", "We are", tweet)
  tweet = re.sub(r"That's", "That is", tweet)
  tweet = re.sub(r"won't", "will not", tweet)
  tweet = re.sub(r"they're", "they are", tweet)
  tweet = re.sub(r"Can't", "Cannot", tweet)
  tweet = re.sub(r"wasn't", "was not", tweet)
  tweet = re.sub(r"don\x89Ûªt", "do not", tweet)
  tweet = re.sub(r"aren't", "are not", tweet)
  tweet = re.sub(r"isn't", "is not", tweet)
  tweet = re.sub(r"What's", "What is", tweet)
  tweet = re.sub(r"haven't", "have not", tweet)
  tweet = re.sub(r"hasn't", "has not", tweet)
  tweet = re.sub(r"There's", "There is", tweet)
  tweet = re.sub(r"He's", "He is", tweet)
  tweet = re.sub(r"It's", "It is", tweet)
  tweet = re.sub(r"You're", "You are", tweet)
  tweet = re.sub(r"I'M", "I am", tweet)
  tweet = re.sub(r"Im", "I am", tweet)
  tweet = re.sub(r"shouldn't", "should not", tweet)
  tweet = re.sub(r"wouldn't", "would not", tweet)
  tweet = re.sub(r"i'm", "I am", tweet)
  tweet = re.sub(r"I\x89Ûªm", "I am", tweet)
  tweet = re.sub(r"I'm", "I am", tweet)
  tweet = re.sub(r"Isn't", "is not", tweet)
  tweet = re.sub(r"Here's", "Here is", tweet)
  tweet = re.sub(r"you've", "you have", tweet)
  tweet = re.sub(r"you\x89Ûªve", "you have", tweet)
  tweet = re.sub(r"we're", "we are", tweet)
  tweet = re.sub(r"what's", "what is", tweet)
  tweet = re.sub(r"couldn't", "could not", tweet)
  tweet = re.sub(r"we've", "we have", tweet)
  tweet = re.sub(r"it\x89Ûªs", "it is", tweet)
  tweet = re.sub(r"doesn\x89Ûªt", "does not", tweet)
  tweet = re.sub(r"It\x89Ûªs", "It is", tweet)
  tweet = re.sub(r"Here\x89Ûªs", "Here is", tweet)
  tweet = re.sub(r"who's", "who is", tweet)
  tweet = re.sub(r"I\x89Ûªve", "I have", tweet)
  tweet = re.sub(r"y'all", "you all", tweet)
  tweet = re.sub(r"can\x89Ûªt", "cannot", tweet)
  tweet = re.sub(r"would've", "would have", tweet)
  tweet = re.sub(r"it'll", "it will", tweet)
  tweet = re.sub(r"we'll", "we will", tweet)
  tweet = re.sub(r"wouldn\x89Ûªt", "would not", tweet)
  tweet = re.sub(r"We've", "We have", tweet)
  tweet = re.sub(r"he'll", "he will", tweet)
  tweet = re.sub(r"Y'all", "You all", tweet)
  tweet = re.sub(r"Weren't", "Were not", tweet)
  tweet = re.sub(r"Didn't", "Did not", tweet)
  tweet = re.sub(r"they'll", "they will", tweet)
  tweet = re.sub(r"they'd", "they would", tweet)
  tweet = re.sub(r"DON'T", "DO NOT", tweet)
  tweet = re.sub(r"That\x89Ûªs", "That is", tweet)
  tweet = re.sub(r"they've", "they have", tweet)
  tweet = re.sub(r"i'd", "I would", tweet)
  tweet = re.sub(r"should've", "should have", tweet)
  tweet = re.sub(r"You\x89Ûªre", "You are", tweet)
  tweet = re.sub(r"where's", "where is", tweet)
  tweet = re.sub(r"Don\x89Ûªt", "Do not", tweet)
  tweet = re.sub(r"we'd", "we would", tweet)
  tweet = re.sub(r"i'll", "I will", tweet)
  tweet = re.sub(r"weren't", "were not", tweet)
  tweet = re.sub(r"They're", "They are", tweet)
  tweet = re.sub(r"Can\x89Ûªt", "Cannot", tweet)
  tweet = re.sub(r"you\x89Ûªll", "you will", tweet)
  tweet = re.sub(r"I\x89Ûªd", "I would", tweet)
  tweet = re.sub(r"let's", "let us", tweet)
  tweet = re.sub(r"it's", "it is", tweet)
  tweet = re.sub(r"can't", "can not", tweet)
  tweet = re.sub(r"cant", "can not", tweet)
  tweet = re.sub(r"don't", "do not", tweet)
  tweet = re.sub(r"dont", "do not", tweet)
  tweet = re.sub(r"you're", "you are", tweet)
  tweet = re.sub(r"i've", "I have", tweet)
  tweet = re.sub(r"that's", "that is", tweet)
  tweet = re.sub(r"i'll", "I will", tweet)
  tweet = re.sub(r"doesn't", "does not", tweet)
  tweet = re.sub(r"i'd", "I would", tweet)
  tweet = re.sub(r"didn't", "did not", tweet)
  tweet = re.sub(r"ain't", "am not", tweet)
  tweet = re.sub(r"you'll", "you will", tweet)
  tweet = re.sub(r"I've", "I have", tweet)
  tweet = re.sub(r"Don't", "do not", tweet)
  tweet = re.sub(r"I'll", "I will", tweet)
  tweet = re.sub(r"I'd", "I would", tweet)
  tweet = re.sub(r"Let's", "Let us", tweet)
  tweet = re.sub(r"you'd", "You would", tweet)
  tweet = re.sub(r"It's", "It is", tweet)
  tweet = re.sub(r"Ain't", "am not", tweet)
  tweet = re.sub(r"Haven't", "Have not", tweet)
  tweet = re.sub(r"Could've", "Could have", tweet)
  tweet = re.sub(r"youve", "you have", tweet)  
  tweet = re.sub(r"donå«t", "do not", tweet)

  return tweet

In [None]:
df['text'] = df['text'].apply(lambda s : cleaner(s))

In [None]:
df_test['text'] = df_test['text'].apply(lambda s : cleaner(s))

### Removing punctuations:

In this step the there are only a few tweets cleaned that still contain symbols and punctuations, as they don't add key information to the message we will get rid of them, the function which applies such step will be called remove_punct:


In [None]:
def remove_punct(text):
    table=str.maketrans('','',string.punctuation)
    return text.translate(table)

example="I am a #king"
print(remove_punct(example))

In [None]:
df['text']=df['text'].apply(lambda x : remove_punct(x))

In [None]:
df_test['text']=df_test['text'].apply(lambda x : remove_punct(x))

### Removing multiple spaces:

Now, some sentences cleaned have different types of extra whitespaces, obviusly they don't add anything to the corpus and we will get rid of them with the following lines:

In [None]:
df['text']=df['text'].str.replace('   ', ' ')
df['text']=df['text'].str.replace('     ', ' ')
df['text']=df['text'].str.replace('\xa0 \xa0 \xa0', ' ')
df['text']=df['text'].str.replace('  ', ' ')
df['text']=df['text'].str.replace('—', ' ')
df['text']=df['text'].str.replace('–', ' ')

In [None]:
df_test['text']=df_test['text'].str.replace('   ', ' ')
df_test['text']=df_test['text'].str.replace('     ', ' ')
df_test['text']=df_test['text'].str.replace('\xa0 \xa0 \xa0', ' ')
df_test['text']=df_test['text'].str.replace('  ', ' ')
df_test['text']=df_test['text'].str.replace('—', ' ')
df_test['text']=df_test['text'].str.replace('–', ' ')

## Modeling

In other project published I showed several models to tackle this problem, however they are based on Bag-of-words embedding method and as we know currently they are a bit obsolete because do not apply an attention mechanism which has proven to be very powerful and makes the model understand the meaning of sentences. This is why in the current step we will build, train and compare the following "attention based" algorithms:

- BERT (Bidirectional Encoder Representation from Transformers)
- RoBERTa (Robustly Optimized BERT Pre-training Approach)
- DistilBERT (Distilled BERT)
- XLNet (Generalized Auto-Regressive model)

Each one of the mentioned have its pros and cons, the most preferred and widely used model is the BERT for being the middle term in performance, whereas RoBERTa and XLNet are known for their better error metrics and DistilBERT for its faster training. We will consider all of these characteristics and choose the best one for our dataset.

We will start by installing the transformers library and importing the functions needed.

In [None]:
!pip install transformers

Then what we need from tensorflow.keras:

In [None]:
from tensorflow.keras.layers import Input, Dropout, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.initializers import TruncatedNormal
from tensorflow.keras.losses import CategoricalCrossentropy
from tensorflow.keras.metrics import CategoricalAccuracy
from tensorflow.keras.utils import to_categorical

import pandas as pd
from sklearn.model_selection import train_test_split

Now we have to gather from the dataset only the two columns useful for training (text and target), then let's create a new column corresponding to our label as categorical which will be useful later.

In [None]:
# Select required columns
data = df[['text', 'target']]

# Set your model output as categorical and save in new label col
data['target_label'] = pd.Categorical(data['target'])

# Transform your output to numeric
data['target'] = data['target_label'].cat.codes

# BERT

As first step we have to import the Model, Config and Tokenizer corresponding to Bert in order to build properly the model.

In [None]:
from transformers import TFBertModel,  BertConfig, BertTokenizerFast

The model we will use is 'bert_base_uncased' and the max_length chosen is 45 in order to cover even the longest possible sequence, also because the number of instances in the dataset is relatively small the training will not take too much time.

In [None]:
### --------- Setup BERT ---------- ###

# Name of the BERT model to use
model_name = 'bert-base-uncased'

# Max length of tokens
max_length = 45

# Load transformers config and set output_hidden_states to False
config = BertConfig.from_pretrained(model_name)
config.output_hidden_states = False

# Load BERT tokenizer
tokenizer = BertTokenizerFast.from_pretrained(pretrained_model_name_or_path = model_name, config = config)

# Load the Transformers BERT model
transformer_bert_model = TFBertModel.from_pretrained(model_name, config = config)

Now that our model has been loaded we can start the processes of building and  tuning according to our dataset and task using the functional API of keras.

As we see below the input layer must consider the max_length of sequences and then this is fed to the bert model, a dropout layer to reduce overfitting (0.1) and finally a dense layer with number of neurons equal to number of classes in our label (2).

In [None]:
# Load the MainLayer
bert = transformer_bert_model.layers[0]

# Build your model input
input_ids = Input(shape=(max_length,), name='input_ids', dtype='int32')
inputs = {'input_ids': input_ids}

# Load the Transformers BERT model as a layer in a Keras model
bert_model = bert(inputs)[1]
dropout = Dropout(config.hidden_dropout_prob, name='pooled_output')
pooled_output = dropout(bert_model, training=False)

# Then build your model output
targets = Dense(units=len(data.target_label.value_counts()), kernel_initializer=TruncatedNormal(stddev=config.initializer_range), name='target')(pooled_output)
outputs = {'target': targets}

# And combine it all in a model object
model = Model(inputs=inputs, outputs=outputs, name='BERT_Binary_Classifier')

# Take a look at the model
model.summary()

The next cell considers the model training and we have to set the optimizer, the loss function as categorical crossentropy and accuracy as metric, as final step we take these as features of the model.compile function.

In [None]:
### ------- Train the model ------- ###

from tensorflow.keras.optimizers import RMSprop,Adam,SGD,Adadelta

optimizer = Adam(learning_rate=6e-05,epsilon=1e-08,decay=0.01,clipnorm=1.0)

# Set loss and metrics
loss = {'target': CategoricalCrossentropy(from_logits = True)}

# Compile the model
model.compile(optimizer = optimizer, loss = loss, metrics = ['accuracy'])

# Ready output data for the model
y_target = to_categorical(data['target'])

# Tokenize the input (takes some time)
x_train = tokenizer(
            text=data['text'].to_list(),
            add_special_tokens=True,
            max_length=max_length,
            truncation=True,
            padding=True, 
            return_tensors='tf',
            return_token_type_ids = False,
            return_attention_mask = True,
            verbose = True)

# Fit the model
history = model.fit(
    x={'input_ids': x_train['input_ids']},
    y={'target': y_target},
    validation_split=0.25,
    batch_size=64,
    epochs=1,
    verbose=1)

The model took 56 seconds and had following characteristics: train/test accuracy of 75.1%/83.5%, val_size=25%, Adam and 1 epoch, which makes sense and we expect such performance for these complex models, almost no disadvantages as it trained so fast because of the small dataset.

## Inference

Finally the following cells are to compute the label predicted of the test (out-of-bag) instances.

In [None]:
x_test = tokenizer(
          text=df_test['text'].to_list(),
          add_special_tokens=True,
          max_length=max_length,
          truncation=True,
          padding=True, 
          return_tensors='tf',
          return_token_type_ids = False,
          return_attention_mask = True,
          verbose = True)

In [None]:
label_predicted = model.predict(
    x={'input_ids': x_test['input_ids']},
)

label_predicted contains a key which is 'target' same name as our actual label, if we show the array contained it corresponds to a matrix predictions for each instance where the highest in each row is the class predicted, therefore we have to apply argmax, firstly let us see such matrix predicted:

In [None]:
label_predicted['target']

In [None]:
label_pred_max=[np.argmax(i) for i in label_predicted['target']]

In [None]:
label_pred_max[:10]

We will build the next 3 models the same way as the previous one, notice there are some lines which includes extra functions proper for the model:

# RoBERTa

In [None]:
from transformers import RobertaTokenizer, TFRobertaModel, RobertaConfig 

In [None]:
### --------- Setup Roberta ---------- ###

model_name = 'roberta-base'

# Max length of tokens
max_length = 45

# Load transformers config and set output_hidden_states to False
config = RobertaConfig.from_pretrained(model_name)
config.output_hidden_states = False

# Load Roberta tokenizer
tokenizer = RobertaTokenizer.from_pretrained(pretrained_model_name_or_path = model_name, config = config)

# Load the Roberta model
transformer_roberta_model = TFRobertaModel.from_pretrained(model_name, config = config)

In [None]:
### ------- Build the model ------- ###

# Load the MainLayer
roberta = transformer_roberta_model.layers[0]

# Build your model input
input_ids = Input(shape=(max_length,), name='input_ids', dtype='int32')
inputs = {'input_ids': input_ids}

# Load the Transformers RoBERTa model as a layer in a Keras model
roberta_model = roberta(inputs)[1]
dropout = Dropout(config.hidden_dropout_prob, name='pooled_output')
pooled_output = dropout(roberta_model, training=False)

# Then build your model output
targets = Dense(units=len(data.target_label.value_counts()), kernel_initializer=TruncatedNormal(stddev=config.initializer_range), name='target')(pooled_output)
outputs = {'target': targets}

# And combine it all in a model object
model2 = Model(inputs=inputs, outputs=outputs, name='RoBERTa_Binary_Classifier')

# Take a look at the model
model2.summary()

In [None]:
### ------- Train the model ------- ###

optimizer = Adam(learning_rate=6e-05,epsilon=1e-08,decay=0.01,clipnorm=1.0)

# Set loss and metrics
loss = {'target': CategoricalCrossentropy(from_logits = True)}

# Compile the model
model2.compile(optimizer = optimizer, loss = loss, metrics = ['accuracy'])

# Ready output data for the model
y_target = to_categorical(data['target'])

# Tokenize the input (takes some time)
x_train = tokenizer(
            text=data['text'].to_list(),
            add_special_tokens=True,
            max_length=max_length,
            truncation=True,
            padding=True, 
            return_tensors='tf',
            return_token_type_ids = False,
            return_attention_mask = True,
            verbose = True)

# Fit the model
history = model2.fit(
    x={'input_ids': x_train['input_ids']},
    y={'target': y_target},
    validation_split=0.25,
    batch_size=64,
    epochs=3,
    verbose=1)

The model took 2 minutes 8 seconds and had following characteristics: train/test accuracy of 85.0%/82.7%, val_size=25%, Adam and 3 epochs.

## Inference

In [None]:
x_test = tokenizer(
          text=df_test['text'].to_list(),
          add_special_tokens=True,
          max_length=max_length,
          truncation=True,
          padding=True, 
          return_tensors='tf',
          return_token_type_ids = False,
          return_attention_mask = True,
          verbose = True)

In [None]:
label_predicted = model2.predict(
    x={'input_ids': x_test['input_ids']},
)

In [None]:
label_predicted['target']

In [None]:
label_pred_max=[np.argmax(i) for i in label_predicted['target']]

In [None]:
label_pred_max[:10]

# DistilBERT

In [None]:
from transformers import DistilBertTokenizer, TFDistilBertModel, DistilBertConfig 

In [None]:
### --------- Setup DistilBERT ---------- ###

model_name = 'distilbert-base-uncased'

# Max length of tokens
max_length = 45

# Load transformers config and set output_hidden_states to False
config = DistilBertConfig.from_pretrained(model_name)
config.output_hidden_states = False

# Load Distilbert tokenizer
tokenizer = DistilBertTokenizer.from_pretrained(pretrained_model_name_or_path = model_name, config = config)

# Load the Distilbert model
transformer_distilbert_model = TFDistilBertModel.from_pretrained(model_name, config = config)

DistilBERT does not consider a pooling layer in the default model which converts the output (None,45,768) to (None,768), this is why we will select the first and third dimension of the 'layer 0' so as to have such output shape required, the next layers are the same as before:

In [None]:
# Load the MainLayer
distilbert = transformer_distilbert_model.layers[0]

# Build your model input
input_ids = Input(shape=(max_length,), name='input_ids', dtype='int32')
inputs = {'input_ids': input_ids}

# Load the Transformers DistilBERT model as a layer in a Keras model
distilbert_model = distilbert(inputs)[0][:,0,:]
dropout = Dropout(0.1, name='pooled_output')
pooled_output = dropout(distilbert_model, training=False)

# Then build your model output
targets = Dense(units=len(data.target_label.value_counts()), kernel_initializer=TruncatedNormal(stddev=config.initializer_range), name='target')(pooled_output)
outputs = {'target': targets}

# And combine it all in a model object
model3 = Model(inputs=inputs, outputs=outputs, name='DistilBERT_Binary_Classifier')

# Take a look at the model
model3.summary()

In [None]:
### ------- Train the model ------- ###

# Set an optimizer
optimizer = Adam(learning_rate=6e-05,epsilon=1e-08,decay=0.01,clipnorm=1.0)

# Set loss and metrics
loss = {'target': CategoricalCrossentropy(from_logits = True)}

# Compile the model
model3.compile(optimizer = optimizer, loss = loss, metrics = ['accuracy'])

# Ready output data for the model
y_target = to_categorical(data['target'])

# Tokenize the input (takes some time)
x_train = tokenizer(
    text=data['text'].to_list(),
    add_special_tokens=True,
    max_length=max_length,
    truncation=True,
    padding=True, 
    return_tensors='tf',
    return_token_type_ids = False,
    return_attention_mask = True,
    verbose = True)

# Fit the model
history = model3.fit(
    x={'input_ids': x_train['input_ids']},
    y={'target': y_target},
    validation_split=0.25,
    batch_size=64,
    epochs=1,
    verbose=1)

The model took 28 seconds and had following characteristics: train/test accuracy of 79.4%/82.7%, val_size=25%, Adam and 1 epoch.

## Inference

In [None]:
x_test = tokenizer(
    text=df_test['text'].to_list(),
    add_special_tokens=True,
    max_length=max_length,
    truncation=True,
    padding=True, 
    return_tensors='tf',
    return_token_type_ids = False,
    return_attention_mask = True,
    verbose = True)

In [None]:
label_predicted = model3.predict(
    x={'input_ids': x_test['input_ids']},
)

In [None]:
label_predicted['target']

In [None]:
label_pred_max=[np.argmax(i) for i in label_predicted['target']]

In [None]:
label_pred_max[:10]

# XLNet

The tokenizer corresponding to XLNet requires an extra library called sentencepiece which we have to install and import as follows:

In [None]:
!pip install sentencepiece 

In [None]:
from transformers import XLNetTokenizer, TFXLNetModel, XLNetConfig
import sentencepiece

In [None]:
### --------- Setup XLNet ---------- ###

model_name = 'xlnet-base-cased'

# Max length of tokens
max_length = 45

# Load transformers config and set output_hidden_states to False
config = XLNetConfig.from_pretrained(model_name)
config.output_hidden_states = False

# Load XLNet tokenizer
tokenizer = XLNetTokenizer.from_pretrained(pretrained_model_name_or_path = model_name, config = config)

# Load the XLNet model
transformer_xlnet_model = TFXLNetModel.from_pretrained(model_name, config = config)

Something similar to DistilBERT happens to the current model, because we have to convert the output shape of the default model first layer to the appropriate (None, 768), in this case we will use tf.squeeze function as can be seen below:

In [None]:
### ------- Build the model ------- ###

# Load the MainLayer
xlnet = transformer_xlnet_model.layers[0]

# Build your model input
input_ids = Input(shape=(max_length,), name='input_ids', dtype='int32')
inputs = {'input_ids': input_ids}

# Load the Transformers XLNet model as a layer in a Keras model
xlnet_model = xlnet(inputs)[0]
xlnet_model = tf.squeeze(xlnet_model[:, -1:, :], axis=1)
dropout = Dropout(0.1, name='pooled_output')
pooled_output = dropout(xlnet_model, training=False)

# Then build your model output
targets = Dense(units=len(data.target_label.value_counts()), kernel_initializer=TruncatedNormal(stddev=config.initializer_range), name='target')(pooled_output)
outputs = {'target': targets}

# And combine it all in a model object
model4 = Model(inputs=inputs, outputs=outputs, name='XLNet_Binary_Classifier')

# Take a look at the model
model4.summary()

In [None]:
### ------- Train the model ------- ###

optimizer = Adam(learning_rate=6e-05,epsilon=1e-08,decay=0.01,clipnorm=1.0)

# Set loss and metrics
loss = {'target': CategoricalCrossentropy(from_logits = True)}

# Compile the model
model4.compile(optimizer = optimizer, loss = loss, metrics = ['accuracy'])

# Ready output data for the model
y_target = to_categorical(data['target'])

# Tokenize the input (takes some time)
x_train = tokenizer(
            text=data['text'].to_list(),
            add_special_tokens=True,
            max_length=max_length,
            truncation=True,
            padding=True, 
            return_tensors='tf',
            return_token_type_ids = False,
            return_attention_mask = True,
            verbose = True)

# Fit the model
history = model4.fit(
    x={'input_ids': x_train['input_ids']},
    y={'target': y_target},
    validation_split=0.25,
    batch_size=64,
    epochs=1,
    verbose=1)

The model took 59 seconds and had following characteristics: train/test accuracy of 76.0%/82.3%, val_size=25%, Adam and 1 epoch.

## Inference

In [None]:
x_test = tokenizer(
    text=df_test['text'].to_list(),
    add_special_tokens=True,
    max_length=max_length,
    truncation=True,
    padding=True, 
    return_tensors='tf',
    return_token_type_ids = False,
    return_attention_mask = True,
    verbose = True)

In [None]:
label_predicted = model4.predict(
    x={'input_ids': x_test['input_ids']},
)

In [None]:
label_predicted['target']

In [None]:
label_pred_max=[np.argmax(i) for i in label_predicted['target']]

In [None]:
label_pred_max[:10]

Submission with pre-trained model:

In [None]:
sample_submission['target'] = label_pred_max
sample_submission.head(10)

In [None]:
sample_submission.to_csv("submission.csv", index=False, header=True)

# Discussion

In general the performance of the four models was similar, also based on the experience from the previous project we can endorse the idea that BERT is the middle term of trade-off between accuracy and training time, whereas DistilBERT was the fastest by far, but having a lower accuracy than the previous as is explained by HuggingFace it achieves 95% accuracy of BERT, finally RoBERTa and XLNet were the slowest models but not the highest accuracies as it corresponds to BERT (validation acc).

I have submitted the predition of the testing set for all models and the best one was BERT reaching 83.12% of accuracy and the lowest was DistilBERT reaching 81.98%. We can say there is a slight difference but as we are dealing with a small sized dataset such gap becomes bigger and more important. As in the previous project the main reason of the misclassifications is because the label is unbalanced (slightly though), but it makes our prediction to sidetrack, therefore we should have to find a proper method to solve such problem either oversampling or undersampling.

Another possible reason of the limited accuracy no matter which model we use is the vocabulary as the cleaning process didn't cover all misspelled words, idioms and weird acronyms widely used (as people normally communicate informally) makes our model understand poorly some sentences. Obviously if we try to take down such problem the cleaning process would take a long time to do because we have to look at each sentence and define all posible confusing words.

Also I have to inform that I have trained each model for more epochs but the unique improvement was in training accuracy whereas validation stayed the same or even decreased, I would really encourage you to try with more epochs and compare their performance having previously done a wider cleaning process so as to be sure it will work better.

I would like to know any feedback in order to increase the performance of the models or tell me if you found a different one even better!

If you liked this notebook I would appreciate so much your upvote if you want to see more projects/tutorials like this one. I encourage you to see my projects portfolio, am sure you will love it.

Thank you!