# Deep Learning methods: BERT model 

## Jose Antonio Jijon Vorbeck

In this notebook, we will apply a BERT model to the tweets data set to perform a sentiment analysis. 

We will compare the results obtained from this model to those obtained using classical ML models.

BERT is a deep learning model that has een proven to give state-of-the art responses in natural Language Processing. BERT (Bidirectional Encoder Representations for Transformers) is a pre-trained model that has been developped by a team of Scientists at Google. It has been already trained with more than 250 million articles, including the english wikipedia and more sources.

BERT relies on a Transfomer, the attention mechanism part of the algorithm. A normal Transfore system consitns of an encoder to read the input, and a decoder to output any kind of information we which to obtain from the input. But since the goal of BERT is to generate a bidirectional representation language model, we will only need the encoder part of the system. 

The input to the encoder from BERT needs a series of tokenized vectors, which have to be in a specific format suitable fo the algorithm. Token embedding is a crucial part of the pre-processing done for this model. The [CLS] token is embedded at the beginning of each new sentence, and the [SEP] token is used at the end of the sentences, between two consecutive sentences.

We will make use of the BERT model by importing it from Tensorflow, we will be using KERAS, for a more suitable usage of the TF mechanism.

[BERT - Kaggle](https://www.kaggle.com/nayansakhiya/text-classification-using-bert)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math
import sklearn

from sklearn.model_selection import train_test_split

In [2]:
# reading the already cleaned data
train_data = pd.read_csv('Data/TweetC_train.csv')
test_data = pd.read_csv('Data/TweetC_test.csv')
print(f'Training obs: {train_data.shape[0]}, and testing obs: {test_data.shape[0]}')

Training obs: 41156, and testing obs: 3798


In [3]:
# dropping indexes with nan

# training set
index_with_nan = train_data.index[train_data.isnull().any(axis=1)]
train_data.drop(index_with_nan,0, inplace=True)

# testing set
index_with_nan = test_data.index[test_data.isnull().any(axis=1)]
test_data.drop(index_with_nan,0, inplace=True)

In [4]:
train_data.head()

Unnamed: 0,Corpus,label
0,advice talk neighbour family exchange phone nu...,4
1,coronavirus australia woolworth elderly disabl...,4
2,food stock panic food need stay calm stay safe,4
3,ready supermarket outbreak paranoid food stock...,0
4,news regionâs confirmed covid case came sulli...,4


In [5]:
y_train = train_data.iloc[:,1]
y_test = test_data.iloc[:,1]

y_train = pd.get_dummies(y_train)
y_test = pd.get_dummies(y_test)

X_train = train_data.iloc[:,0]
X_test = test_data.iloc[:,0]

We will use the training dataset for both training and validation sets.

In [6]:
# 20% of the training data will be devoted to the validation dataset
train_text, validation_text, train_labels, validation_labels = train_test_split(X_train.to_numpy(), y_train.to_numpy(),test_size=0.2,random_state=0)

# making the test data into numpy formats
test_text = X_test.to_numpy()
test_labels = y_test.to_numpy()

In [7]:
# looking for empty strings or only numbers in a string
np.where(train_text == ' ')

#x = numpy.delete(x,(2), axis=1)

(array([], dtype=int64),)

In [8]:
print(f'The training data contains {len(train_text)} tweets, the validation contains {len(validation_text)} tweets')

The training data contains 32869 tweets, the validation contains 8218 tweets


In [9]:
train_labels

array([[1, 0, 0, 0, 0],
       [0, 0, 0, 1, 0],
       [0, 1, 0, 0, 0],
       ...,
       [0, 0, 1, 0, 0],
       [0, 0, 0, 0, 1],
       [0, 1, 0, 0, 0]], dtype=uint8)

## Since the data we are using was already cleaned, and pre-processed, we will use now it directly to train a Neural Netwrok using BERT (Bidirectional Encoder R Trasformer) 

Now we will aplly more specific tasks more related to the BERT algorithm, like the BERT tokenization or the BERT model itself. We will be importing these tools from the internet.

In [10]:
# import BERT tokenization

!wget --quiet https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.py

Now we do more imports for SKlearn and TF, as well as the tokenization library

In [11]:
! pip install tensorflow_hub

You should consider upgrading via the 'c:\users\jobandtalent\appdata\local\programs\python\python38\python.exe -m pip install --upgrade pip' command.


In [12]:
import tokenization
import tensorflow as tf
import tensorflow_hub as hub
from keras.utils import to_categorical
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

# keras
import keras
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, Dropout
from tensorflow.keras.optimizers import Adam



Now we will import the BERT model from the hub.KerasLayer

In [13]:
# importing BERT Layer

m_url = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2'
bert_layer = hub.KerasLayer(m_url, trainable=True)

### Load the tokenizer functions

In [14]:
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

### Create the tokenization fuction

In [15]:
def bert_encode(texts, tokenizer, max_len=160):
    all_tokens = []
    all_masks = []
    all_segments = []
    
    for text in texts:
        text = tokenizer.tokenize(text)
        
        text = text[:max_len-2]
        input_sequence = ["[CLS]"] + text + ["[SEP]"]
        pad_len = max_len-len(input_sequence)
        
        tokens = tokenizer.convert_tokens_to_ids(input_sequence) + [0] * pad_len
        pad_masks = [1] * len(input_sequence) + [0] * pad_len
        segment_ids = [0] * max_len
        
        all_tokens.append(tokens)
        all_masks.append(pad_masks)
        all_segments.append(segment_ids)
        
    return np.array(all_tokens), np.array(all_masks), np.array(all_segments)

### And the Build model function

In [16]:
def build_model(bert_layer, max_len=160):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    input_mask = Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
    segment_ids = Input(shape=(max_len,), dtype=tf.int32, name="segment_ids")

    _, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
    clf_output = sequence_output[:, 0, :]

    #lay = Dense(128, activation='relu')(clf_output)
    #lay = Dropout(0.2)(lay)
    #lay = Dense(64, activation='relu')(lay)
    #lay = Dropout(0.2)(lay)
    out = Dense(5, activation='softmax')(clf_output)
    
    model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)
    model.compile(Adam(lr=2e-6), loss='binary_crossentropy', metrics=['accuracy',keras.metrics.Precision(), keras.metrics.Recall(), keras.metrics.TruePositives()])
    
    return model

### Now we perform the text encoding

In [17]:
train_input = bert_encode(train_text, tokenizer, max_len = 160)
test_input = bert_encode(test_text, tokenizer, max_len = 160)
val_input = bert_encode(validation_text, tokenizer, max_len = 160)

train_labels = train_labels
test_labels = test_labels
val_labels = validation_labels

### Build the model

In [18]:
model = build_model(bert_layer, max_len = 160)
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_word_ids (InputLayer)     [(None, 160)]        0                                            
__________________________________________________________________________________________________
input_mask (InputLayer)         [(None, 160)]        0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        [(None, 160)]        0                                            
__________________________________________________________________________________________________
keras_layer (KerasLayer)        [(None, 768), (None, 109482241   input_word_ids[0][0]             
                                                                 input_mask[0][0]             

### Applying Early Stopping and saving the best model

In [19]:
# Save the model after every epoch.
saveBestModel = ModelCheckpoint('best_model.hdf5', monitor='val_acc', verbose=0, save_best_only=True, save_weights_only=False, mode='auto', save_freq=1)
# Stop training when a monitored quantity has stopped improving.
earlyStopping = EarlyStopping(monitor='val_loss', min_delta=0, patience=3, verbose=0, mode='auto')

### Finally we can fit the model

In [20]:
# running this in google collab because in the local server it takes much longer to complete.
train_history = model.fit(
    train_input, train_labels,
    validation_data=(val_input, val_labels),
    epochs=10,
    batch_size=50,
    callbacks=[earlyStopping]
)

Epoch 1/10
 14/658 [..............................] - ETA: 15:32:42 - loss: 0.8021 - accuracy: 0.1727 - precision: 0.1033 - recall: 0.0254 - true_positives: 11.0714