# Real or Not? NLP classification with disaster Tweets

This is a [Kaggle competition](https://www.kaggle.com/c/nlp-getting-started/overview) notebook made with the intention of practicing NLP classification tasks. Here I will try to predict if individual Tweets related to disasters (earthquakes, tsunamis, tornados and so on) are related to real disasters or not.

## BERT using TFHub

To achieve a high accuracy with very little data tweaking I will use Google´s BERT (Bidirectional Encoder Representations from Transformers), a state of the art model for NLP tasks. For this task BERT offers many advantages compared to models like LSTM and Word2vec, like tokenization. Word2Vec handle whole words, and can't easily handle words they haven't seen before. BERT by the other hand has it's own method of chunking unrecognized words into ngrams it recognizes (e.g. circumlocution might be broken into "circum", "locu" and "tion"), and these ngrams can be averaged into whole-word vectors.

BERT also incorporates context much better than Word2Vec, as is designed to use whole sentences as context. This can be a disadvantage when the context of the word isnt easily available in a sentence, or if you are working with individual words. However that is not our case here, as disaster tweets are usually whole sentences, and humans can usually pick up the context of those tweets with no problem.

I will download the pre-trained BERT model by Google (that trained for many hours on wikipedia articles and on books), as training the entire model on my computer would take a ridiculous amount of time! then I will fine-tune the model to output 1 or 0 (disaster or not).

On hyperparameter tuning I will follow some ideas that appeared on the [original BERT paper](https://arxiv.org/pdf/1810.04805.pdf), like using an ADAM optimizer, a learning rate between 2e-5 and 5e-5, and adding a sigmoid function to the last layer of BERT, with no dense layer or intermediate layer in between the model and the output. They trained the model for 3 epochs with a batch size of 32, however I cant use a batch size of 32 as Kaggle GPU´s memory isnt high enough to train that many parameters at once. I will use instead a batch size of 4 and 3 epochs.

I believe this model will have a high accuracy with almost no tweaking and no feature enginerring needed.

For more information about BERT I recommend reading the original paper or checking this [great article](https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/) that explains the most basic concepts.

In [1]:
import numpy as np 
import pandas as pd 
import os
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
import tensorflow_hub as hub
import urllib.request
from urllib.request import urlretrieve

/kaggle/input/nlp-getting-started/train.csv
/kaggle/input/nlp-getting-started/test.csv
/kaggle/input/nlp-getting-started/sample_submission.csv


### Loading the data 

into a pandas dataframe:

In [2]:
train = pd.read_csv(r'../input/nlp-getting-started/train.csv')

In [3]:
train

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
...,...,...,...,...,...
7608,10869,,,Two giant cranes holding a bridge collapse int...,1
7609,10870,,,@aria_ahrary @TheTawniest The out of control w...,1
7610,10871,,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,10872,,,Police investigating after an e-bike collided ...,1


Each tweet is a row in the dataframe, and the text is written in the column "text". Target refers to disaster or not (1 or 0). It seems keyword and location contain a few missing values. 

### Missing values exploration

In [4]:
train.text.isnull().sum()

0

In [5]:
train.keyword.isnull().sum()

61

In [6]:
train.keyword.value_counts()

fatalities               45
armageddon               42
deluge                   42
sinking                  41
harm                     41
                         ..
forest%20fire            19
epicentre                12
threat                   11
inundation               10
radiation%20emergency     9
Name: keyword, Length: 221, dtype: int64

In [7]:
train.location.isnull().sum()

2533

It seems the column we care about the most doesnt contain any missing values. Both keyword and location do, however, but they arent as important for BERT prediction, so I wont drop any row or use data imputation.

### BERT setup 

Lets download BERT [tokenizer:](https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/#22-tokenization) 

In [8]:
from urllib.request import urlretrieve
url = 'https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.py'
dst = 'tokenization.py'
urlretrieve(url, dst)

('tokenization.py', <http.client.HTTPMessage at 0x7f11554b51d0>)

In [9]:
import tokenization

Lets define a function to encode the text in train.text to feed it to bert. This text encoding determines the way in which our data is turned into numbers, so BERT can do calculations with. First we put a upper limit in the size of the text that gets fed to the model (512), then we surround our sentence with 'CLS' (classifier) and 'SEP' (separation) that determines that beggining and the ending of each sentence. We also use padding so each sentence has the same lenght.

In [10]:
def bert_encode(texts, tokenizer, max_len=512):
    all_tokens = []
    all_masks = []
    all_segments = []
    
    for text in texts:
        text = tokenizer.tokenize(text)
            
        text = text[:max_len-2]
        input_sequence = ["[CLS]"] + text + ["[SEP]"]
        pad_len = max_len - len(input_sequence)
        
        tokens = tokenizer.convert_tokens_to_ids(input_sequence)
        tokens += [0] * pad_len
        pad_masks = [1] * len(input_sequence) + [0] * pad_len
        segment_ids = [0] * max_len
        
        all_tokens.append(tokens)
        all_masks.append(pad_masks)
        all_segments.append(segment_ids)
    
    return np.array(all_tokens), np.array(all_masks), np.array(all_segments)

Now lets define our model. We will add a sigmoid function to the last layer of the pre-trained BERT model:

In [11]:
def build_model(bert_layer, max_len=512):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    input_mask = Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
    segment_ids = Input(shape=(max_len,), dtype=tf.int32, name="segment_ids")

    _, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
    clf_output = sequence_output[:, 0, :]
    out = Dense(1, activation='sigmoid')(clf_output)
    
    model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)
    model.compile(Adam(lr=2e-6), loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

Downloading BERT:

In [12]:
%%time
module_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/1"
bert_layer = hub.KerasLayer(module_url, trainable=True)

CPU times: user 1min 26s, sys: 8.88 s, total: 1min 35s
Wall time: 1min 38s


In [13]:
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

Now lets encode the text we want to feed to the model, train['text'].values, and create a series with the targets that we want to predict:

In [14]:
train_input = bert_encode(train.text.values, tokenizer, max_len=160)
train_labels = train.target.values

Lets build the model with a sentence leght of 160. Is recommended to keep the product of Batch size * Sequence length lower than 3000 to avoid memory issues, so this numbers can still be tweaked before I run into memory issues. 

In [15]:
model = build_model(bert_layer, max_len=160)
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_word_ids (InputLayer)     [(None, 160)]        0                                            
__________________________________________________________________________________________________
input_mask (InputLayer)         [(None, 160)]        0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        [(None, 160)]        0                                            
__________________________________________________________________________________________________
keras_layer (KerasLayer)        [(None, 1024), (None 335141889   input_word_ids[0][0]             
                                                                 input_mask[0][0]             

### Trainning our model

Lets finally train our model! I will train it on a Kaggle GPU´s. I will use a validation size of 20%, 3 epochs and batch size = 4, to avoid memory issues.

In [16]:
train_history = model.fit(
    train_input, train_labels,
    validation_split=0.2,
    epochs=3,
    batch_size= 4
)

model.save('model.h5')

Train on 6090 samples, validate on 1523 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


Interestingly even though the accuracy in the training data went up 12% from the first epoch to the third, the validation accuracy went down 1%, suggesting minor overfitting. I didnt expect overfitting in so few epochs.

### Submit to Kaggle

In [20]:
test = pd.read_csv(r'../input/nlp-getting-started/test.csv')
submission = pd.read_csv(r"../input/nlp-getting-started/sample_submission.csv")

In [18]:
test_input = bert_encode(test.text.values, tokenizer, max_len=160)

test_pred = model.predict(test_input)
submission['target'] = test_pred.round().astype(int)
submission.to_csv('submission.csv', index=False)

### Score

Accuracy score of 0.83537.

Position: 602 out of 2974. Top 20% // jvmd95 

We can see the power of BERT, with very few lines of code (though with a relatively long and computionally expensive trainning), and no data cleaning or feature enginerring whatsoever, we were able to train a model that perfomed better than 80% of the submissions of this Kaggle competition!