# Disaster Tweets
------
>In this Third phase of the project, we will:  
>> define another model that takes into account the **metadata** extracted from the text and the **text** itself to classify tweets.  
>> The idea is to define and train a model that takes mixed data inputs:   
>>> **Numerical MetaData** and   
>>> **Tweets Text**   

>> to give one output, that is, the final prediction given these pieces of data.

------

<img src="img/mlp_lstm.png" width="400" height="400">

>In order to build the multi-input neural network we need two branches:
>> The first branch is a **Multi-layer Perceptron (MLP)** designed to handle the **numerical** metadata  
The second branch is a **Long Short-Term Memory (LSTM)** Network to operate over the **text** data

> These branches are then **concatenated** together to form the final multi-input model.

# Import useful Librairies 

In [51]:
import pandas as pd
import numpy as np

# Machine learning librairies
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

import tensorflow as tf

# global params
pre_file_path = "data/pre_train.csv"
models_path = 'models/'

seed = 0

# Load Preprocessed Data

In [3]:
tweets = pd.read_csv(pre_file_path)
tweets.head(2)

Unnamed: 0,keyword,text,target,word_count,unique_word_count,stop_word_count,url_count,char_count,punctuation_count,hashtag_count,at_count,clean_text,clean_keyword,keyword_text
0,,Our Deeds are the Reason of this #earthquake M...,1,13,13,8,0,69,1,1,0,deed reason earthquake allah forgive,,deed reason earthquake allah forgive
1,,Forest fire near La Ronge Sask. Canada,1,7,7,0,0,38,1,0,0,forest fire near ronge sask canada,,forest fire near ronge sask canada


# Prepare the data

##  Text Data

In [7]:
max_features = 5_000

#build the vocab and keep the K most common word based on word frequency (K = max_features)
# max_features+ 1(1 OOV token)
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words = max_features + 1)
tokenizer.fit_on_texts(tweets["keyword_text"])

# Transforms each text to a sequence of integers, only the K most common words will be transformed (K = max_features)
tweets["tweet_encoded"] = tokenizer.texts_to_sequences(tweets.keyword_text)

# check whether we have empty lists
tweets['length'] = tweets['tweet_encoded'].apply(lambda x : len(x))
tweets = tweets[tweets["length"]!=0]

# add padding so that all sequences have the same length --> a numpy array of equal length sequences
tweet_pad = tf.keras.preprocessing.sequence.pad_sequences(tweets.tweet_encoded, padding="post")

##  Numerical MetaData

In [8]:
# define the columns related to the numerical metadata
cols = ['word_count', 'unique_word_count',
       'stop_word_count', 'url_count', 'char_count', 'punctuation_count',
       'hashtag_count', 'at_count', 'target']

# split the whole data into train and test sets
trainMetaX, testMetaX, trainTextX, testTextX = train_test_split(tweets[cols], tweet_pad, test_size=0.2, random_state=seed)

# get the target values
trainY = trainMetaX.target.values
testY = testMetaX.target.values

# standardize the numerical metadata
trainMetaX = trainMetaX.drop(columns=['target'])
testMetaX = testMetaX.drop(columns=['target'])

scaler = StandardScaler()
trainMetaX = scaler.fit_transform(trainMetaX)
testMetaX = scaler.transform(testMetaX)

# Define the model

## Helpers

In [39]:
# helper for the multi-layer perceptron (deep feed forword network)
# input = Numerical Metadata
def create_mlp(dim, regress=False):
    model = tf.keras.Sequential()
    # input layer
    model.add(tf.keras.layers.Dense(32, input_dim=dim, activation="relu"))
    # hidden layer
    model.add(tf.keras.layers.Dense(16, activation="relu"))
    # check to see if the output regression node should be added
    if regress:
        model.add(tf.keras.layers.Dense(1, activation="linear"))
    # return our model
    return model

# ------------------------------------------------------------------------------------------------------------------------------
# helper for the LSTM network
# input = Text data
def create_lstm(vocab_size, seq_length, regress=False):
    model = tf.keras.Sequential()
    model.add(tf.keras.layers.Embedding(input_dim = vocab_size +1, output_dim = 16, input_length = seq_length))
    model.add(tf.keras.layers.Dropout(0.2))
    model.add(tf.keras.layers.LSTM(units = 32, return_sequences=False)) 
    model.add(tf.keras.layers.Dropout(0.2))
    if regress:
         model.add(tf.keras.layers.Dense(1, activation="sigmoid"))
    return model

## Architecture

In [41]:
seq_length = tweet_pad[0].shape[0]
vocab_size = len(tokenizer.word_index)

mlp = create_mlp(trainMetaX.shape[1], regress=False)
lstm = create_lstm(vocab_size, seq_length, regress=False)

combinedInput = tf.keras.layers.concatenate([mlp.output, lstm.output])
x = tf.keras.layers.Dense(4, activation="relu")(combinedInput)
x = tf.keras.layers.Dense(1, activation="linear")(x)

model = tf.keras.models.Model(inputs=[mlp.input, lstm.input], outputs=x)

## Optimizer

In [None]:
optimizer= tf.keras.optimizers.Adam()

model.compile(optimizer = optimizer,
              loss = tf.keras.losses.BinaryCrossentropy(),
              metrics = [tf.keras.metrics.BinaryAccuracy()])

## Train the model

In [49]:
es_callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience = 3)

history = model.fit(
    x = [trainMetaX, trainTextX], y = trainY,
    validation_data = ([testMetaX, testTextX], testY),
    epochs=10, 
    batch_size=32,
    callbacks = [es_callback]
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10


In [58]:
print('\n---------------------------- Train Accuracy ------------------------------\n')
print('Mean: ', np.mean(history.history['binary_accuracy']))
print('Std: ', np.std(history.history['binary_accuracy']))
print('\n---------------------------- Validation Accuracy ------------------------------\n')
print('Mean: ', np.mean(history.history['val_binary_accuracy']))
print('Std: ', np.std(history.history['val_binary_accuracy']))


---------------------------- Train Accuracy ------------------------------

Mean:  0.9498193114995956
Std:  0.003527710657804122

---------------------------- Validation Accuracy ------------------------------

Mean:  0.7692181169986725
Std:  0.003897440609891768


>🗒 With the first model that is only trained on the text data, we obtained a mean accuracy of 0.69 and a std of 0.02 over the validation set. With the combined model that takes into account the numerical metadata, we got a mean accuracy of 0.76 and a std of 0.004. The second model improves the prediction accuracy by 7%.

> More tuning may give better results.

## Save the model

In [52]:
model.save(models_path + "model_lstm_mlp.h5")