# RNN classification

Lets try and build an RNN to classify our disaster tweets.

First we create the pipeline we build in 'data_inspection.ipynb'. 
Next we load the data from disk, transform the data and drop the columns we won't need in order to save some memory. Next, for clarity we define X and y.

NOTE: The PARALELL_JOBS constant it is used as for the 'n_jobs' parameter where multi-processing is possible. Set it to a higher valued depending on your system.

In [1]:
import sys
import logging
import fasttext.util
import numpy as np
import pandas as pd

from data_cleaning.transformers import tokenizer, urlRemover, punctuationRemover, SnakeCaseSplitter, numericsFilter, stopwordsFilter, Vectorizer
from sklearn.pipeline import Pipeline

main_logger = logging.getLogger()
main_logger.setLevel(logging.DEBUG)
stdout_handler = logging.StreamHandler(sys.stdout)
main_logger.addHandler(stdout_handler)

MAX_TOKEN_LEN = 25
VECTOR_DIM = 300
# Change this parameter to use multiprocessing where possible.
PARALELL_JOBS = 1

fasttext.util.download_model('en', if_exists='ignore')
ft = fasttext.load_model('cc.en.300.bin')
                         
pipeline = Pipeline([('tokenize', tokenizer(PARALELL_JOBS)), ('remove_urls', urlRemover(PARALELL_JOBS)), 
                     ('remove_punctuation', punctuationRemover(PARALELL_JOBS)), ('remove_numerics', numericsFilter(PARALELL_JOBS)), 
                     ('stopwords_filter', stopwordsFilter(PARALELL_JOBS)), ('snake_case_splitting', SnakeCaseSplitter(PARALELL_JOBS)),
                     ('vectorize', Vectorizer(ft, MAX_TOKEN_LEN))])

df = pd.read_csv('resources/data/train.csv')
df = pipeline.transform(df)
df.drop(['text', 'tokens', 'keyword', 'location'], 1)

X = np.array(df['vectors'].to_list())
y = df['target']

tokenizer transforming data on 5 processes.




urlRemover transforming data on 5 processes.
punctuationRemover transforming data on 5 processes.
numericsFilter transforming data on 5 processes.
stopwordsFilter transforming data on 5 processes.
SnakeCaseSplitter transforming data on 5 processes.
Vectorizer transforming data.


  from pandas import Panel
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7613/7613 [00:11<00:00, 657.25it/s]


## Basic RNN

Now we have the data prepared lets start with a basic RNN. RNN's have been effective tools for interpreting text as they are capable of interpreting sequences of data. Meaning that the when the model is 'reading' a tweet it will have some notion of the previous words while interpreting the next word.

We wil using the Keras api to build the model, Keras is easy to read and allows us to build a fully functioning RNN in just a few lines of code. The model will be a LSTM with 100 units connected to a dense layer of 10 neurons and using an adam optimizer.

In [2]:
from tensorflow.keras import layers, models

def basic_rnn():
    model = models.Sequential()
    model.add(layers.Input(shape=(MAX_TOKEN_LEN, VECTOR_DIM), dtype='float32'))
    model.add(layers.LSTM(100))
    model.add(layers.Dense(10))
    model.add(layers.Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

    return model

Next we will define two more functions. One for showing the cross validation scores, and another for  displaying a confusion matrix.

In [3]:
from math import sqrt
from statistics import variance
from sklearn.metrics import confusion_matrix

def show_x_val_scores(scores: dict):
    mean_accuracy = round(sum(scores['test_accuracy']) / len(scores['test_accuracy']), 2)
    accuracy_sqrt_var = round(sqrt(variance(scores['test_accuracy'])), 3)

    mean_f1 = round(sum(scores['test_f1']) / len(scores['test_f1']), 2)
    f1_sqrt_var = round(sqrt(variance(scores['test_f1'])), 3)

    print('scores')
    print(f"accuracy: {mean_accuracy} average, {accuracy_sqrt_var} squared variance")
    print(f"f1: {mean_f1} average, {f1_sqrt_var} squared variance")
    
def conf_matrix(y_true, y_pred):
    matrix = confusion_matrix(y_true, y_pred)
    TP = int(matrix[1][1])
    FP = int(matrix[0][1])
    TN = int(matrix[0][0])
    FN = int(matrix[1][0])
    
    print(f'\rconfusion matrix (n={len(y_true)})')
    print('\pred:  false | true ')
    print('truth -------------')
    print(f'false:| {TN} | {FP} |')
    print('      |-----|-----|')
    print(f'true :| {FN} | {TP} |')
    print('      -------------')
    
    precision = TP / (TP + FP)
    recall = TP / (TP + FN)
    accuracy = ((TP + TN) / (FN + FP + TN + TP)) * 100
    
    print(f'\rprecision: \t{round(precision, 2)}, out of all your positives this part whas true.')
    print(f'recall: \t{round(recall, 2)}, out of all the positives, this is the part we caught.')
    print(f'accuracy: \t{round(accuracy, 2)}.')
    print(f'f1: \t\t{round(2*((precision*recall)/(precision+recall)), 2)}')

The next step is to build the model and wrap it in a KerasClassifier so we can use the sklearn cross_validation function. 

In [5]:
from sklearn.model_selection import train_test_split, cross_validate
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
    
model = KerasClassifier(build_fn=basic_rnn)

print('Running cross validation, this will take a minute.')
scores = cross_validate(model, X, y, cv=5, scoring=['accuracy', 'f1'], n_jobs=PARALELL_JOBS)
show_x_val_scores(scores)

Running cross validation, this will take a minute.
scores
accuracy: 0.78 average, 0.018 squared variance
f1: 0.72 average, 0.04 squared variance


Cross validating the model across 5 splits results in a 78% accuracy, so 78% of all predictions are correct. The F1 score is a bit lower.

To get some more insight into the models performance lets display a confusion matrix using the `conf_matrix` method we wrote earlier.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
conf_matrix(y_test, y_pred)

Train on 6090 samples
confusion matrix (n=1523)
\pred:  false | true 
truth -------------
false:| 763 | 119 |
      |-----|-----|
true :| 188 | 453 |
      -------------
precision: 	0.79, out of all your positives this part whas true.
recall: 	0.71, out of all the positives, this is the part we caught.
accuracy: 	79.84.
f1: 		0.75


The model seems to have more false negatives (bottom left of the matrix) than false positives (top right of the matrix). This means that the model is slightly biased towards classifying tweets as a disaster tweet.

This explains the lower f1 score, it is possibly the result of the imbalance between the number of positively(3271) classified tweets and the number of negatively classified tweets(4342).

If we balance the dataset we might get different scores.

In [9]:
# There are too many 'false' tweets, lets calculate how many
false_count, true_count = y.value_counts()
surplus_false_tweets = false_count - true_count

# Get a list of indices to drop, in order to balance the data.
false_indices = y[y == 0].index.to_list()
indices_to_remove = false_indices[:surplus_false_tweets]

# Create balanced dataset.
X_bal = np.array(df['vectors'].drop(indices_to_remove).to_list())
y_bal = df['target'].drop(indices_to_remove)

# Repeat the process.
X_bal_train, X_bal_test, y_bal_train, y_bal_test = train_test_split(X_bal, y_bal, test_size=0.2, random_state=1)
model.fit(X_bal_train, y_bal_train)
y_bal_pred = model.predict(X_bal_test)
conf_matrix(y_bal_test, y_bal_pred)

Train on 5233 samples
confusion matrix (n=1309)
\pred:  false | true 
truth -------------
false:| 582 | 67 |
      |-----|-----|
true :| 199 | 461 |
      -------------
precision: 	0.87, out of all your positives this part whas true.
recall: 	0.7, out of all the positives, this is the part we caught.
accuracy: 	79.68.
f1: 		0.78


By balancing the dataset the f1 improved. So the imbalance whas a factor.

## A better RNN?

Now lets see if we can improve tha algorithm a bit. Maybe we can get a better score by changing the number of neurons, or using a GRU instead of a LSTM.

In [17]:
def rnn(rnn_units: int, rnn_type: str, dense_units: int, bidirectional: bool):
    model = models.Sequential()
    model.add(layers.Input(shape=(25, 300), dtype='float32'))

    if rnn_type == 'lstm':
        rnn = layers.LSTM(rnn_units)
    elif rnn_type == 'gru':
        rnn = layers.GRU(rnn_units)
    else:
        raise ValueError(f"Invalid rnn type: {rnn_type}.")

    if bidirectional is True:
        rnn = layers.Bidirectional(rnn)

    model.add(rnn)
    model.add(layers.Dense(dense_units))
    model.add(layers.Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

    return model

The method above allows us to pass some hyperparameters to the model and do some fine tuning. Lets determint the hyperparametes we want to use.

In [11]:
parameters = {
    'rnn_units': [100, 300], 
    'rnn_type': ['lstm', 'gru'], 
    'dense_units': [10, 100], 
    'bidirectional': [True, False]
}

The Gridsearch object provided in the SKlearn library will allow us to try all combinations of the parameters in the grid above, while using crossvalidating the results. This wil be a total of 2 x 2 x 2 x 2 = 16 combinations. Crossvalidating with a cv of 5. This means that the Gridsearch object will be training 16 x 5 = 80 models in the next code block. I you want to reduce this you can lower the cv to 3 or 4.

In [23]:
from sklearn.model_selection import GridSearchCV

model = KerasClassifier(build_fn=rnn)
gs = GridSearchCV(estimator=model, param_grid=parameters, cv=5, n_jobs=PARALELL_JOBS, scoring='accuracy', verbose=2)
gs.fit(X_bal, y_bal)

print(f'The best found model scored: {round(gs.best_score_,3)}.')
print(f'Using the following hyperparams{gs.best_params_}.')
for param, value in gs.best_params_.items():
    print(f'{param}: {value}')
print()

result = [(params, mean_score) for params, mean_score in zip(gs.cv_results_['params'], gs.cv_results_['mean_test_score'])]
def sort_fn(x):
    return x[1]

result.sort(reverse=True, key=sort_fn)

print('other models scored')
for params, mean_score in result:
    print(f'score: {round(mean_score, 3)}')
    for param, value in params.items():
        print(f'{param}: {value}')
    print()

The best found model scored: 0.77.
Using the following hyperparams{'bidirectional': True, 'dense_units': 10, 'rnn_type': 'gru', 'rnn_units': 100}.
bidirectional: True
dense_units: 10
rnn_type: gru
rnn_units: 100

other models scored
score: 0.77
bidirectional: True
dense_units: 10
rnn_type: gru
rnn_units: 100

score: 0.77
bidirectional: True
dense_units: 10
rnn_type: lstm
rnn_units: 100

score: 0.768
bidirectional: True
dense_units: 100
rnn_type: gru
rnn_units: 300

score: 0.768
bidirectional: True
dense_units: 100
rnn_type: gru
rnn_units: 100

score: 0.767
bidirectional: True
dense_units: 100
rnn_type: lstm
rnn_units: 100

score: 0.767
bidirectional: False
dense_units: 100
rnn_type: lstm
rnn_units: 300

score: 0.764
bidirectional: True
dense_units: 10
rnn_type: gru
rnn_units: 300

score: 0.761
bidirectional: False
dense_units: 10
rnn_type: lstm
rnn_units: 300

score: 0.76
bidirectional: False
dense_units: 100
rnn_type: lstm
rnn_units: 100

score: 0.76
bidirectional: False
dense_units: 