# RNN classification

Lets try and build an RNN to classify our disaster tweets.

First we create the pipeline we build in 'data_inspection.ipynb'. 
Next we load the data from disk, transform the data and drop the columns we won't need in order to save some memory. Next, for clarity we define X and y.

NOTE: set the PARALELL_JOBS constant, it is used as for the 'n_jobs' parameter where multi-processing is possible.

In [3]:
import sys
import logging
import fasttext.util
import numpy as np
import pandas as pd

from data_cleaning.transformers import tokenizer, urlRemover, punctuationRemover, SnakeCaseSplitter, numericsFilter, stopwordsFilter, Vectorizer
from sklearn.pipeline import Pipeline


main_logger = logging.getLogger()
main_logger.setLevel(logging.DEBUG)
stdout_handler = logging.StreamHandler(sys.stdout)
main_logger.addHandler(stdout_handler)

MAX_TOKEN_LEN = 25
VECTOR_DIM = 300
PARALELL_JOBS = 5

fasttext.util.download_model('en', if_exists='ignore')
ft = fasttext.load_model('cc.en.300.bin')
                         
pipeline = Pipeline([('tokenize', tokenizer(PARALELL_JOBS)), ('remove_urls', urlRemover(PARALELL_JOBS)), 
                     ('remove_punctuation', punctuationRemover(PARALELL_JOBS)), ('remove_numerics', numericsFilter(PARALELL_JOBS)), 
                     ('stopwords_filter', stopwordsFilter(PARALELL_JOBS)), ('snake_case_splitting', SnakeCaseSplitter(PARALELL_JOBS)),
                     ('vectorize', Vectorizer(ft, MAX_TOKEN_LEN))])

df = pd.read_csv('resources/data/train.csv')
df = pipeline.transform(df)
df.drop(['text', 'tokens', 'keyword', 'location'], 1)

X = np.array(df['vectors'].to_list())
y = df['target']



tokenizer transforming data on 5 processes.
tokenizer transforming data on 5 processes.
urlRemover transforming data on 5 processes.
urlRemover transforming data on 5 processes.
punctuationRemover transforming data on 5 processes.
punctuationRemover transforming data on 5 processes.
numericsFilter transforming data on 5 processes.
numericsFilter transforming data on 5 processes.
stopwordsFilter transforming data on 5 processes.
stopwordsFilter transforming data on 5 processes.
SnakeCaseSplitter transforming data on 5 processes.
SnakeCaseSplitter transforming data on 5 processes.
Vectorizer transforming data.
Vectorizer transforming data.


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7613/7613 [00:12<00:00, 625.97it/s]


Index(['id', 'keyword', 'location', 'text', 'target', 'tokens', 'vectors'], dtype='object')
todo: remove excess columns


## Basic RNN

Lets build a function to create a basic RNN using the Keras api. It uses a lstm with 100 units and is connected to a dense layer of 10 neurons and uses the adam optimizer.

In [5]:
from tensorflow.keras import layers, models

def basic_rnn():
    model = models.Sequential()
    model.add(layers.Input(shape=(MAX_TOKEN_LEN, VECTOR_DIM), dtype='float32'))
    model.add(layers.LSTM(100))
    model.add(layers.Dense(10))
    model.add(layers.Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

    return model

Next we will need two more functions for showing the cross validation scores (both accuracy and f1), and displaying a confusion matrix.

In [13]:
from math import sqrt
from statistics import variance
from sklearn.metrics import confusion_matrix

def show_x_val_scores(scores: dict):
    mean_accuracy = round(sum(scores['test_accuracy']) / len(scores['test_accuracy']), 2)
    accuracy_sqrt_var = round(sqrt(variance(scores['test_accuracy'])), 3)

    mean_f1 = round(sum(scores['test_f1']) / len(scores['test_f1']), 2)
    f1_sqrt_var = round(sqrt(variance(scores['test_f1'])), 3)

    print('scores')
    print(f"accuracy: {mean_accuracy} average, {accuracy_sqrt_var} squared variance")
    print(f"f1: {mean_f1} average, {f1_sqrt_var} squared variance")
    
def conf_matrix(y_true, y_pred):
    TN, FP, FN, TP = confusion_matrix(y_true, y_pred, [1, 0])
    print(f'\rconfusion matrix (n={len(y_true)})')
    print('\pred:  false | true ')
    print('truth -----------------')
    print(f'false:|  {TN} |  {FP}  |')
    print('      |-------|-------|')
    print(f'true :|  {FN}  |  {TP} |')
    print('      -----------------')

The next step is to build the model and wrap it in a KerasClassifier so we can use the keras cross validation function. We use the `show_x_val_scores` function to display the cross validation result.

In [7]:
from sklearn.model_selection import train_test_split, cross_validate
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
    
model = KerasClassifier(build_fn=basic_rnn)

print('Running cross validation, this will take a minute.')
scores = cross_validate(model, X, y, cv=5, scoring=['accuracy', 'f1'], n_jobs=PARALELL_JOBS)
show_x_val_scores(scores)

scores:
accuracy: 0.79 average, 0.019 squared variance
f1: 0.73 average, 0.038 squared variance


To get some more insight into the models behavior lets display a confusion matrix using the `conf_matrix` method we wrote earlier.

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
conf_matrix(y_test, y_pred)

Train on 6090 samples




ValueError: not enough values to unpack (expected 4, got 2)