# CNN classification

Let try classification with a Convolutional Neural Network. Where RNN are good add evaluating input in the context of previous input. A CNN is able to interate over multiple subsections of the data and evaluate these. When applying a CNN to text, it will focus on the different combinations of naburing words found in the text (n-grams). 
A famous paper on the subject was written by Yoon Kim (https://github.com/yoonkim/CNN_sentence).

Lets try and build a simple CNN capable of performing predicions on our tweets.

First we start with the basics and helper functions, please see the RNN notebook for more information. 
And lets setup the data pipeline, please refer to the data_analysis notebook for more information aswell.

NOTE: The PARALELL_JOBS constant it is used as for the 'n_jobs' parameter where multi-processing is possible. Set it to a higher valued depending on your system.

In [1]:
import sys
import logging
import fasttext.util
import numpy as np
import pandas as pd

from data_cleaning.transformers import tokenizer, urlRemover, punctuationRemover, SnakeCaseSplitter, numericsFilter, stopwordsFilter, Vectorizer
from sklearn.pipeline import Pipeline

# Setup logger.
main_logger = logging.getLogger()
main_logger.setLevel(logging.DEBUG)
stdout_handler = logging.StreamHandler(sys.stdout)
main_logger.addHandler(stdout_handler)

# Parameters
MAX_TOKEN_LEN = 25
VECTOR_DIM = 300
# Change this parameter to use multiprocessing where possible.
PARALELL_JOBS = 5

# Load fasttext embedding.
print('Loading Fasttext model.')
fasttext.util.download_model('en', if_exists='ignore')
ft = fasttext.load_model('cc.en.300.bin')

# Create data pipeline.
pipeline = Pipeline([('tokenize', tokenizer(PARALELL_JOBS)), ('remove_urls', urlRemover(PARALELL_JOBS)), 
                     ('remove_punctuation', punctuationRemover(PARALELL_JOBS)), ('remove_numerics', numericsFilter(PARALELL_JOBS)), 
                     ('stopwords_filter', stopwordsFilter(PARALELL_JOBS)), ('snake_case_splitting', SnakeCaseSplitter(PARALELL_JOBS)),
                     ('vectorize', Vectorizer(ft, MAX_TOKEN_LEN))])

# Load dataset.
print('loading dataset.')
df = pd.read_csv('resources/data/train.csv')
df = pipeline.transform(df)
df.drop(['text', 'tokens', 'keyword', 'location'], 1)

# Balancing positive and negative classified tweets.
y = df['target']
false_count, true_count = y.value_counts()
surplus_false_tweets = false_count - true_count
false_indices = y[y == 0].index.to_list()
indices_to_remove = false_indices[:surplus_false_tweets]

X = np.array(df['vectors'].drop(indices_to_remove).to_list())
y = np.array(df['target'].drop(indices_to_remove).to_list())

Loading Fasttext model.
loading dataset.
tokenizer transforming data on 5 processes.




urlRemover transforming data on 5 processes.
punctuationRemover transforming data on 5 processes.
numericsFilter transforming data on 5 processes.
stopwordsFilter transforming data on 5 processes.
SnakeCaseSplitter transforming data on 5 processes.
Vectorizer transforming data.


  from pandas import Panel
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7613/7613 [00:13<00:00, 545.64it/s]


In [2]:
from sklearn.metrics import confusion_matrix

def conf_matrix(y_true, y_pred):
    matrix = confusion_matrix(y_true, y_pred)
    TP = int(matrix[1][1])
    FP = int(matrix[0][1])
    TN = int(matrix[0][0])
    FN = int(matrix[1][0])

    print(f'\rconfusion matrix (n={len(y_true)})')
    print('\pred:  false | true ')
    print('truth -------------')
    print(f'false:| {TN} | {FP} |')
    print('      |-----|-----|')
    print(f'true :| {FN} | {TP} |')
    print('      -------------')

    precision = TP / (TP + FP)
    recall = TP / (TP + FN)
    accuracy = ((TP + TN) / (FN + FP + TN + TP)) * 100

    print(f'\rprecision: \t{round(precision, 2)}, out of all your positives this part whas true.')
    print(f'recall: \t{round(recall, 2)}, out of all the positives, this is the part we caught.')
    print(f'accuracy: \t{round(accuracy, 2)}.')
    print(f'f1: \t\t{round(2*((precision*recall)/(precision+recall)), 2)}')

## The model
Lets create a function for building our model. It will use filters ranging fom 1 to 5 words. 

In [3]:
from tensorflow.keras import layers, models

def yoon_cnn():
    sequence_input = layers.Input(shape=(25, 300), dtype='float32')

    filter_sizes = [1, 2, 3, 4, 5]

    convs = []
    for filter_size in filter_sizes:
        l_conv = layers.Conv1D(filters=200,
                        kernel_size=filter_size,
                        activation='relu')(sequence_input)
        l_pool = layers.GlobalMaxPooling1D()(l_conv)
        convs.append(l_pool)
    l_merge = layers.concatenate(convs, axis=1)
    x = layers.Dense(10, activation='relu')(l_merge)
    preds = layers.Dense(1, activation='sigmoid')(x)
    model = models.Model(sequence_input, preds)
    model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['acc'])

    return model

## Cross validation

Now lets cross-validate the model. Because the `cross_validate` function is only compatible with Keras's `Sequential` model we'll have to write some custom logic.

*Note where are importieng the worker function from a seperate file so the multiprocessing will work on Windows.

In [4]:
import multiprocessing as mp
from sklearn.model_selection import KFold
from workers import cnn_worker
from statistics import variance
from math import sqrt

kf = KFold(n_splits=7, shuffle=True, random_state=1)
folds = [(train_index, test_index, X, y) for train_index, test_index in kf.split(X)]

pool = mp.Pool(PARALELL_JOBS)
results = pool.map(cnn_worker, folds)
pool.close()

accuracies = [float(result[1]) for result in results]
mean_accuracy = round(sum(accuracies) / len(accuracies), 2)
accuracy_sqrt_var = round(sqrt(variance(accuracies)), 3)

print('scores:')
print(f"accuracy: {mean_accuracy} average, {accuracy_sqrt_var} squared variance")

scores:
accuracy: 0.78 average, 0.014 squared variance


Accuracy is the same compared to the RNN trained in the RNN notebook.

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
model = yoon_cnn()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
conf_matrix(y_test, np.round(y_pred.flatten()))

Train on 5233 samples
confusion matrix (n=1309)
\pred:  false | true 
truth -------------
false:| 546 | 103 |
      |-----|-----|
true :| 169 | 491 |
      -------------
precision: 	0.83, out of all your positives this part whas true.
recall: 	0.74, out of all the positives, this is the part we caught.
accuracy: 	79.22.
f1: 		0.78


Again The same performance as the RNN Model. This suggests that performance might be bottlenecked by something else other thatn the models architecture.