# Word2Vec Vectorisation, CNN and RNN

Ideally I would split up these notebooks into separate files but I did not know how to save the output of my Word2Vec Vectorisation for the inputs to my CNN, so I thought it best to run them all in the same notebook. Enjoy!

# Word2Vec Vectorisation

In [1]:
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

To make full use of CNNs and RNNs I will need to develop a more computer readable input for my word data. I will also want to represent each document as a matrix so that I can apply filters in the case of CNNs and _______ for RNNs. 

You can create your own word embedding model by training it on your own dataset. However, this is can take time to build and run. And, unless you have a lot of data, it can produce poor results.

There are number of different word embeddings available that have been prebuilt such as LexVec that was built from Wikipedia articles as well as Google's model that was made from Google News articles (https://machinelearningmastery.com/develop-word-embeddings-python-gensim/). Given this, I will leverage Google's prebuilt modle for my vectorisation as its domain is related to my dataset and it will be a more useful model than any I would be able to train on my small dataset as it used 100 billion words in its training....

In [2]:
# Import packages including Gensim package, download the Google model
from gensim.models import KeyedVectors

data_filename = 'data/GoogleNews-vectors-negative300.bin'
model = KeyedVectors.load_word2vec_format(data_filename, binary=True)

Now that I have my model we can see that it has some really cool functionality. As each word is a vector in some dimensional space, we can look at the most similar words to a given word based on the closest word vectors to that word in the space.

In [3]:
most_similar = model.most_similar("Vancouver")
print(model['Vancouver'].shape)
print(most_similar)

(300,)
[('Calgary', 0.7786356210708618), ('Edmonton', 0.7367619276046753), ('Burnaby', 0.7355164885520935), ('Kelowna', 0.7330868244171143), ('Vancouver_BC', 0.7306336164474487), ('Chilliwack', 0.6999480128288269), ('British_Columbia', 0.6976038813591003), ('Nanaimo', 0.6958351135253906), ('Saskatoon', 0.6886626482009888), ('Kamloops', 0.6885457038879395)]


Here we can see that the closest words to Vancouver are other Canadian citiesor provinces, which makes a lot of intuitive sense. Also, note that each vector has 300 dimensions.

Now that I have my vectorisation model, I need to apply it to each document in my dataset. This will result in an output for each document that has d_i rows and n columns, where d_i equals the number words in my processed dataset and n is the number of dimensions of the word vectors (in this case 300).

For each document I will end up with at t_i x n matrix, so will have over 8500 matrices in my training dataset and almost 2000 in my test dataset. I will need to create an array of these matrices for each of my training sets that will get passed into my CNNs. This is because for Keras, the input needs to be in the shape:

(number of documents, number of rows, number of columns, number of channels) 

Where: \
number of documents = the number of vectorised matrices, one for each document \
number of rows = number of words in each document \
number of columns = number of dimensions in each vector \
number of channels = depth of each document matrix if not 2D (not relevant here)

One issue here is that the number of rows, columns and channels need to be the same for all the images. For my datasets, however, the headlines have different numbers of words in them so the number of rows will not be the same for each document. The number of columns and channels will be the same as the dimension of each word vector is the same and I only have 1 channel for each document.

To account for this, what I will do is vectorise each document and then see which document has the most rows. I will then add as many additional 0 vector rows as required for all other matrices so that they have the same number of rows.

One issue with this method is that the model will need to learn that these 0 vector rows preovide no information about the headline, so this could impact my model's classification accuracy, but we shall see how it gets on.

First I need to create my vectorised document matrices and store them in a new array.

In [4]:
# Import pre-processed data
df_fake_news = pd.read_csv('data/preprocessed_fake_news.csv')

In [5]:
df_fake_news

Unnamed: 0,Headline,Processed Headline,Body,Processed Body,Label
0,Four ways Bob Corker skewered Donald Trump,way Bob Corker skewer Donald Trump,Image copyright Getty Images\nOn Sunday mornin...,image copyright Getty Images \n Sunday morning...,1
1,Linklater's war veteran comedy speaks to moder...,Linklater war veteran comedy speak modern Amer...,"LONDON (Reuters) - “Last Flag Flying”, a comed...",LONDON Reuters flag Flying comedy drama Vietna...,1
2,Trump’s Fight With Corker Jeopardizes His Legi...,Trump ’s Fight Corker Jeopardizes legislative ...,The feud broke into public view last week when...,feud break public view week Mr. Corker say Mr....,1
3,Egypt's Cheiron wins tie-up with Pemex for Mex...,Egypt Cheiron win tie Pemex mexican onshore oi...,MEXICO CITY (Reuters) - Egypt’s Cheiron Holdin...,MEXICO CITY Reuters Egypt ’s Cheiron Holdings ...,1
4,Jason Aldean opens 'SNL' with Vegas tribute,Jason Aldean open SNL Vegas tribute,"Country singer Jason Aldean, who was performin...",country singer Jason Aldean perform Las Vegas ...,1
...,...,...,...,...,...
10318,State Department says it can't find emails fro...,State Department say find email Clinton specia...,The State Department told the Republican Natio...,State Department tell Republican National Comm...,1
10319,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,p PBS stand plutocratic Pentagon,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,p PBS stand plutocratic Pentagon post Oct 27 2...,0
10320,Anti-Trump Protesters Are Tools of the Oligarc...,Anti trump Protesters tool Oligarchy info...,Anti-Trump Protesters Are Tools of the Oligar...,Anti trump Protesters tool Oligarchy reform...,0
10321,"In Ethiopia, Obama seeks progress on peace, se...",Ethiopia Obama seek progress peace security Ea...,"ADDIS ABABA, Ethiopia —President Obama convene...",ADDIS ABABA Ethiopia —President Obama convene ...,1


I need to drop the Body columns as I am no longer using them for my models. I also don't require the original Headline column as I will only be vectorising the Processed Headline column.

In [6]:
# Drop body columns
df_fake_news.drop(['Body', 'Processed Body', 'Headline'], axis=1, inplace=True)

In [7]:
df_fake_news.isna().sum()

Processed Headline    6
Label                 0
dtype: int64

As discovered before we have some NaN entries in the headlines so I will remove these before proceeding.

In [8]:
df_fake_news.dropna(inplace=True)

In [9]:
df_fake_news.isna().sum()

Processed Headline    0
Label                 0
dtype: int64

---
### Note: 
In preparing my data input for the CNN function I found out that while padding the matrices to be the same size one of the headlines had 65 words. This is an issue as the average headline in my dataset has 7 words. So that means a lot of the matrices will have almost 10x more padded zero vector rows than non-zero vector rows. So I decided I will remove any headlines longer than 13 words. This is becuase less than 3% of all headlines in my training set had more than 13 words in the headline.  (See below)

By removing these longer headlines I reduce the amount of required padding on all my matrices which will save space and computation time. It will also make it easier for the model to pick out which vectors are relevant as there won't be a bunch of zero vectors in each matrix.

---

In [12]:
# Test to see what headline length cutoff threshold would be best to remove long headlines

thresh_30 = 0
thresh_25 = 0
thresh_20 = 0
thresh_15 = 0
thresh_13 = 0
thresh_10 = 0

for index, row in df_fake_news.iterrows():
    num_headlines = df_fake_news.shape[0]
    headline = row[0]
    headline_words = headline.split(' ')
    num_words = len(headline_words)
    
    if num_words > 30:
        thresh_30 += 1
        continue
        
    elif num_words > 25:
        thresh_25 += 1
        continue
        
    elif num_words > 20:
        thresh_20 += 1
        continue
        
    elif num_words > 15:
        thresh_15 += 1
        continue
        
    elif num_words > 13:
        thresh_13 += 1
        continue
        
    elif num_words > 10:
        thresh_10 += 1
        continue

print(f' Percent of headlines for threshold 30: {100*thresh_30/num_headlines}')
print(f' Percent of headlines for threshold 25: {100*(thresh_30 + thresh_25)/num_headlines}')
print(f' Percent of headlines for threshold 20: {100*(thresh_30 + thresh_25 + thresh_20)/num_headlines}')
print(f' Percent of headlines for threshold 15: {100*(thresh_30 + thresh_25 + thresh_20 + thresh_15)/num_headlines}')
print(f' Percent of headlines for threshold 13: {100*(thresh_30 + thresh_25 + thresh_20 + thresh_15 + thresh_13)/num_headlines}')

 Percent of headlines for threshold 30: 0.02907822041291073
 Percent of headlines for threshold 25: 0.04846370068818455
 Percent of headlines for threshold 20: 0.12600562178927982
 Percent of headlines for threshold 15: 0.9789667539013279
 Percent of headlines for threshold 13: 2.5201124357855966


In [13]:
# Remove all rows from the original dataset that have headlines with more than 13 words. Save this dataset.

for index, row in df_fake_news.iterrows():
    
    words = row[0].split(' ')
    num_words = len(words)
    
    if num_words > 13:
        df_fake_news.drop(labels=index, inplace=True)

Now I have a dataset that is just the pre-processed Headlines with the associated classification labels and each Headline has 13 words or less, I will now split this into my train and test sets and targets.

In [14]:
from sklearn.model_selection import train_test_split

X = df_fake_news.iloc[:, 0]
y = df_fake_news.iloc[:, 1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

Now I have my training and test data sets, it is time to convert each document into a vectorised matrix using the Google embedding. To do this I will create a function that generates an array of vectorised matrices given a dataset.

In [25]:
def buildDocArray(headline_words):
    
    doc_array = []
    
    # Loop over each word in the headline
    for word in headline_words:

        # Vectorise each word and create a Data Frame for the word. If word is not in the model, 
        # print the word and move on to the next word in the headline
        try:
            word_vec = model[word]

        except KeyError:
            continue

        # Join the new word onto the document matrix along the columns
        doc_array.append(word_vec)   
        
    return doc_array

In [26]:
def padDocument(doc_array, max_length):
    
    # Create a zero_vector to pad matrices with shape (1,300)
    zero_vector = pd.DataFrame(np.zeros((1,300)))
    
    # Get the current number of rows in the document array
    num_rows = len(doc_array)
    
    # Find how many padded rows are needed
    missing_rows = max_length - num_rows
    
    for i in np.arange(missing_rows):
        doc_array.append(zero_vector)
        
    return doc_array

In [52]:
def documentVectoriser(df, max_length):

    # Output array of vectorised matrices
    output_array = []

    # Loop over every entry in my training set
    for i in np.arange(df.shape[0]):
        
        # Extract each headline
        headline = df.iloc[i]

        # Convert the string into words
        headline_words = headline.str.split(' ')

        #Create a dataframe for the document
        doc_array = []

        # Create the document array
        doc_array = buildDocArray(headline_words)
         
        # Pad the document array as needed
        doc_array = padDocument(doc_array, max_length)

        # Add the new document array to the output array
        output_array.append(doc_array)

    return np.array(output_array)

In [28]:
train_inputs = documentVectoriser(X_train, 13)
test_inputs = documentVectoriser(X_test, 13)

In [156]:
train_inputs.shape

(8045, 13, 300)

At this point, I have vectorised all my headlines. In the process I realised some of the words used in the headlines were not recognised my Google's model so they were dropped from the vectorisation.

# Convolutional Neural Networks

In [29]:
# Specific neural network models & layer types
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.layers import MaxPooling2D
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout

We are finally at the exciting part of the modelling, CNNs! To start I will warm up with a simple CNN taken from our in-class lecture (modified for size) to give us a baseline accuracy.

First I need to make sure my training and test data are in the correct shape!

In [30]:
print(f'Training input shape: {train_inputs.shape}')
print(f'Test input shape: {test_inputs.shape}')

Training input shape: (8045, 13, 300)
Test input shape: (2012, 13, 300)


The CNN needs input in the form of a 4D tensor where the input is of shape: \

(n_documents, num_words, n_dim_word_vector, num_channels) \

Currently we have our inputs in the following shape:

(n_documents, num_words, n_dim_word_vector) \

So we just need to reshape and add in the num_channels, which in this case is just 1.

In [31]:
# Define input image dimensions
num_words, n_dim, num_channels = 13, 300, 1

# Reshape for Keras model types
X_train_CNN = train_inputs.reshape(train_inputs.shape[0], num_words, n_dim, 1)
X_test_CNN = test_inputs.reshape(test_inputs.shape[0], num_words, n_dim, 1)

In [32]:
# Create simple CNN model architecture with Pooling for dimensionality reduction 
# and Dropout to reduce overfitting

num_classes = 2

CNN_model = Sequential()

CNN_model.add(Conv2D(512, kernel_size=(3, 3), activation = 'relu', input_shape = (num_words, n_dim, num_channels)))
CNN_model.add(Conv2D(1024, (3, 3), activation='relu'))
CNN_model.add(MaxPooling2D(pool_size=(2, 2)))

CNN_model.add(Dropout(0.25))
CNN_model.add(Flatten())
CNN_model.add(Dense(12, activation='relu'))
CNN_model.add(Dropout(0.5))
CNN_model.add(Dense(num_classes, activation='softmax'))

CNN_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 11, 298, 512)      5120      
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 9, 296, 1024)      4719616   
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 4, 148, 1024)      0         
_________________________________________________________________
dropout (Dropout)            (None, 4, 148, 1024)      0         
_________________________________________________________________
flatten (Flatten)            (None, 606208)            0         
_________________________________________________________________
dense (Dense)                (None, 12)                7274508   
_________________________________________________________________
dropout_1 (Dropout)          (None, 12)                0         
__________

In [33]:
# Compile the model with the desired loss function, optimizer, and metric to optimize
CNN_model.compile(loss = 'sparse_categorical_crossentropy',
                  optimizer = 'Adam',
                  metrics = ['accuracy'])

So running this model initially gave me a runtime of 5 days. To improve this I will take a 10% subsample of my training and test data so that it hopefully only runs in 12 hours.

In [34]:
# Take a sub-sample of my training and test data set

X_sub_train_indices = np.random.random_integers(0, X_train_CNN.shape[0], 850)
X_sub_test_indices = np.random.random_integers(0, X_test_CNN.shape[0], 200)

In [35]:
X_sub_train = X_train_CNN[X_sub_train_indices]
X_sub_test = X_test_CNN[X_sub_test_indices]
y_sub_train = y_train.iloc[X_sub_train_indices]
y_sub_test = y_test.iloc[X_sub_test_indices]

In [39]:
# Fit the model on the training data, defining desired batch_size & number of epochs,
# running validation on the test data after each batch
# THIS WILL TAKE A LONG TIME TO RUN!!!
CNN_model.fit(X_sub_train, y_sub_train,
              batch_size = 128,
              epochs = 10,
              verbose = 1,
              validation_data = (X_sub_test, y_sub_test))

Train on 850 samples, validate on 200 samples
Epoch 1/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x1a1d41cb38>

In [40]:
# Evaluate the model's performance on the test data
score = CNN_model.evaluate(X_sub_test, y_sub_test, verbose=1)

print('Test loss:', score[0])
print('Test accuracy:', score[1])

Test loss: 0.7057784414291381
Test accuracy: 0.745


After running for 24 hours we managed to achieve a test accuracy of 74.5%. This is a significant improvement from our previous models and so I will save this CNN to file.

Ref: https://machinelearningmastery.com/save-load-keras-deep-learning-models/

In [43]:
from tensorflow.keras.models import model_from_json

# serialize model to JSON
model_json = CNN_model.to_json()
with open("model.json", "w") as json_file:
    json_file.write(model_json)
# serialize weights to HDF5
CNN_model.save_weights("model.h5")
print("Saved model to disk")

Saved model to disk


In [44]:
# load json and create model
json_file = open('model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)
# load weights into new model
loaded_model.load_weights("model.h5")
print("Loaded model from disk")

Loaded model from disk


In [45]:
# evaluate loaded model on test data
loaded_model.compile(loss = 'sparse_categorical_crossentropy',
                  optimizer = 'Adam',
                  metrics = ['accuracy'])
score = loaded_model.evaluate(X_sub_test, y_sub_test, verbose=1)
print("%s: %.2f%%" % (loaded_model.metrics_names[1], score[1]*100))

acc: 74.50%


# Recurrent Neural Networks

The last model I will be using is a Recurrernt Neural Network (RNN). I have high expectations that this will be the best model for my dataset as RNNs leverage the connectedness of a dataset. In my case, the sequence of words in my headlines plays an important role and an RNN is able to use the sequence as input for the model.

The data input for an RNN is almost the the same as for the CNN, but we no longer require the num_channels. So I just need to reshape the data. For this I will use the same sub-sample as for the CNN.

In [167]:
# Reshape the sub-sample training and test data

X_train_RNN_sub_sample = np.reshape(X_sub_train, (X_sub_train.shape[0], X_sub_train.shape[1],X_sub_train.shape[2]))
X_test_RNN_sub_sample = np.reshape(X_sub_test, (X_sub_test.shape[0], X_sub_test.shape[1], X_sub_test.shape[2]))

In [168]:
print(X_train_RNN_sub_sample.shape)
print(X_test_RNN_sub_sample.shape)

(850, 13, 300)
(200, 13, 300)


850 = # headlines \
13 = # words in each headline (including padded zero vectors) \
300 = dimension of the embedded word vectors

In [169]:
# Building our RNN
from tensorflow.keras.layers import LSTM, BatchNormalization, Flatten
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.utils import plot_model
from tensorflow.keras.models import model_from_json

RNN_model = Sequential()

RNN_model.add(LSTM(2048, activation='relu', return_sequences=True))
RNN_model.add(Dropout(0.2))
RNN_model.add(BatchNormalization())

RNN_model.add(LSTM(618, activation='relu', return_sequences=True))
RNN_model.add(Dropout(0.2))
RNN_model.add(BatchNormalization())

RNN_model.add(LSTM(1024, activation='relu'))
RNN_model.add(Dropout(0.2))
RNN_model.add(BatchNormalization())

RNN_model.add(Dense(32, activation='relu'))
RNN_model.add(Dropout(0.1))

RNN_model.add(Dense(2, activation='softmax'))

In [170]:
# Compile model
RNN_model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='Adam',
    metrics=['accuracy']
)

In [171]:
# Save an image of its architecture to file
plot_model(RNN_model, to_file='data/RNN_model.png', show_shapes=True, show_layer_names=True)

In [172]:
y_RNN_test = np.array(y_sub_test)
y_RNN_train = np.array(y_sub_train)

EPOCHS = 40       # NNs operate in epochs, meaning this is how many times the neural network will go through 
                      # the entire data
BATCH_SIZE = 480   # at each epoch, it will split the data into units of 48 samples, and train on those

RNN_model.fit(X_train_RNN_sub_sample, y_RNN_train,
               batch_size=BATCH_SIZE,
               epochs=EPOCHS,
              verbose=1,
               validation_data = (X_test_RNN_sub_sample, y_RNN_test))

Train on 850 samples, validate on 200 samples
Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


<tensorflow.python.keras.callbacks.History at 0x1bb428c6d8>

Running the RNN with a 10% subsample managed to achieve

In [173]:
# Evaluate the model's performance on the test data
score = RNN_model.evaluate(X_test_RNN_sub_sample, y_RNN_test, verbose=1)

print('Test loss:', score[0])
print('Test accuracy:', score[1])

Test loss: 1.1123304891586303
Test accuracy: 0.58


In [None]:
# serialize model to JSON
RNN_model_json = RNN_model.to_json()
with open("RNN_model.json", "w") as json_file:
    json_file.write(RNN_model_json)
# serialize weights to HDF5
RNN_model.save_weights("RNN_model.h5")
print("Saved model to disk")

In [None]:
# load json and create model
RNN_json_file = open('RNN_model.json', 'r')
RNN_loaded_model_json = RNN_json_file.read()
RNN_json_file.close()
RNN_loaded_model = model_from_json(RNN_loaded_model_json)
# load weights into new model
RNN_loaded_model.load_weights("RNN_model.h5")
print("Loaded model from disk")

In [None]:
# evaluate loaded model on test data
RNN_loaded_model.compile(loss = 'sparse_categorical_crossentropy',
                  optimizer = 'Adam',
                  metrics = ['accuracy'])
score = RNN_loaded_model.evaluate(X_sub_test, y_RNN_test, verbose=1)
print("%s: %.2f%%" % (RNN_loaded_model.metrics_names[1], score[1]*100))

In [175]:
# Compile model
RNN_model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='Adam',
    metrics=['accuracy']
)

In [177]:
EPOCHS = 40       # NNs operate in epochs, meaning this is how many times the neural network will go through 
                      # the entire data
BATCH_SIZE = 480   # at each epoch, it will split the data into units of 48 samples, and train on those

RNN_model.fit(train_inputs, y_train,
               batch_size=BATCH_SIZE,
               epochs=EPOCHS,
              verbose=1,
               validation_data = (test_inputs, y_test))

Train on 8045 samples, validate on 2012 samples
Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


<tensorflow.python.keras.callbacks.History at 0x1b46db3080>

In [178]:
# Evaluate the model's performance on the test data
score = RNN_model.evaluate(X_test_RNN_sub_sample, y_RNN_test, verbose=1)

print('Test loss:', score[0])
print('Test accuracy:', score[1])

Test loss: 0.7389770400524139
Test accuracy: 0.81
