# Challenge: Building neural networks

Now take your Keras skills and go build another neural network. Pick your data set, but it should be one of abstract types, possibly even nonnumeric, and use Keras to make different implementations of your network. Compare them both in computational complexity as well as in accuracy and given that tradeoff decide which one you like best.

Your dataset should be sufficiently large for a neural network to perform well (samples should really be in the thousands here) and try to pick something that takes advantage of neural networks’ ability to have both feature extraction and supervised capabilities, so don’t pick something with an easy to consume list of features already generated for you (though neural networks can still be useful in those contexts).

## The dataset

**The dataset:** I build on the work I already did for the 20 newsgroups dataset. It is a text-based dataset built into the scikit learn library comprises around 18000 newsgroups posts on 20 topics. The main page and instructions for downloading can be found [here]( http://scikit-learn.org/stable/datasets/twenty_newsgroups.html).

By using this dataset I can compare the results of various supervised algorithms to the results of using various neural networks.

The highest accuracy was with a Naive Bayes model (82%) and a speedy runtime of 0.03s.

## Summary

I apply three different models to the dataset.

1. **Multi-layer perceptron with tf-idf**: This model performs relatively well. Running with epochs = 15 it obtains an accuracy of 81% - already close to the highest accuracy of Naive Bayes model. While further gains could be extracted by optimizing the model (no gridsearch approach is taken here), the biggest cost of shifting into a neural net is the run time of the model. A single epoch takes orders of magnitute longer to run (30s) than a Naive Bayes (0.03s) and even a standard SDG model (6s).

2. **Multi-layer perceptron using embedding**: Directly using an embedding layer in the neural network (rather than simple inputting a tf-idf matrix) did not improve the model. In order to keep the number of parameters smaller, the feature set is reduced and only some of the words are sampled. This, I think, leads to information loss (particularly in the padding process compared to creating tf-idf matrix).   

3. ** Recurrent neural network**: The RNN model does surprisingly poorly (45% accuracy). This is likely due to the setup (particularly the restricting of words and condensing the feature set). However, with a similar set-up the MLP still obtains 75% accuracy.

4. **Long short-term memory with word2vec as input**: This model is interesting as it uses vectors created by the word2vec process as an input into another neural network - in this case LSTM. Using word2vec should improve on the bag of words/ tf-idf approach as the vectors created by word2vec retains more information. It should also solve some of the padding/ embedding issues. However, it takes too long to run (over an hour per epoch).


**Sources:**

This notebook is highly indebted to two sources:

https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/20_Natural_Language_Processing.ipynb

https://github.com/giuseppebonaccorso/Reuters-21578-Classification/blob/master/Text%20Classification.ipynb

# Preprocessing

In [1]:
import tensorflow as tf
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dropout, Flatten, Conv2D, MaxPooling2D
from keras.layers import LSTM, Input, TimeDistributed, SimpleRNN
from keras.models import Model
from keras.optimizers import RMSprop

# Import the backend
from keras import backend as K

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
import seaborn as sns
import re
from nltk.corpus import stopwords
import random

In [3]:
from sklearn.datasets import fetch_20newsgroups


# Full dataset categories
categories = ['comp.graphics','comp.os.ms-windows.misc',
                  'comp.sys.ibm.pc.hardware','comp.sys.mac.hardware',
                  'comp.windows.x', 'rec.autos','rec.motorcycles',
                  'rec.sport.baseball','rec.sport.hockey', 'sci.crypt',
                  'sci.electronics','sci.med','sci.space',
                  'misc.forsale', 'talk.politics.misc',
                  'talk.politics.guns','talk.politics.mideast', 'talk.religion.misc',
                  'alt.atheism','soc.religion.christian']

# Reduce size of dataset to improve speed
random.seed(13)
categories_small = random.sample(categories, 8)

# Import dataset (type Bunch)
# Remove headers, footers, and quotes so that classifiers only work on text
dataset = fetch_20newsgroups(subset='train', categories=categories_small,
                             remove=('headers', 'footers', 'quotes'),
                             shuffle=True, random_state=42)


dataset_test = fetch_20newsgroups(subset='test', categories=categories_small,
                                  remove=('headers', 'footers', 'quotes'),
                                  shuffle=True, random_state=42)

# Also import full dataset for cv
dataset_full = fetch_20newsgroups(subset='all', categories=categories_small,
                                  remove=('headers', 'footers', 'quotes'),
                                  shuffle=True, random_state=42)

# Convert to dataframe
news = pd.DataFrame(dataset.data, columns=['Text'])
news_test = pd.DataFrame(dataset_test.data, columns=['Text'])
news_full = pd.DataFrame(dataset_full.data, columns=['Text'])

# Set outcome variable
y = dataset.target
y_test = dataset_test.target
y_full = dataset_full.target

In [5]:
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

# Basic stopwords list
stopWords = stopwords.words('english')
stopped = []
# Change I'll to Ill
for w in stopWords:
    stopped.append(re.sub(r"\'", '', w))

def textcleaner(text):
    ''' Takes in raw unformatted text and strips punctuation, removes whitespace,
    strips numbers, tokenizes and stems.
    Returns string of processed text to be used into CountVectorizer
    '''
    # Lowercase and strip everything except words
    cleaner = re.sub(r"[^a-zA-Z ]+", ' ', text.lower())
    # Tokenize
    cleaner = word_tokenize(cleaner)
    ps = PorterStemmer()
    clean = []
    for w in cleaner:
        # filter out stopwords
        if w not in stopped:
            # filter out short words
            if len(w)>2:
                # Stem 
                clean.append(ps.stem(w))
    return ' '.join(clean)

In [6]:
# Clean up the dfs
news['Clean_text'] = news.Text.apply(lambda x: textcleaner(x))
print('Done news')
news_test['Clean_text'] = news_test.Text.apply(lambda x: textcleaner(x))
print('Done test')
news_full['Clean_text'] = news_full.Text.apply(lambda x: textcleaner(x))
print('Done full')

Done news
Done test
Done full


In [7]:
# Drop unprocessed text
news.drop(['Text'], inplace=True, axis=1)
news_test.drop(['Text'], inplace=True, axis=1)
news_full.drop(['Text'], inplace=True, axis=1)

# Multi-layer perceptron

In [8]:
from keras.models import Sequential
from keras.layers import Dense, GRU, Embedding
from keras.optimizers import Adam
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from scipy.spatial.distance import cdist

### Getting data into correct form

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

# Counting number of words
count_vec = CountVectorizer(strip_accents='ascii', ngram_range=(1, 1), 
                                analyzer='word',  stop_words='english')
count_fit = count_vec.fit_transform(news_full.loc[:, 'Clean_text'])
#Number of columns = number of unique words
print("Number of unique words:", count_fit.shape[1])

Number of unique words: 31059


In [9]:
# Can use num_words to reduce size to speed up model
num_words = count_fit.shape[1]

# Tokenize words (a bit of duplication here from original cleaning...)
# The +1 is because I'm not using the embedding layer (which masks on index 0)
# See https://github.com/keras-team/keras/issues/8583
tokenizer = Tokenizer(num_words = (num_words + 1))
tokenizer.fit_on_texts(news_full.loc[:,'Clean_text'])

# Tokenizer is a dict map words to index

# Convert words in sample to index
x_train_tokens = tokenizer.texts_to_sequences(news.loc[:, 'Clean_text'])
x_test_tokens = tokenizer.texts_to_sequences(news_test.loc[:, 'Clean_text'])

# Compare...
print(news.iloc[0:1])
# ...to
print((x_train_tokens[0]))

                      Clean_text
0  make ten eight met astro join
[15, 1383, 2467, 1205, 2337, 1965]


In [10]:
# Currently samples are of different length, move into matrix
# Use built-in tfidf transformation
X_train_matrix = tokenizer.texts_to_matrix(news.loc[:, 'Clean_text'], mode='tfidf')
X_test_matrix = tokenizer.texts_to_matrix(news_test.loc[:, 'Clean_text'], mode='tfidf')

# Compare...
print(news.iloc[0:1])
# ...to
print((x_train_tokens[0]))
print(len(x_train_tokens[0]))
#... to
print(X_train_matrix[0])
print(X_train_matrix.shape)

                      Clean_text
0  make ten eight met astro join
[15, 1383, 2467, 1205, 2337, 1965]
6
[0. 0. 0. ... 0. 0. 0.]
(4733, 31060)


In [9]:
# One hot encoding for y
y_train_long = keras.utils.to_categorical(y, 8)
print(y_train_long.shape)
y_test_long = keras.utils.to_categorical(y_test, 8)
print(y_test_long.shape)

(4733, 8)
(3151, 8)


### The model

In [14]:
from keras import optimizers

model = Sequential()
model.add(Dense(256, activation='relu', input_shape=(num_words,)))
# Dropout layers remove features and fight overfitting
model.add(Dropout(0.1))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.1))
# End with a number of units equal to the number of classes we have for our outcome
model.add(Dense(8, activation='softmax'))

model.summary()

# Compile the model to put it all together.
model.compile(loss='categorical_crossentropy',
              optimizer=optimizers.Adam(),
              metrics=['accuracy'])

# Fitting model (note adjusting matrix to avoid 0)
history = model.fit(X_train_matrix[:,1:], y_train_long,
                    batch_size=128,
                    epochs=15,
                    verbose=1,
                    validation_data=(X_test_matrix[:, 1:], y_test_long))


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_4 (Dense)              (None, 256)               7951360   
_________________________________________________________________
dropout_3 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 128)               32896     
_________________________________________________________________
dropout_4 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 8)                 1032      
Total params: 7,985,288
Trainable params: 7,985,288
Non-trainable params: 0
_________________________________________________________________
Train on 4733 samples, validate on 3151 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8

In [15]:
score = model.evaluate(X_test_matrix[:, 1:], y_test_long, verbose=1)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Test loss: 0.8122931501669416
Test accuracy: 0.80958425900324


In [16]:
all_scores = {}
all_scores['mlp_single'] = score[1]

## Multi-layer perceptron with embedding

In [12]:
# Tokenize words (a bit of duplication here from original cleaning...)
# Note no longer num_words +1, use max_words here to avoid confusion with above
num_words = 20000
tokenizer = Tokenizer(num_words = num_words)
tokenizer.fit_on_texts(news_full.loc[:,'Clean_text'])


x_train_tokens = tokenizer.texts_to_sequences(news.loc[:, 'Clean_text'])
x_test_tokens = tokenizer.texts_to_sequences(news_test.loc[:, 'Clean_text'])

num_tokens = [len(tokens) for tokens in x_train_tokens + x_test_tokens]
num_tokens = np.array(num_tokens)

print(np.mean(num_tokens))
print(np.max(num_tokens))
max_tokens = np.mean(num_tokens) + 2 * np.std(num_tokens)
max_tokens = int(max_tokens)
print(max_tokens)

# Pad the samples to get them to equal length
pad = 'pre'
x_train_pad = pad_sequences(x_train_tokens, maxlen=max_tokens,
                            padding=pad, truncating=pad)
x_test_pad = pad_sequences(x_test_tokens, maxlen=max_tokens,
                           padding=pad, truncating=pad)

x_train_pad.shape


76.00634195839675
6735
573


(4733, 573)

In [19]:
from keras.optimizers import RMSprop

model = Sequential()
embedding_size = 100
model.add(Embedding(input_dim=num_words,
                    output_dim=embedding_size,
                    input_length=max_tokens,
                    name='layer_embedding'))
model.add(Flatten())
model.add(Dense(128, activation='relu', input_shape=(embedding_size,)))
# Dropout layers remove features and fight overfitting
model.add(Dropout(0.1))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.1))
# End with a number of units equal to the number of classes we have for our outcome
model.add(Dense(8, activation='softmax'))

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
layer_embedding (Embedding)  (None, 573, 100)          2000000   
_________________________________________________________________
flatten_1 (Flatten)          (None, 57300)             0         
_________________________________________________________________
dense_7 (Dense)              (None, 128)               7334528   
_________________________________________________________________
dropout_5 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_8 (Dense)              (None, 64)                8256      
_________________________________________________________________
dropout_6 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_9 (Dense)              (None, 8)                 520       
Total para

In [20]:
# Compile the model to put it all together.
model.compile(loss='categorical_crossentropy',
              optimizer=optimizers.Adam(),
              metrics=['accuracy'])

# Fitting model (note adjusting matrix to avoid 0)
history = model.fit(x_train_pad, y_train_long,
                    batch_size=128,
                    epochs=10,
                    verbose=1,
                    validation_data=(x_test_pad, y_test_long))

score = model.evaluate(x_test_pad, y_test_long, verbose=1)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Train on 4733 samples, validate on 3151 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test loss: 0.858665461626858
Test accuracy: 0.7543636940842922


In [21]:
all_scores['mlp_embed'] = score[1]

# Recurrent neural network

In [22]:
from keras.optimizers import RMSprop

model = Sequential()
#embedding_size = 100
model.add(Embedding(input_dim=num_words,
                    output_dim=embedding_size,
                    input_length=max_tokens,
                    name='layer_embedding'))
model.add(GRU(units=16, return_sequences=True))
model.add(GRU(units=8, return_sequences=True))
model.add(GRU(units=4))
model.add(Dense(8, activation='softmax'))
optimizer = Adam(lr=1e-3)
model.compile(loss='categorical_crossentropy',
              optimizer=RMSprop(),
              metrics=['accuracy'])

model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
layer_embedding (Embedding)  (None, 573, 100)          2000000   
_________________________________________________________________
gru_1 (GRU)                  (None, 573, 16)           5616      
_________________________________________________________________
gru_2 (GRU)                  (None, 573, 8)            600       
_________________________________________________________________
gru_3 (GRU)                  (None, 4)                 156       
_________________________________________________________________
dense_10 (Dense)             (None, 8)                 40        
Total params: 2,006,412
Trainable params: 2,006,412
Non-trainable params: 0
_________________________________________________________________


In [23]:
history = model.fit(x_train_pad, y_train_long,
                    batch_size=128,
                    epochs=10,
                    verbose=1,
                    validation_data=(x_test_pad, y_test_long))

score = model.evaluate(x_test_pad, y_test_long, verbose=1)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Train on 4733 samples, validate on 3151 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test loss: 1.472765001105112
Test accuracy: 0.45826721678495935


In [24]:
all_scores['rnn'] = score[1]

## Using tf-idf matrix rather than embedding layer

In [26]:
num_words = count_fit.shape[1]

# Tokenize words (a bit of duplication here from original cleaning...)
# The +1 is because I'm not using the embedding layer (which masks on index 0)
# See https://github.com/keras-team/keras/issues/8583
tokenizer = Tokenizer(num_words = (num_words + 1))
tokenizer.fit_on_texts(news_full.loc[:,'Clean_text'])

# Tokenizer is a dict map words to index

X_train_matrix = tokenizer.texts_to_matrix(news.loc[:, 'Clean_text'], mode='tfidf')
X_test_matrix = tokenizer.texts_to_matrix(news_test.loc[:, 'Clean_text'], mode='tfidf')


In [44]:
# Change to 3D
X_train_matrix3 = np.reshape(X_train_matrix, (len(X_train_matrix), num_words+1, 1))

model = Sequential()
#embedding_size = 100
model.add(SimpleRNN(units=32, return_sequences=True, input_shape=(num_words+1, 1)))
model.add(SimpleRNN(units=16, return_sequences=True))
model.add(SimpleRNN(units=8))
model.add(Dense(8, activation='softmax'))
optimizer = Adam(lr=1e-3)
model.compile(loss='categorical_crossentropy',
              optimizer=RMSprop(),
              metrics=['accuracy'])


In [None]:
# DANGER: This takes hours to run... Not attempted here

# Compile the model to put it all together.
model.compile(loss='categorical_crossentropy',
              optimizer=optimizers.Adam(),
              metrics=['accuracy'])

# Fitting model (note adjusting matrix to avoid 0)
history = model.fit(X_train_matrix3, y_train_long,
                    batch_size=128,
                    epochs=5,
                    verbose=1,
                    #validation_data=(X_test_matrix[:, 1:], y_test_long)
                   )

score = model.evaluate(X_test_matrix[:, 1:], y_test_long, verbose=1)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

In [None]:
all_scores['rnn_tf_idf'] = score[1]

# Long short-term memory (LSTM) with Word2vec as input layer

In [10]:
# Gensim model expects tokens
X_train = news.Clean_text.apply(lambda x: x.split(sep=' '))
X_test = news_test.Clean_text.apply(lambda x: x.split(sep=' '))
X_full = news_full.Clean_text.apply(lambda x: x.split(sep=' '))

In [11]:
import gensim
from gensim.models import word2vec

# Word2Vec number of features
num_features =100

# Creating instance
w2v_model = word2vec.Word2Vec(
    X_full,
    #workers=4,     # Number of threads to run in parallel (if your computer does parallel processing).
    #min_count=1,  # Minimum word count threshold.
    window=10,      # Number of words around target word to consider.
    #sg=0,          # Use CBOW because our corpus is small.
    sample=1e-3 ,  # Penalize frequent words.
    size=num_features,      # Word vector length.
    hs=1           # Use hierarchical softmax.
)

print('done!')



done!


In [13]:
w2v_model = dict(zip(w2v_model.wv.index2word, w2v_model.wv.vectors))

In [14]:
# Limit each newsline to a fixed number of words
document_max_num_words = max_tokens

# Categories (should really refer to start rather than hardcoded)
num_categories = 8

# Creating vars to match code below
number_of_documents = len(news_full)

X_wv = np.zeros(shape=(number_of_documents, document_max_num_words, num_features)).astype(np.float32)

empty_word = np.zeros(num_features).astype(np.float32)

for idx, document in enumerate(news_full.loc[:, 'Clean_text']):
    for jdx, word in enumerate(document):
        if jdx == document_max_num_words:
            break
            
        else:
            if word in w2v_model:
                X_wv[idx, jdx, :] = w2v_model[word]
            else:
                X_wv[idx, jdx, :] = empty_word


In [15]:
print(X_wv.shape)

(7884, 573, 100)


In [16]:
Y_wv_long = keras.utils.to_categorical(y_full, 8)
Y_wv_long.shape

(7884, 8)

In [17]:
from sklearn.model_selection import train_test_split
X_train_wv, X_test_wv, y_train_wv, y_test_wv = train_test_split(X_wv, Y_wv_long, test_size=0.2)

In [26]:
from keras.layers import Dense, Dropout, Activation, LSTM

model = Sequential()

model.add(LSTM(int(document_max_num_words*1.1), input_shape=(document_max_num_words, num_features)))
model.add(Dropout(0.3))
model.add(Dense(8))
model.add(Activation('softmax'))

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_4 (LSTM)                (None, 630)               1842120   
_________________________________________________________________
dropout_4 (Dropout)          (None, 630)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 8)                 5048      
_________________________________________________________________
activation_4 (Activation)    (None, 8)                 0         
Total params: 1,847,168
Trainable params: 1,847,168
Non-trainable params: 0
_________________________________________________________________


In [27]:
# An hour is a bit long to wait...

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train model
model.fit(X_train_wv, y_train_wv, batch_size=64, epochs=5)

# Evaluate model
score, acc = model.evaluate(X_test_wv, y_test_wv, batch_size=128)
    
print('Score: %1.4f' % score)
print('Accuracy: %1.4f' % acc)

Epoch 1/5
 448/6307 [=>............................] - ETA: 1:03:52 - loss: 2.0805 - acc: 0.1094

KeyboardInterrupt: 