## Getting Started

In [2]:
#load libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_20newsgroups
from keras.utils import to_categorical
from keras.layers import LSTM, GRU, Dense, GlobalMaxPool1D, Embedding, Dropout
from keras.models import Sequential
from keras.preprocessing import text, sequence
%matplotlib inline
np.random.seed(0)

Using TensorFlow backend.


## Import and Preprocess Text Data

Specifically, we'll need to:

* Import and load the data and labels, and store them separately
* Convert the labels to a one-hot encoded format
* tokenize our text data
* Convert the tokenized text to sequences
* Pad the sequences, so that they are all the same length. 


In [3]:
#get data and labels
newsgroups = fetch_20newsgroups()

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [4]:
#split data
data = newsgroups.data
labels = newsgroups.target

In [6]:
#convert data to OHE format using to_categorical
labels = to_categorical(labels, 20)

### Creating Sequences From Text

Anytime we work with text data for deep learning, you can expect to see the following preprocessing pattern:

> **raw text --> tokenized text --> text sequences --> padded sequences**


In [7]:
#instantiate Tokenizer with a limit of the 20000 most used words
tokenizer = text.Tokenizer(num_words=20000)
#fit on list of data
tokenizer.fit_on_texts(list(data))
#convert to sequences
list_tokenized_train = tokenizer.texts_to_sequences(data)
#set hard limits on dimensionality with pad_sequences
X_t = sequence.pad_sequences(list_tokenized_train, maxlen=100)

## Creating Our Models



### LSTM Model

Both of our models will stick to the following architecture:

1. An `Embedding()` layer, of size `(20000, 128)`. This means that the first parameter passed into the embedding layer should be `20000` for the 20,000 words in our our text vocabulary, and the second parameter should be `128`, for the size of the Dense vectors the embedding layer will learn for each of the 20,000 words. 
2. An `LSTM()` layer (or `GRU()` layer, for the second model) of size `50`. During this step, also set the `return_sequences` parameter to `True`, so that during back propagation our models will calculate loss and learn for every step of the sequence, not just the final result of the sequence.
3. A `GlobalMaxPool1D()` layer, so that our model performs a combined _MaxPool_  operation across all weights in the recurrent layer. 
4. A `Dropout()` layer set to `0.5`.
5. A `Dense()` layer of size `50`, with this layer's `activation` parameter set to `'relu'`
6. Another `Dropout()` layer set to `0.5`
7. A `Dense()` layer that will act as our output layer. This layer should contain `20` neurons (one for each possible predicted class), and should have it's `activation` parameter set to `'softmax'`
8. **compile settings**: `loss='categorical_crossentropy'`, `optimizer='adam'`, `metrics=['accuracy']`
9. **train settings**: X_t (data), labels (labels), epochs=2, batch_size=32, validation_data=0.1

In [8]:
# LSTM Model
lstm_model = Sequential()
lstm_model.add(Embedding(20000, 128))
lstm_model.add(LSTM(50, return_sequences=True))
lstm_model.add(GlobalMaxPool1D())
lstm_model.add(Dropout(0.5))
lstm_model.add(Dense(50, activation='relu'))
lstm_model.add(Dropout(0.5))
lstm_model.add(Dense(20, activation='softmax'))


Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


In [9]:
#compile
lstm_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [10]:
#inspect model
lstm_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 128)         2560000   
_________________________________________________________________
lstm_1 (LSTM)                (None, None, 50)          35800     
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 50)                0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 50)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 50)                2550      
_________________________________________________________________
dropout_2 (Dropout)          (None, 50)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 20)                1020      
Total para

In [13]:
#train
lstm_model.fit(X_t, labels, epochs=2, batch_size=32, validation_split=0.1)

Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Train on 10182 samples, validate on 1132 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x1b716304a58>

#### GRU Model

In [14]:
# GRU Model
gru_model = Sequential()
gru_model.add(Embedding(20000, 128))
gru_model.add(GRU(50, return_sequences=True))
gru_model.add(GlobalMaxPool1D())
gru_model.add(Dropout(0.5))
gru_model.add(Dense(50, activation='relu'))
gru_model.add(Dropout(0.5))
gru_model.add(Dense(20, activation='softmax'))


In [15]:
#compile
gru_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [16]:
#inspect
gru_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, None, 128)         2560000   
_________________________________________________________________
gru_1 (GRU)                  (None, None, 50)          26850     
_________________________________________________________________
global_max_pooling1d_2 (Glob (None, 50)                0         
_________________________________________________________________
dropout_3 (Dropout)          (None, 50)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 50)                2550      
_________________________________________________________________
dropout_4 (Dropout)          (None, 50)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 20)                1020      
Total para

In [17]:
#train
gru_model.fit(X_t, labels, epochs=2, batch_size=32, validation_split=0.1)

Train on 10182 samples, validate on 1132 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x1b71c4581d0>

## Summary
In this particular case, GRUs strongly outperformed LSTMs in the first epoch, but the gap quickly leveled out between them by the end of epoch 2. When comparing LSTMs and GRUs for a given task, this isn't always the case--there are certainly times where LSTMs will outperform GRUs. However, overall, GRUs seem to have a slight advantage over LSTMs. The interesting thing about this is that researchers don't yet know _why_ GRUs tend to slightly outperform LSTMs, especially when GRU cells are a bit simpler than LSTM cells.