## Using LSTMs to Classify the 20 Newsgroups Data Set
The 20 Newsgroups data set is a well known classification problem. The goal is to classify which newsgroup a particular post came from.  The 20 possible groups are:

`comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x	rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey	
sci.crypt
sci.electronics
sci.med
sci.space
misc.forsale	
talk.politics.misc
talk.politics.guns
talk.politics.mideast	
talk.religion.misc
alt.atheism
soc.religion.christian`

As you can see, some pairs of groups may be quite similar while others are very different.
The full dataset is given as a designated training set of size 11314 and test set of size 7532. The 20 categories are represented in roughly equal proportions, so the baseline accuracy is around 5%.

Steps include loading in the 20 newsgroups data (or parts of it), loading in the GloVe data, building the word embedding matrix, and building the LSTM model.

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import LSTM

import keras

from sklearn.datasets import fetch_20newsgroups
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

Using TensorFlow backend.


Load the newsgroups and store names in variable `categories`.

In [3]:
newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')

categories = list(newsgroups_train.target_names)
print(categories)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


Select a random subset of newsgroups for further analysis and load data.

In [4]:
def select_number_of_categories(n, cats):
    assert n > 0 and n <= 20, 'Incorrect number'
    idx = np.random.choice(len(cats), int(n), replace=False)
    selected_cats = [cats[i] for i in idx]
    return selected_cats

subcats = select_number_of_categories(4, categories)
print(subcats)

['talk.politics.guns', 'comp.sys.ibm.pc.hardware', 'talk.politics.misc', 'alt.atheism']


In [5]:
# Download categories from newsgroups data
def load_categories(cats):
    """
    The keyword 'remove' is used to strip metadata that may
    lead to overfitting.
    """
    newsgroups_train = fetch_20newsgroups(subset='train',
                                          remove=('headers', 'footers', 'quotes'),
                                          categories=cats)
    newsgroups_test = fetch_20newsgroups(subset='test',
                                         remove=('headers', 'footers', 'quotes'),
                                         categories=cats)
    return newsgroups_train, newsgroups_test

newsgroups_train, newsgroups_test = load_categories(subcats)

In [6]:
print('Number of articles in training:', len(newsgroups_train.data))
print('Number of articles in test:', len(newsgroups_test.data))
print('Example of an article:') 
print(newsgroups_train.data[9])

Number of articles in training: 2081
Number of articles in test: 1385
Example of an article:
Ahhh, remember the days of Yesterday?  When we were only 
	going to pay $17 / month?

	When only 1.2% of the population would pay extra taxes?

	Remember when a few of us predicted that it wasn't true?  :)
	Remember the Inaugural?   Dancing and Singing!  Liberation
	at last!  

	Well, figure *this* out:

	5% VAT, estimated to raise $60-100 Billion per year ( on CNN )
	Work it out, chum...

	     $60,000,000,000  /  125,000,000 taxpayers = $480 / year

        But, you exclaim, " I'll get FREE HEALTH CARE! "
	But, I exclaim, " No, you won't! "

	This is only for that poor 37 million who have none.  Not for
	YOU, chum. :)  That comes LATER.

	Add in the estimates of the energy tax costs - $300-500 / year

	Plus, all that extra "corporate and rich" taxes that will 
	trickle down, and what do you have?

	$1,000 / year, just like I said two months ago.

	And, the best part?   You don't GET ANYTHING 

In [7]:
def tfidf(train, test):
    vectorizer = TfidfVectorizer()

    vectors = vectorizer.fit_transform(train.data)
    vectors_test = vectorizer.transform(test.data)

    clf = MultinomialNB(alpha=.01)
    clf.fit(vectors, train.target)

    pred = clf.predict(vectors_test)
    #metrics.f1_score(newsgroups_test.target, pred, average='macro')
    acc = metrics.accuracy_score(test.target, pred)
    return acc

acc = tfidf(newsgroups_train, newsgroups_test)
print('Accuracy with', len(subcats), 'categories:', acc)

Accuracy with 4 categories: 0.7898916967509025


Compare results of Naive Bayes classifier with a simple RNN.

In [8]:
print('Maximum document length:', max([len(doc.split(" ")) for doc in newsgroups_train.data]))

Maximum document length: 10446


In [9]:
max_features = 50000
seq_length = 30
batch_size = 32

In [10]:
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(newsgroups_train.data)

In [11]:
sequences_train = tokenizer.texts_to_sequences(newsgroups_train.data)
sequences_test = tokenizer.texts_to_sequences(newsgroups_test.data)

In [12]:
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 27920 unique tokens.


In [13]:
x_train = pad_sequences(sequences_train, maxlen=seq_length)
x_test = pad_sequences(sequences_test, maxlen=seq_length)

In [14]:
x_train

array([[ 2515,    18,    15, ...,    18,   549,  1661],
       [ 1016,   508,    18, ...,  5233,     5,  5234],
       [    8,  7951,   327, ...,  5235,     7,  4750],
       ...,
       [   20,  2885,   315, ..., 27913, 13764,  1373],
       [ 2537,     3,  2619, ...,  3659,    25,  2276],
       [   43,     4, 13757, ...,   307,   706,   813]])

In [15]:
y_train = keras.utils.to_categorical(np.asarray(newsgroups_train.target))
y_test = keras.utils.to_categorical(np.asarray(newsgroups_test.target))

The Glove pre-trained word vectors can be downloaded from:

http://nlp.stanford.edu/data/glove.6B.zip (822 MB!)

We will use the file `glove.6B.200d.txt`

In [16]:
embeddings_index = {}
f = open('glove/glove.6B.200d.txt', encoding="utf8")
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400001 word vectors.


Example of a word embedding:

In [17]:
dog_vec = embeddings_index['dog']
dog_vec

array([-1.3791e-01, -4.7601e-01, -5.6369e-02, -3.9082e-01, -1.7544e-01,
       -6.2244e-01, -3.9816e-01,  2.9620e-01, -6.0647e-02, -6.7017e-02,
        1.1466e-01, -3.3015e-01, -2.0318e-02,  6.0616e-01, -1.3920e-01,
        1.3896e-01, -5.4781e-01,  3.0864e-01,  1.7354e-01,  3.9927e-01,
        2.1137e-01,  1.3004e+00,  8.8030e-01,  2.3946e-01,  2.8838e-01,
       -4.6336e-01,  2.5745e-01, -3.1755e-01, -3.2877e-01, -5.9534e-01,
        2.3983e-01,  3.4159e-01,  1.2754e-01, -8.8208e-01,  1.4258e-01,
       -1.8857e-01, -1.6961e-01,  2.7808e-01, -2.4600e-01,  1.9122e-01,
        5.0244e-01,  5.3660e-01, -5.3568e-01,  2.4827e-01,  3.2561e-01,
        6.7882e-01,  9.6401e-01, -2.8892e-01,  5.1206e-01,  5.8496e-01,
       -3.1934e-02, -2.4849e-02,  8.8564e-02,  1.7360e-01,  5.4166e-01,
       -8.6743e-02, -3.8412e-01,  1.3974e-01, -7.4122e-03,  9.2210e-01,
       -2.5799e-01, -4.7018e-01, -5.5742e-01, -2.1213e-02, -7.1072e-01,
        8.0995e-02, -4.7254e-01, -3.2925e-01,  6.8052e-01,  1.72

Below we create a matrix where the $i$th row gives the word embedding for the word represented by integer $i$.  
Essentially, these will be the "weights" for the Embedding Layer.  
Rather than learning the weights, we will use these ones and "freeze" the layer.

In [18]:
embedding_matrix = np.zeros((len(word_index) + 1, 200))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

In [19]:
print('Shape of embedding matrix:', embedding_matrix.shape)

Shape of embedding matrix: (27921, 200)


## LSTM Layer
`keras.layers.recurrent.LSTM(units, activation='tanh', recurrent_activation='hard_sigmoid', use_bias=True, kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', bias_initializer='zeros', unit_forget_bias=True, kernel_regularizer=None, recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None, bias_constraint=None, dropout=0.0, recurrent_dropout=0.0)`

- Similar in structure to the `SimpleRNN` layer
- `units` defines the dimension of the recurrent state
- `recurrent_...` refers the recurrent state aspects of the LSTM
- `kernel_...` refers to the transformations done on the input



In [20]:
word_dimension = 200  # This is the dimension of the words we are using from GloVe
dropout = 0.1
rec_dropout = 0.1
model = Sequential()
model.add(Embedding(len(word_index) + 1,
                    word_dimension,
                    weights=[embedding_matrix],  # We set the weights to be the word vectors from GloVe
                    input_length=seq_length,
                    trainable=False))  # By setting trainable to False, we "freeze" the word embeddings.

model.add(LSTM(seq_length, return_sequences=True, dropout=dropout, recurrent_dropout=rec_dropout))
model.add(LSTM(seq_length, return_sequences=True, dropout=dropout, recurrent_dropout=rec_dropout))
model.add(LSTM(seq_length, dropout=dropout, recurrent_dropout=rec_dropout))
#model.add(Dense(20, activation='softmax'))
model.add(Dense(len(subcats), activation='softmax'))

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 30, 200)           5584200   
_________________________________________________________________
lstm_1 (LSTM)                (None, 30, 30)            27720     
_________________________________________________________________
lstm_2 (LSTM)                (None, 30, 30)            7320      
_________________________________________________________________
lstm_3 (LSTM)                (None, 30)                7320      
_________________________________________________________________
dense_1 (Dense)              (None, 4)                 124       
Total params: 5,626,684
Trainable params: 42,484
Non-trainable params: 5,584,200
_________________________________________________________________


In [21]:
rmsprop = keras.optimizers.RMSprop(lr =0.001)
model.compile(loss='categorical_crossentropy', optimizer=rmsprop, metrics=['accuracy'])

In [22]:
model.fit(x_train, y_train, batch_size=batch_size, epochs=20, validation_data=(x_test, y_test))

Train on 2081 samples, validate on 1385 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x1b739739710>