# fastText-like text classification in Keras on 20newsgroups

fastText:
* Basically word embeddings + averaging layer trained for classification
* https://arxiv.org/abs/1607.01759
* Very fast, but not many options to change the model (no regularization, etc.)

Keras:
* Library on top of tensorflow / theano
* https://keras.io
* Version used: 1.2.1

20 newsgroups:
* http://scikit-learn.org/stable/datasets/twenty_newsgroups.html

# Loading the Data

In [1]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')

target_names = newsgroups_train.target_names

train_texts = newsgroups_train.data
train_labels = newsgroups_train.target

test_texts = newsgroups_test.data
test_labels = newsgroups_test.target

print("Training samples: %s" % len(train_texts))
print("Testing samples: %s" % len(test_texts))
print("\n")

print(train_texts[0])
print("Label: %s" % target_names[train_labels[0]])


Training samples: 11314
Testing samples: 7532


From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----





Label: rec.autos


# Transforming the samples

* We fit a [Tokenizer](https://keras.io/preprocessing/text/#tokenizer) and use it to convert texts to sequences of integers, which are the input to the Model

In [2]:
import keras
from keras.preprocessing.text import Tokenizer, text_to_word_sequence

tk = Tokenizer()
train_texts_str = [x.encode("utf8") for x in train_texts]
tk.fit_on_texts(train_texts_str)

v_size = len(tk.word_index)
print("Vocabulary size: %s" % v_size)

X_train = tk.texts_to_sequences(train_texts_str)

test_texts_str = [x.encode("utf8") for x in test_texts]
X_test = tk.texts_to_sequences(test_texts_str)

print(text_to_word_sequence(train_texts_str[0]))
print(X_train[1])


Using TensorFlow backend.


Vocabulary size: 134142
['from', 'lerxst', 'wam', 'umd', 'edu', "where's", 'my', 'thing', 'subject', 'what', 'car', 'is', 'this', 'nntp', 'posting', 'host', 'rac3', 'wam', 'umd', 'edu', 'organization', 'university', 'of', 'maryland', 'college', 'park', 'lines', '15', 'i', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'i', 'saw', 'the', 'other', 'day', 'it', 'was', 'a', '2', 'door', 'sports', 'car', 'looked', 'to', 'be', 'from', 'the', 'late', '60s', 'early', '70s', 'it', 'was', 'called', 'a', 'bricklin', 'the', 'doors', 'were', 'really', 'small', 'in', 'addition', 'the', 'front', 'bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', 'this', 'is', 'all', 'i', 'know', 'if', 'anyone', 'can', 'tellme', 'a', 'model', 'name', 'engine', 'specs', 'years', 'of', 'production', 'where', 'this', 'car', 'is', 'made', 'history', 'or', 'whatever', 'info', 'you', 'have', 'on', 'this', 'funky', 'looking', 'car', 'please', 'e', 'm

# Transforming the labels

* We transform the labels to binary vectors.
* Keras has a convenience function to do this ([keras.np_utils.to_categorical](https://keras.io/utils/np_utils/)), but
* Sklearn's [MultiLabelBinarizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html) has some useful extra methods, e.g. reverse transforming the binary vector to labels when doing the prediction step

In [3]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y_train = mlb.fit_transform([(l, ) for l in train_labels])
y_test = mlb.transform([(l, ) for l in test_labels])

nb_classes = len(mlb.classes_)

print(y_train[0])

[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]


# Padding the input sequences

* To do Matrix multiplications each sample in a batch needs the same length, so we pad with 0
* In the Tokenizer 0 is reserved and will not be given to a "normal" word


In [4]:
from keras.preprocessing import sequence
maxlen = max(len(x) for x in X_train)
print("maxlen: %s" % maxlen) 

maxlen = 1000  # for faster training max 1000 words

X_train_padded = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test_padded = sequence.pad_sequences(X_test, maxlen=maxlen)

# Because in our Model we just average over the word vectors
# we do not want the zero vectors from the padding in the average,
# so we prepare a special layer that removes those.
# From: https://github.com/fchollet/keras/issues/2728
import keras.backend as K
from keras.engine.topology import Layer

class ZeroMaskedEntries(Layer):
    """
    This layer is called after an Embedding layer.
    It zeros out all of the masked-out embeddings.
    It also swallows the mask without passing it on.
    You can change this to default pass-on behavior as follows:

    def compute_mask(self, x, mask=None):
        if not self.mask_zero:
            return None
        else:
            return K.not_equal(x, 0)
    """

    def __init__(self, **kwargs):
        self.support_mask = True
        super(ZeroMaskedEntries, self).__init__(**kwargs)

    def build(self, input_shape):
        self.output_dim = input_shape[1]
        self.repeat_dim = input_shape[2]

    def call(self, x, mask=None):
        mask = K.cast(mask, 'float32')
        mask = K.repeat(mask, self.repeat_dim)
        mask = K.permute_dimensions(mask, (0, 2, 1))
        return x * mask

    def compute_mask(self, input_shape, input_mask=None):
        return None



maxlen: 16333


# Building the model

In [32]:
# modified version of https://github.com/fchollet/keras/blob/master/examples/imdb_fasttext.py

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dropout
from keras.layers import Dense
from keras.layers import Embedding
from keras.layers import GlobalAveragePooling1D
from keras.regularizers import l2

embedding_dims = 100

model = Sequential()

# input layer
emb = Embedding(v_size + 1, embedding_dims,
                mask_zero=True,  # Mask out padded tokens
                dropout=0.2,
               )
model.add(emb)

# Add special layer because averaging layer does not support mask_zero=True
model.add(ZeroMaskedEntries())  

# averaging layer
model.add(GlobalAveragePooling1D())
model.add(Dropout(0.2))

# output layer
model.add(Dense(nb_classes, activation='sigmoid', W_regularizer=l2(0.00001)))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy', 'precision', 'recall', 'fmeasure'])
model.summary()


____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_15 (Embedding)         (None, None, 100)     13414300    embedding_input_15[0][0]         
____________________________________________________________________________________________________
zeromaskedentries_15 (ZeroMasked (None, None, 100)     0           embedding_15[0][0]               
____________________________________________________________________________________________________
globalaveragepooling1d_15 (Globa (None, 100)           0           zeromaskedentries_15[0][0]       
____________________________________________________________________________________________________
dropout_11 (Dropout)             (None, 100)           0           globalaveragepooling1d_15[0][0]  
___________________________________________________________________________________________

# Training the model

* If your data is too big to fit in memory use [fit_generator](https://keras.io/models/sequential/)

In [33]:
batch_size = 32
nb_epoch = 500

model.fit(X_train_padded, y_train,
          batch_size=batch_size,
          nb_epoch=nb_epoch,
          verbose=2, # 1 for process bar (might crash jupyter notebook)
          validation_data=(X_test_padded, y_test))


Train on 11314 samples, validate on 7532 samples
Epoch 1/500
8s - loss: 0.4521 - acc: 0.9480 - precision: 6.5776e-04 - recall: 0.0020 - fmeasure: 7.3241e-04 - val_loss: 0.2932 - val_acc: 0.9500 - val_precision: 0.0000e+00 - val_recall: 0.0000e+00 - val_fmeasure: 0.0000e+00
Epoch 2/500
7s - loss: 0.2732 - acc: 0.9500 - precision: 0.0000e+00 - recall: 0.0000e+00 - fmeasure: 0.0000e+00 - val_loss: 0.2635 - val_acc: 0.9500 - val_precision: 0.0000e+00 - val_recall: 0.0000e+00 - val_fmeasure: 0.0000e+00
Epoch 3/500
7s - loss: 0.2559 - acc: 0.9500 - precision: 0.0000e+00 - recall: 0.0000e+00 - fmeasure: 0.0000e+00 - val_loss: 0.2508 - val_acc: 0.9500 - val_precision: 0.0000e+00 - val_recall: 0.0000e+00 - val_fmeasure: 0.0000e+00
Epoch 4/500
7s - loss: 0.2435 - acc: 0.9500 - precision: 0.0000e+00 - recall: 0.0000e+00 - fmeasure: 0.0000e+00 - val_loss: 0.2393 - val_acc: 0.9500 - val_precision: 0.0000e+00 - val_recall: 0.0000e+00 - val_fmeasure: 0.0000e+00
Epoch 5/500
7s - loss: 0.2319 - acc: 0.

<keras.callbacks.History at 0x7ff60412d7d0>

# Examples for other models

* https://github.com/fchollet/keras/blob/1.2.1/examples/reuters_mlp.py
* http://machinelearningmastery.com/predict-sentiment-movie-reviews-using-deep-learning/
* http://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras
* https://github.com/fchollet/keras/blob/1.2.1/examples/imdb_cnn.py
* https://github.com/fchollet/keras/blob/1.2.1/examples/imdb_lstm.py
* https://github.com/fchollet/keras/blob/1.2.1/examples/imdb_bidirectional_lstm.py
* https://github.com/fchollet/keras/blob/1.2.1/examples/imdb_cnn_lstm.py
* https://github.com/explosion/spaCy/blob/v1.6.0/examples/deep_learning_keras.py#L102

# Quick BOW Naive Bayes baseline

In [37]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import precision_recall_fscore_support

classifier = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range=(1,1))),
    ('clf', OneVsRestClassifier(MultinomialNB(alpha=0.001)))])
classifier.fit(train_texts, y_train)
train_pred = classifier.predict(train_texts)
test_pred = classifier.predict(test_texts)
print precision_recall_fscore_support(y_train, train_pred, average="micro")
print precision_recall_fscore_support(y_test, test_pred, average="micro")

(0.94602851323828918, 0.98532791232101824, 0.9652783790804399, None)
(0.82634358496427462, 0.70631970260223054, 0.76163206871868294, None)
