# Bag-of-words document classification

* BoW is the simplest way to do classification: Feature vector goes in, decision falls out.

* Feature vector: a vector with as many dimensions as we have unique features, and a non-zero value set for every feature present in our example
* Binary features: 1/0

In the following we work with the IMDB data, have a look on [how to read it in](read_imdb.ipynb). Here we just read the ready data in.


In [1]:
import json
import random
with open("data/imdb_train.json") as f:
    data=json.load(f)
random.shuffle(data) #play it safe!
print(data[0]) #Every item is a dictionary with `text` and `class` keys, here's the first one:

{'text': "This movie doesn't even deserve a one. This was an utter waste of time. It was a waste of film and money. It was not offensive but everything was provocative and disgusting. My spoiler is one that I think should be read by everyone. There is full frontal nudity and disgusting language. But not only that, there is NO plot line, the actors are terrible, the accents are horrible, the actors are small time and I was even EXCITED to watch this movie!   The only reason I rented it was for Brian van Holt (who got only a fifteen second part, by the way). I think this might have been a mistake on the directors and editors parts but they repeated the same segments two or three times, adding only a new sentence.  A film similar to this is Eraser Head, possibly the most disturbing movie in existence. There is no plot line, and is not funny. Although it isn't trying to be funny. DO NOT WATCH EITHER MOVIE.", 'class': 'neg'}


To learn on this data, we will need a few steps:

* Build a data matrix with dimensionality (number of examples, number of possible features), and a value for each feature, 0/1 for binary features
* Build a class label matrix (number of examples, number of classes) with the correct labels for the examples, setting 1 for the correct class, and 0 for others

It is quite useless to do all this ourselves, so we will use ready-made classes and functions mostly from scikit

In [2]:
# We need to gather the texts, into a list
texts=[one_example["text"] for one_example in data]
labels=[one_example["class"] for one_example in data]
print(texts[:2])
print(labels[:2])

["This movie doesn't even deserve a one. This was an utter waste of time. It was a waste of film and money. It was not offensive but everything was provocative and disgusting. My spoiler is one that I think should be read by everyone. There is full frontal nudity and disgusting language. But not only that, there is NO plot line, the actors are terrible, the accents are horrible, the actors are small time and I was even EXCITED to watch this movie!   The only reason I rented it was for Brian van Holt (who got only a fifteen second part, by the way). I think this might have been a mistake on the directors and editors parts but they repeated the same segments two or three times, adding only a new sentence.  A film similar to this is Eraser Head, possibly the most disturbing movie in existence. There is no plot line, and is not funny. Although it isn't trying to be funny. DO NOT WATCH EITHER MOVIE.", "I don't have words to describe how good this movie is. Only a genius like Amrita Pritam c

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer=CountVectorizer(max_features=100000,binary=True,ngram_range=(1,2))
feature_matrix=vectorizer.fit_transform(texts)
print("shape=",feature_matrix.shape)


shape= (25000, 100000)


Now we have the feature matrix done! Next thing we need is the class labels to be predicted in one-hot encoding. This means:

* one row for every example
* one column for every possible class label
* exactly one column has 1 for every example, corresponding to the desired class

In [4]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

label_encoder=LabelEncoder() #Turns class labels into integers
one_hot_encoder=OneHotEncoder(sparse=False) #Turns class integers into one-hot encoding
class_numbers=label_encoder.fit_transform(labels)
print("class_numbers shape=",class_numbers.shape)
print("class labels",label_encoder.classes_) #this will let us translate back from indices to labels
#And now yet the one-hot encoding
classes_1hot=one_hot_encoder.fit_transform(class_numbers.reshape(-1,1)) #running without reshape tells you to reshape
print("classes_1hot",classes_1hot)

class_numbers shape= (25000,)
class labels ['neg' 'pos']
classes_1hot [[1. 0.]
 [0. 1.]
 [0. 1.]
 ...
 [1. 0.]
 [1. 0.]
 [0. 1.]]


* The data is ready, we need to build the network now
* Input
* Hidden Dense layer with some kind of non-linearity, and a suitable number of nodes
* Output Dense layer with the softmax activation (normalizes output to distribution) and as many nodes as there are classes

In [5]:
from keras.models import Model
from keras.layers import Input, Dense

example_count,feature_count=feature_matrix.shape
example_count2,class_count=classes_1hot.shape
assert example_count==example_count2 #sanity check

inp=Input(shape=(feature_count,))
hidden=Dense(200,activation="tanh")(inp)
outp=Dense(class_count,activation="softmax")(hidden)
model=Model(inputs=[inp], outputs=[outp])

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


...it's **this** simple...!

Once the model is constructed it needs to be compiled, for that we need to know:
* which optimizer we want to use (sgd is fine to begin with)
* what is the loss (categorial_crossentropy for multiclass of the kind we have is the right choice)
* which metrics to measure, accuracy is an okay choice

In [6]:
model.compile(optimizer="sgd",loss="categorical_crossentropy",metrics=['accuracy'])

A compiled model can be fitted on data:

In [7]:
hist=model.fit(feature_matrix,classes_1hot,batch_size=100,verbose=1,epochs=10,validation_split=0.1)

Train on 22500 samples, validate on 2500 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [8]:
print(hist.history["val_acc"])

[0.8403999996185303, 0.8616000008583069, 0.8711999988555909, 0.8764000010490417, 0.8812000012397766, 0.883199999332428, 0.8884000039100647, 0.8868000030517578, 0.8868000030517578, 0.8876000022888184]


* We ran for 10 epochs of training
* Made it to 88.7% accuracy on the validation and 94.9% accuracy on the training data

* But we do not have the model saved, so let's fix that and get the whole thing done
* What constitutes a model (ie what we need to run the model on new data)
  - The feature dictionary in the vectorizer
  - The list of classes in their correct order
  - The structure of the network
  - The weights the network learned

* Do all these things, and run again. This time we also increase the number of epochs to 30, see what happens.

In [9]:
import h5py
from keras.models import Model
from keras.layers import Input, Dense
from keras.callbacks import ModelCheckpoint

def save_model(file_name,model,label_encoder,vectorizer):
    """Saves model structure and vocabularies"""
    model_json = model.to_json()
    with open(file_name+".model.json", "w") as f:
        print(model_json,file=f)
    with open(file_name+".vocabularies.json","w") as f:
        classes=list(label_encoder.classes_)
        vocab=dict(((str(w),int(idx)) for w,idx in vectorizer.vocabulary_.items())) #must turn numpy objects to python ones
        json.dump((classes,vocab),f,indent=2)
        
example_count,feature_count=feature_matrix.shape
example_count2,class_count=classes_1hot.shape
assert example_count==example_count2 #sanity check

inp=Input(shape=(feature_count,))
hidden=Dense(200,activation="tanh")(inp)
outp=Dense(class_count,activation="softmax")(hidden)
model=Model(inputs=[inp], outputs=[outp])
model.compile(optimizer="sgd",loss="categorical_crossentropy",metrics=['accuracy'])

# Save model and vocabularies, can be done before training
save_model("models/imdb_bow",model,label_encoder,vectorizer)
# Callback function to save weights during training, if validation loss goes down
save_cb=ModelCheckpoint(filepath="models/imdb_bow.weights.h5", monitor='val_loss', verbose=1, save_best_only=True, mode='auto')

hist=model.fit(feature_matrix,classes_1hot,batch_size=100,verbose=1,epochs=30,validation_split=0.1,callbacks=[save_cb])


Train on 22500 samples, validate on 2500 samples
Epoch 1/30

Epoch 00001: val_loss improved from inf to 0.45507, saving model to models/imdb_bow.weights.h5
Epoch 2/30

Epoch 00002: val_loss improved from 0.45507 to 0.38103, saving model to models/imdb_bow.weights.h5
Epoch 3/30

Epoch 00003: val_loss improved from 0.38103 to 0.34400, saving model to models/imdb_bow.weights.h5
Epoch 4/30

Epoch 00004: val_loss improved from 0.34400 to 0.32214, saving model to models/imdb_bow.weights.h5
Epoch 5/30

Epoch 00005: val_loss improved from 0.32214 to 0.30702, saving model to models/imdb_bow.weights.h5
Epoch 6/30

Epoch 00006: val_loss improved from 0.30702 to 0.29654, saving model to models/imdb_bow.weights.h5
Epoch 7/30

Epoch 00007: val_loss improved from 0.29654 to 0.28953, saving model to models/imdb_bow.weights.h5
Epoch 8/30

Epoch 00008: val_loss improved from 0.28953 to 0.28661, saving model to models/imdb_bow.weights.h5
Epoch 9/30

Epoch 00009: val_loss improved from 0.28661 to 0.27901,

# Summary

* We put together a program to train a neural network classifier for sentiment detector
* We learned the necessary code/techniques to save models, and feed the training with data in just the right format
* We observed the training across epochs
* We saw how the classifier can be applied to various text classification problems
* The IMDB sentiment classifier ended up at nearly 90% accuracy, the state of the art is about 95%, we got surprisingly far in few lines of code
