# Bag-of-words document classification

BoW is the simplest way to do classification: Feature vector goes in, decision falls out.

Feature vector: a vector with as many dimensions as we have unique features, and a non-zero value set for every feature present in our example.


In [19]:
import json
import random
with open("data/imdb_train.json") as f:
    data=json.load(f)
random.shuffle(data) #play it safe!
print(data[0]) #Every item is a dictionary with `text` and `class` keys, here's one:

{'text': 'This movie has a lot of comedy, not dark and Gordon Liu shines in this one. He displays his comical side and it was really weird seeing him get beat up. His training is \\unorthodox\\" and who would\'ve thought knot tying could be so deadly?? Lots of great stunts and choreography. Very creative!  Add Johnny Wang in the mix and you\'ve got an awesome final showdown! Don\'t mess with Manchu thugs; they\'re ruthless!"', 'class': 'pos'}


To learn this data, we will need a few steps:

* Build a data matrix with dimensionality (number of examples, number of possible features)
* Build a vector (number of examples,) with the correct labels for the examples

It is quite useless to do all this ourselves, so we will use ready-made classes and functions mostly from scikit

In [20]:
# We need to gather the texts, into a list
texts=[one_example["text"] for one_example in data]
labels=[one_example["class"] for one_example in data]
print(texts[:2])
print(labels[:2])

['This movie has a lot of comedy, not dark and Gordon Liu shines in this one. He displays his comical side and it was really weird seeing him get beat up. His training is \\unorthodox\\" and who would\'ve thought knot tying could be so deadly?? Lots of great stunts and choreography. Very creative!  Add Johnny Wang in the mix and you\'ve got an awesome final showdown! Don\'t mess with Manchu thugs; they\'re ruthless!"', "This mini series, also based on a book by Alex Haley as was `Queen', tried to use similar formulas, that is, constructing a long history following the lives of a family over many years. Whereas in `Queen' the result was masterful, here in Mama Flora the inspiration was lacking. Firstly perhaps in the book itself, and most certainly in this TV production. Too much is put in with too much haste over the years, such that the unfolding saga is shallow, superficial, not nearly so authentic as in `Queen'. Full marks for the scenification in the earlier parts of the film, whic

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer


vectorizer=CountVectorizer(max_features=100000,binary=True,ngram_range=(1,2))
feature_matrix=vectorizer.fit_transform(texts)
print("shape=",feature_matrix.shape)
#print(feature_matrix.todense())




shape= (25000, 100000)


Now we have the feature matrix done! Next thing we need is the class labels to be predicted in one-hot encoding. This means:

* one row for every example
* one column for every possible class label
* exactly one column has 1 for every example, corresponding to the desired class

In [22]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

label_encoder=LabelEncoder() #Turns class labels into integers
one_hot_encoder=OneHotEncoder(sparse=False) #Turns class integers into one-hot encoding
class_numbers=label_encoder.fit_transform(labels)
print("class_numbers shape=",class_numbers.shape)
print("class_numbers",class_numbers)
print("class labels",label_encoder.classes_)
#And now yet the one-hot encoding
classes_1hot=one_hot_encoder.fit_transform(class_numbers.reshape(-1,1))
print("classes_1hot",classes_1hot)

class_numbers shape= (25000,)
class_numbers [1 0 1 ... 0 0 0]
class labels ['neg' 'pos']
classes_1hot [[0. 1.]
 [1. 0.]
 [0. 1.]
 ...
 [1. 0.]
 [1. 0.]
 [1. 0.]]


In [None]:
from keras.models import Model
from keras.layers import Input, Dense

example_count,feature_count=feature_matrix.shape
example_count,class_count=classes_1hot.shape

inp=Input(shape=(feature_count,))
hidden=Dense(200,activation="tanh")(inp)
outp=Dense(class_count,activation="softmax")(hidden)
model=Model(inputs=[inp], outputs=[outp])
model.compile(optimizer="sgd",loss="categorical_crossentropy",metrics=['accuracy'])
hist=model.fit(feature_matrix,classes_1hot,batch_size=100,verbose=1,epochs=10,validation_split=0.2)


Train on 20000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10

In [None]:
print(hist.history["val_acc"])