# Bag-of-words document classification

What will happen on Reuters?

In [1]:
import json
import random
with open("data/reuters_51cls.json") as f:
    data=json.load(f)
random.shuffle(data) #play it safe!
print(data[0]) #Every item is a dictionary with `text` and `class` keys, here's one:

{'text': "&#2;\nANALYSTS SEE NO OTHER BIDDER FOR PUROLATOR&lt;PCC>\nNew York, March 2 - Several analysts said they do not\nbelieve another suitor will top the 265 mln dlr bid for\nPurolator Courier Corp by E.F. Hutton LBO Inc and a management\ngroup from Purolator's courier division.\nThere had been speculation another offer might be\nforthcoming, but analysts mostly believe the 35 dlrs per share\nprice being paid by Hutton and the managers' PC Acquisition Inc\nis fully valued.\nAnalysts and some Wall Street sources said they doubted\nanother bidder would emerge since Purolator had been for sale\nfor sometime before a deal was struck with Hutton Friday.\nPurolator's stock slipped 3/8 today to close at 34-3/4. It\nhad been trading slightly higher than the 35 dlr offer on\nFriday. At least one analyst Friday speculated the company\nmight fetch 38 to 42 dlrs per share.\nanalysts and wall street sources doubted a competitive\noffer would emerge since the company has been for sale for\nsome

In [2]:
# We need to gather the texts, into a list
texts=[one_example["text"] for one_example in data]
labels=[one_example["class"] for one_example in data]
print(texts[:2])
print(labels[:2])

["&#2;\nANALYSTS SEE NO OTHER BIDDER FOR PUROLATOR&lt;PCC>\nNew York, March 2 - Several analysts said they do not\nbelieve another suitor will top the 265 mln dlr bid for\nPurolator Courier Corp by E.F. Hutton LBO Inc and a management\ngroup from Purolator's courier division.\nThere had been speculation another offer might be\nforthcoming, but analysts mostly believe the 35 dlrs per share\nprice being paid by Hutton and the managers' PC Acquisition Inc\nis fully valued.\nAnalysts and some Wall Street sources said they doubted\nanother bidder would emerge since Purolator had been for sale\nfor sometime before a deal was struck with Hutton Friday.\nPurolator's stock slipped 3/8 today to close at 34-3/4. It\nhad been trading slightly higher than the 35 dlr offer on\nFriday. At least one analyst Friday speculated the company\nmight fetch 38 to 42 dlrs per share.\nanalysts and wall street sources doubted a competitive\noffer would emerge since the company has been for sale for\nsometime bef

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer


vectorizer=CountVectorizer(max_features=100000,binary=True,ngram_range=(1,2))
feature_matrix=vectorizer.fit_transform(texts)
print("shape=",feature_matrix.shape)
#print(feature_matrix.todense())




shape= (9465, 100000)


Now we have the feature matrix done! Next thing we need is the class labels to be predicted in one-hot encoding. This means:

* one row for every example
* one column for every possible class label
* exactly one column has 1 for every example, corresponding to the desired class

In [4]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

label_encoder=LabelEncoder() #Turns class labels into integers
one_hot_encoder=OneHotEncoder(sparse=False) #Turns class integers into one-hot encoding
class_numbers=label_encoder.fit_transform(labels)
print("class_numbers shape=",class_numbers.shape)
print("class_numbers",class_numbers)
print("class labels",label_encoder.classes_)
#And now yet the one-hot encoding
classes_1hot=one_hot_encoder.fit_transform(class_numbers.reshape(-1,1))
print("classes_1hot",classes_1hot)

class_numbers shape= (9465,)
class_numbers [ 0  0  0 ... 11 11 46]
class labels ['acq' 'alum' 'bop' 'carcass' 'cocoa' 'coffee' 'copper' 'cotton' 'cpi'
 'crude' 'dlr' 'earn' 'fuel' 'gas' 'gnp' 'gold' 'grain' 'heat' 'housing'
 'income' 'instal-debt' 'interest' 'ipi' 'iron-steel' 'jobs' 'lead' 'lei'
 'livestock' 'lumber' 'meal-feed' 'money-fx' 'money-supply' 'nat-gas'
 'oilseed' 'orange' 'pet-chem' 'potato' 'reserves' 'retail' 'rubber'
 'ship' 'silver' 'strategic-metal' 'sugar' 'tea' 'tin' 'trade' 'veg-oil'
 'wpi' 'yen' 'zinc']
classes_1hot [[1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [5]:
import h5py
from keras.models import Model
from keras.layers import Input, Dense
from keras.callbacks import ModelCheckpoint

def save_model(file_name,model,label_encoder,vectorizer):
    """Saves model structure and vocabularies"""
    model_json = model.to_json()
    with open(file_name+".model.json", "w") as f:
        print(model_json,file=f)
    with open(file_name+".vocabularies.json","w") as f:
        classes=list(label_encoder.classes_)
        vocab=dict(((str(w),int(idx)) for w,idx in vectorizer.vocabulary_.items()))
        json.dump((classes,vocab),f,indent=2)
        
example_count,feature_count=feature_matrix.shape
example_count,class_count=classes_1hot.shape

inp=Input(shape=(feature_count,))
hidden=Dense(200,activation="tanh")(inp)
outp=Dense(class_count,activation="softmax")(hidden)
model=Model(inputs=[inp], outputs=[outp])
model.compile(optimizer="sgd",loss="categorical_crossentropy",metrics=['accuracy'])

# Save model and vocabularies
save_model("models/reuters_51cls_bow",model,label_encoder,vectorizer)
# Callback function to save weights during training
save_cb=ModelCheckpoint(filepath="models/reuters_51cls_bow.weights.h5", monitor='val_loss', verbose=1, save_best_only=True, mode='auto')
hist=model.fit(feature_matrix,classes_1hot,batch_size=100,verbose=1,epochs=30,validation_split=0.1,callbacks=[save_cb])


  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Train on 8518 samples, validate on 947 samples
Epoch 1/30

Epoch 00001: val_loss improved from inf to 1.80867, saving model to models/reuters_51cls_bow.weights.h5
Epoch 2/30

Epoch 00002: val_loss improved from 1.80867 to 1.51141, saving model to models/reuters_51cls_bow.weights.h5
Epoch 3/30

Epoch 00003: val_loss improved from 1.51141 to 1.37008, saving model to models/reuters_51cls_bow.weights.h5
Epoch 4/30

Epoch 00004: val_loss improved from 1.37008 to 1.27653, saving model to models/reuters_51cls_bow.weights.h5
Epoch 5/30

Epoch 00005: val_loss improved from 1.27653 to 1.20564, saving model to models/reuters_51cls_bow.weights.h5
Epoch 6/30

Epoch 00006: val_loss improved from 1.20564 to 1.14851, saving model to models/reuters_51cls_bow.weights.h5
Epoch 7/30

Epoch 00007: val_loss improved from 1.14851 to 1.10025, saving model to models/reuters_51cls_bow.weights.h5
Epoch 8/30

Epoch 00008: val_loss improved from 1.10025 to 1.05764, saving model to models/reuters_51cls_bow.weights.