# Bag-of-words document classification

What will happen on Reuters?

* How to read the original Reuters data in: [read_news.ipynb](read_news.ipynb)
* Reuters news with about 10000 news articles classified into 66 classes
* We only keep classes with at least 5 examples, end up with 51 classes
* How well can we do on a 51-class classification problem with our BoW?

This is the exact same code as in the original bag-of-words, just file names changed:

In [1]:
import json
import random
random.seed(0)
with open("data/reuters_51cls.json") as f:
    data=json.load(f)
random.shuffle(data) #play it safe!
print(data[0]) #Every item is a dictionary with `text` and `class` keys, here's one:

{'class': 'earn', 'text': '&#2;\nUNITED COMPANIES &lt;UNCF> DECLARES STOCK DIVIDEND\nBATON ROUGE, La, March 6 - United Companies Financial Corp\nsaid its board declared a two pct stock dividend payable APril\neight to holders of record March 17.\nThe board also declared a regular quarterly cash dividend\nof 12.5 cts payable April one to holders of record March 16.\nReuter\n&#3;'}


In [2]:
# We need to gather the texts, into a list
texts=[one_example["text"] for one_example in data]
labels=[one_example["class"] for one_example in data]
print(texts[:2])
print(labels[:2])

['&#2;\nUNITED COMPANIES &lt;UNCF> DECLARES STOCK DIVIDEND\nBATON ROUGE, La, March 6 - United Companies Financial Corp\nsaid its board declared a two pct stock dividend payable APril\neight to holders of record March 17.\nThe board also declared a regular quarterly cash dividend\nof 12.5 cts payable April one to holders of record March 16.\nReuter\n&#3;', '&#2;\nCANBRA FOODS SETS SPECIAL FIVE DLR/SHR PAYOUT\nLETHBRIDGE, Alberta, March 16 - &lt;Canbra Foods Ltd>, earlier\nreporting a 1986 net profit against a year-ago loss, said it\ndeclared a special, one-time dividend of five dlrs per common\nshare, pay March 31, record March 26.\nCanbra said it set the special payout to allow shareholders\nto participate in the gain on the sale of unit Stafford Foods\nLtd in November, 1986, as well as the company\'s "unusually\nprofitable performance" in 1986.\nCanbra earlier reported 1986 net earnings of 4.2 mln dlrs,\nexcluding a 1.3 mln dlr gain on the Stafford sale, compared to\na year-ago loss o

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer


vectorizer=CountVectorizer(max_features=100000,binary=True,ngram_range=(1,2))
feature_matrix=vectorizer.fit_transform(texts)
print("shape=",feature_matrix.shape)
print(feature_matrix)
#print(feature_matrix.todense())




shape= (9465, 100000)
  (0, 962)	1
  (0, 52588)	1
  (0, 64220)	1
  (0, 13123)	1
  (0, 23677)	1
  (0, 599)	1
  (0, 60531)	1
  (0, 26329)	1
  (0, 20050)	1
  (0, 72153)	1
  (0, 73704)	1
  (0, 24550)	1
  (0, 9813)	1
  (0, 17949)	1
  (0, 88415)	1
  (0, 1048)	1
  (0, 52589)	1
  (0, 73231)	1
  (0, 62260)	1
  (0, 41669)	1
  (0, 92983)	1
  (0, 28800)	1
  (0, 13045)	1
  (0, 67258)	1
  (0, 26333)	1
  :	:
  (9464, 86960)	1
  (9464, 52051)	1
  (9464, 39878)	1
  (9464, 79885)	1
  (9464, 7285)	1
  (9464, 14650)	1
  (9464, 62684)	1
  (9464, 97844)	1
  (9464, 21591)	1
  (9464, 67940)	1
  (9464, 26746)	1
  (9464, 76654)	1
  (9464, 43074)	1
  (9464, 79603)	1
  (9464, 21573)	1
  (9464, 67925)	1
  (9464, 26570)	1
  (9464, 46456)	1
  (9464, 75025)	1
  (9464, 88027)	1
  (9464, 60494)	1
  (9464, 76235)	1
  (9464, 22906)	1
  (9464, 52581)	1
  (9464, 50580)	1


Now we have the feature matrix done! Next thing we need is the class labels to be predicted in one-hot encoding. This means:

* one row for every example
* one column for every possible class label
* exactly one column has 1 for every example, corresponding to the desired class

In [4]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

label_encoder=LabelEncoder() #Turns class labels into integers
one_hot_encoder=OneHotEncoder(sparse=False) #Turns class integers into one-hot encoding
class_numbers=label_encoder.fit_transform(labels)
print("class_numbers shape=",class_numbers.shape)
print("class_numbers",class_numbers)
print("class labels",label_encoder.classes_)
#And now yet the one-hot encoding
classes_1hot=one_hot_encoder.fit_transform(class_numbers.reshape(-1,1))
print("classes_1hot",classes_1hot)

class_numbers shape= (9465,)
class_numbers [11 11 11 ... 11  0  0]
class labels ['acq' 'alum' 'bop' 'carcass' 'cocoa' 'coffee' 'copper' 'cotton' 'cpi'
 'crude' 'dlr' 'earn' 'fuel' 'gas' 'gnp' 'gold' 'grain' 'heat' 'housing'
 'income' 'instal-debt' 'interest' 'ipi' 'iron-steel' 'jobs' 'lead' 'lei'
 'livestock' 'lumber' 'meal-feed' 'money-fx' 'money-supply' 'nat-gas'
 'oilseed' 'orange' 'pet-chem' 'potato' 'reserves' 'retail' 'rubber'
 'ship' 'silver' 'strategic-metal' 'sugar' 'tea' 'tin' 'trade' 'veg-oil'
 'wpi' 'yen' 'zinc']
classes_1hot [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]]


In [5]:
import h5py
from keras.models import Model
from keras.layers import Input, Dense
from keras.callbacks import ModelCheckpoint

def save_model(file_name,model,label_encoder,vectorizer):
    """Saves model structure and vocabularies"""
    model_json = model.to_json()
    with open(file_name+".model.json", "w") as f:
        print(model_json,file=f)
    with open(file_name+".vocabularies.json","w") as f:
        classes=list(label_encoder.classes_)
        vocab=dict(((str(w),int(idx)) for w,idx in vectorizer.vocabulary_.items()))
        json.dump((classes,vocab),f,indent=2)
        
example_count,feature_count=feature_matrix.shape
example_count,class_count=classes_1hot.shape

inp=Input(shape=(feature_count,))
hidden=Dense(300,activation="tanh")(inp)
outp=Dense(class_count,activation="softmax")(hidden)
model=Model(inputs=[inp], outputs=[outp])
model.compile(optimizer="adam",loss="categorical_crossentropy",metrics=['accuracy'])

# Save model and vocabularies
save_model("models/reuters_51cls_bow",model,label_encoder,vectorizer)
# Callback function to save weights during training
save_cb=ModelCheckpoint(filepath="models/reuters_51cls_bow.weights.h5", monitor='val_loss', verbose=1, save_best_only=True, mode='auto')
hist=model.fit(feature_matrix,classes_1hot,batch_size=100,verbose=1,epochs=10,validation_split=0.1,callbacks=[save_cb])


  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Train on 8518 samples, validate on 947 samples
Epoch 1/10

Epoch 00001: val_loss improved from inf to 0.26932, saving model to models/reuters_51cls_bow.weights.h5
Epoch 2/10

Epoch 00002: val_loss improved from 0.26932 to 0.26482, saving model to models/reuters_51cls_bow.weights.h5
Epoch 3/10

Epoch 00003: val_loss improved from 0.26482 to 0.26075, saving model to models/reuters_51cls_bow.weights.h5
Epoch 4/10

Epoch 00004: val_loss did not improve
Epoch 5/10

Epoch 00005: val_loss did not improve
Epoch 6/10

Epoch 00006: val_loss did not improve
Epoch 7/10

Epoch 00007: val_loss did not improve
Epoch 8/10

Epoch 00008: val_loss did not improve
Epoch 9/10

Epoch 00009: val_loss did not improve
Epoch 10/10

Epoch 00010: val_loss did not improve


In [6]:
import numpy
from sklearn.metrics import classification_report, confusion_matrix

#Validation data used during training:
val_instances,val_labels_1hot,_=hist.validation_data

print("Network output=",model.predict(val_instances))
predictions=numpy.argmax(model.predict(val_instances),axis=1)
print("Maximum class for each example=",predictions)
gold=numpy.nonzero(val_labels_1hot)[1] #undo 1-hot encoding
conf_matrix=confusion_matrix(list(gold),list(predictions))
print(conf_matrix)

### FIXED VERSION (thanks for reporting the bug during the lecture!)
### 
gold_labels=label_encoder.inverse_transform(list(gold))
predicted_labels=label_encoder.inverse_transform(list(predictions))
print("Gold labels=",gold_labels)
print("Predicted labels=",predicted_labels)
print(classification_report(gold_labels,predicted_labels))


Network output= [[2.3797024e-03 3.8961964e-04 2.8251315e-04 ... 1.1926511e-04
  2.8264275e-04 1.3207854e-04]
 [5.3375764e-08 7.1641972e-09 8.0094527e-09 ... 9.6114672e-10
  2.4021600e-09 8.2160332e-09]
 [9.7265923e-01 3.7264079e-04 8.6545653e-05 ... 4.2552263e-05
  8.2533930e-05 1.3641357e-04]
 ...
 [2.4583791e-05 1.6724870e-06 3.5406126e-07 ... 1.4980736e-07
  3.8118338e-07 1.1669617e-06]
 [9.9999809e-01 1.6384947e-08 1.0749545e-09 ... 1.7454846e-10
  1.3519129e-09 3.6480141e-09]
 [9.9994731e-01 3.7715068e-07 5.0317361e-08 ... 1.3922719e-08
  8.2615223e-08 1.3961670e-07]]
Maximum class for each example= [30 11  0 16  0  0 11 11 11  9  0  0 11  0 11  9 19  0  0 11  0 11 48 21
 46  9 11 11  0 30 21 11 38 40 46  0 11  0 11 30 37  5 11  0 11 11  0 11
 11 11  0  0 16 11 11 21  0  8 11 46  0 11  0 11 11 11 21 30 31 46 37 11
 11  0 11 46 11 21  0 11  0 33 11 40  0 21 11 22 11 11 11 11 11  0 11 11
 46  0 11  4 11  0 11  0 21 11  0  0 11 11  0 21  0  6  0  0 46 11 46  8
 11 46  0  0 46 11  0 2

  if diff:
  if diff:
  'precision', 'predicted', average, warn_for)
