## Machine Learning with Keras Neural Networking of categories, text mining, and parts of speech##

Joseph Rochelle

This notebook only includes Part 2 of Exercise 9.4.  I have separate files for Part 1 and Part 3.  I exported the cleaned text dataframe from my Part 1 notebook to a csv to be used in this notebook so that the cleaning code does not need to be re-run multiple times as it is time consuming.


In [1]:
# import libraries
import numpy as np
import pandas as pd
import random


As discussed above, the below code imports the cleaned text dataframe with a parts-of-speech feature and the remainder of this notebook uses this cleaned dataframe.

In [2]:
# read random sample because file is too large 606475 lines 
# https://stackoverflow.com/questions/22258491/read-a-small-random-sample-from-a-big-csv-file-into-a-python-data-frame

p = 0.0016  # .16% of the lines
# if random from [0,1] interval is greater than 0.0016 the row will be skipped
df = pd.read_csv('categComments.csv', 
         skiprows=lambda i: i>0 and random.random() > p
)


In [5]:
# look at what the imported dataframe looks like
df.head()

Unnamed: 0,cat,txt,pos
0,sports,"['trent', 'brown', 'one', 'piec', 'rais', 'lot...","[[('trent', 'NN'), ('brown', 'IN'), ('one', 'C..."
1,sports,"['he', 'right', 'base', 'franchis', 'tag', 'nu...","[[('he', 'PRP'), ('right', 'VBD'), ('base', 'N..."
2,sports,"['agre', 'fuck']","[[('agre', 'NNS'), ('fuck', 'VBD')]]"
3,sports,"['swap', '1st', 'pat', 'approx', '2000', 'poin...","[[('swap', 'NN'), ('1st', 'CD'), ('pat', 'NN')..."
4,sports,"['I', 'dont', 'know', 'hand', 'Im', 'fulli', '...","[[('I', 'PRP'), ('dont', 'VBP'), ('know', 'JJ'..."


> Will use the part-of-speech feature for the modeling.

In [61]:
# separate into input and output columns (X variable for features and y for target variables)

X = df['pos']
y = df['cat']

In [62]:
# check the feature input
X.head()

0    [[('freeman', 'JJ'), ('pick', 'NN'), ('blitz',...
1    [[('im', 'NN'), ('feel', 'NN'), ('still', 'RB'...
2    [[('bleacher', 'DT'), ('report', 'NN'), ('must...
3                                  [[('delet', 'NN')]]
4    [[('wallpap', 'NN'), ('android', 'NN'), ('pan'...
Name: pos, dtype: object

In [63]:
# convert the target variable to an array
y = np.array(df['cat'])
y[10:15]

array(['science_and_technology', 'science_and_technology',
       'science_and_technology', 'science_and_technology',
       'science_and_technology'], dtype=object)

In [64]:
# Create the tf-idf feature matrix
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features = 3000)
X = tfidf.fit_transform(df['pos'])

In [65]:
# the TfidfVectorizer turned the feature variable to a sparse matrix, which causes problems in the model.
# solve the error by converting the sparse matrix to a dense matrix
X = X.todense()

In [66]:
X.shape

(972, 3000)

In [10]:
# import keras libraries
from keras.utils.np_utils import to_categorical
from keras.preprocessing.text import Tokenizer
from keras import models
from keras import layers
from keras.layers import Dense
from keras.models import Sequential
from keras.wrappers.scikit_learn import KerasClassifier

In [47]:
# number of features outputted by the tfidf vectorizer
nFeatures = 3000
# number of categories in the target
nClasses = 3

In [48]:
# build the model 
def build_network():
    """
    Create a function that returns a compiled neural network
    """
    nn = Sequential()
    nn.add(Dense(500, activation = 'relu', input_shape =(nFeatures,)))
    nn.add(Dense(150, activation = 'relu'))
    nn.add(Dense(nClasses, activation = 'softmax'))
    nn.compile(loss = 'categorical_crossentropy',
              optimizer = 'adam',
              metrics = ['accuracy']
              )
    return nn

> Train the Keras model and use validation split for splitting the data between training and validation.

In [68]:
# train the model
nn2 = KerasClassifier(build_fn = build_network, 
                            epochs = 200,
                            batch_size = 128)
nn2.fit(X,y, validation_split=0.33)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

Epoch 118/200
Epoch 119/200
Epoch 120/200
Epoch 121/200
Epoch 122/200
Epoch 123/200
Epoch 124/200
Epoch 125/200
Epoch 126/200
Epoch 127/200
Epoch 128/200
Epoch 129/200
Epoch 130/200
Epoch 131/200
Epoch 132/200
Epoch 133/200
Epoch 134/200
Epoch 135/200
Epoch 136/200
Epoch 137/200
Epoch 138/200
Epoch 139/200
Epoch 140/200
Epoch 141/200
Epoch 142/200
Epoch 143/200
Epoch 144/200
Epoch 145/200
Epoch 146/200
Epoch 147/200
Epoch 148/200
Epoch 149/200
Epoch 150/200
Epoch 151/200
Epoch 152/200
Epoch 153/200
Epoch 154/200
Epoch 155/200
Epoch 156/200
Epoch 157/200
Epoch 158/200
Epoch 159/200
Epoch 160/200
Epoch 161/200
Epoch 162/200
Epoch 163/200
Epoch 164/200
Epoch 165/200
Epoch 166/200
Epoch 167/200
Epoch 168/200
Epoch 169/200
Epoch 170/200
Epoch 171/200
Epoch 172/200
Epoch 173/200
Epoch 174/200


Epoch 175/200
Epoch 176/200
Epoch 177/200
Epoch 178/200
Epoch 179/200
Epoch 180/200
Epoch 181/200
Epoch 182/200
Epoch 183/200
Epoch 184/200
Epoch 185/200
Epoch 186/200
Epoch 187/200
Epoch 188/200
Epoch 189/200
Epoch 190/200
Epoch 191/200
Epoch 192/200
Epoch 193/200
Epoch 194/200
Epoch 195/200
Epoch 196/200
Epoch 197/200
Epoch 198/200
Epoch 199/200
Epoch 200/200


<tensorflow.python.keras.callbacks.History at 0x175e538ce88>

> Make predictions from fitted model.

In [69]:
# make predictions from fitted model
predicts2 = nn2.predict(X)

Instructions for updating:
Please use instead:* `np.argmax(model.predict(x), axis=-1)`,   if your model does multi-class classification   (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`,   if your model does binary classification   (e.g. if it uses a `sigmoid` last-layer activation).


> Return measurement calculations.

In [70]:
from sklearn.metrics import accuracy_score 
from sklearn.metrics import precision_score 
from sklearn.metrics import recall_score 
from sklearn.metrics import f1_score 
from sklearn.metrics import cohen_kappa_score 
from sklearn.metrics import roc_auc_score 
from sklearn.metrics import confusion_matrix  

# accuracy: (tp + tn) / (p + n) 
accuracy = accuracy_score(y, predicts2) 
print('Accuracy: %f' % accuracy) 

# precision tp / (tp + fp) 
precision = precision_score(y, predicts2, average = 'macro') 
print('Precision: %f' % precision) 

# recall: tp / (tp + fn) 
recall = recall_score(y, predicts2, average = 'macro') 
print('Recall: %f' % recall) 

# f1: 2 tp / (2 tp + fp + fn) 
f1 = f1_score(y, predicts2, average = 'macro') 
print('F1 score: %f' % f1)   



Accuracy: 0.867284
Precision: 0.876670
Recall: 0.892128
F1 score: 0.868852


>

In [71]:
# confusion matrix 
matrix2 = confusion_matrix(y, predicts2)

In [72]:
# print confusion matrix
print(matrix2)

[[ 30   5   0]
 [  0 251   0]
 [  1 123 562]]
