Can Neural Network do a better job than Logistic Regression?
===

In this section, I'll train a feed forward neural network with the help of TensorFlowand. I'll compare the neural model with my prevous logistic regression model. I'll be using the same bag of words features from last section.

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import re
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.multiclass import OneVsOneClassifier
from sklearn.linear_model import LogisticRegression
from nltk.corpus import stopwords
from pymongo import MongoClient
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jiahe_000\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

The folowing script will extract the transaction data from MongoDB, clean up the data, and generate the bag of words feature set. It also spilt the dataset into training set and test set. It is the same steps/code used in last section to prepare data for logistic regression, I put them here for completeness.

In [5]:
mongoclient = MongoClient('localhost')
##Mongoclient.database_names()
db = mongoclient['bankdata']

##Get name, amount, date
transactions = pd.DataFrame(list(db.transactions.find({},{"name":1,"amount":1,"date":1,"category":1})))
transactions.head(5)


def name_to_words( raw_name ):
    # Function to convert a raw name to a string of words
    # The input is a single string (a raw transaction name), and 
    # the output is a single string (a preprocessed transaction name)
    #
    # Remove non-letters        
    ##letters_only = re.sub("[^a-zA-Z]", " ", raw_name)
    letters_only = re.sub("\S*\d\S*", "", raw_name)
    # Convert to lower case
    words = letters_only.lower().split()        
    
    stops = set(stopwords.words("english"))                  
    # 
    # Remove stop words
    meaningful_words = [w for w in words if not w in stops]   
    #
    # Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( meaningful_words ))  


num_trans = transactions["name"].size

# Initialize an empty list to hold the clean reviews
clean_name = []

for i in range( 0, num_trans ):
    # Call our function for each one, and add the result to the list of clean names
    clean_name.append( name_to_words( transactions["name"][i] ) )

transactions['clean_name'] = clean_name


vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             ) 
bow_features = vectorizer.fit_transform(clean_name)

vocab = vectorizer.get_feature_names()
bow_features = pd.DataFrame(bow_features.toarray())
bow_features.columns = vocab



np.random.seed(0)
num_trans = transactions["_id"].size
split_list = np.random.uniform(0,1,num_trans)
p=0.5
train_x = bow_features[split_list < p]
test_x = bow_features[split_list >= p]
train_y = transactions["category"][split_list < p]
test_y = transactions["category"][split_list >= p]


Softmax Regression
---

In [25]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
import tensorflow as tf

lenc = LabelEncoder()
lenc.fit(transactions["category"])
num_y = lenc.transform(transactions["category"])
oenc = OneHotEncoder(sparse=False)
onehot_y = oenc.fit_transform(num_y.reshape(-1,1))
print (onehot_y)

[[ 0.  0.  1. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  1.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  1. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]


In [26]:
train_y = onehot_y[split_list < p]
test_y = onehot_y[split_list >= p]
print (train_y.shape)
print (test_y.shape)
print (train_x.shape)
print (test_x.shape)

(790, 14)
(784, 14)
(790, 562)
(784, 562)


In [41]:
x = tf.placeholder(tf.float32, [None, 562])

W = tf.Variable(tf.zeros([562,14]))
b = tf.Variable(tf.zeros([14]))

logits = tf.matmul(x, W) + b
y = tf.nn.softmax(logits)
y_ = tf.placeholder(tf.float32,[None,14])

cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels = y_)

train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy)

init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)

correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

for i in range(1000):
    batch_xs, batch_ys = train_x, train_y
    sess.run(train_step, feed_dict={x:batch_xs, y_:batch_ys})
    print(sess.run(accuracy, feed_dict={x:test_x, y_: test_y}))
    
    

0.492347
0.556122
0.627551
0.702806
0.715561
0.739796
0.771684
0.811224
0.811224
0.836735
0.844388
0.844388
0.844388
0.845663
0.850765
0.852041
0.855867
0.864796
0.864796
0.871173
0.877551
0.878827
0.882653
0.885204
0.885204
0.889031
0.894133
0.894133
0.894133
0.894133
0.896684
0.896684
0.897959
0.897959
0.897959
0.899235
0.901786
0.909439
0.909439
0.909439
0.909439
0.909439
0.909439
0.909439
0.909439
0.909439
0.909439
0.910714
0.910714
0.910714
0.910714
0.910714
0.910714
0.910714
0.910714
0.913265
0.913265
0.913265
0.913265
0.913265
0.913265
0.913265
0.913265
0.913265
0.913265
0.913265
0.913265
0.913265
0.914541
0.914541
0.914541
0.914541
0.914541
0.914541
0.914541
0.914541
0.914541
0.914541
0.914541
0.914541
0.914541
0.914541
0.914541
0.914541
0.914541
0.914541
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.917092
0.917092
0.917092
0.917092
0.917092
0.917092
0.917092
0.917092
0.917092
0.917092
0.917092
0.917092
0

0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816
0.915816


Forward neural network
---

In [17]:
tf.zeros([10])

<tf.Tensor 'zeros:0' shape=(10,) dtype=float32>