## Naive Bayes

The kernel is restarted every time the parameters are changed to make reproducibility easier. 

#### The Pretrained Model

The first few blocks format the data and create the training and validation split. The blocks up until the section on hyper parameters must be run. Then under the 'loading the model' section the last block should be run. This block is entitled load model. In the 'Final model' section the block, entitled 'accuracy rates' can be run. This will run the model on the preprocessed data. 

In [2]:
import numpy as np
np.random.seed(1337)  # for reproducibility
from sklearn.naive_bayes import MultinomialNB

In [3]:
# look into preprocessor.py file for details
from preprocessor import DocumentTermMatrix, accuracy, indicator_to_matrix

In [4]:
yelp_data = DocumentTermMatrix("yelp_academic_dataset_review.json", "text", "stars", 1000) #  TDM with IDF

In [6]:
nb_classes = len(yelp_data.docs_label_index)

In [8]:
from sklearn import cross_validation

# data split - training and validation data (80, 20)
X_train, X_test,Y_train, Y_test = cross_validation.train_test_split(yelp_data.X_docs,yelp_data.Y_docs,test_size=0.2)

## Hyper Parameter 

The code in this section is specifically for setting hyper parameters.

In [5]:
models = ['' for i in xrange(4)] 
for cv in xrange(4):
    
    print('Building model...', cv)
    models[cv] = MultinomialNB()

('Building model...', 0)
('Building model...', 1)
('Building model...', 2)
('Building model...', 3)


In [7]:
from preprocessor import Kfold_cv

In [8]:
full_index = [i for i in xrange(len(X_train))]
 
indices_cv = Kfold_cv(full_index,4)

In [9]:
histories = []
for cv in xrange(4):
    np.random.seed(1337)  # for reproducibility
    # data split
    #  this could be done in a single shot before hand
    #  at the moment this will preserve memory which seems import in a VM setting
    x_train = X_train[indices_cv[cv]["train"]] 
    x_test  = X_train[indices_cv[cv]["test"]] 

    y_train = [ lang_data.docs_label_index[Y_train[i]] for i in indices_cv[cv]["train"] ] 
    y_test  = [ lang_data.docs_label_index[Y_train[i]] for i in indices_cv[cv]["test"] ] 

    # create appropirate matrix (hot encoded) response
    # y_train, y_test = [indicator_to_matrix(x,lang_data.docs_label_index)  for x in (y_train, y_test)]

    history = models[cv].fit(x_train, y_train) 
    
    # set validation split to 0 or none so all the traning data is used 
    #   the out of sample rate will be determined later

    # do not set verbose = 1
    
    histories.append(history) # don't know how kosher this is, seems worth trying though :)

In [10]:
# in and out scores over the cross folds

out_sample_accuracies = []
in_sample_accuracies = []
confustion_matrices = []
for cv in xrange(4):
    x_train = lang_data.X_docs[indices_cv[cv]["train"]] 
    x_test  = lang_data.X_docs[indices_cv[cv]["test"]] 

    y_train = [ lang_data.docs_label_index[lang_data.Y_docs[i]] for i in indices_cv[cv]["train"] ] 
    y_test  = [ lang_data.docs_label_index[lang_data.Y_docs[i]] for i in indices_cv[cv]["test"] ] 

    # y_test_vec = [ lang_data.docs_label_index[i] for i in y_test ] 

    # create appropirate matrix (hot encoded) response
    #y_train, y_test = [indicator_to_matrix(x,lang_data.docs_label_index)  for x in (y_train, y_test)]
    
    train_pred = models[cv].predict(x_train)
    test_pred = models[cv].predict(x_test)
    
    train_acc = 100*float(np.array([ train_pred[idx] == y_train[idx]  for idx, pred in enumerate(train_pred)]).sum())/len(train_pred)
    test_acc  = 100*float(np.array([ test_pred[idx]  == y_test[idx]   for idx, pred in enumerate(test_pred )]).sum())/len(test_pred )
    
    out_sample_accuracies.append(test_acc)
    in_sample_accuracies.append(train_acc)
    
    #confustion_matrices.append(confusion_matrix(y_test,test_pred))

In [12]:
[np.mean(in_sample_accuracies),np.mean(out_sample_accuracies)]

# [87.907608695652172, 87.273550724637687]

[87.907608695652172, 87.273550724637687]

## Final Model

In [10]:
model = MultinomialNB()
np.random.seed(1337)  # for reproducibility

y_train = [ yelp_data.docs_label_index[i] for i in Y_train ] 
y_test  = [ yelp_data.docs_label_index[i] for i in Y_test ] 

history = model.fit(X_train, y_train) 
    
# set validation split to 0 or none so all the traning data is used 
#   the out of sample rate will be determined later

# do not set verbose = 1


In [12]:
# accuracy rates

y_train = [ yelp_data.docs_label_index[i] for i in Y_train ] 
y_test  = [ yelp_data.docs_label_index[i] for i in Y_test ] 

train_pred = model.predict(X_train)
test_pred = model.predict(X_test)
    
train_acc = 100*float(np.array([ train_pred[idx] == y_train[idx]  for idx, pred in enumerate(train_pred)]).sum())/len(train_pred)
test_acc  = 100*float(np.array([ test_pred[idx]  == y_test[idx]   for idx, pred in enumerate(test_pred )]).sum())/len(test_pred )
    
print([train_acc,test_acc])

[98.25, 38.0]


In [13]:
# confustion matrix

from sklearn.metrics import confusion_matrix
confustion_matrix = (confusion_matrix(y_test,test_pred))
print(confustion_matrix)

[[ 8  8  2  3  0]
 [ 4 10 10  3  2]
 [ 0  5  9 15  7]
 [ 1  5 21 27 15]
 [ 3  4  2 14 22]]


#### Loading the Model

In [14]:
# save model
import cPickle
# save the classifier
with open('yelp_nb.pkl', 'wb') as fid:
    cPickle.dump(model, fid)   

In [6]:
# load model
import cPickle
with open('yelp_nb.pkl', 'rb') as fid:
    model = cPickle.load(fid)