## Random Forest Hyper Parameter Setting

The kernel is restarted every time the parameters are changed to make reproducibility easier. 

#### The Pretrained Model

The first few blocks format the data and create the training and validation split. These bloack up until the section on hyper parameters must be run. Then under the 'loading the model' section the last block should be run. This block is entitled load model. In the 'Final model' section the last block, entitled 'accuracy rates' can be run. This will run the model on the preprocessed data. 

In [1]:
import numpy as np
np.random.seed(1337)  # for reproducibility
from sklearn.ensemble import RandomForestClassifier

In [3]:
# look into preprocessor.py file for details
from preprocessor import DocumentTermMatrix, accuracy, indicator_to_matrix

In [4]:
yelp_data = DocumentTermMatrix("yelp_academic_dataset_review.json", "text", "stars", 1000) #  TDM with IDF

In [5]:
nb_classes = len(yelp_data.docs_label_index)

In [7]:
from sklearn import cross_validation

# data split - training and validation data (80, 20)
X_train, X_test,Y_train, Y_test = cross_validation.train_test_split(yelp_data.X_docs,yelp_data.Y_docs,test_size=0.2)

## Hyper Parameter 

The code in this section is specifically for setting hyper parameters.

In [5]:
models = ['' for i in xrange(4)] 
for cv in xrange(4):
    
    print('Building model...', cv)
    models[cv] = RandomForestClassifier(10,max_features="log2") # 1000 as well

('Building model...', 0)
('Building model...', 1)
('Building model...', 2)
('Building model...', 3)


In [7]:
from preprocessor import Kfold_cv

In [8]:
full_index = [i for i in xrange(len(X_train))]
 
indices_cv = Kfold_cv(full_index,4)

In [9]:
histories = []

out_sample_accuracies = []
in_sample_accuracies = []

for cv in xrange(4):
    np.random.seed(1337)  # for reproducibility
    # data split
    #  this could be done in a single shot before hand
    #  at the moment this will preserve memory which seems import in a VM setting
    x_train = X_train[indices_cv[cv]["train"]] 
    x_test  = X_train[indices_cv[cv]["test"]] 

    y_train = [ lang_data.docs_label_index[Y_train[i]] for i in indices_cv[cv]["train"] ] 
    y_test  = [ lang_data.docs_label_index[Y_train[i]] for i in indices_cv[cv]["test"] ] 

    # create appropirate matrix (hot encoded) response
    # y_train, y_test = [indicator_to_matrix(x,lang_data.docs_label_index)  for x in (y_train, y_test)]

    history = models[cv].fit(x_train, y_train) 
    
    # set validation split to 0 or none so all the traning data is used 
    #   the out of sample rate will be determined later

    # do not set verbose = 1
    
    train_pred = models[cv].predict(x_train)
    test_pred = models[cv].predict(x_test)
    
    train_acc = 100*float(np.array([ train_pred[idx] == y_train[idx]  for idx, pred in enumerate(train_pred)]).sum())/len(train_pred)
    test_acc  = 100*float(np.array([ test_pred[idx]  == y_test[idx]   for idx, pred in enumerate(test_pred )]).sum())/len(test_pred )
    
    out_sample_accuracies.append(test_acc)
    in_sample_accuracies.append(train_acc)
    
    histories.append(history) # don't know how kosher this is, seems worth trying though :)

In [11]:
[np.mean(in_sample_accuracies),np.mean(out_sample_accuracies)]

# log2, sqrt, num_trees

[99.758454106280197, 95.652173913043484]

## Final Model

The code in this section is for the final model.

The Kernel should be reset before running the final model

In [8]:
model = RandomForestClassifier(100,max_features="log2") 

In [9]:
np.random.seed(1337)  # for reproducibility
    
y_train = [ yelp_data.docs_label_index[i] for i in Y_train ] 
y_test  = [ yelp_data.docs_label_index[i] for i in Y_test ] 

# create appropirate matrix (hot encoded) response
# y_train, y_test = [indicator_to_matrix(x,lang_data.docs_label_index)  for x in (y_train, y_test)]

history = model.fit(X_train, y_train) 
    
# set validation split to 0 or none so all the traning data is used 
#   the out of sample rate will be determined later

# do not set verbose = 1

In [10]:
# accuracy rates

y_train = [ yelp_data.docs_label_index[i] for i in Y_train ] 
y_test  = [ yelp_data.docs_label_index[i] for i in Y_test ] 

train_pred = model.predict(X_train)
test_pred  = model.predict(X_test)
    
train_acc = 100*float(np.array([ train_pred[idx] == y_train[idx]  for idx, pred in enumerate(train_pred)]).sum())/len(train_pred)
test_acc  = 100*float(np.array([ test_pred[idx]  == y_test[idx]   for idx, pred in enumerate(test_pred )]).sum())/len(test_pred )

print([train_acc,test_acc])

[100.0, 43.0]


#### Loading the Model

In [11]:
# save model
import cPickle
# save the classifier
with open('yelp_rf.pkl', 'wb') as fid:
    cPickle.dump(model, fid)   

In [6]:
# load model
import cPickle
with open('yelp_rf.pkl', 'rb') as fid:
    model = cPickle.load(fid)