# Neural CI
----
In this notebook, we'll be reading in data from the data formatting notebook (dataFormmatter.ipynb) and passing them into [my neural net library](https://github.com/RobGeada/nn). Specifically, we'll design functions that allow for easy manipulation of the network architecture and dataset sizes, to simplify neural net tuning.

## Imports and Initializations

In [1]:
import numpy as np
import os,sys
import time
import pickle
import matplotlib.pyplot as plt
import nn

cwd = os.getcwd()

## Data Helpers
We're going to wanna be able to specify dataset sizes here, so we can play around with how much data we're training and testing on, and thus we're going to want a function rather than hardcoding specific values. We also need to be able to randomly shuffle the data, as per the neural net algorithm. Finally, we're going to want a way to easily save our predictions to file, ideally allowing for comparisons to the true ci_status values.

In [24]:
def unison_shuffled_copies(a, b):
    assert len(a) == len(b)
    p = np.random.permutation(len(a))
    return a[p], b[p]

def loadData(trainUB,testUB):
    tX,tY = np.load(cwd+"/formattedData/tvectors.npy"),np.load(cwd+"/formattedData/tstatus.npy")
    vX,vY = np.load(cwd+"/formattedData/vvectors.npy"),np.load(cwd+"/formattedData/vstatus.npy")
    
    #randomly shuffle data
    vX2,vY2 = unison_shuffled_copies(vX,vY)
    tX2,tY2 = unison_shuffled_copies(tX,tY)

    tysize = tY2.shape
    vysize = vY2.shape

    trainX,trainY = tX2[:trainUB],tY2[:trainUB]
    testX,testY  = vX2[:testUB],vY2[:testUB]

    return trainX,trainY,testX,testY

#save predictions to file
def savePredictions(predictions,testY,filename):
    print "Saving predictions..."
    f = open("{}/{}_Predictions.csv".format(cwd,filename),"w")
    numPred = len(predictions)
    for i,prediction in enumerate(predictions):
        if round(prediction,0)!=testY[i]:
            f.write("ID: {},P: {},A: {}, INCORRECT".format(i,round(prediction,3),testY[i]))
        else:
            f.write("ID: {},P: {},A: {}".format(i,round(prediction,3),testY[i]))
        if i<numPred-1:
            f.write("\n")
    f.close()
    print "Done!"
    
def checkPredictions(logID):
    #check to make sure user has requisite data
    if os.listdir(cwd)
    
    #get validation logs
    f = open(cwd+"/formattedData/validLogs.pkl","r")
    sentences = pickle.load(f)
    tgtLog = sentences[logID]
    
    #merge tags into the same index
    tgtLog[54:56] = [[tgtLog[54],tgtLog[55]]]
    
    #get fieldNames
    f = open(cwd+"/formattedData/fieldNames.pkl","r")
    fieldNames = pickle.load(f)
    
    print zip(fieldNames,sentences[logID])
              
checkPredictions(logID=0)

[('@timestamp', u'2016-10-18 19:12:22'), ('__label', u'None'), ('ci_agent_label', u'None'), ('ci_agent_name', u'None'), ('ci_job_build_id', u'356'), ('ci_job_full_url', u'https://rhcs-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/QCI_1.1_RHEV_Conn_API-MLibVirt-QCI_OSE_CFME-Smoke-74ad3-stable-runtest/356/'), ('ci_job_log_url', u'http://shipshift.perf.lab.eng.bos.redhat.com/rhcs-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/QCI_1.1_RHEV_Conn_API-MLibVirt-QCI_OSE_CFME-Smoke-74ad3-stable-runtest/356/consoleText'), ('ci_job_name', u'QCI_1.1_RHEV_Conn_API-MLibVirt-QCI_OSE_CFME-Smoke-74ad3-stable-runtest'), ('ci_job_phase', u'FINISHED'), ('ci_master_hostname', u'None'), ('file', u'consoleText'), ('geoip_location_lat', u'None'), ('geoip_location_lon', u'None'), ('hostname', u'jslave-QCI_1.1_RHEV_Conn_API-MLibVirt-QCI_OSE_CFME-Smoke-74ad3-stable-210fb'), ('ipaddr4', u'None'), ('ipaddr6', u'None'), ('level', u'None'), ('message', u'    /usr/include/features.h:168:0: note: this is the location of the pre

## Setup NN Test Function
The same goes here; we're going to want to be able to play with network parameters, so let's write a function rather than hardcode anything. You'll notice that I'm passing test data into the `net.train()` function; don't be alarmed,  `net.train()` only uses testing data to produce a per-epoch glimpse at the test error, so we can nip over-fitting in the bud.

In [5]:
def nnTest(parameters):
    #unpack parameters construct
    trainUB,testUB,hiddenSize,epochs,learningRate = parameters

    #load training,testing data from the specified sets
    trainX,trainY,testX,testY = loadData(trainUB,testUB)

    #create network
    net = nn.Network(inDim=35,biases=1,hiddenDims=[hiddenSize,],outDim=1,learningRate=learningRate)

    #train network
    tStart = time.time()
    Y = net.train(trainX,trainY,testX,testY,epochs=epochs)

    #display training stats
    print "\n===RESULTS==="
    print "Train time:   {} s".format(time.time()-tStart)

    #make predictions
    tStart = time.time()
    predictionsX = net.predict(testX)
    print "Predict time: {} s".format(time.time()-tStart)
    
    #test predictions and display accuracy stats
    net.error(testX,testY,verbose=True)
    return predictionsX,testY

## Run It!
Here we define the network parameters we want to test. The variables defined below correspond to network parameters as follows:


| Variable        | Parameter           |
| ------------- |:-------------:|
| trainUB     | Size of training dataset|
| testUB      | Size of testing dataset|
| hiddenSize  | Number of nodes in hidden layer|
| epochs  | Self-explanatory|
| learningRate  | Eta value for neural net backpropagation|

The values below are just the ones I've found to perform best on my particular slice of the dataset, so tune away!

In [6]:
#define net parameters
trainUB,testUB,hiddenSize,epochs,learningRate = 75000,10000,35,100,.15
netParams = (trainUB,testUB,hiddenSize,epochs,learningRate)

#test said parameters
predictions,testY = nnTest(parameters=netParams)

Training network...
===EPOCH 0===
Train error: 0.145435626526
Holdout error: 0.1414
===EPOCH 1===
Train error: 0.128026573818
Holdout error: 0.1252
===EPOCH 2===
Train error: 0.123344138952
Holdout error: 0.1217
===EPOCH 3===
Train error: 0.116700684356
Holdout error: 0.1174
===EPOCH 4===
Train error: 0.114272755166
Holdout error: 0.1156
===EPOCH 5===
Train error: 0.110684222462
Holdout error: 0.1126
===EPOCH 6===
Train error: 0.10714905084
Holdout error: 0.1091
===EPOCH 7===
Train error: 0.104547698136
Holdout error: 0.1072
===EPOCH 8===
Train error: 0.102439935433
Holdout error: 0.1048
===EPOCH 9===
Train error: 0.101559477595
Holdout error: 0.1038
===EPOCH 10===
Train error: 0.100745721108
Holdout error: 0.1038
===EPOCH 11===
Train error: 0.101546137325
Holdout error: 0.1038
===EPOCH 12===
Train error: 0.102906844893
Holdout error: 0.1061
===EPOCH 13===
Train error: 0.104400955163
Holdout error: 0.1071
===EPOCH 14===
Train error: 0.104561038407
Holdout error: 0.1072
===EPOCH 15===
T

## Save Predictions

In [7]:
savePredictions(predictions,testY,"CI")

Saving predictions...
Done!


# Examining Predictions
Looking at the CI_Predictions.csv file that the code has generated, we can see that for ID 0, we predicted .001. The neural net predicts in the range [0,1], and then rounds to the nearest integer, where 0 is failure and 1 is success. Therefore, the raw prediction of .001 corresponds to a rounded prediction of 0, meaning the net has (correctly) predicted failure for this log. Let's take a look at this log message, using the `checkPredictions` function!

In [25]:
checkPredictions(logID=0)

[('@timestamp', u'2016-10-18 19:12:22'), ('__label', u'None'), ('ci_agent_label', u'None'), ('ci_agent_name', u'None'), ('ci_job_build_id', u'356'), ('ci_job_full_url', u'https://rhcs-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/QCI_1.1_RHEV_Conn_API-MLibVirt-QCI_OSE_CFME-Smoke-74ad3-stable-runtest/356/'), ('ci_job_log_url', u'http://shipshift.perf.lab.eng.bos.redhat.com/rhcs-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/QCI_1.1_RHEV_Conn_API-MLibVirt-QCI_OSE_CFME-Smoke-74ad3-stable-runtest/356/consoleText'), ('ci_job_name', u'QCI_1.1_RHEV_Conn_API-MLibVirt-QCI_OSE_CFME-Smoke-74ad3-stable-runtest'), ('ci_job_phase', u'FINISHED'), ('ci_master_hostname', u'None'), ('file', u'consoleText'), ('geoip_location_lat', u'None'), ('geoip_location_lon', u'None'), ('hostname', u'jslave-QCI_1.1_RHEV_Conn_API-MLibVirt-QCI_OSE_CFME-Smoke-74ad3-stable-210fb'), ('ipaddr4', u'None'), ('ipaddr6', u'None'), ('level', u'None'), ('message', u'    /usr/include/features.h:168:0: note: this is the location of the pre