# ANN Implementation: West Nile in Chicago


### Kaggle: West Nile

The data set comes from a Kaggle competition where users were given data related to mosquito tests, weather, and mosquito spraying in the city of Chicago.

For the case of this neural network implementation, I used my existing cleaned and combined data from the competition.

### Neural Network: Keras & Theano

For this implementation, I used [Keras](http://keras.io/), a wrapper which provides a Scikit-Learn style API for theano and tensorflow. I find keras provides a great framework for getting a neural net up and running quickly as well as several tools for cleaning and tweaking data. I used keras over theano, training on CPU - however theano does have the ability to train on a GPU using a Cuda backend.

In this example, I used Keras's sequential model. The network has three dense layers of 32 perceptrons each with a 0.5 dropout rate (0.2 on the input layer) using the Rectifier Linear Unit (ReLU) activation function, then a 2 perceptron output layer with a SoftMax activation. 

#### Dependecies:

In [1]:
#Pandas - dataframes and data management
import pandas as pd

#Theano / numPy - math and matrix operations
import theano
import numpy as np

#Keras - Neural Net API
from keras.layers.core import Dense, Dropout, Activation
from keras.utils import np_utils
from keras.models import Sequential

#Sklean - metric, preprocessing, cross-validation
from sklearn import metrics

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import KFold

from sklearn.pipeline import Pipeline

Couldn't import dot_parser, loading of dot files will not be possible.


Using Theano backend.


#### Import Train and Test datasets

In [47]:
train = pd.read_csv('./assets/trainComb.csv')
test_data = pd.read_csv('./assets/testComb.csv')

print train.shape
print test_data.shape

(10506, 37)
(116293, 36)


## Define Keras Model

#### Model: Sequential

#### Input Layer

#### Dense Layers

#### Output Layer

#### Loss

#### Optimizer



In [3]:

def annModel(inputD,outputD):
    model = Sequential()

    model.add(Dense(32, input_dim=inputD))
    model.add(Activation('relu'))
    model.add(Dropout(0.2))

    model.add(Dense(32, init='glorot_uniform'))
    model.add(Activation('relu'))
    model.add(Dropout(0.5))

    model.add(Dense(32, init='glorot_uniform'))
    model.add(Activation('relu'))
    model.add(Dropout(0.5))

    model.add(Dense(outputD))
    model.add(Activation('softmax'))

    model.compile(loss='categorical_crossentropy', optimizer="adadelta")
    return model


#### Set features (X) and target (y)

In [28]:
feature_cols = [ u'Species', u'Latitude', u'Longitude',u'Week', u'Month', u'Tmax_x',
       u'Tmin_x', u'Tavg_x', u'DewPoint_x', u'WetBulb_x',
     u'StnPressure_x',]

X = train[feature_cols]
y = train['WnvPresent']

#### Preprocess Features

In [31]:
# PCAScale = Pipeline([('PCA',PCA(n_components=10)),('scale',StandardScaler())])
# X = PCAScale.fit_transform(X)

scale = StandardScaler()
X = scale.fit_transform(X)

#### Preprocess target

Our model is outputting two values: rows predicted 0, and rows predicted 1. I found this to improve the accuracy of the network as opposed to predicting a single output of 0 or 1

Keras has a utlity which quickly takes a variable (WnvPresent) and formats it as a numpy array with a binary variable for each category.

In [32]:
yc = np_utils.to_categorical(y)

## KFold Validation of Neural Net

#### KFold and model atributes

5 folds, shuffled to prevent clumps of data, with a random state for consistency through model tweaks.

Input dimensions: # of features
Output dimensions: # of categories (2, 1/0)

In [33]:
kf = KFold(len(y), n_folds=5 ,shuffle=True, random_state=0)
inputD = X.shape[1]
outputD = 2


auc_scores = []
fold = 0

#### KFold Cross-Val

Runs the model 5 times, each time on a new set of training data holding out a portion of the data for validation. The model uses the validation data to optimize the loss function.

#### Neural Net parameters:

Epochs - 100
Batch Size - 16

In [34]:
for training, testing in kf:
    fold += 1
    print "Fold Start", fold
    X_train = X[training]
    X_test = X[testing]
    y_train = yc[training]
    y_test = yc[testing]
    y_true = y[testing]

    model = annModel(inputD,outputD)
    model.fit(X_train, y_train, nb_epoch=100, batch_size=16, validation_data=(X_test, y_test), verbose=0)

    y_probs = model.predict_proba(X_test,verbose=0)
    y_pred = model.predict(X_test)

    roc = metrics.roc_auc_score(y_test, y_probs)
    auc_scores.append(roc)
    print "Fold Score: %.4f" % roc
    print "Fold Complete", fold
    


Fold Start 1
Fold Score: 0.8068
Fold Complete 1
Fold Start 2
Fold Score: 0.8280
Fold Complete 2
Fold Start 3
Fold Score: 0.7962
Fold Complete 3
Fold Start 4
Fold Score: 0.7713
Fold Complete 4
Fold Start 5
Fold Score: 0.8246
Fold Complete 5


#### Local CV AUC Scores

In [40]:
print "Mean AUC Score:"
print np.mean(auc_scores), '\n'

print "Fold Scores:"
for a in auc_scores:
    print "%.4f" % a

Mean AUC Score:
0.805375828195 

Fold Scores:
0.8068
0.8280
0.7962
0.7713
0.8246


#### Train the model on all of the data

Increase the number of epochs as we're only running the model once, not 5 times

In [41]:
inputD = X.shape[1]
outputD = 2

model = annModel(inputD,outputD)
model.fit(X, yc, nb_epoch=200, batch_size=16, verbose=0)

<keras.callbacks.History at 0x11957ff10>

#### Prepare test data, and make predictions using the trained neural network

In [43]:
test_data.head()

Unnamed: 0,Id,Date,Species,Trap,Latitude,Longitude,AddressAccuracy,Week,test_geo,Station,...,WetBulb,Heat,Cool,SnowFall,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed
0,1,2008-06-11,2,1,41.95469,-87.800991,9,24,"(41.95469,-87.800991)",1,...,64.0,0.0,,0,0.0,29.28,29.99,8.9,18,10.0
1,2,2008-06-11,3,1,41.95469,-87.800991,9,24,"(41.95469,-87.800991)",1,...,64.0,0.0,,0,0.0,29.28,29.99,8.9,18,10.0
2,3,2008-06-11,1,1,41.95469,-87.800991,9,24,"(41.95469,-87.800991)",1,...,64.0,0.0,,0,0.0,29.28,29.99,8.9,18,10.0
3,4,2008-06-11,4,1,41.95469,-87.800991,9,24,"(41.95469,-87.800991)",1,...,64.0,0.0,,0,0.0,29.28,29.99,8.9,18,10.0
4,5,2008-06-11,6,1,41.95469,-87.800991,9,24,"(41.95469,-87.800991)",1,...,64.0,0.0,,0,0.0,29.28,29.99,8.9,18,10.0


In [49]:
Xt = test_data[feature_cols]
Xt = scale.fit_transform(Xt)

In [50]:
sub_probs = model.predict_proba(Xt,verbose=0)

### Prepare Kaggle Submission

For this competition, Kaggle used AUC score as a leaderboard metric, so submissions were in the format of predicted probabilities for class 1 (WnvPressent == 1).

In [51]:
submission = pd.read_csv('./assets/sampleSubmission.csv')
subdat = pd.DataFrame(submission)

subdat['WnvPresent'] = sub_probs[:,1]
subdat.to_csv('test_sub.csv', index=False)