# West Nile Mosquito Virus kaggle competition prediction

## Team of 3
     Development by Nasrudin
     Documentation by 
     Visualizations by 

### Objective and Targets
    Clear Documentation
    AUC Score top 5%
    CleanCode

#### Development:
    
    Provide Software/development architect for team members to fill, project plan
    Data Cleaning
    Imputations
    Hot Encoding
    Feature Engineering
    Create Models
    Stack Models
    Validate models
    
    
    
#### Documentation:

    EDA, verbal and descriptive analysis of project
    Explain process flow and steps
    Explain Code with documentation, Explain Thought Process
    Output predictions to submission file
    Help convert jupyter notebook into slideshow
    Document model predictive ability with stats/metrics

    
 #### Visualization:
     Input visualizations for EDA, ( Exploratory data analysis)
     Visualization on data for insights
     Output: On Findings, heatmap and predictions of mosquitoes
     Help convert jupyter notebook into slideshow
     Output visualizations on validation on model integry
     
     
     


## Background Information
West Nile virus is most commonly spread to humans through infected mosquitos. Around 20% of people who become infected with the virus develop symptoms ranging from a persistent fever, to serious neurological illnesses that can result in death.



In 2002, the first human cases of West Nile virus were reported in Chicago. By 2004 the City of Chicago and the Chicago Department of Public Health (CDPH) had established a comprehensive surveillance and control program that is still in effect today.

Every week from late spring through the fall, mosquitos in traps across the city are tested for the virus. The results of these tests influence when and where the city will spray airborne pesticides to control adult mosquito populations.

Given weather, location, testing, and spraying data, this competition asks you to predict when and where different species of mosquitos will test positive for West Nile virus. A more accurate method of predicting outbreaks of West Nile virus in mosquitos will help the City of Chicago and CPHD more efficiently and effectively allocate resources towards preventing transmission of this potentially deadly virus. 

# This IPython Notebook is in Python 3


### For this project, I'll be exploring and using:
## Models using these packages
### I will also attempt to stack some models
    Keras
    SkLearn
    XGBoost
### Models
    Logisticic Regrssion
    RandomForests
    DeepNeuralNetworks
    CNN
    BootStrap Aggregating / Bagging
    KNN (  for imputation )

## Validation and refining, tuning hyperparameters
    K-means Cross Validation
    Cross Validation w/ hold out
    GridSearch
    Feature Selection
    Model Selection
    
## Visualization Outputs
    Geomap of mosquitoes
    Geomap of predicted mosquitoes
    HeatMaps
    


### Import basic stuff

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sys
import math
from datetime import datetime
#%matplotlib inline

### Import Machine Learning Stuff

In [None]:
# #Install machine learning stuff
# !conda install -c conda-forge keras --yes
# !conda install -c conda-forge tensorflow --yes
# !conda install make --yes
# !conda install -c conda-forge xgboost --yes #May not work for windows
# !conda install -c anaconda py-xgboost --yes
# !pip install xgboost
# !conda install m2w64-toolchain --yes
# !conda install theano --yes

In [None]:
#SKLearn
from sklearn import ensemble, preprocessing, metrics
from sklearn.cross_validation import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KernelDensity

In [None]:
#Keras
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.utils import np_utils

In [None]:
import xgboost as xgb

### Load Data

In [None]:
#I looked into the data after loading it and decided to parse in some parameters,
#especially for Date as it was imported originally as an object when imported normally without parameters

StreetMapFile = ('./Data/mapdata_copyright_openstreetmap_contributors.txt')

#Load into variables
sample = pd.read_csv('./Data/sampleSubmission.csv' )

train = pd.read_csv('./Data/train.csv', parse_dates=['Date'])
test = pd.read_csv('./Data/test.csv', parse_dates=['Date'])
weather = pd.read_csv('./Data/weather.csv', parse_dates=['Date'])
spray = pd.read_csv('./Data/spray.csv', parse_dates=['Date'])

traps = train[['Date', 'Trap','Longitude', 'Latitude', 'WnvPresent']]

#Load the map
mapdata = np.loadtxt(StreetMapFile)

datasets = [train,test,weather]

#### Random Seed

In [None]:
np.random.seed(1337)

# Perform EDA Here
    Describe Data, Create Data Dictionary, background googling.

### Observe Data

In [None]:
(train.head())

In [None]:
(test.head())

In [None]:
(weather.head())

In [None]:
spray.head(5)

In [None]:
(sample.head())

In [None]:
weather.dtypes

In [None]:
train.dtypes

In [None]:
print (train['Date'].unique())
print (weather['Date'].unique())
print (test['Date'].unique())

In [None]:
train.isnull().sum()

# Clean the Data

# PreProcess the Data

In [None]:
train.corr()

### Create Data Dictionary

## Merging the Data
    We want to merge on "date"

### Date Time Fuctions

In [None]:
def StrToDate(string):
    date =datetime.strptime(string , '%d/%m/%Y')
    return date

In [None]:
# Merge with weather data
train = train.merge(weather, on='Date')
test = test.merge(weather, on='Date')
train.head()

# Hot Encoding

### Preparing the Data
#### Scaling

In [None]:
def preprocess_data(X, scaler=None):
    if not scaler:
        scaler = StandardScaler()
        scaler.fit(X)
    X = scaler.transform(X)
    return X, scaler

##### Bootstrapping

In [None]:
def bootstrap(X, y):
    shuffle = np.arange(len(y))
    np.random.shuffle(shuffle)
    X = X[shuffle]
    y = y[shuffle]
    return X, y

##### Into Float

In [None]:
def intofloat(string):
    try:
        return float(string)
    except:
        return string

# Insert Visualizations here

### Visualization of areas with traps

In [None]:
aspect = mapdata.shape[0] * 1.0 / mapdata.shape[1]
lon_lat_box = (-88, -87.5, 41.6, 42.1)

plt.figure(figsize=(10,14))
plt.imshow(mapdata, 
           cmap=plt.get_cmap('gray'), 
           extent=lon_lat_box, 
           aspect=aspect)

locations = traps[['Longitude', 'Latitude']].drop_duplicates().values
plt.scatter(locations[:,0], locations[:,1], marker='x')
plt.show()

# Insert Imputations here
Should note that missing values are labeled as M or T
### Using KNN Imputation

# Insert Data Engineering Here

## Convert categorical data to numbers

In [None]:
lbl = preprocessing.LabelEncoder()
lbl.fit(list(train['Species'].values) + list(test['Species'].values))
train['Species'] = lbl.transform(train['Species'].values)
test['Species'] = lbl.transform(test['Species'].values)

# Create Models

## Model #1 Keras Sequential, Deep Net

In [None]:
def build_kerassequential(input_dim, output_dim):
    model = Sequential()
    model.add(Dense(32, input_dim=input_dim))
    model.add(Activation('relu'))
    model.add(Dropout(0.5))

    model.add(Dense(32))
    model.add(Activation('relu'))
    model.add(Dropout(0.5))

    model.add(Dense(output_dim))
    model.add(Activation('softmax'))

    model.compile(loss='categorical_crossentropy', optimizer="adadelta")
    return model

## Model #2 Random Forests

In [None]:
# Random Forest Classifier 
labels = train['WnvPresent'].values
clf = ensemble.RandomForestClassifier(n_jobs=-1, n_estimators=1000, min_samples_split=1)
clf.fit(train, labels)

### Model #3 XgBoost

In [None]:
dtrain = xgb.DMatrix(Xtrain, label=Ytrain, missing = MISSING)
dtest = xgb.DMatrix(Xtest, missing = MISSING)
param = {}
# use logistic regression loss, use raw prediction before logistic transformation
# since we only need the rank
param['objective'] = 'binary:logitraw'
# scale weight of positive examples
param['scale_pos_weight'] = sum_wneg/sum_wpos
param['eta'] = 0.1
param['max_depth'] = 7
param['eval_metric'] = 'auc'
param['silent'] = 1
param['min_child_weight'] = 100
param['subsample'] = 0.7
param['colsample_bytree'] = 0.7
param['nthread'] = 4

num_round = 50

#xgb.cv(param, dtrain, num_round, nfold=5)
bst = xgb.train(param, dtrain, num_round)

## K-Fold Cross Validation Function

In [None]:
#Insert cross validation function here

# Stack the Models Layer 1

## Grid Search for Stacked Models

## Stacked Models Validations with Cross Validation Hold - Out