# West Nile Mosquito Virus kaggle competition prediction

## Team of 3
     Development by Nasrudin
     Documentation by Nasrudin
     Visualizations by Nasrudin
     
#### I ended up doing the project by myself..

### Objective and Targets
    Clear Documentation
    AUC Score top 5%
    CleanCode

#### Development:
    
    Provide Software/development architect for team members to fill, project plan
    Data Cleaning
    Imputations
    Hot Encoding
    Feature Engineering
    Create Models
    Stack Models
    Validate models
    
    
    
#### Documentation:

    EDA, verbal and descriptive analysis of project
    Explain process flow and steps
    Explain Code with documentation, Explain Thought Process
    Output predictions to submission file
    Help convert jupyter notebook into slideshow
    Document model predictive ability with stats/metrics

    
 #### Visualization:
     Input visualizations for EDA, ( Exploratory data analysis)
     Visualization on data for insights
     Output: On Findings, heatmap and predictions of mosquitoes
     Help convert jupyter notebook into slideshow
     Output visualizations on validation on model integry
     
     
     


## Background Information
West Nile virus is most commonly spread to humans through infected mosquitos. Around 20% of people who become infected with the virus develop symptoms ranging from a persistent fever, to serious neurological illnesses that can result in death.



In 2002, the first human cases of West Nile virus were reported in Chicago. By 2004 the City of Chicago and the Chicago Department of Public Health (CDPH) had established a comprehensive surveillance and control program that is still in effect today.

Every week from late spring through the fall, mosquitos in traps across the city are tested for the virus. The results of these tests influence when and where the city will spray airborne pesticides to control adult mosquito populations.

Given weather, location, testing, and spraying data, this competition asks you to predict when and where different species of mosquitos will test positive for West Nile virus. A more accurate method of predicting outbreaks of West Nile virus in mosquitos will help the City of Chicago and CPHD more efficiently and effectively allocate resources towards preventing transmission of this potentially deadly virus. 

# This IPython Notebook is in Python 3


### For this project, I'll be exploring and using:
## Models using these packages
### I will also attempt to stack and/or ensemble some models
    Keras
    SkLearn
    XGBoost
### Models
    Logisticic Regrssion
    RandomForests
    DeepNeuralNetworks
    CNN
    BootStrap Aggregating / Bagging
    KNN (  for imputation )

## Validation and refining, tuning hyperparameters
    K-means Cross Validation
    Cross Validation w/ hold out
    GridSearch
    Feature Selection
    Model Selection
    
## Visualization Outputs
    Geomap of mosquitoes
    Geomap of predicted mosquitoes
    HeatMaps
    


### Import basic stuff

In [1]:
import pandas as pd
import numpy as np
from sklearn import ensemble, preprocessing


#### Load dataset 

In [2]:
train = pd.read_csv('assets/kaggle/train.csv')
test = pd.read_csv('assets/kaggle/test.csv')
sample = pd.read_csv('assets/kaggle/sampleSubmission.csv')
weather = pd.read_csv('assets/kaggle/weather.csv')

#### Answer Labels

In [3]:
labels = train.WnvPresent.values

In [4]:
# Not using codesum
weather = weather.drop('CodeSum', axis=1)


#### Split station 1 and 2 and join horizontally

In [5]:
weather_stn1 = weather[weather['Station']==1]
weather_stn2 = weather[weather['Station']==2]
weather_stn1 = weather_stn1.drop('Station', axis=1)
weather_stn2 = weather_stn2.drop('Station', axis=1)
weather = weather_stn1.merge(weather_stn2, on='Date')

#### replace some missing values and T with -1

In [6]:
weather = weather.replace('M', -1)
weather = weather.replace('-', -1)
weather = weather.replace('T', -1)
weather = weather.replace(' T', -1)
weather = weather.replace('  T', -1)


#### Functions to extract month and day from dataset
#### You can also use parse_dates of Pandas.

In [7]:
def create_month(x):
    return x.split('-')[1]

def create_day(x):
    return x.split('-')[2]

In [8]:
train['month'] = train.Date.apply(create_month)
train['day'] = train.Date.apply(create_day)
test['month'] = test.Date.apply(create_month)
test['day'] = test.Date.apply(create_day)

#### Adding the Latitude/longtitude columns as features

In [9]:
train['Lat_int'] = train.Latitude.apply(int)
train['Long_int'] = train.Longitude.apply(int)
test['Lat_int'] = test.Latitude.apply(int)
test['Long_int'] = test.Longitude.apply(int)

#### Drop the address columns

In [10]:
train = train.drop(['Address', 'AddressNumberAndStreet','WnvPresent', 'NumMosquitos'], axis = 1)
test = test.drop(['Id', 'Address', 'AddressNumberAndStreet'], axis = 1)

#### Merge with weather Data

In [11]:
train = train.merge(weather, on='Date')
test = test.merge(weather, on='Date')
train = train.drop(['Date'], axis = 1)
test = test.drop(['Date'], axis = 1)

#### Convert Categorical Data to numbers using LabelEncoder

In [12]:
lbl = preprocessing.LabelEncoder()
lbl.fit(list(train['Species'].values) + list(test['Species'].values))
train['Species'] = lbl.transform(train['Species'].values)
test['Species'] = lbl.transform(test['Species'].values)

lbl.fit(list(train['Street'].values) + list(test['Street'].values))
train['Street'] = lbl.transform(train['Street'].values)
test['Street'] = lbl.transform(test['Street'].values)

lbl.fit(list(train['Trap'].values) + list(test['Trap'].values))
train['Trap'] = lbl.transform(train['Trap'].values)
test['Trap'] = lbl.transform(test['Trap'].values)

#### Drop columns with -1s

In [13]:
train = train.ix[:,(train != -1).any(axis=0)]
test = test.ix[:,(test != -1).any(axis=0)]


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  


## Random forest classifier model #1

In [15]:
clf = ensemble.RandomForestClassifier(n_jobs=-1, n_estimators=1000, min_samples_split=1.0)
clf.fit(train, labels)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=1.0,
            min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

## Creating the predictions and submission file

In [16]:
predictions = clf.predict_proba(test)[:,1]
sample['WnvPresent'] = predictions
sample.to_csv('submission.csv', index=False)