# West Nile Virus Data Story
https://www.kaggle.com/c/predict-west-nile-virus
#### Instructions:
Pick a dataset - ideally the dataset for your Capstone. If for some reason you want to do this on a different data set, you can find one on Mode Analytics or Google's public data sets directory, or pick another one you like from elsewhere.

Get going by asking the following questions and looking for the answers with some code and plots:
Can you count something interesting?
Can you find some trends (high, low, increase, decrease, anomalies)?
Can you make a bar plot or a histogram?
Can you compare two related quantities?
Can you make a scatterplot?
Can you make a time-series plot?

Having made these plots, what are some insights you get from them? Do you see any correlations? Is there a hypothesis you would like to investigate further? What other questions do they lead you to ask?

By now you’ve asked a bunch of questions, and found some neat insights. Is there an interesting narrative, a way of presenting the insights using text and plots from the above, that tells a compelling story? As you work out this story, what are some other trends/relationships you think will make it more complete?


#### Update:
    (1) Use sklearn.proprocessing OneHotEncoder to turn categorical variable into numeric
        I acutally used pandas.get_dummies
    (2) Extract zipcode from address data
    (3) Use Pandas merge to join different data source
    (4) Use Pandas fillna to replace missing data
    
    
    to do:
    Build logistic regression model
    Scrape data from website

In [11]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
from scipy import stats 
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction import DictVectorizer
from sklearn import metrics
% matplotlib inline

In [13]:
# load data 
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
spray = pd.read_csv('spray.csv')
weather = pd.read_csv('weather.csv')
train_label = train['WnvPresent']




In [17]:
# check data type of dataframe
print 'train\n',train.dtypes,'\n'
print 'test\n',test.dtypes,'\n'
print 'spray\n',spray.dtypes,'\n'
print 'weather\n',weather.dtypes

train
Date                       object
Address                    object
Species                    object
Block                       int64
Street                     object
Trap                       object
AddressNumberAndStreet     object
Latitude                  float64
Longitude                 float64
AddressAccuracy             int64
NumMosquitos                int64
WnvPresent                  int64
dtype: object 

test
Id                          int64
Date                       object
Address                    object
Species                    object
Block                       int64
Street                     object
Trap                       object
AddressNumberAndStreet     object
Latitude                  float64
Longitude                 float64
AddressAccuracy             int64
dtype: object 

spray
Date          object
Time          object
Latitude     float64
Longitude    float64
dtype: object 

weather
Station          int64
Date            object
Tmax           

### Baselines

In [16]:
# predict all with virus 
print "Accuracy when predicting all with virus = ", metrics.accuracy_score(train_label, np.ones(train_label.shape))
# predict all without virus
print "Accuracy when predicting all without virus = ", metrics.accuracy_score(train_label, np.zeros(train_label.shape))

Accuracy when predicting all with virus =  0.0524462212069
Accuracy when predicting all without virus =  0.947553778793


The baseline probablity of no virus detected is around 0.948.

The number of mosquitos is not included in the test dataset, use number of mosquitos to predict the detection of virus.

In [18]:
# predict with virus if number of mosquitos >=50
print "Accuracy when predicting num of mos >=50 with virus = ", metrics.accuracy_score(train_label,train['NumMosquitos']>=50)

Accuracy when predicting num of mos >=50 with virus =  0.87112126404


### Extract zip code from address and add it as a new column 'Zipcode'

In [4]:
# extract zip code from Address column and add to column 'Zipcode'
import re
def address2zip(address):
    extzip = re.match('^.*(?P<zipcode>\d{5}).*$', address)
    if extzip is None:
        zipcode = 'na'
    else:
        zipcode = extzip.groupdict()['zipcode']
    return zipcode


# add zipcode column to train dataset
def addzip2df(df):
    zipcodes =[]
    rownum = df.shape
    rownum = rownum[0]
    for i in range(0,rownum):
        add = df.iloc[i]['Address']
        z = address2zip(add)
        zipcodes.append(z)
    df['zipcode'] = zipcodes
    return df
    
traindf = addzip2df(train)
testdf = addzip2df(test)

print traindf.dtypes

Date                       object
Address                    object
Species                    object
Block                       int64
Street                     object
Trap                       object
AddressNumberAndStreet     object
Latitude                  float64
Longitude                 float64
AddressAccuracy             int64
NumMosquitos                int64
WnvPresent                  int64
zipcode                    object
dtype: object


In [5]:
print 'train',traindf.dtypes,'\n'
print 'test',testdf.dtypes
print traindf.iloc[0]

train Date                       object
Address                    object
Species                    object
Block                       int64
Street                     object
Trap                       object
AddressNumberAndStreet     object
Latitude                  float64
Longitude                 float64
AddressAccuracy             int64
NumMosquitos                int64
WnvPresent                  int64
zipcode                    object
dtype: object 

test Id                          int64
Date                       object
Address                    object
Species                    object
Block                       int64
Street                     object
Trap                       object
AddressNumberAndStreet     object
Latitude                  float64
Longitude                 float64
AddressAccuracy             int64
zipcode                    object
dtype: object
Date                                                             2007-05-29
Address                   4100 No

### Convert categorical variable to dummy variables

In [6]:
traindfdummy = pd.get_dummies(traindf,columns = ['Species','zipcode'])
testdfdummy = pd.get_dummies(testdf,columns = ['Species','zipcode'])

### Fill missing value

In [7]:
traindfdummy_fill = traindfdummy.fillna('ffill')
testdfdummy_fill = testdfdummy.fillna('ffill')

### Merge different tables

In [10]:
# merge train and spray
#train_spray = pd.merge(train,spray, how = 'left',on=['Date','Latitude','Longitude'])
#test_spray = pd.merge(test,spray, how = 'left', on=['Date','Latitude','Longitude'])
#train_spray.iloc[0]

train_spray = pd.merge(spray,train, how = 'right',on=['Date','Latitude','Longitude'])
test_spray = pd.merge(spray,test, how = 'right', on=['Date','Latitude','Longitude'])


print set(train_spray['Time'])
# print len(train)
# print len(train_spray)
# print len(spray)

# print train_spray.iloc[0]
# print train.iloc[0]
# print spray.iloc[0]

set([nan])


In [9]:
'''
This part does not work
'''
# merge train_spray and weather
trainset = pd.merge(weather[['Date','Heat','PrecipTotal']],train_spray, how = 'right', on = ['Date'])
testset = pd.merge(weather[['Date','Heat','PrecipTotal']],test_spray, how = 'right', on = ['Date'])

set(trainset['Heat'])

{nan}