# Hello, thanks for viewing

# Problem statement and hypothesis

Every year West Nile Virus is a growing concern and causes death in Chicago and nearby suburban areas. The virus is found in infected birds but is passed on to humans by mosquitoes that bit the birds and became infected. So, the project is to use the data where they have found West Nile Virus in mosquitoes in the past to predict where and when West Nile Virus will be present in the future. For more details please go to https://www.kaggle.com/c/predict-west-nile-virus.


# Description of your data set and how it was obtained

The data is provided by Kaggle competition. The data provides the location, mosquito species, weather, spraying (Chicago sprays anti-mosquito chemical in certain areas) data, and whether the virus was found in the mosquitoes. The datasets are spray.csv, weather.csv, train.csv, test.csv. Refer to https://www.kaggle.com/c/predict-west-nile-virus/data for more details (“noaa_weather_qclcd_documentation” provides more details regarding the weather data). 

Basically 
- Train.csv provides data of when and where they trapped mosquitoes (Chicago trap mosquitoes around the city to test for virus), the species of mosquitoes that were trapped, and if they have West Nile Virus. 
- Weather.csv provides a bunch of weather info (temperature, humidity, wind speed, etc.) from 2 weather station. 
- Spray.csv provides when and where they sprayed anti-mosquitoes chemical.
- Test.csv is a test set, please ignore for now.
- Sample_submission.csv is how to format my data when I submit, so ignore this one too.

So far, I have completed parts of the Data Munging process, and I still need to clean the data further. I have successfully combined the training data with the weather data to add features from the weather data to the training data. I will take you through the steps with the code below.

In [1642]:
#Installing Packages
import json
import datetime
import numpy as np
import pandas as pd
from math import *

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [1643]:
#Reading the data, data acquired from kaggle competition
spray = pd.read_csv('spray.csv')
weather = pd.read_csv('weather.csv')
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
sample_submission = pd.read_csv('sampleSubmission.csv')


In [1644]:
sample_submission.head()
sample_submission.WnvPresent.unique()

array([0])

# Data Visualization

Parts of my data visualization is done pn Excel, because it is easier to view on Excel than on an ipython notebook.
Most of data is complete, the concern is the weather data which provides data in different units/formats, and also has missing data in different formats (some are blank, some are replaced by letter "M"). The weather data is also taken from two weather stations (2 locations) which does not provide an accurate weather report of the location that mosquitoes were caught which are spread out across the city. I also need to find a way to create features from the spray data which seems like a stand alone data set at the moment. I am planning to use the distance between the location they trapped the mosquitoes and the sprayed area because spraying should reduce the number of mosquito overall and also mosquito with West Nile Virus.

In [1645]:
spray.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14835 entries, 0 to 14834
Data columns (total 4 columns):
Date         14835 non-null object
Time         14251 non-null object
Latitude     14835 non-null float64
Longitude    14835 non-null float64
dtypes: float64(2), object(2)

In [1646]:
spray.head()

Unnamed: 0,Date,Time,Latitude,Longitude
0,2011-08-29,6:56:58 PM,42.391623,-88.089163
1,2011-08-29,6:57:08 PM,42.391348,-88.089163
2,2011-08-29,6:57:18 PM,42.391022,-88.089157
3,2011-08-29,6:57:28 PM,42.390637,-88.089158
4,2011-08-29,6:57:38 PM,42.39041,-88.088858


In [1647]:
#some data are null, fill na with median
spray_features = spray.fillna(spray.dropna().median()).values
spray_features

array([['2011-08-29', '6:56:58 PM', 42.3916233333333, -88.0891633333333],
       ['2011-08-29', '6:57:08 PM', 42.3913483333333, -88.0891633333333],
       ['2011-08-29', '6:57:18 PM', 42.3910216666667, -88.0891566666667],
       ..., 
       ['2013-09-05', '8:35:21 PM', 42.006021666666705, -87.8173916666667],
       ['2013-09-05', '8:35:31 PM', 42.0054533333333, -87.8174233333333],
       ['2013-09-05', '8:35:41 PM', 42.004805, -87.81746]], dtype=object)

In [1648]:
weather.head()
weather.describe()

Unnamed: 0,Station,Tmax,Tmin,DewPoint,ResultSpeed,ResultDir
count,2944.0,2944.0,2944.0,2944.0,2944.0,2944.0
mean,1.5,76.166101,57.810462,53.45788,6.960666,17.494905
std,0.500085,11.46197,10.381939,10.675181,3.587527,10.063609
min,1.0,41.0,29.0,22.0,0.1,1.0
25%,1.0,69.0,50.0,46.0,4.3,7.0
50%,1.5,78.0,59.0,54.0,6.4,19.0
75%,2.0,85.0,66.0,62.0,9.2,25.0
max,2.0,104.0,83.0,75.0,24.1,36.0


In [1649]:
#SKIP THIS CELL
#Tried  to use variance to to eliminate redundant data, but was not successful yet
#redundancy was seen when I opened the data in excel
#replacing NA data and removing data with low variance
#weather = pd.read_csv('weather.csv')
#weather_features = weather.fillna(weather.dropna().median()).values

'''
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=(.9 * (1 - .9)))
sel.fit_transform(weather.values)
weather.head()
'''

'\nfrom sklearn.feature_selection import VarianceThreshold\nsel = VarianceThreshold(threshold=(.9 * (1 - .9)))\nsel.fit_transform(weather.values)\nweather.head()\n'

In [1650]:
print train.shape
train['WnvPresent'].unique()

(10506, 12)


array([0, 1])

In [1716]:
test.head()
test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 116293 entries, 0 to 116292
Data columns (total 11 columns):
Id                        116293 non-null int64
Date                      116293 non-null object
Address                   116293 non-null object
Species                   116293 non-null object
Block                     116293 non-null int64
Street                    116293 non-null object
Trap                      116293 non-null object
AddressNumberAndStreet    116293 non-null object
Latitude                  116293 non-null float64
Longitude                 116293 non-null float64
AddressAccuracy           116293 non-null int64
dtypes: float64(2), int64(3), object(6)

# Data Munging

So we have 4 sets of data, 3 which we will be using to create the model (spray, weather, and train). I am planning to first merge
the weather data set into the training dataset to add weather features to the set. Second I am planning to use the distance between
each location and the nearest spray site as another feature to make use of the spray data.

In [2443]:
#Create two new columns of zeroes in the weather dataframe
#We will add latitude and longitude of each station to the data set after this
weather = pd.read_csv('weather.csv')
weather['S_lat'], weather['S_long']  = np.zeros(len(weather.Cool.values)), np.zeros(len(weather.Cool.values))
weather.head()

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,...,Water1,SnowFall,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed,S_lat,S_long
0,1,2007-05-01,83,50,67,14,51,56,0,2,...,M,0.0,0.0,29.1,29.82,1.7,27,9.2,0,0
1,2,2007-05-01,84,52,68,M,51,57,0,3,...,M,M,0.0,29.18,29.82,2.7,25,9.6,0,0
2,1,2007-05-02,59,42,51,-3,42,47,14,0,...,M,0.0,0.0,29.38,30.09,13.0,4,13.4,0,0
3,2,2007-05-02,60,43,52,M,42,47,13,0,...,M,M,0.0,29.44,30.08,13.3,2,13.4,0,0
4,1,2007-05-03,66,46,56,2,40,48,9,0,...,M,0.0,0.0,29.39,30.12,11.7,7,11.9,0,0


In [2444]:
#Add latitutde and longitude of each station to the dataset, S1 = Station 1, S2 = Station 2
#then map it to the table (station latitude and longitude were provided by Kaggle)
S1_lat = 41.995
S1_long= -87.933
S2_lat= 41.786
S2_long=  -87.752
for vec in range(len(weather.Station.values)):
    if weather.Station[vec] == 1:
        weather.S_lat[vec] = S1_lat
        weather.S_long[vec] = S1_long
    else:
        weather.S_lat[vec] = S2_lat
        weather.S_long[vec] = S2_long
weather = weather[weather.columns[1:]]
        


In [2445]:
#View unique values of each feature if there are less than 10 unique feature
for i in range(len(weather.columns)):
    weather_values = weather.values
    if len(np.unique(weather_values[:,i])) <50:
        print weather[weather.columns[i]].head(1), np.unique(weather_values[:,i])


0    14
Name: Depart, dtype: object [' 0' ' 1' ' 2' ' 3' ' 4' ' 5' ' 6' ' 7' ' 8' ' 9' '-1' '-10' '-11' '-12'
 '-13' '-14' '-15' '-16' '-17' '-2' '-3' '-4' '-5' '-6' '-7' '-8' '-9' '10'
 '11' '12' '13' '14' '15' '16' '17' '18' '19' '20' '21' '22' '23' 'M']
0    56
Name: WetBulb, dtype: object ['32' '33' '34' '35' '36' '37' '38' '39' '40' '41' '42' '43' '44' '45' '46'
 '47' '48' '49' '50' '51' '52' '53' '54' '55' '56' '57' '58' '59' '60' '61'
 '62' '63' '64' '65' '66' '67' '68' '69' '70' '71' '72' '73' '74' '75' '76'
 '77' '78' 'M']
0    0
Name: Heat, dtype: object ['0' '1' '10' '11' '12' '13' '14' '15' '16' '17' '18' '19' '2' '20' '21'
 '22' '23' '24' '25' '26' '27' '28' '29' '3' '4' '5' '6' '7' '8' '9' 'M']
0     2
Name: Cool, dtype: object [' 0' ' 1' ' 2' ' 3' ' 4' ' 5' ' 6' ' 7' ' 8' ' 9' '10' '11' '12' '13' '14'
 '15' '16' '17' '18' '19' '20' '21' '22' '23' '24' '25' '26' '27' '28' '29'
 'M']
0    0
Name: Depth, dtype: object ['0' 'M']
0    M
Name: Water1, dtype: object ['M']
0    

In [2446]:
#Water1 is all empty so eliminate the column, also eliminate CodeSum since the other parameters already account for it
weather = weather.drop(['Water1','Depth','CodeSum'],1)


In [2447]:
#before replacing values
weather.head()

Unnamed: 0,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,Sunrise,Sunset,SnowFall,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed,S_lat,S_long
0,2007-05-01,83,50,67,14,51,56,0,2,0448,1849,0.0,0.0,29.1,29.82,1.7,27,9.2,41.995,-87.933
1,2007-05-01,84,52,68,M,51,57,0,3,-,-,M,0.0,29.18,29.82,2.7,25,9.6,41.786,-87.752
2,2007-05-02,59,42,51,-3,42,47,14,0,0447,1850,0.0,0.0,29.38,30.09,13.0,4,13.4,41.995,-87.933
3,2007-05-02,60,43,52,M,42,47,13,0,-,-,M,0.0,29.44,30.08,13.3,2,13.4,41.786,-87.752
4,2007-05-03,66,46,56,2,40,48,9,0,0446,1851,0.0,0.0,29.39,30.12,11.7,7,11.9,41.995,-87.933


In [2448]:
#Replace M, space, and - with NA
weather= weather.replace(['M','-', ' ','  T','-inf'], [np.nan, np.nan, np.nan,0.0,np.nan])
weather.head()

Unnamed: 0,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,Sunrise,Sunset,SnowFall,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed,S_lat,S_long
0,2007-05-01,83,50,67,14.0,51,56,0,2,448.0,1849.0,0.0,0.0,29.1,29.82,1.7,27,9.2,41.995,-87.933
1,2007-05-01,84,52,68,,51,57,0,3,,,,0.0,29.18,29.82,2.7,25,9.6,41.786,-87.752
2,2007-05-02,59,42,51,-3.0,42,47,14,0,447.0,1850.0,0.0,0.0,29.38,30.09,13.0,4,13.4,41.995,-87.933
3,2007-05-02,60,43,52,,42,47,13,0,,,,0.0,29.44,30.08,13.3,2,13.4,41.786,-87.752
4,2007-05-03,66,46,56,2.0,40,48,9,0,446.0,1851.0,0.0,0.0,29.39,30.12,11.7,7,11.9,41.995,-87.933


In [2449]:
#dropna
print weather.shape
weather_na = weather.dropna(0)
print weather_na.shape
weather_features = weather.columns

#if we dropna we lose a lot of data ~half
weather_na.head(6)

(2944, 20)
(1464, 20)


Unnamed: 0,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,Sunrise,Sunset,SnowFall,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed,S_lat,S_long
0,2007-05-01,83,50,67,14,51,56,0,2,448,1849,0.0,0.0,29.1,29.82,1.7,27,9.2,41.995,-87.933
2,2007-05-02,59,42,51,-3,42,47,14,0,447,1850,0.0,0.0,29.38,30.09,13.0,4,13.4,41.995,-87.933
4,2007-05-03,66,46,56,2,40,48,9,0,446,1851,0.0,0.0,29.39,30.12,11.7,7,11.9,41.995,-87.933
6,2007-05-04,66,49,58,4,41,50,7,0,444,1852,0.0,0.0,29.31,30.05,10.4,8,10.8,41.995,-87.933
8,2007-05-05,66,53,60,5,38,49,5,0,443,1853,0.0,0.0,29.4,30.1,11.7,7,12.0,41.995,-87.933
10,2007-05-06,68,49,59,4,30,46,6,0,442,1855,0.0,0.0,29.57,30.29,14.4,11,15.0,41.995,-87.933


In [2450]:
weather_values = weather_na.values
weather_features = weather_na.columns
np.mean(weather_values[:,2])
weather_values = weather_values[:,1:]


In [2451]:
"""
#centering the data
for c in range(len(weather_na.columns)-1):
    for r in weather.index:
        weather_values[r][c]= weather_values[r][c]- weather_na.mean()[c+1]
"""

'\n#centering the data\nfor c in range(len(weather_na.columns)-1):\n    for r in weather.index:\n        weather_values[r][c]= weather_values[r][c]- weather_na.mean()[c+1]\n'

In [2452]:
# PCA on weather data
from sklearn.decomposition import PCA
pca_n2 = PCA(n_components = .8, whiten = True)
weather_transform = pca_n2.fit_transform(weather_na[weather_features[1:]])
np.unique(weather_transform)




array([-1.0009539 , -1.00049159, -1.00024441, ...,  2.30201739,
        2.30374175,  2.30927813])

In [2453]:
#Center the data
for vec in range(len(weather_transform)):
    weather_transform[vec] = weather_transform[vec] - np.mean(weather_transform)

In [2454]:
np.unique(weather_transform)

array([-1.0009539 , -1.00049159, -1.00024441, ...,  2.30201739,
        2.30374175,  2.30927813])

In [2455]:
#Check variance
(pca_n2.explained_variance_ratio_) * 100

array([ 93.82901034])

In [2456]:
weather['Weather Effect'] = weather_transform
weather_new = weather[['Date','Weather Effect','S_lat','S_long']]
weather_new.shape

ValueError: Length of values does not match length of index

In [2458]:
#join weather and training data by date
train_weather = pd.merge(train, weather_new, how = 'inner', on = 'Date')
print train_weather.shape
train_weather.head(5)


(10413, 15)


Unnamed: 0,Date,Address,Species,Block,Street,Trap,AddressNumberAndStreet,Latitude,Longitude,AddressAccuracy,NumMosquitos,WnvPresent,Weather Effect,S_lat,S_long
0,2007-05-29,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS/RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,1,0,-0.851234,41.995,-87.933
1,2007-05-29,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,1,0,-0.851234,41.995,-87.933
2,2007-05-29,"6200 North Mandell Avenue, Chicago, IL 60646, USA",CULEX RESTUANS,62,N MANDELL AVE,T007,"6200 N MANDELL AVE, Chicago, IL",41.994991,-87.769279,9,1,0,-0.851234,41.995,-87.933
3,2007-05-29,"7900 West Foster Avenue, Chicago, IL 60656, USA",CULEX PIPIENS/RESTUANS,79,W FOSTER AVE,T015,"7900 W FOSTER AVE, Chicago, IL",41.974089,-87.824812,8,1,0,-0.851234,41.995,-87.933
4,2007-05-29,"7900 West Foster Avenue, Chicago, IL 60656, USA",CULEX RESTUANS,79,W FOSTER AVE,T015,"7900 W FOSTER AVE, Chicago, IL",41.974089,-87.824812,8,4,0,-0.851234,41.995,-87.933


In [2328]:
#Since each date has two weather data c,alculate the distance between the location they captured the mosquito and the 
#weather station and pick the closest weather station.
#Return the index of rows with the closest weather station

"""
train_index = []

for i in range(0,len(train_weather),2):
    lat_diff = train_weather.S_lat[i]-train_weather.Latitude[i]
    long_diff = train_weather.S_long[i]-train_weather.Longitude[i]
    
    lat_diff2 = train_weather.S_lat[i+1]-train_weather.Latitude[i+1]
    long_diff2 = train_weather.S_long[i+1]-train_weather.Longitude[i+1]
    
    d1 = 2*3959*acos(sqrt(sin(lat_diff/2)**2+cos(train_weather.S_lat[i])*cos(train_weather.Latitude[i])*sin(long_diff/2)**2))
    d2 = 2*3959*acos(sqrt(sin(lat_diff2/2)**2+cos(train_weather.S_lat[i+1])*cos(train_weather.Latitude[i+1])*sin(long_diff2/2)**2))
    
    if d1 >= d2:
        train_index.append(i+1)
    else:
        train_index.append(i)
'''
        
    
    
    
    
    

'\ntrain_index = []\n\nfor i in range(0,len(train_weather),2):\n    lat_diff = train_weather.S_lat[i]-train_weather.Latitude[i]\n    long_diff = train_weather.S_long[i]-train_weather.Longitude[i]\n    \n    lat_diff2 = train_weather.S_lat[i+1]-train_weather.Latitude[i+1]\n    long_diff2 = train_weather.S_long[i+1]-train_weather.Longitude[i+1]\n    \n    d1 = 2*3959*acos(sqrt(sin(lat_diff/2)**2+cos(train_weather.S_lat[i])*cos(train_weather.Latitude[i])*sin(long_diff/2)**2))\n    d2 = 2*3959*acos(sqrt(sin(lat_diff2/2)**2+cos(train_weather.S_lat[i+1])*cos(train_weather.Latitude[i+1])*sin(long_diff2/2)**2))\n    \n    if d1 >= d2:\n        train_index.append(i+1)\n    else:\n        train_index.append(i)\n'

In [2329]:
'''
#reindex the training and weather table data to have only the data of the closest weather station.
train_weather_min = pd.DataFrame(train_weather, index = train_index)
#view reindexed data, the data frame should give us the data of where and when they caught the mosquitoes, whether it has the virus, 
#and the weather information of the closest weather station
train_weather_min.head(5)
'''

'\n#reindex the training and weather table data to have only the data of the closest weather station.\ntrain_weather_min = pd.DataFrame(train_weather, index = train_index)\n#view reindexed data, the data frame should give us the data of where and when they caught the mosquitoes, whether it has the virus, \n#and the weather information of the closest weather station\ntrain_weather_min.head(5)\n'

# Feedbacks

Next Steps

2. Create a new feature using the spray data - The feature I plan to use is how far that particular location is to the 
closest location that was sprayed, to see if the mosquito that were caught are close to the area they sprayed chemicals to kill mosquitoes.


4. Apply logistic regression model, feature selection, and iterate.

5. Probability Calibration
6. Logistic Regression, Decision Tree


In [2460]:
#View unique values of each feature
for i in range(len(train_weather.columns)):
    weather_min_values = train_weather.values
    if len(np.unique(weather_min_values[:,i])) <10:
        print train_weather[train_weather.columns[i]].head(1), np.unique(train_weather.values[:,i])


0    CULEX PIPIENS/RESTUANS
Name: Species, dtype: object ['CULEX ERRATICUS' 'CULEX PIPIENS' 'CULEX PIPIENS/RESTUANS'
 'CULEX RESTUANS' 'CULEX SALINARIUS' 'CULEX TARSALIS' 'CULEX TERRITANS']
0    9
Name: AddressAccuracy, dtype: int64 [3 5 8 9]
0    0
Name: WnvPresent, dtype: int64 [0 1]
0    41.995
Name: S_lat, dtype: float64 [41.995]
0   -87.933
Name: S_long, dtype: float64 [-87.933]


In [2462]:
train_weather.head(5)

Unnamed: 0,Date,Address,Species,Block,Street,Trap,AddressNumberAndStreet,Latitude,Longitude,AddressAccuracy,NumMosquitos,WnvPresent,Weather Effect,S_lat,S_long
0,2007-05-29,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS/RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,1,0,-0.851234,41.995,-87.933
1,2007-05-29,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,1,0,-0.851234,41.995,-87.933
2,2007-05-29,"6200 North Mandell Avenue, Chicago, IL 60646, USA",CULEX RESTUANS,62,N MANDELL AVE,T007,"6200 N MANDELL AVE, Chicago, IL",41.994991,-87.769279,9,1,0,-0.851234,41.995,-87.933
3,2007-05-29,"7900 West Foster Avenue, Chicago, IL 60656, USA",CULEX PIPIENS/RESTUANS,79,W FOSTER AVE,T015,"7900 W FOSTER AVE, Chicago, IL",41.974089,-87.824812,8,1,0,-0.851234,41.995,-87.933
4,2007-05-29,"7900 West Foster Avenue, Chicago, IL 60656, USA",CULEX RESTUANS,79,W FOSTER AVE,T015,"7900 W FOSTER AVE, Chicago, IL",41.974089,-87.824812,8,4,0,-0.851234,41.995,-87.933


In [2463]:
#take out address features because we already have latitudes and longitudes
train_weather2= train_weather.drop(train_weather.columns[[1,3,4,5,6]], axis=1)
train_weather2.head(3)

Unnamed: 0,Date,Species,Latitude,Longitude,AddressAccuracy,NumMosquitos,WnvPresent,Weather Effect,S_lat,S_long
0,2007-05-29,CULEX PIPIENS/RESTUANS,41.95469,-87.800991,9,1,0,-0.851234,41.995,-87.933
1,2007-05-29,CULEX RESTUANS,41.95469,-87.800991,9,1,0,-0.851234,41.995,-87.933
2,2007-05-29,CULEX RESTUANS,41.994991,-87.769279,9,1,0,-0.851234,41.995,-87.933


In [2464]:
"""
#labelencoder on mosquito species
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(train_weather2.Species)
"""

'\n#labelencoder on mosquito species\nfrom sklearn import preprocessing\nle = preprocessing.LabelEncoder()\nle.fit(train_weather2.Species)\n'

In [2465]:
"""
#use label encoder on training species data and replace Species in 
species_new = le.transform(train_weather2.Species) 
train_weather2['Species'] = species_new
train_weather2.Species.unique()
"""

"\n#use label encoder on training species data and replace Species in \nspecies_new = le.transform(train_weather2.Species) \ntrain_weather2['Species'] = species_new\ntrain_weather2.Species.unique()\n"

In [2466]:
#create dummy variables for Species data
train_species = pd.get_dummies(train_weather2.Species, prefix='Species')
#train_species['NumMosquitos'] = train_weather2.NumMosquitos
train_species.head(5)

Unnamed: 0,Species_CULEX ERRATICUS,Species_CULEX PIPIENS,Species_CULEX PIPIENS/RESTUANS,Species_CULEX RESTUANS,Species_CULEX SALINARIUS,Species_CULEX TARSALIS,Species_CULEX TERRITANS
0,0,0,1,0,0,0,0
1,0,0,0,1,0,0,0
2,0,0,0,1,0,0,0
3,0,0,1,0,0,0,0
4,0,0,0,1,0,0,0


In [2467]:
np.sum(train_species[train_species.index == 1].values)

1.0

In [2468]:
#add number of mosquito from all columns
'''
non_zero_sum = []
for vec in range(len(train_species.index)):
    non_zero_sum.append(np.sum(train_species[train_species.index == vec].values))
train_species['Mosquito_Sum'] = non_zero_sum
train_species['NumMosquitos'] = train_weather2['NumMosquitos']

train_species.head(5)
'''

"\nnon_zero_sum = []\nfor vec in range(len(train_species.index)):\n    non_zero_sum.append(np.sum(train_species[train_species.index == vec].values))\ntrain_species['Mosquito_Sum'] = non_zero_sum\ntrain_species['NumMosquitos'] = train_weather2['NumMosquitos']\n\ntrain_species.head(5)\n"

In [2469]:
'''
#Check if there is only one species of mosquito per trap
train_species.NumMosquitos.unique()
#Yes only 1 per trap
'''

'\n#Check if there is only one species of mosquito per trap\ntrain_species.NumMosquitos.unique()\n#Yes only 1 per trap\n'

In [2470]:
'''
#Multiply the number of mosquitos into the species
for x in range(len(train_species.index)):
    for i in train_species.columns:
        if train_species[i][x] > 0:
            train_species[i][x] = train_species[i][x] * train_species.NumMosquitos[x]
train_species2 = train_species.drop(['Mosquito_Sum','NumMosquitos'],1)
train_species2.shape
'''         
        

"\n#Multiply the number of mosquitos into the species\nfor x in range(len(train_species.index)):\n    for i in train_species.columns:\n        if train_species[i][x] > 0:\n            train_species[i][x] = train_species[i][x] * train_species.NumMosquitos[x]\ntrain_species2 = train_species.drop(['Mosquito_Sum','NumMosquitos'],1)\ntrain_species2.shape\n"

In [2471]:
#3rd column corelates with 2nd and 4th column, if value of column 3 is 1 then column 2 and 4 are 1

for r in train_species.index:
    if train_species[train_species.columns[2]][r] > 0:
        train_species[train_species.columns[1]][r] = train_species[train_species.columns[2]][r]
        train_species[train_species.columns[3]][r] = train_species[train_species.columns[2]][r]

#drop column 3
train_species= train_species.drop('Species_CULEX PIPIENS/RESTUANS',1)

train_species.head(5)

        
        


Unnamed: 0,Species_CULEX ERRATICUS,Species_CULEX PIPIENS,Species_CULEX RESTUANS,Species_CULEX SALINARIUS,Species_CULEX TARSALIS,Species_CULEX TERRITANS
0,0,1,1,0,0,0
1,0,0,1,0,0,0
2,0,0,1,0,0,0
3,0,1,1,0,0,0
4,0,0,1,0,0,0


In [2474]:
#concategate species table back to train_weather table
train_final = pd.concat([train_weather, train_species], axis=1)

In [2475]:
train_final2 = train_final.drop(['Species', 'S_lat', 'S_long'], 1)

In [2476]:
#Insert Spray Data, remove time data
print spray.shape
spray_drop = spray.drop('Time',1)
spray_drop.columns = ['Date','Latitude_Spray','Longitude_Spray']
spray_drop.head()


(14835, 4)


Unnamed: 0,Date,Latitude_Spray,Longitude_Spray
0,2011-08-29,42.391623,-88.089163
1,2011-08-29,42.391348,-88.089163
2,2011-08-29,42.391022,-88.089157
3,2011-08-29,42.390637,-88.089158
4,2011-08-29,42.39041,-88.088858


In [2477]:
spray_drop.Date.head()

0    2011-08-29
1    2011-08-29
2    2011-08-29
3    2011-08-29
4    2011-08-29
Name: Date, dtype: object

In [2478]:
#merge spray data_drop data into train data
#train_spray = pd.merge(train_final2, spray_drop, how = 'inner', on = 'Date')

from time import strptime
for i in range(len(spray_drop.Date)):
    spray_drop.Date[i] = dt.datetime.strptime(spray_drop.Date[i], "%Y-%m-%d")


In [2479]:

spray_drop.columns = (['Spray_Date','Latitude_Spray', 'Longitude_Spray'])
spray_drop.head(3)

Unnamed: 0,Spray_Date,Latitude_Spray,Longitude_Spray
0,2011-08-29 00:00:00,42.391623,-88.089163
1,2011-08-29 00:00:00,42.391348,-88.089163
2,2011-08-29 00:00:00,42.391022,-88.089157


In [2480]:
# format training date string
import datetime as dt
for i in range(len(train_final2.Date)):
    train_final2.Date[i] = dt.datetime.strptime( train_final2.Date[i],"%Y-%m-%d")


#for j in range(len(train_final2.Date)):
#    train_final2.Date[j+1] = train_final2.Date[j+1][2:4]+ ' ' + train_final2.Date[j+1][5:7]+ ' ' + train_final2.Date[j+1][8:10]

In [2481]:
train_final2.head(2)

Unnamed: 0,Date,Address,Block,Street,Trap,AddressNumberAndStreet,Latitude,Longitude,AddressAccuracy,NumMosquitos,WnvPresent,Weather Effect,Species_CULEX ERRATICUS,Species_CULEX PIPIENS,Species_CULEX RESTUANS,Species_CULEX SALINARIUS,Species_CULEX TARSALIS,Species_CULEX TERRITANS
0,2007-05-29 00:00:00,"4100 North Oak Park Avenue, Chicago, IL 60634,...",41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,1,0,-0.851234,0,1,1,0,0,0
1,2007-05-29 00:00:00,"4100 North Oak Park Avenue, Chicago, IL 60634,...",41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,1,0,-0.851234,0,0,1,0,0,0


In [2482]:
train_spray = pd.concat([train_final2, spray_drop], axis =1, join = 'inner')

In [2483]:
print train_final2.shape
print spray_drop.shape
train_spray.shape

(10413, 18)
(14835, 3)


(10413, 21)

In [2505]:
train_date = train_final2[['Date','Latitude','Longitude']]
train_date.head()

Unnamed: 0,Date,Latitude,Longitude
0,2007-05-29 00:00:00,41.95469,-87.800991
1,2007-05-29 00:00:00,41.95469,-87.800991
2,2007-05-29 00:00:00,41.994991,-87.769279
3,2007-05-29 00:00:00,41.974089,-87.824812
4,2007-05-29 00:00:00,41.974089,-87.824812


In [2514]:
#Create the mean distance of the trap from the spray that was sprayed on the same day
min_distance = []
train_date['Distance'] = np.zeros(len(train_final2.index))
all_distance =[]
for d in train_date.index:
    print d
    distance = min_distance
    for i in spray_drop.index:
        if train_date.Date[d] == spray_drop.Spray_Date[i]:
            lat_diff = train_date.Latitude[d]-spray_drop.Latitude_Spray[i]
            long_diff = train_date.Longitude[d]-spray_drop.Longitude_Spray[i]
            d1 = 2*3959*acos(sqrt(sin(lat_diff/2)**2+cos(train_date.Latitude[d])*cos(spray_drop.Latitude_Spray[i])*sin(long_diff/2)**2))
            all_distance.append(d1)
            min_distance.append(np.mean(all_distance))
            
 
#distance.append(np.mean(all_distance))
#print all_distance.mean()
#train_spray.head(2)


0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32


Try using .loc[row_index,col_indexer] = value instead
  app.launch_new_instance()


KeyboardInterrupt: 

In [2515]:
min_distance[1:5]

[]

In [None]:
#Create time difference between the trap and the spray that was sprayed in the closest location

all_distance2 = []

for d in train_date.index:
    for i in train_spray.index:
        if s
        #d2 = 2*3959*acos(sqrt(sin(lat_diff/2)**2+cos(train_spray.Latitude[d])*cos(train_spray.Latitude_Spray[i])*sin(long_diff/2)**2))
        
#closest_date = train_spray.Spray_Date[all_distance2.index(min(all_distance2))]
    

    
        









In [None]:
#Create spray time feature


In [None]:
import datetime as dt
for i in range(len(train_final3.Date)):
    train_final3.Date[i] = dt.datetime.strptime( train_final3.Date[i],"%Y-%m-%d")
features_train.head()

In [2498]:
train_final3 = train_final2.drop(train_final2.columns[[1,2,3,4,5]],1)
features_train = train_final3
target_train = train_final2['WnvPresent']
# format training date string


In [2499]:
features_train = features_train.drop('Date',1)

In [2500]:
#separate train test data
from sklearn.cross_validation import train_test_split, cross_val_score
x_train, x_test, y_train, y_test = train_test_split(features_train, target_train, test_size=0.3, random_state=10)






In [2503]:


#Or use decision tree
from sklearn import ensemble, preprocessing
clf = ensemble.RandomForestClassifier(n_jobs=-1, n_estimators=1000, min_samples_split=1)
clf.fit(x_train, x_test)


KeyError: '<ipython-input-1855-f4f0c973ab7a>'

In [2487]:
#evaluate accuracy of model
from sklearn.metrics import accuracy_score
y_predicted = lr.predict(x_test)
accuracy_score(y_test, y_predicted)


AttributeError: 'LogisticRegression' object has no attribute 'coef_'

In [2325]:
#cross validation to evaluate model
scores = cross_val_score(lr, features_train, target_train, cv=5)
scores.mean()

0.94286075482620579

# Clean Testing Data

In [2193]:
test.head()

Unnamed: 0,Id,Date,Address,Species,Block,Street,Trap,AddressNumberAndStreet,Latitude,Longitude,AddressAccuracy
0,1,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS/RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9
1,2,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9
2,3,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9
3,4,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX SALINARIUS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9
4,5,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX TERRITANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9


In [2194]:
test_weather = pd.merge(test, weather_new, how = 'inner', on = 'Date')
test_weather.head(3)

Unnamed: 0,Id,Date,Address,Species,Block,Street,Trap,AddressNumberAndStreet,Latitude,Longitude,AddressAccuracy,Weather Effect,S_lat,S_long
0,1,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS/RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,-0.942053,41.995,-87.933
1,2,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,-0.942053,41.995,-87.933
2,3,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,-0.942053,41.995,-87.933


In [2195]:
#use label encoder on training species data and replace Species in 
"""
le.fit(test_weather.Species)
species_new2 = le.transform(test_weather.Species) 
test_weather['Species'] = species_new2
"""

"\nle.fit(test_weather.Species)\nspecies_new2 = le.transform(test_weather.Species) \ntest_weather['Species'] = species_new2\n"

In [2196]:
#create dummy variables for Species data
test_species = pd.get_dummies(test_weather.Species, prefix = 'Species')
test_species.head()

Unnamed: 0,Species_CULEX ERRATICUS,Species_CULEX PIPIENS,Species_CULEX PIPIENS/RESTUANS,Species_CULEX RESTUANS,Species_CULEX SALINARIUS,Species_CULEX TARSALIS,Species_CULEX TERRITANS,Species_UNSPECIFIED CULEX
0,0,0,1,0,0,0,0,0
1,0,0,0,1,0,0,0,0
2,0,1,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0
4,0,0,0,0,0,0,1,0


In [2197]:
#2d column corelates with 1st and 3rd column, if value of column 2 is 1 then column 1 and 3 are 1
for r in test_species.index:
    if test_species[test_species.columns[1]][r] == 1:
        test_species[test_species.columns[0]][r] = 1
        test_species[test_species.columns[2]][r] = 1
    else:
         1+1
#drop column 3
test_species = test_species.drop('Species_CULEX PIPIENS/RESTUANS',1)
test_species = test_species.drop('Species_UNSPECIFIED CULEX',1)
test_species['Species_CULEX ERRATICUS'] = np.zeros(len(test_species.index))

test_species.head(5)
        
        
        

Unnamed: 0,Species_CULEX ERRATICUS,Species_CULEX PIPIENS,Species_CULEX RESTUANS,Species_CULEX SALINARIUS,Species_CULEX TARSALIS,Species_CULEX TERRITANS
0,0,0,0,0,0,0
1,0,0,1,0,0,0
2,0,1,0,0,0,0
3,0,0,0,1,0,0
4,0,0,0,0,0,1


In [2198]:
#combine species back into main dataframe
test_final = pd.concat([test_weather, test_species], axis=1)

In [2199]:
test_final.columns

Index([u'Id', u'Date', u'Address', u'Species', u'Block', u'Street', u'Trap', u'AddressNumberAndStreet', u'Latitude', u'Longitude', u'AddressAccuracy', u'Weather Effect', u'S_lat', u'S_long', u'Species_CULEX ERRATICUS', u'Species_CULEX PIPIENS', u'Species_CULEX RESTUANS', u'Species_CULEX SALINARIUS', u'Species_CULEX TARSALIS', u'Species_CULEX TERRITANS'], dtype='object')

In [2200]:
#drop unnecessary columns
test_final2 = test_final.drop(test_final.columns[[1,2,3,4,5,6,7,12,13]],1)
test_final2.head()

Unnamed: 0,Id,Latitude,Longitude,AddressAccuracy,Weather Effect,Species_CULEX ERRATICUS,Species_CULEX PIPIENS,Species_CULEX RESTUANS,Species_CULEX SALINARIUS,Species_CULEX TARSALIS,Species_CULEX TERRITANS
0,1,41.95469,-87.800991,9,-0.942053,0,0,0,0,0,0
1,2,41.95469,-87.800991,9,-0.942053,0,0,1,0,0,0
2,3,41.95469,-87.800991,9,-0.942053,0,1,0,0,0,0
3,4,41.95469,-87.800991,9,-0.942053,0,0,0,1,0,0
4,5,41.95469,-87.800991,9,-0.942053,0,0,0,0,0,1


In [2201]:
#create features test
features_test = test_final[test_final2.columns[1:]]
features_test.head(5)

Unnamed: 0,Latitude,Longitude,AddressAccuracy,Weather Effect,Species_CULEX ERRATICUS,Species_CULEX PIPIENS,Species_CULEX RESTUANS,Species_CULEX SALINARIUS,Species_CULEX TARSALIS,Species_CULEX TERRITANS
0,41.95469,-87.800991,9,-0.942053,0,0,0,0,0,0
1,41.95469,-87.800991,9,-0.942053,0,0,1,0,0,0
2,41.95469,-87.800991,9,-0.942053,0,1,0,0,0,0
3,41.95469,-87.800991,9,-0.942053,0,0,0,1,0,0
4,41.95469,-87.800991,9,-0.942053,0,0,0,0,0,1


In [2202]:
target_predicted = lr.predict(features_test)
target_predicted

array([0, 0, 0, ..., 0, 0, 0])

In [2203]:
#Create Submission file
submission = test_final2.drop(test_final2.columns[1:],1)
submission['WnvPresent'] = target_predicted
submission.head(5)

Unnamed: 0,Id,WnvPresent
0,1,0
1,2,0
2,3,0
3,4,0
4,5,0


In [2204]:
print submission.head()
print sample_submission.head()

   Id  WnvPresent
0   1           0
1   2           0
2   3           0
3   4           0
4   5           0
   Id  WnvPresent
0   1           0
1   2           0
2   3           0
3   4           0
4   5           0


In [2205]:
submission.to_csv('/home/jerd/Desktop/submission.csv',columns=['Id','WnvPresent'])

In [2206]:
submission.WnvPresent.unique()

array([0])