train.csv, test.csv - the training and test set of the main dataset. The training set consists of data from 2007, 2009, 2011, and 2013, while in the test set you are requested to predict the test results for 2008, 2010, 2012, and 2014.
* Id: the id of the record
* Date: date that the WNV test is performed
* Address: approximate address of the location of trap. This is used to send to the GeoCoder. 
* Species: the species of mosquitos
* Block: block number of address
* Street: street name
* Trap: Id of the trap
* AddressNumberAndStreet: approximate address returned from GeoCoder
* Latitude, Longitude: Latitude and Longitude returned from GeoCoder
* AddressAccuracy: accuracy returned from GeoCoder
* NumMosquitos: number of mosquitoes caught in this trap
* WnvPresent: whether West Nile Virus was present in these mosquitos. 1 means WNV is present, and 0 means not present

In [78]:
import numpy as np
import pandas as pd
import plotly.express as px

In [79]:
train_df = pd.read_csv('./data/train.csv')

In [80]:
train_df.shape

(10506, 12)

In [81]:
train_df.dtypes

Date                       object
Address                    object
Species                    object
Block                       int64
Street                     object
Trap                       object
AddressNumberAndStreet     object
Latitude                  float64
Longitude                 float64
AddressAccuracy             int64
NumMosquitos                int64
WnvPresent                  int64
dtype: object

In [82]:
train_df.isnull().sum() # no null values

Date                      0
Address                   0
Species                   0
Block                     0
Street                    0
Trap                      0
AddressNumberAndStreet    0
Latitude                  0
Longitude                 0
AddressAccuracy           0
NumMosquitos              0
WnvPresent                0
dtype: int64

In [83]:
train_df['Species'].unique() # no abnormalities observed in Species

array(['CULEX PIPIENS/RESTUANS', 'CULEX RESTUANS', 'CULEX PIPIENS',
       'CULEX SALINARIUS', 'CULEX TERRITANS', 'CULEX TARSALIS',
       'CULEX ERRATICUS'], dtype=object)

In [84]:
train_df['NumMosquitos'].unique() # no abnormalities observed in NumMosquitos

array([ 1,  4,  2,  3,  5,  9,  7, 10,  8,  6, 19, 20, 25, 16, 11, 12, 28,
       18, 50, 35, 14, 22, 21, 37, 27, 13, 39, 29, 15, 17, 34, 26, 32, 47,
       44, 23, 46, 48, 42, 33, 45, 24, 41, 38, 40, 36, 43, 49, 30, 31])

In [57]:
# As we will be using lat and long for location, the following features will be removed.
train_df.drop(columns=['Address','Block','Street','AddressNumberAndStreet'],inplace=True)

In [58]:
train_df.columns

Index(['Date', 'Species', 'Trap', 'Latitude', 'Longitude', 'AddressAccuracy',
       'NumMosquitos', 'WnvPresent'],
      dtype='object')

In [59]:
def split_date(df):
    df['Date'] = pd.to_datetime(df['Date'])
    df['Year'] = df.Date.dt.year
    df['Month'] = df.Date.dt.month
    df['Day'] = df.Date.dt.day

In [60]:
split_date(train_df)

In [88]:
train_df.columns

Index(['Date', 'Address', 'Species', 'Block', 'Street', 'Trap',
       'AddressNumberAndStreet', 'Latitude', 'Longitude', 'AddressAccuracy',
       'NumMosquitos', 'WnvPresent'],
      dtype='object')

In [85]:
train_df['Address'].unique()

array(['4100 North Oak Park Avenue, Chicago, IL 60634, USA',
       '6200 North Mandell Avenue, Chicago, IL 60646, USA',
       '7900 West Foster Avenue, Chicago, IL 60656, USA',
       '1500 West Webster Avenue, Chicago, IL 60614, USA',
       '2500 West Grand Avenue, Chicago, IL 60654, USA',
       '1100 Roosevelt Road, Chicago, IL 60608, USA',
       '1100 West Chicago Avenue, Chicago, IL 60642, USA',
       '2100 North Stave Street, Chicago, IL 60647, USA',
       '2200 North Cannon Drive, Chicago, IL 60614, USA',
       '2200 West 113th Street, Chicago, IL 60643, USA',
       '1100 South Peoria Street, Chicago, IL 60608, USA',
       '1700 West 95th Street, Chicago, IL 60643, USA',
       '2200 West 89th Street, Chicago, IL 60643, USA',
       'North Streeter Drive, Chicago, IL 60611, USA',
       '6500 North Oak Park Avenue, Chicago, IL 60631, USA',
       '7500 North Oakley Avenue, Chicago, IL 60645, USA',
       '1500 North Long Avenue, Chicago, IL 60651, USA',
       '8900 Sou

In [None]:
for col_name, data in df.items():
	print("col_name:",col_name, "\ndata:",data)

In [93]:
for a,b in train_df.items():
	print("col_name:",a, "\ndata:", b)

col_name: Date 
data: 0        2007-05-29
1        2007-05-29
2        2007-05-29
3        2007-05-29
4        2007-05-29
5        2007-05-29
6        2007-05-29
7        2007-05-29
8        2007-05-29
9        2007-05-29
10       2007-05-29
11       2007-05-29
12       2007-05-29
13       2007-05-29
14       2007-05-29
15       2007-05-29
16       2007-05-29
17       2007-05-29
18       2007-05-29
19       2007-05-29
20       2007-05-29
21       2007-05-29
22       2007-05-29
23       2007-05-29
24       2007-05-29
25       2007-06-05
26       2007-06-05
27       2007-06-05
28       2007-06-05
29       2007-06-05
30       2007-06-05
31       2007-06-05
32       2007-06-05
33       2007-06-05
34       2007-06-05
35       2007-06-05
36       2007-06-05
37       2007-06-05
38       2007-06-05
39       2007-06-05
40       2007-06-05
41       2007-06-05
42       2007-06-05
43       2007-06-05
44       2007-06-05
45       2007-06-05
46       2007-06-05
47       2007-06-05
48       2007-06-0

data: 0                        N OAK PARK AVE
1                        N OAK PARK AVE
2                         N MANDELL AVE
3                          W FOSTER AVE
4                          W FOSTER AVE
5                         W WEBSTER AVE
6                           W GRAND AVE
7                           W ROOSEVELT
8                           W ROOSEVELT
9                             W CHICAGO
10                           N STAVE ST
11                          N CANNON DR
12                          N CANNON DR
13                           W 113TH ST
14                           W 113TH ST
15                          S PEORIA ST
16                            W 95TH ST
17                            W 89TH ST
18                            W 89TH ST
19                        N STREETER DR
20                        N STREETER DR
21                       N OAK PARK AVE
22                         N OAKLEY AVE
23                           N LONG AVE
24                       S CARPENT

data: 0                       4100  N OAK PARK AVE, Chicago, IL
1                       4100  N OAK PARK AVE, Chicago, IL
2                        6200  N MANDELL AVE, Chicago, IL
3                         7900  W FOSTER AVE, Chicago, IL
4                         7900  W FOSTER AVE, Chicago, IL
5                        1500  W WEBSTER AVE, Chicago, IL
6                          2500  W GRAND AVE, Chicago, IL
7                          1100  W ROOSEVELT, Chicago, IL
8                          1100  W ROOSEVELT, Chicago, IL
9                            1100  W CHICAGO, Chicago, IL
10                          2100  N STAVE ST, Chicago, IL
11                         2200  N CANNON DR, Chicago, IL
12                         2200  N CANNON DR, Chicago, IL
13                          2200  W 113TH ST, Chicago, IL
14                          2200  W 113TH ST, Chicago, IL
15                         1100  S PEORIA ST, Chicago, IL
16                           1700  W 95TH ST, Chicago, IL
17      

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [70]:
train_df['Year'].unique()

array([2007, 2009, 2011, 2013])

In [62]:
Year_virus = train_df[['Year', 'WnvPresent']].groupby('Year', as_index = False).sum()

In [64]:
fig = px.bar(Year_virus, x = 'Year', y = 'WnvPresent')
fig.update_layout(
    title="Virus vs Year",
    xaxis_title="Year",
    yaxis_title="Virus",
    width=500,
    height=300,)
fig.show()

In [69]:
train_df['Month'].unique()

array([ 5,  6,  7,  8,  9, 10])

In [65]:
month_virus = train_df[['Month', 'WnvPresent']].groupby('Month', as_index = False).sum()

In [66]:
fig = px.bar(month_virus, x = 'Month', y = 'WnvPresent')
fig.update_layout(
    title="Virus vs Month",
    xaxis_title="Month",
    yaxis_title="Virus",
    width=500,
    height=300,)
fig.show()

In [73]:
train_df['Day'].unique()

array([29,  5, 26,  2, 11, 18, 19, 25, 27,  1,  3,  7,  8,  9, 15, 16, 17,
       21, 22, 24, 28,  4,  6, 12, 10, 13, 31, 14, 30, 23])

In [71]:
day_virus = train_df[['Day', 'WnvPresent']].groupby('Day', as_index = False).sum()

In [72]:
fig = px.bar(day_virus, x = 'Day', y = 'WnvPresent')
fig.update_layout(
    title="Virus vs Day",
    xaxis_title="Day",
    yaxis_title="Virus",
    width=500,
    height=300,)
fig.show()

In [74]:
species_virus = train_df[['Species', 'WnvPresent']].groupby('Species', as_index = False).sum()

In [75]:
species_virus

Unnamed: 0,Species,WnvPresent
0,CULEX ERRATICUS,0
1,CULEX PIPIENS,240
2,CULEX PIPIENS/RESTUANS,262
3,CULEX RESTUANS,49
4,CULEX SALINARIUS,0
5,CULEX TARSALIS,0
6,CULEX TERRITANS,0


In [76]:
fig = px.bar(species_virus, x = 'Species', y = 'WnvPresent')
fig.update_layout(
    title="Virus vs Species",
    xaxis_title="Species",
    yaxis_title="Virus",
    width=500,
    height=300,)
fig.show()

In [102]:
train_df.columns

Index(['Date', 'Address', 'Species', 'Block', 'Street', 'Trap',
       'AddressNumberAndStreet', 'Latitude', 'Longitude', 'AddressAccuracy',
       'NumMosquitos', 'WnvPresent'],
      dtype='object')

In [98]:
a = train_df

In [105]:
train_df.groupby(['Date','Species','Latitude','Longitude'])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fefe4c3fb20>

In [113]:
filtered = train_df.groupby(['Date','Species','Latitude','Longitude'])[['NumMosquitos','WnvPresent']].sum()

In [116]:
filtered.reset_index

<bound method DataFrame.reset_index of                                                         NumMosquitos  \
Date       Species                Latitude  Longitude                  
2007-05-29 CULEX PIPIENS          41.731922 -87.677512             1   
           CULEX PIPIENS/RESTUANS 41.688324 -87.676709             1   
                                  41.867108 -87.654224             1   
                                  41.891126 -87.611560             1   
                                  41.919343 -87.694259             1   
                                  41.921965 -87.632085             2   
                                  41.954690 -87.800991             1   
                                  41.974089 -87.824812             1   
                                  41.999129 -87.795585             1   
                                  42.017430 -87.687769             1   
           CULEX RESTUANS         41.688324 -87.676709             1   
                         

In [None]:
n_by_state = df.groupby("state")["last_name"].count()
>>> n_by_state.head(10)

In [96]:
#Looking for records that are wnvpresent and have num of mosquitos over 50.
pd.set_option("display.max_rows",None)
wnv_present = train_df[train_df['WnvPresent'] == 1]
wnv_present[wnv_present['NumMosquitos'] == 50]


Unnamed: 0,Date,Address,Species,Block,Street,Trap,AddressNumberAndStreet,Latitude,Longitude,AddressAccuracy,NumMosquitos,WnvPresent
553,2007-07-18,"3800 East 115th Street, Chicago, IL 60617, USA",CULEX PIPIENS/RESTUANS,38,E 115TH ST,T215,"3800 E 115TH ST, Chicago, IL",41.686398,-87.531635,8,50,1
603,2007-07-25,"South Doty Avenue, Chicago, IL, USA",CULEX PIPIENS/RESTUANS,12,S DOTY AVE,T115,"1200 S DOTY AVE, Chicago, IL",41.673408,-87.599862,5,50,1
611,2007-07-25,"South Doty Avenue, Chicago, IL, USA",CULEX PIPIENS/RESTUANS,12,S DOTY AVE,T115,"1200 S DOTY AVE, Chicago, IL",41.673408,-87.599862,5,50,1
618,2007-07-25,"South Doty Avenue, Chicago, IL, USA",CULEX PIPIENS,12,S DOTY AVE,T115,"1200 S DOTY AVE, Chicago, IL",41.673408,-87.599862,5,50,1
660,2007-07-25,"South Doty Avenue, Chicago, IL, USA",CULEX PIPIENS,12,S DOTY AVE,T115,"1200 S DOTY AVE, Chicago, IL",41.673408,-87.599862,5,50,1
661,2007-07-25,"South Doty Avenue, Chicago, IL, USA",CULEX PIPIENS,12,S DOTY AVE,T115,"1200 S DOTY AVE, Chicago, IL",41.673408,-87.599862,5,50,1
777,2007-08-01,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS/RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,50,1
917,2007-08-01,"4200 West 127th Street, Alsip, IL 60803, USA",CULEX PIPIENS/RESTUANS,42,W 127TH PL,T135,"4200 W 127TH PL, Chicago, IL",41.662014,-87.724608,8,50,1
986,2007-08-01,"7000 North Moselle Avenue, Chicago, IL 60646, USA",CULEX PIPIENS,70,N MOSELL AVE,T008,"7000 N MOSELL AVE, Chicago, IL",42.008314,-87.777921,9,50,1
1306,2007-08-01,"South Avenue L, Chicago, IL 60617, USA",CULEX PIPIENS/RESTUANS,11,S AVENUE L,T103,"1100 S AVENUE L, Chicago, IL",41.702724,-87.536497,5,50,1


In [97]:
wnv_present['NumMosquitos'] == 50

553       True
603       True
611       True
618       True
660       True
661       True
777       True
778      False
784      False
812      False
842      False
902      False
917       True
919      False
970      False
986       True
1047     False
1078     False
1159     False
1166     False
1250     False
1306      True
1309      True
1310      True
1332     False
1411      True
1423     False
1476     False
1502     False
1503     False
1515     False
1517     False
1519     False
1561     False
1575      True
1576     False
1591     False
1599     False
1633     False
1653     False
1671     False
1685     False
1699      True
1709      True
1730     False
1745      True
1763     False
1772     False
1779     False
1781     False
1788     False
1789     False
1791     False
1802     False
1818     False
1819     False
1821     False
1831     False
1832     False
1833     False
1834     False
1835     False
1837      True
1838      True
1839     False
1850     False
1867     F