<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 4: West Nile Virus Prediction

# Background

West Nile Virus(WNV) is most commonly spread to humans through infected mosquitoes. Around 20% of the people who become infected with the virus develop symptoms ranging from a persistent fever, to serious neurological illnesses that can result in death.

In 2002, the first human case of the WNV was reported in Chicago. By 2004 the City of Chicago and the Chicago Department of Public Health(CDPH) had established a comprehensive surveillance and control program that is still in effect today.

Due to the rising seasonal epidemic of the WNV in Chicago, an increasing number of people need medical attention. As such, there is a need for better mosquito control. However, the City of Chicago can only react to the new cases and spraying pesticide is costly and must be done regularly.

# Problem Statement

Build a classification model of at least 90% recall score to predict areas of the city where there are mosquitoes carrying the West Nile Virus. Optimise current mosquito control measures to effectively and efficiently tackle the mosquito problem and create a desirable cost-benefit analysis.


## This notebook consists of data cleaning. Please refer to the relevant notebooks for the other parts.

### Contents:
- **Data Dictionary**
- **Data Cleaning**
- **Exploratory Data Analysis & Visualisaion** - *refer to part 2 EDA notebook*
- **Modelling** - *refer to part 3 Modelling notebook*
- **Spray Analysis** - *refer to part 4 spray analysis and conclusion notebook*
- **Conclusion and Recommendations** - *refer to part 4 spray analysis and conclusion notebook*


# Data Dictionary

#### train.csv, test.csv

* The train set consists of data from 2007, 2009, 2011, and 2013
* The test set consists of data from 2008, 2010, 2012 and 2014 (without data on NumMosquitos and WnvPresent)


---
| Feature | Data Type | Description |
| :--- | :--- | :--- 
| **Date**| *object* | Date that the WNV test is performed |
| **Address** | *object* | Approximate address of the location of trap. This is used to send to the GeoCoder. |
| **Species** | *object* | The species of mosquitos |
| **Block** | *int* | Block number of address |
| **Street** | *object* | Street name |
| **Trap** | *object* | Id of the trap |
| **AddressNumberAndStreet** | *object* | Approximate address returned from GeoCoder |
| **Latitude** | *float* | Latitude returned from GeoCoder |
| **Longitude** | *float* | Longitude returned from GeoCoder |
| **AddressAccuracy** | *int* | Accuracy returned from GeoCoder |
| **NumMosquitos** | *int* | Number of mosquitoes caught in this trap |
| **WnvPresent** | *int* | Whether West Nile Virus was present in these mosquitos. 1 means WNV is present, and 0 means not present. |

---
#### spray.csv

* GIS data of spraying efforts in 2011 and 2013

---
| Feature | Data Type | Description |
| :--- | :--- | :--- 
| **Date** | *object* | The date of the spray |
| **Time** | *object* | The time of the spray |
| **Latitude** | *float* | Latitude of the spray |
| **Longitude** | *float* | Longitude of the spray |

---
#### weather.csv

* Weather data from 2007 to 2014. Column descriptions in noaa_weather_qclcd_documentation.pdf saved in document folder.

# Data Cleaning

In [1]:
# Import Libraries

import pandas as pd
import numpy as np


In [2]:
# Import relevant CSV files

weather = pd.read_csv('data/datasets/weather.csv', parse_dates=True)
train = pd.read_csv('data/datasets/train.csv', parse_dates=True)
test = pd.read_csv('data/datasets/test.csv', parse_dates=True)
spray = pd.read_csv('data/datasets/spray.csv', parse_dates=True)

### Data cleaning for train and test dataset

In [3]:
# Get info on train dataset

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10506 entries, 0 to 10505
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Date                    10506 non-null  object 
 1   Address                 10506 non-null  object 
 2   Species                 10506 non-null  object 
 3   Block                   10506 non-null  int64  
 4   Street                  10506 non-null  object 
 5   Trap                    10506 non-null  object 
 6   AddressNumberAndStreet  10506 non-null  object 
 7   Latitude                10506 non-null  float64
 8   Longitude               10506 non-null  float64
 9   AddressAccuracy         10506 non-null  int64  
 10  NumMosquitos            10506 non-null  int64  
 11  WnvPresent              10506 non-null  int64  
dtypes: float64(2), int64(4), object(6)
memory usage: 985.1+ KB


In [4]:
# Check Null values for train dataset

null_stats = pd.DataFrame(train.isnull().sum(), columns = ['missing_value_count'])
null_stats.loc[:,'% of dataset'] = np.round(null_stats.loc[:,'missing_value_count']/train.shape[0]*100,2)
null_stats.loc[null_stats.loc[:,'missing_value_count']>0,:].sort_values(by='missing_value_count',ascending=False)

Unnamed: 0,missing_value_count,% of dataset


From our preliminary checks:

* It seems that there is no null values inside the train dataset.
* Train dataset contains 12 columns
* Train dataset contains 10506 rows

No cleaning is required at this stage.

In [5]:
# Get info on test dataset

test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 116293 entries, 0 to 116292
Data columns (total 11 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   Id                      116293 non-null  int64  
 1   Date                    116293 non-null  object 
 2   Address                 116293 non-null  object 
 3   Species                 116293 non-null  object 
 4   Block                   116293 non-null  int64  
 5   Street                  116293 non-null  object 
 6   Trap                    116293 non-null  object 
 7   AddressNumberAndStreet  116293 non-null  object 
 8   Latitude                116293 non-null  float64
 9   Longitude               116293 non-null  float64
 10  AddressAccuracy         116293 non-null  int64  
dtypes: float64(2), int64(3), object(6)
memory usage: 9.8+ MB


In [6]:
# Check Null values
null_stats = pd.DataFrame(test.isnull().sum(), columns = ['missing_value_count'])
null_stats.loc[:,'% of dataset'] = np.round(null_stats.loc[:,'missing_value_count']/weather.shape[0]*100,2)
null_stats.loc[null_stats.loc[:,'missing_value_count']>0,:].sort_values(by='missing_value_count',ascending=False)

Unnamed: 0,missing_value_count,% of dataset


From our preliminary checks:

* It seems that there is no null values inside the test dataset as well
* Test dataset contains 11 columns
    . There is a new column ID and there is no NumMosquitos nor WnvPresent
* Test dataset contains 10 times more rows than the train dataset (116293)

No cleaning is required at this stage.

### Data Cleaning for weather dataset

In [7]:
# Get info on weather dataset
weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2944 entries, 0 to 2943
Data columns (total 22 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Station      2944 non-null   int64  
 1   Date         2944 non-null   object 
 2   Tmax         2944 non-null   int64  
 3   Tmin         2944 non-null   int64  
 4   Tavg         2944 non-null   object 
 5   Depart       2944 non-null   object 
 6   DewPoint     2944 non-null   int64  
 7   WetBulb      2944 non-null   object 
 8   Heat         2944 non-null   object 
 9   Cool         2944 non-null   object 
 10  Sunrise      2944 non-null   object 
 11  Sunset       2944 non-null   object 
 12  CodeSum      2944 non-null   object 
 13  Depth        2944 non-null   object 
 14  Water1       2944 non-null   object 
 15  SnowFall     2944 non-null   object 
 16  PrecipTotal  2944 non-null   object 
 17  StnPressure  2944 non-null   object 
 18  SeaLevel     2944 non-null   object 
 19  Result

In [8]:
# Display top 10 rows of weather dataset
weather.head(10)

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,...,CodeSum,Depth,Water1,SnowFall,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed
0,1,2007-05-01,83,50,67,14,51,56,0,2,...,,0,M,0.0,0.00,29.1,29.82,1.7,27,9.2
1,2,2007-05-01,84,52,68,M,51,57,0,3,...,,M,M,M,0.00,29.18,29.82,2.7,25,9.6
2,1,2007-05-02,59,42,51,-3,42,47,14,0,...,BR,0,M,0.0,0.00,29.38,30.09,13.0,4,13.4
3,2,2007-05-02,60,43,52,M,42,47,13,0,...,BR HZ,M,M,M,0.00,29.44,30.08,13.3,2,13.4
4,1,2007-05-03,66,46,56,2,40,48,9,0,...,,0,M,0.0,0.00,29.39,30.12,11.7,7,11.9
5,2,2007-05-03,67,48,58,M,40,50,7,0,...,HZ,M,M,M,0.00,29.46,30.12,12.9,6,13.2
6,1,2007-05-04,66,49,58,4,41,50,7,0,...,RA,0,M,0.0,T,29.31,30.05,10.4,8,10.8
7,2,2007-05-04,78,51,M,M,42,50,M,M,...,,M,M,M,0.00,29.36,30.04,10.1,7,10.4
8,1,2007-05-05,66,53,60,5,38,49,5,0,...,,0,M,0.0,T,29.4,30.1,11.7,7,12.0
9,2,2007-05-05,66,54,60,M,39,50,5,0,...,,M,M,M,T,29.46,30.09,11.2,7,11.5


According to the weather data documentation, cells with 'M' indicate that the data is missing or not normally recorded by the station. Replacing these with null value will give a clearer indicator of the sparsity of the data. 

In [9]:
# Replace '-' and 'M' with NA
weather = weather.replace('-', np.nan)
weather = weather.replace('M', np.nan)

In [10]:
# Check Null values
null_stats = pd.DataFrame(weather.isnull().sum(), columns = ['missing_value_count'])
null_stats.loc[:,'% of dataset'] = np.round(null_stats.loc[:,'missing_value_count']/weather.shape[0]*100,2)
null_stats.loc[null_stats.loc[:,'missing_value_count']>0,:].sort_values(by='missing_value_count',ascending=False)

Unnamed: 0,missing_value_count,% of dataset
Water1,2944,100.0
Depart,1472,50.0
Sunrise,1472,50.0
Sunset,1472,50.0
Depth,1472,50.0
SnowFall,1472,50.0
Tavg,11,0.37
Heat,11,0.37
Cool,11,0.37
SeaLevel,9,0.31


In [11]:
# Display the number of records for the respective stations
weather.value_counts('Station')

Station
1    1472
2    1472
dtype: int64

Water1 has 100% of its values missing, it will be dropped completely. All other missing values, except for Tavg, will be matched with the value from Station 1. Tavg will have it value calculated from Tmax and Tmin. SnowFall and Depth will also be dropped as the train and test data collected falls between spring and autumn seasons. Thus these values will not be applicable. 

In [12]:
# Drop Water1, SnowFall, Depth
weather.drop(['Water1', 'SnowFall', 'Depth'], axis=1, inplace=True)

In [13]:
# Replace missing values with values from station 1
weather['Depart'] = weather['Depart'].fillna(method='ffill')
weather['Sunrise'] = weather['Sunrise'].fillna(method='ffill')
weather['Sunset'] = weather['Sunset'].fillna(method='ffill')
weather['Heat'] = weather['Heat'].fillna(method='ffill')
weather['Cool'] = weather['Cool'].fillna(method='ffill')
weather['SeaLevel'] = weather['SeaLevel'].fillna(method='ffill')
weather['WetBulb'] = weather['WetBulb'].fillna(method='ffill')
weather['StnPressure'] = weather['StnPressure'].fillna(method='ffill')
weather['AvgSpeed'] = weather['AvgSpeed'].fillna(method='ffill')
weather['PrecipTotal'] = weather['PrecipTotal'].fillna(method='ffill')

In [14]:
# Function to replace Tavg values from average of Tmax and Tmin
def fill_avg(xmax,xmin,xavg):
    if pd.isnull(xavg):
        xavg=np.ceil((xmax+xmin)/2)
        return xavg
    else:
        return xavg

In [15]:
# Apply function to df to impute missing Tavg values
weather['Tavg'] = weather.apply(lambda x: fill_avg(x['Tmax'], x['Tmin'], x['Tavg']), axis=1)

In [16]:
# Check Null values
null_stats = pd.DataFrame(weather.isnull().sum(), columns = ['missing_value_count'])
null_stats.loc[:,'% of dataset'] = np.round(null_stats.loc[:,'missing_value_count']/weather.shape[0]*100,2)
null_stats.loc[null_stats.loc[:,'missing_value_count']>0,:].sort_values(by='missing_value_count',ascending=False)

Unnamed: 0,missing_value_count,% of dataset


In [17]:
# Check data types
weather.dtypes

Station          int64
Date            object
Tmax             int64
Tmin             int64
Tavg            object
Depart          object
DewPoint         int64
WetBulb         object
Heat            object
Cool            object
Sunrise         object
Sunset          object
CodeSum         object
PrecipTotal     object
StnPressure     object
SeaLevel        object
ResultSpeed    float64
ResultDir        int64
AvgSpeed        object
dtype: object

In [18]:
# For precipTotal, records contains T which refers to trace
# Replace T with 0.00
weather['PrecipTotal'].replace(['T',' T', '  T'], 0.00, inplace=True )

In [19]:
# Set correct dtypes
weather['Tavg'] = weather['Tavg'].astype(int)
weather['WetBulb'] = weather['WetBulb'].astype(int)
weather['Heat'] = weather['Heat'].astype(int)
weather['Cool'] = weather['Cool'].astype(int)
weather['PrecipTotal'] = weather['PrecipTotal'].astype(float)
weather['StnPressure'] = weather['StnPressure'].astype(float)
weather['SeaLevel'] = weather['SeaLevel'].astype(float)
weather['AvgSpeed'] = weather['AvgSpeed'].astype(float)

### Data Cleaning for spray dataset

In [20]:
# Get info on spray dataset
spray.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14835 entries, 0 to 14834
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       14835 non-null  object 
 1   Time       14251 non-null  object 
 2   Latitude   14835 non-null  float64
 3   Longitude  14835 non-null  float64
dtypes: float64(2), object(2)
memory usage: 463.7+ KB


In [21]:
# Display top 5 rows
spray.head()

Unnamed: 0,Date,Time,Latitude,Longitude
0,2011-08-29,6:56:58 PM,42.391623,-88.089163
1,2011-08-29,6:57:08 PM,42.391348,-88.089163
2,2011-08-29,6:57:18 PM,42.391022,-88.089157
3,2011-08-29,6:57:28 PM,42.390637,-88.089158
4,2011-08-29,6:57:38 PM,42.39041,-88.088858


In [22]:
# Display counts for each date
spray['Date'].value_counts()

2013-08-15    2668
2013-08-29    2302
2013-07-17    2202
2011-09-07    2114
2013-07-25    1607
2013-08-22    1587
2013-08-08    1195
2013-09-05     924
2013-08-16     141
2011-08-29      95
Name: Date, dtype: int64

As Spray shows that it only occurs in 2011 & 2013, and very irregular as well. This is not suitable for model training. Both the Train and Test dataset encompasses the years 2007-2013, therefore the spray data is only a small portion of this and therefore not at all useful. Thus, it will not be used for model training but could provide useful insights through visualisations.

In [23]:
# Check Null values
null_stats = pd.DataFrame(spray.isnull().sum(), columns = ['missing_value_count'])
null_stats.loc[:,'% of dataset'] = np.round(null_stats.loc[:,'missing_value_count']/spray.shape[0]*100,2)
null_stats.loc[null_stats.loc[:,'missing_value_count']>0,:].sort_values(by='missing_value_count',ascending=False)

Unnamed: 0,missing_value_count,% of dataset
Time,584,3.94


There are 584 empty Time values in the spray dataset

In [24]:
# Display the rows with NA
spray[spray.Time.isnull()]

Unnamed: 0,Date,Time,Latitude,Longitude
1030,2011-09-07,,41.987092,-87.794286
1031,2011-09-07,,41.987620,-87.794382
1032,2011-09-07,,41.988004,-87.794574
1033,2011-09-07,,41.988292,-87.795486
1034,2011-09-07,,41.988100,-87.796014
...,...,...,...,...
1609,2011-09-07,,41.995876,-87.811615
1610,2011-09-07,,41.995972,-87.810271
1611,2011-09-07,,41.995684,-87.810319
1612,2011-09-07,,41.994724,-87.810415


In [25]:
# Display 5 rows before the rows with NA
spray.iloc[1024:1029]

Unnamed: 0,Date,Time,Latitude,Longitude
1024,2011-09-07,7:44:32 PM,41.98646,-87.794225
1025,2011-09-07,7:44:32 PM,41.98646,-87.794225
1026,2011-09-07,7:44:32 PM,41.98646,-87.794225
1027,2011-09-07,7:44:32 PM,41.98646,-87.794225
1028,2011-09-07,7:44:32 PM,41.98646,-87.794225


In [26]:
# Display 5 rows after the rows with NA
spray.iloc[1614: 1619]

Unnamed: 0,Date,Time,Latitude,Longitude
1614,2011-09-07,7:46:30 PM,41.973465,-87.827643
1615,2011-09-07,7:46:40 PM,41.973457,-87.828302
1616,2011-09-07,7:46:50 PM,41.973435,-87.829073
1617,2011-09-07,7:47:00 PM,41.973442,-87.829835
1618,2011-09-07,7:47:10 PM,41.973418,-87.830568


All missing values are on 7 Sep 2011. The values for time that are missing all fall between 7:44:32pm and 7:46:30pm of the same day. These values will be filled with a time that falls between. 

In [27]:
# fill na with '7:45:00 PM'
spray['Time'].fillna('7:45:00 PM', inplace=True)

The Spray dataset is now cleaned. 

### Save clean dataset

In [28]:
# Save df as csv
weather.to_csv('data/datasets/weather_clean.csv', index=False)
spray.to_csv('data/datasets/spray_clean.csv', index=False)

# No change has been made to train and test dataset