# Project 4: West Nile Virus Prediction

Notebook 3 of 4

# Feature Engineering

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
import datetime

from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [2]:
# Import datasets
train = pd.read_csv('./datasets/cleaned_train.csv')
test = pd.read_csv('./datasets/cleaned_test.csv')
spray = pd.read_csv('./datasets/cleaned_spray.csv')
weather = pd.read_csv('./datasets/cleaned_weather.csv')

We've observed that our features generally have a pretty low correlation to `WnvPresent`, the strongest feature is `NumMosquitos` with a correlation of 0.2. [CDC](https://www.cdc.gov/westnile/resourcepages/mosqSurvSoft.html#:~:text=The%20simplest%20estimate%2C%20the%20minimum,goals%20of%20the%20surveillance%20program) defined the simplest and traditional estimate <b>Minimum Infection Rate (MIR)</b> which assumed that a positive pool contains only one infected mosquito (an assumption that may be invalid):

$$ \text{MIR} = 1000 * {\text{number of positive pools} \over \text{total number of mosquitos in pools tested}} $$

CDC has developed easy-to-use programs for calculating virus infection rate (IR) estimates from mosquito pool data using methods that do not require the assumption used in the MIR calculation.

$$ \text{IR} = {\text{number of infected mosquitos} \over 1000} $$

CDC encourages to incorporate virus infection rate (IR) into their mosquito-based evaluation of local virus activity patterns. At the county level or below, weekly tracking of mosquito IR can provide important predictive indicators of transmission activity levels associated with elevated human risk.

Unfortunately, our test data doesn't have the information we need to make this a usable feature. We discussed estimating the number of mosquitos based on total rows in the test set, but we ultimately decided that this was a slightly [<i>'hackish'</i> solution](https://www.kaggle.com/c/predict-west-nile-virus/discussion/14790). We'll drop NumMosquitos moving forward.

In [3]:
# Dropping NumMosquitos as it isn't present within test data
train = train.drop(columns='NumMosquitos')

Our remaining features can be categorised as a mixture of 
1. time, 
2. weather (e.g. Temperature, Precipitation), 
3. location variables. 

Each of these variables has a low correlation of absolute 0.1 or less to our target. While we certainly could just go ahead with these features and jump straight into predictive modelling, a much better approach in the form of feature engineering is available. Without engineering, our models consistenly scored an AUC-ROC of approximately 0.5.

In this section, we'll look to <b>decompose and split our features</b>, as well as carry out <b>data enrichment</b> in the form of historical temperature records from the [National Weather Service](weather.gov). We'll also carry out a bit of polynomial feature engineering, to try and create features with a higher correlation to our target. 

### Preparation for Engineering

In [4]:
# Convert to datetime object
weather['Date'] = pd.to_datetime(weather['Date'])
train['Date'] = pd.to_datetime(train['Date'])

In [5]:
# This gives me a more precise means of accessing certain weeks in a specific year
def year_week(row):
    week = row['Week']
    year = row['Year']
    row['YearWeek'] = f'{year}{week}'
    row['YearWeek'] = int(row['YearWeek'])
    return row

In [6]:
train = train.apply(year_week, axis=1)
weather = weather.apply(year_week, axis=1)

In [7]:
train.head(3)

Unnamed: 0,Date,Address,Species,Block,Street,Trap,AddressNumberAndStreet,Latitude,Longitude,AddressAccuracy,Year,Month,Week,DayOfWeek,WnvPresent,LatLong,Coordinate,Nearest_station,YearWeek
0,2007-05-29,"1100 Roosevelt Road, Chicago, IL 60608, USA",CULEX PIPIENS/RESTUANS,11,W ROOSEVELT,T048,"1100 W ROOSEVELT, Chicago, IL",41.867108,-87.654224,8,2007,5,22,1,0,"(41.867108, -87.654224)",POINT (41.867108 -87.654224),2,200722
1,2007-05-29,"1100 Roosevelt Road, Chicago, IL 60608, USA",CULEX RESTUANS,11,W ROOSEVELT,T048,"1100 W ROOSEVELT, Chicago, IL",41.867108,-87.654224,8,2007,5,22,1,0,"(41.867108, -87.654224)",POINT (41.867108 -87.654224),2,200722
2,2007-05-29,"1100 South Peoria Street, Chicago, IL 60608, USA",CULEX RESTUANS,11,S PEORIA ST,T091,"1100 S PEORIA ST, Chicago, IL",41.862292,-87.64886,8,2007,5,22,1,0,"(41.862292, -87.64886)",POINT (41.862292 -87.64886),2,200722


## From the [*(article)*](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4342965/), it discussed the main impacts of climatic variables on the epidemiology of WNV:
1. Temperature
1. Relative Humidity
1. Precipitation
1. Wind

## Feature 1: Weekly Average Temperature

Temperature has been [acknowledged](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4342965/) to be the most prominent feature associated with outbreaks of the West Nile Virus. Among other things, high temperature has been show to positively correlate with viral replication rates, seasonal phenology of mosquito host populations, growth rates of vector populations, viral transmission efficiency to birds and geographical variations in human case incidence.

Rather than looking at daily temperature, we'll also look at average temperatures by week.

In [8]:
# Setting up grouped df for calculation of cumulative weekly precipitation
group_df = weather.groupby('YearWeek').sum()

In [9]:
def WeekAvgTemp(row):
    # Retrieve current week
    YearWeek = row['YearWeek']
    
    # Retrieving sum of average temperature for current week
    temp_sum = group_df.loc[YearWeek]['Tavg']
    
    # Getting number of days recorded by weather station for current week
    n_days = weather[weather['YearWeek'] == YearWeek].shape[0]
    
    # Calculate Week Average Temperature
    row['WeekAvgTemp'] = temp_sum / n_days
    
    return row

In [10]:
weather = weather.apply(WeekAvgTemp, axis=1)

## Feature 1.1: Winter Temperature

Winter temperatures aren't a very intuitive variable when it comes to predicting the West Nile Virus. However, it turns out that the WNV can <b>[overwinter](https://ugaurbanag.com/811-2/)</b>. What this means, is that there are specific species of mosquito such as the Culex species that can overwinter -- this takes place in the adult stage by fertilized, non-blood-fed females. The Culex pipiens in particular goes into physiological diapauses (akin to hibernation) during the winter months, and while it may be active when temperatures get above 50°, it will not take a blood meal.

The virus does not replicate within the mosquito at lower temperatures, <b>but is available to begin replication when temperatures increase</b>. This corresponds with the beginning of the nesting period of birds and the presence of young birds. Circulation of virus in the bird populations lead to the amplication of the virus and growth of vector mosquito populations.

The National Weather Service carries [historical records of January temperatures](https://www.weather.gov/lot/January_Temperature_Rankings_Chicago) -- I created a dataset based on this and carried out some minimal cleaning to create a proxy feature measuring winter temperatures.

In [11]:
# This dataset gives us the average Janurary temperature of each year -- we're using this as a proxy for Winter temperatures.
# We can also see how far each temperature differs from the 30 year normal (23.8 degrees F)
winter_df = pd.read_csv('./datasets/jan_winter.csv')
winter_df.head()

Unnamed: 0,Year,AvgTemp,JanDepart
0,2006,35.8,12.0
1,2007,27.9,4.1
2,2008,23.5,-0.3
3,2009,15.9,-7.9
4,2010,22.0,-1.8


In [12]:
def winter_temp(row):
    year = row['Year']
    #row['WinterTemp'] = winter_df[winter_df['Year'] == year]['AvgTemp'].values[0]
    row['WinterDepart'] = winter_df[winter_df['Year'] == year]['JanDepart'].values[0]
    return row

In [13]:
weather = weather.apply(winter_temp, axis=1)

In [14]:
weather.head(3)

Unnamed: 0,Date,AvgSpeed,Cool,Depart,DewPoint,Heat,PrecipTotal,ResultDir,ResultSpeed,SeaLevel,StnPressure,Sunrise,Sunset,Tavg,Tmax,Tmin,WetBulb,lowvis,n_codesum,rain,Year,Month,Week,DayOfWeek,YearWeek,WeekAvgTemp,WinterDepart
0,2007-05-01,9.4,2.5,14.5,51.0,0.0,0.0,26.0,2.2,29.82,29.14,448.0,1849.0,67.5,83.5,51.0,56.5,0.0,0.0,0.0,2007,5,18,1,200718,59.416667,4.1
1,2007-05-02,13.4,0.0,-2.5,42.0,13.5,0.0,3.0,13.15,30.085,29.41,447.0,1850.0,51.5,59.5,42.5,47.0,1.0,1.5,0.0,2007,5,18,2,200718,59.416667,4.1
2,2007-05-03,12.55,0.0,3.0,40.0,8.0,0.0,6.5,12.3,30.12,29.425,446.0,1851.0,57.0,66.5,47.0,49.0,0.5,0.5,0.0,2007,5,18,3,200718,59.416667,4.1


## Feature 1.2: Summer Temperature

While the link between summer temperature and WNV isn't as clear, we thought it might be worth investigating into whether warmer summers (or in this case - warm Julys) affect the spread of the WNV. The virus is said to spreads most efficiently in the United States at temperatures [between 75.2 and 77 degrees Fahrenheit](https://www.medicinenet.com/script/main/art.asp?articlekey=247250#:~:text=The%20mosquito%2Dborne%20virus%20spreads,15%20in%20the%20journal%20eLife.).

This data also comes from the [National Weather Service](https://www.weather.gov/lot/July_Temperature_Rankings_Chicago).

In [15]:
# This dataset gives us the average July temperature of each year -- we're using this as a proxy for Summer temperatures.
# We can also see how far each temperature differs from the 30 year normal (74.0 degrees F)
summer_df = pd.read_csv('./datasets/jul_summer.csv')
summer_df.head()

Unnamed: 0,Year,AvgTemp,JulDepart
0,2006,76.5,2.5
1,2007,73.7,-0.3
2,2008,74.0,0.0
3,2009,69.4,-4.6
4,2010,77.7,3.7


In [16]:
def summer_temp(row):
    year = row['Year']
    #row['SummerTemp'] = summer_df[summer_df['Year'] == year]['AvgTemp'].values[0]
    row['SummerDepart'] = summer_df[summer_df['Year'] == year]['JulDepart'].values[0]
    return row

In [17]:
weather = weather.apply(summer_temp, axis=1)
weather.head(3)

Unnamed: 0,Date,AvgSpeed,Cool,Depart,DewPoint,Heat,PrecipTotal,ResultDir,ResultSpeed,SeaLevel,StnPressure,Sunrise,Sunset,Tavg,Tmax,Tmin,WetBulb,lowvis,n_codesum,rain,Year,Month,Week,DayOfWeek,YearWeek,WeekAvgTemp,WinterDepart,SummerDepart
0,2007-05-01,9.4,2.5,14.5,51.0,0.0,0.0,26.0,2.2,29.82,29.14,448.0,1849.0,67.5,83.5,51.0,56.5,0.0,0.0,0.0,2007,5,18,1,200718,59.416667,4.1,-0.3
1,2007-05-02,13.4,0.0,-2.5,42.0,13.5,0.0,3.0,13.15,30.085,29.41,447.0,1850.0,51.5,59.5,42.5,47.0,1.0,1.5,0.0,2007,5,18,2,200718,59.416667,4.1,-0.3
2,2007-05-03,12.55,0.0,3.0,40.0,8.0,0.0,6.5,12.3,30.12,29.425,446.0,1851.0,57.0,66.5,47.0,49.0,0.5,0.5,0.0,2007,5,18,3,200718,59.416667,4.1,-0.3


## Feature 2: Relative Humidity

High humidity is thought to be a strong factor in the spread of the West Nile Virus -- it's been [reported](https://www.mdpi.com/1660-4601/17/4/1403/pdf#:~:text=caspius.,%25%20%5B19%2C23%5D.) that <b>high humidity increases egg production, larval indices, mosquito activity and influences their activities</b>. Other studies have shown that a suitable range of humidity stimulating mosquito flight activity is between 44% and 69%, with 65% as a focal percentage. 

The climate of Chicago is classified as hot-summer humid continental (Köppen climate classification: Dfa), which means that humidity is worth looking into. To calculate relative humidity, we'll first look to convert some of our temperature readings into degrees celcius.

### Calculate Celcius

In [18]:
# To calculate Relative Humidity, we need to change our features from Fahrenheit to Celcius
def celsius(x):
    c = ((x - 32) * 5.0)/9.0
    return float(c)

In [19]:
weather['TavgC'] = weather['Tavg'].apply(celsius)
weather['TminC'] = weather['Tmin'].apply(celsius)
weather['TmaxC'] = weather['Tmax'].apply(celsius)
weather['DewPointC'] = weather['DewPoint'].apply(celsius)

In [20]:
def r_humid(row):
    row['r_humid'] = round(100*(math.exp((17.625*row['DewPointC'])/(243.04+row['DewPointC'])) \
                          / math.exp((17.625*row['TavgC'])/(243.04+row['TavgC']))))
    return row

Formula for [Relative Humidity](https://bmcnoldy.rsmas.miami.edu/Humidity.html):

$$ \large RH = 100 {exp({aT_{d} \over {b + T_{d}}}) \over exp({aT \over b + T})}$$

where: <br>
$ \small a = \text{17.625} $ <br>
$ \small b = 243.04 $ <br>
$ \small T = \text{Average Temperature (Celsius)} $ <br>
$ \small T_{d} = \text{Dewpoint Temperature (Celsius)} $ <br>
$ \small RH = \text{Relative Humidity} $ (%)<br>

In [21]:
weather = weather.apply(r_humid, axis=1)

In [22]:
# Dropping as Celcius features are no longer needed
weather = weather.drop(columns=['TavgC', 'TminC', 'TmaxC', 'DewPointC'])

In [23]:
weather.sort_values(by='r_humid', ascending=False).head()

Unnamed: 0,Date,AvgSpeed,Cool,Depart,DewPoint,Heat,PrecipTotal,ResultDir,ResultSpeed,SeaLevel,StnPressure,Sunrise,Sunset,Tavg,Tmax,Tmin,WetBulb,lowvis,n_codesum,rain,Year,Month,Week,DayOfWeek,YearWeek,WeekAvgTemp,WinterDepart,SummerDepart,r_humid
1287,2013-10-31,11.45,0.0,9.5,56.0,9.5,1.535,23.5,9.45,29.46,28.73,623.0,1647.0,55.5,64.5,46.5,57.0,1.0,2.0,1.0,2013,10,44,3,201344,51.125,2.8,-0.8,102
1454,2014-10-14,9.4,0.0,7.0,58.0,5.0,0.915,11.0,2.65,29.545,28.86,603.0,1712.0,60.0,66.0,53.0,59.0,1.0,3.0,1.0,2014,10,42,1,201442,54.0,-8.1,-3.6,93
319,2008-09-13,9.4,7.5,7.5,70.0,0.0,4.855,21.5,6.8,29.705,29.005,529.0,1807.0,72.5,76.0,68.5,71.0,1.0,2.0,1.0,2008,9,37,5,200837,65.214286,-0.3,0.0,92
1286,2013-10-30,6.85,0.0,8.5,52.0,10.5,0.855,14.5,5.95,30.005,29.255,622.0,1649.0,54.5,63.5,45.0,53.0,1.0,3.0,1.0,2013,10,44,2,201344,51.125,2.8,-0.8,91
763,2011-05-28,6.25,0.0,-5.5,55.0,7.5,0.175,18.0,5.2,29.83,29.13,421.0,1916.0,57.5,63.0,51.5,56.0,1.0,3.0,1.0,2011,5,21,5,201121,57.928571,-3.2,5.0,91


In [24]:
# The average humidity in Chicago could be a factor in the spread of the West Nile Virus
print(f'{round(weather["r_humid"].mean(),2)} %')

62.21 %


*Note:* Relative humidity can exist beyond 100% due to [supersaturation](https://www.chicagotribune.com/news/ct-xpm-2011-07-20-ct-wea-0720-asktom-20110720-story.html#:~:text=Surprisingly%2C%20yes%2C%20the%20condition%20is,is%20needed%20to%20cause%20saturation.). Water vapor begins to condense onto impurities (such as dust or salt particles) in the air as the RH approaches 100 percent, and a cloud or fog forms.

## Feature 3: Weekly Average Precipitation

It's often thought that above-average precipitation leads to a higher abundance of mosquitoes and increases the potential for disease outbreaks like the West Nile Virus. This positive association has been confirmed by several [studies](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4342965/#RSTB20130561C42), but precipitation can be slightly more complex as a feature, as heavy rainfall could dilute the nutrients for larvae, thus decreasing development rate. It might also lead to a negative association by [flushing ditches and drainage channels used by Culex larvae](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4342965/#RSTB20130561C56).

Regardless, precipitation is still worth looking into. Instead of looking at daily precipitation amounts which likely don't affect the presence of WNV on that particular day, we can take cumulative weekly precipiation into account, and create a feature measuring weeks with heavy rain.

In [25]:
def WeekPrecipTotal(row):
    row['WeekPrecipTotal'] = group_df.loc[row['YearWeek']]['PrecipTotal']
    return row

In [26]:
weather = weather.apply(WeekPrecipTotal, axis=1)
weather.head(3)

Unnamed: 0,Date,AvgSpeed,Cool,Depart,DewPoint,Heat,PrecipTotal,ResultDir,ResultSpeed,SeaLevel,StnPressure,Sunrise,Sunset,Tavg,Tmax,Tmin,WetBulb,lowvis,n_codesum,rain,Year,Month,Week,DayOfWeek,YearWeek,WeekAvgTemp,WinterDepart,SummerDepart,r_humid,WeekPrecipTotal
0,2007-05-01,9.4,2.5,14.5,51.0,0.0,0.0,26.0,2.2,29.82,29.14,448.0,1849.0,67.5,83.5,51.0,56.5,0.0,0.0,0.0,2007,5,18,1,200718,59.416667,4.1,-0.3,55,0.015
1,2007-05-02,13.4,0.0,-2.5,42.0,13.5,0.0,3.0,13.15,30.085,29.41,447.0,1850.0,51.5,59.5,42.5,47.0,1.0,1.5,0.0,2007,5,18,2,200718,59.416667,4.1,-0.3,70,0.015
2,2007-05-03,12.55,0.0,3.0,40.0,8.0,0.0,6.5,12.3,30.12,29.425,446.0,1851.0,57.0,66.5,47.0,49.0,0.5,0.5,0.0,2007,5,18,3,200718,59.416667,4.1,-0.3,53,0.015


## Time-lagged Features

Some [studies](https://pubmed.ncbi.nlm.nih.gov/30145430/) have argued that increased precipitation and temperatures might have a <n>lagged direct effect</n> on the incidence of WNV infection. Given that the incubation period for most Culex mosquitos is approximately [7-10 days](https://www.cdc.gov/mosquitoes/about/life-cycles/culex.html#:~:text=Life%20stages%20of%20Culex%20pipiens,develop%20into%20an%20adult%20mosquito) or [8-10 days (Source: CDC, 2021)](https://www.cdc.gov/dengue/resources/factsheets/mosquitolifecyclefinal.pdf), the temperature, humidity and precipitation of previous weeks could play into higher mosquito growth in following weeks. According to the CDC, eggs are ready to hatch from a few days to several months after being laid. 

Thus, we'll create some time-lagged variables going back to a month before the current date.

### Average Temperature (1 week - 4 weeks before)

In [27]:
def create_templag(row):   
    # Getting average temperature one week before
    YearWeek = row['YearWeek']
    
    # Calculating average temperature for up to four weeks before
    for i in range(4):
        try:
            row[f'templag{i+1}'] = weather[weather['YearWeek'] == (YearWeek - (i+1))]['WeekAvgTemp'].unique()[0]
            
        # For the first 4 weeks of the year where no previous data exists, create rough estimate of temperatures
        except IndexError:
            row[f'templag{i+1}'] = row['WeekAvgTemp'] - i
    return row

In [28]:
weather = weather.apply(create_templag, axis=1)

### Cumulative Weekly Precipitation (1 week - 4 weeks before)

In [29]:
def create_rainlag(row):
    # Getting average temperature one week before
    YearWeek = row['YearWeek']
    
    # Calculating average temperature for up to four weeks before
    for i in range(4):
        try:
            row[f'rainlag{i+1}'] = weather[weather['YearWeek'] == (YearWeek - (i+1))]['WeekPrecipTotal'].unique()[0]
            
        # Use average of column if no data available
        except IndexError:
            row[f'rainlag{i+1}'] = weather['WeekPrecipTotal'].mean()
    return row

In [30]:
weather = weather.apply(create_rainlag, axis=1)

### Relative Humidity (1 week - 4 weeks before)

In [31]:
def create_humidlag(row):
    # Getting average temperature one week before
    YearWeek = row['YearWeek']
    
    # Calculating average temperature for up to four weeks before
    for i in range(4):
        try:
            row[f'humidlag{i+1}'] = weather[weather['YearWeek'] == (YearWeek - (i+1))]['r_humid'].unique()[0]
            
        # Use average of column if no data available
        except IndexError:
            row[f'humidlag{i+1}'] = weather['r_humid'].mean()
    return row

In [32]:
weather = weather.apply(create_humidlag, axis=1)

In [33]:
# Checking that temperature lagged variables are correct
weather.groupby(by='YearWeek').mean()[['WeekAvgTemp', 'templag1', 'templag2', 'templag3', 'templag4']].tail(5)

Unnamed: 0_level_0,WeekAvgTemp,templag1,templag2,templag3,templag4
YearWeek,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
201440,57.142857,65.285714,62.142857,60.071429,74.428571
201441,53.714286,57.142857,65.285714,62.142857,60.071429
201442,54.0,53.714286,57.142857,65.285714,62.142857
201443,54.642857,54.0,53.714286,57.142857,65.285714
201444,50.2,54.642857,54.0,53.714286,57.142857


### Wind related
lag this by days (1,3,7,14)

In [34]:
weather.columns

Index(['Date', 'AvgSpeed', 'Cool', 'Depart', 'DewPoint', 'Heat', 'PrecipTotal',
       'ResultDir', 'ResultSpeed', 'SeaLevel', 'StnPressure', 'Sunrise',
       'Sunset', 'Tavg', 'Tmax', 'Tmin', 'WetBulb', 'lowvis', 'n_codesum',
       'rain', 'Year', 'Month', 'Week', 'DayOfWeek', 'YearWeek', 'WeekAvgTemp',
       'WinterDepart', 'SummerDepart', 'r_humid', 'WeekPrecipTotal',
       'templag1', 'templag2', 'templag3', 'templag4', 'rainlag1', 'rainlag2',
       'rainlag3', 'rainlag4', 'humidlag1', 'humidlag2', 'humidlag3',
       'humidlag4'],
      dtype='object')

In [33]:
# create a variable to store the wind features to do a time lag on
wind_features = ['AvgSpeed', 'ResultDir', 'ResultSpeed', 'SeaLevel', 'StnPressure']

lag_df = weather[wind_features]

# set the number of lags in days
lags = (1, 3, 7, 14)

# assign new columns to the respective dataframes
wind_lag_df = weather.assign(**{f'{col}_lag_{n}': group1_lag_features[col].shift(n) for n in lags for col in wind_features})


KeyError: 'Station'