## 1. Data Wrangling - West Nile Virus<a id='1_Data Wrangling'></a>

### 1.1 Table of Contents <a id='1.1_Table_of_Contents'></a>
* [1 Data wrangling - West Nile Virus](#1_Data_wrangling)
     * [1.1 Table of Contents](#1.1_Table_of_Contents)
     * [1.2 Recap Of Data Science Problem](#1.2_Recap_Of_Data_Science_Problem)
     * [1.3 Imports](#1.3_Imports)
     * [1.4 Objects](#1.4_Objects)
* [2. Spray Data](#2_Spray_Data)
     * [2.1 Loading](#2.1_Loading)
     * [2.2 Exploring ](#2.2_Exploring)
     * [2.3 Missing Data](#2.3_Missing_Data)
* [3. Weather Data](#3._Weather_Data)
     * [3.1 Loading](#3.1_Loading)
     * [3.2 Exploring](#3.2_Exploring)
     * [3.3 Missing Data](#3.3_Missing_Data)
          * [3.3.1 Tempature](#3.3.1_Tempature)
          * [3.3.2 Precipitation](#3.3.2_Precipitation)
          * [3.3.3 Dew Point](#3.3.3_Dew_Point)
     * [3.4 Relitive Humidity](#3.4_Relitive_Humidity)
          * [3.4.1 Convert to Celsius](#3.4.1_Convert_to_Celsius)
          * [3.4.2 Calculate Relative Humidity](#3.4.2_Calculate_Relative_Humidity)
     * [3.5 Stations](#3.5_Stations)            
     * [3.6 Shift Values](#3.6_Shift_Values)
* [4. Map Data](#4._Map_Data)   
     * [4.1 Loading](#4.1_Loading)
     * [4.2 Exploring ](#4.2_Exploring)
     * [4.3 Missing Data](#4.3_Missing_Data)

### 1.2 Recap Of Data Science Problem <a id='1.2_Recap_Of_Data_Science_Problem'></a>

### 1.3 Imports <a id='1.3_Imports'></a>

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pyreadr

### 1.4 Objects <a id='1.4_Objects'></a>

## 2. Spray Data <a id='2_Spray_Data'></a>

### 2.1 Loading <a id='2.1_Loading'></a>

In [2]:
spray_data = pd.read_csv('data/spray.csv')

### 2.2 Exploring <a id='2.2_Exploring'></a>

In [3]:
spray_data.head()

Unnamed: 0,Date,Time,Latitude,Longitude
0,2011-08-29,6:56:58 PM,42.391623,-88.089163
1,2011-08-29,6:57:08 PM,42.391348,-88.089163
2,2011-08-29,6:57:18 PM,42.391022,-88.089157
3,2011-08-29,6:57:28 PM,42.390637,-88.089158
4,2011-08-29,6:57:38 PM,42.39041,-88.088858


In [4]:
spray_data.dtypes

Date          object
Time          object
Latitude     float64
Longitude    float64
dtype: object

In [5]:
spray_data.shape

(14835, 4)

### 2.3 Missing Data <a id='2.3_Missing_Data'></a>

In [6]:
missing = pd.concat([spray_data.isnull().sum(), 100 * spray_data.isnull().mean()], axis=1)
missing.columns=['count', '%']

missing.sort_values(by='count',  ascending=False)

Unnamed: 0,count,%
Time,584,3.936636
Date,0,0.0
Latitude,0,0.0
Longitude,0,0.0


In [7]:
null_data = spray_data[spray_data.isnull().any(axis=1)]
null_data.head()

Unnamed: 0,Date,Time,Latitude,Longitude
1030,2011-09-07,,41.987092,-87.794286
1031,2011-09-07,,41.98762,-87.794382
1032,2011-09-07,,41.988004,-87.794574
1033,2011-09-07,,41.988292,-87.795486
1034,2011-09-07,,41.9881,-87.796014


In [8]:
null_data.Date.unique()

array(['2011-09-07'], dtype=object)

Just to double check:

In [9]:
datetimecheck = null_data[null_data['Date'] == '2011-09-07']
datetimecheck.shape

(584, 4)

All missing values are on the data 2011-09-07, it is unlikely we would be able to recover or solve this issue.

In [10]:
spray_data = spray_data[spray_data['Time'].notna()]
spray_data.shape

(14251, 4)

14835 - 14251 = 584

In [11]:
null_data = spray_data[spray_data.isnull().any(axis=1)]
null_data.head()

Unnamed: 0,Date,Time,Latitude,Longitude


### 2.4 Datetime Modification <a id='2.4_Datetime Modification'></a>

In [12]:
spray_data['Date'] = pd.to_datetime(spray_data.Date)
spray_data['Date'].head()

0   2011-08-29
1   2011-08-29
2   2011-08-29
3   2011-08-29
4   2011-08-29
Name: Date, dtype: datetime64[ns]

In [14]:
spray_data['Time'] = pd.to_datetime(spray_data.Time, format= '%H:%M:%S').dt.time
spray_data['Time'].head()

ValueError: unconverted data remains:  PM

### 2.5 Check for Zero

In [None]:
spray_data.Time[spray_data['Time'] == 0]

In [None]:
spray_data.Time[spray_data['Date'] == 0]

In [None]:
spray_data.Time[spray_data['Latitude'] == 0]

In [None]:
spray_data.Time[spray_data['Longitude'] == 0]

## 3. Weather Data <a id='3_Weather_Data'></a>

### 3.1 Loading <a id='3.1_Loading'></a>

In [None]:
weather_data = pd.read_csv('data/weather.csv', na_values='M')

Thats not weird having two spaces before the T? But lets not remove Trace amounts out yet, We may want to infer values or make them a legit 0.

### 3.2 Exploring <a id='3.2_Exploring'></a>

In [None]:
weather_data.head(12)

In [None]:
weather_data.dtypes

In [None]:
weather_data.shape

### 3.3 Missing Data <a id='3.3_Missing Data'></a>

In [None]:
missing = pd.concat([weather_data.isnull().sum(), 100 * weather_data.isnull().mean()], axis=1)
missing.columns=['count', '%']

missing.sort_values(by='count',  ascending=False)

Too many missing data points from Water1, Depart, SnowFall, and Depth. Off with their Cols! Also, Codesum doesn't look useful, none numeric and nothing categorical I can think of to compare it with.


In [None]:
weather_data = weather_data.drop(['Water1', 'Depart', 'SnowFall', 'Depth', 'CodeSum'], axis = 1)

In [None]:
missing = pd.concat([weather_data.isnull().sum(), 100 * weather_data.isnull().mean()], axis=1)
missing.columns=['count', '%']

missing.sort_values(by='count',  ascending=False)

In theory only 8 cols to fix for missing values, though in actuality becuase we are going up and down editing this, we have more.

#### 3.3.1 Tempature <a id='3.3.1_Tempature'></a>

In [None]:
weather_data.Tavg.unique()

In [None]:
weather_data['Tavg'] = (weather_data.Tmax + weather_data.Tmin)/2

Apparantly I need to review how .mean() interacts with everything, everytime I use it, it does the mean of the whole col

In [None]:
weather_data.Tavg.head(8)

In [None]:
weather_data.Tavg.unique()

#### 3.3.2 Precipitation <a id='3.3.2_Precipitation'></a>

In [None]:
weather_data.PrecipTotal.unique()

According to the NOA records, T stands for Trace, which likely means less than 0.01. As for dealing with M, if we have the relavent data from other locations on the same datetime, take the mean for that datetime. Otherwise remove?

In [None]:
# Count the number of M's and T's

weather_data.PrecipTotal.str.count("T").sum()

In [None]:
# Arbitrarily choosing T to be 0.001
weather_data['PrecipTotal'] = weather_data['PrecipTotal'].replace('  T', 0.001)
weather_data.PrecipTotal.str.count("T").sum()

In [None]:
weather_data.PrecipTotal.isnull().sum()

In [None]:
# Simple Mean fill

weather_data.PrecipTotal.fillna(value=weather_data.PrecipTotal.mean(), inplace=True)
weather_data.PrecipTotal.isnull().sum()

#### 3.3.3 Dew Point <a id='3.3.3_Dew_Point'></a>

In [None]:
weather_data.DewPoint.unique()

In [None]:
weather_data.DewPoint.plot()

Nothing obviously weird here

#### 3.3.4 WetBulb

In [None]:
weather_data.WetBulb.unique()

In [None]:
weather_data.WetBulb.plot()

#### 3.3.5 Heat

In [None]:
weather_data.Heat.unique()

In [None]:
# Simple Mean fill

weather_data.Heat.fillna(value=weather_data.Heat.mean(), inplace=True)
weather_data.Heat.isnull().sum()

In [None]:
weather_data.Heat.plot()

#### 3.3.6 Cool

In [None]:
weather_data.Cool.unique()

In [None]:
# Simple Mean fill

weather_data.Cool.fillna(value=weather_data.Cool.mean(), inplace=True)
weather_data.Cool.isnull().sum()

In [None]:
weather_data.Cool.plot()

#### 3.3.7 StnPressure

In [None]:
weather_data.StnPressure.unique()

In [None]:
# Simple Mean fill

weather_data.StnPressure.fillna(value=weather_data.StnPressure.mean(), inplace=True)
weather_data.StnPressure.isnull().sum()

In [None]:
weather_data.StnPressure.plot()

#### 3.3.8 SeaLevel

In [None]:
weather_data.SeaLevel.unique()

In [None]:
# Simple Mean fill

weather_data.SeaLevel.fillna(value=weather_data.SeaLevel.mean(), inplace=True)
weather_data.SeaLevel.isnull().sum()

In [None]:
weather_data.SeaLevel.plot()

#### 3.3.9 ResultSpeed

In [None]:
weather_data.ResultSpeed.unique()

In [None]:
weather_data.ResultSpeed.plot()

#### 3.3.10 ResultDir

In [None]:
weather_data.ResultDir.unique()

In [None]:
weather_data.ResultDir.plot()

#### 3.3.11 AvgSpeed

In [None]:
weather_data.AvgSpeed.unique()

In [None]:
# Simple Mean fill

weather_data.AvgSpeed.fillna(value=weather_data.AvgSpeed.mean(), inplace=True)
weather_data.AvgSpeed.isnull().sum()

In [None]:
weather_data.AvgSpeed.plot()

### 3.4 Relitive Humidity <a id='3.4_Relitive_Humidity'></a>

Calculation Notes for relative humidity:

Saturation Vapor Pressure :

Temputure = T (in celsius) 
Dewpoint Temp = Td (in celsius)

* es = 6.11E( ( 7.5 * T ) / ( 273.3 + T ) )  [eq. 1]

Vapor Pressure :

* e = 6.11E( ( 7.5 * Td ) / ( 273.3 + Td ) ) [eq. 2]

Relative Humidity:

* RH = ( e / es ) * 100 [eq. 3]

F to C conversion:

* C = ( F - 32 ) * ( 5 / 9 ) [eq. 4]

Primary Sources:

* https://www.weather.gov/media/epz/wxcalc/vaporPressure.pdf <-- primary for equations

* https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7068582/ <-- Journal concerning primary weather conditions

* https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3750695/ <-- specific study attempting to determine mathmatical formula based on weather

Secondary Sources:

* https://www.theweatherprediction.com/habyhints/186/

* https://snowball.millersville.edu/~adecaria/ESCI241/esci241_lesson06_humidity.pdf

* https://www.omnicalculator.com/chemistry/vapour-pressure-of-water

#### 3.4.1 Convert to Celsius <a id='3.4.1_Convert_to_Celsius'></a>

In [None]:
# Double check format, NOAA is in F not C, need to convert. Using equation #4

weather_data['TmaxC'] = (weather_data['Tmax'] - 32) * (5/9)
weather_data['TminC'] = (weather_data['Tmin'] - 32) * (5/9)
weather_data['TavgC'] = (weather_data['Tavg'] - 32) * (5/9)
weather_data['DewPointC'] = (weather_data['DewPoint'] - 32) * (5/9)

In [None]:
weather_data.TmaxC.head()

In [None]:
weather_data.TminC.head()

In [None]:
weather_data.TavgC.head()

In [None]:
weather_data.DewPointC.head()

#### 3.4.2 Calculate Relative Humidity <a id='3.4.2_Calculate_Relative_Humidity'></a>

In [None]:
# See Equations #1, 2, and 3

weather_data['RHmax'] =(( 6.11 ** ((7.5 * weather_data['DewPointC']) / (273.3 + weather_data['DewPointC']))) / 6.11 ** ((7.5 * weather_data['TmaxC'])/(273.3 + weather_data['TmaxC']))) * 100
weather_data['RHmin'] =(( 6.11 ** ((7.5 * weather_data['DewPointC']) / (273.3 + weather_data['DewPointC']))) / 6.11 ** ((7.5 * weather_data['TminC'])/(273.3 + weather_data['TminC']))) * 100
weather_data['RHavg'] =(( 6.11 ** ((7.5 * weather_data['DewPointC']) / (273.3 + weather_data['DewPointC']))) / 6.11 ** ((7.5 * weather_data['TavgC'])/(273.3 + weather_data['TavgC']))) * 100

In [None]:
weather_data.RHmax.head()

In [None]:
weather_data.RHmin.head()

In [None]:
weather_data.RHavg.head()

### 3.?

### 3.5 Stations, Dates & Format<a id='3.5_Stations'></a>

Back tracking a bit. To properly use shift we need to seperate the stations and reformat some cols

In [None]:
weather_data.Station.unique()

In [None]:
weather_data['Date'] = pd.to_datetime(weather_data.Date)
weather_data['Date'].head()

In [None]:
weather_data.dtypes

In [None]:
station1 = weather_data[weather_data['Station'] == 1]
station1.reset_index(drop=True, inplace=True)
station2 = weather_data[weather_data['Station'] == 2]
station2.reset_index(drop=True, inplace=True)
station1.head()


### 3.? Shift Values<a id='3.6_Shift_Values'></a>

Use shift function to create lag variables

Incubation Length Sources:
* https://www.cdc.gov/dengue/resources/factsheets/mosquitolifecyclefinal.pdf

Key Takeaways:
* The entire life cycle, from an egg to an adult, takes approximately 8-10 days
* Eggs are ready to hatch from a few days to several months after being laid.

Research to be done:
* Specific data for specific species

## 4. Map Data <a id='4._Map Data'></a>

### 4.1 Loading <a id='4.1_Loading'></a>

In [None]:
map_data = pyreadr.read_r('data/mapdata_copyright_openstreetmap_contributors.Rds')

### 4.2 Exploring <a id='4.2_Exploring'></a>

In [None]:
map_data

### 4.3 Missing Data <a id='4.3_Missing_Data'></a>