<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## The Data Science Process - Potential Solution

_Author: Tim Hogan | DSI-DC_

---

** This is an open-ended lab with many possible solutions. These are only a few possible answers. **

In this lab, you will step through a series of questions about the West Nile Virus data set.

https://www.kaggle.com/c/predict-west-nile-virus

The purpose of this lab is to understand the steps a data scientist takes, without getting caught up in the code.

## 1. Describe the problem

---

** In your own words describe the problem at hand.**

We are given the time, weather, and mosquito spraying information at a specific location in the Chicago area. We are asked to predict whether or not West Nile Virus will be present.

**What are 3 potential goals for this project?**

- Decrease number of people who contract the West Nile Virus.
- Allocate limited public health resources more effectively.
- Limit human and nature exposure to potentially harmful mosquito spraying.

**What decisions can the city of Chicago make, once they have this data?**

- Spray locations based on likelihood of West Nile Virus.
- Warn citizens in specific locations about the potential of West Nile Virus.
- Pre-allocate proper funding for adequate public health resources depending on potential severity.

## 2. Acquire and Parse
---
https://www.kaggle.com/c/predict-west-nile-virus/data

To better understand the data, investigate this page. Make sure to read the data documentation! (There is a "Data Description" section underneath the list of files -- there is no need to download the actual data files for this activity!)

**What is the target variable in this dataset?**


WnvPresent: whether West Nile Virus was present in these mosquitos. 1 means WNV is present, and 0 means not present. 

**What are the features in this dataset?**

The features describing each location are in `train.csv`/`test.csv`:

```
Date: date that the WNV test is performed
Address: approximate address of the location of trap. This is used to send to the GeoCoder. 
Species: the species of mosquitos
Block: block number of address
Street: street name
Trap: Id of the trap
AddressNumberAndStreet: approximate address returned from GeoCoder
Latitude, Longitude: Latitude and Longitude returned from GeoCoder
AddressAccuracy: accuracy returned from GeoCoder
NumMosquitos: number of mosquitoes caught in this trap
```

For each location, we will add an additional three features. These will be derived from `spray.csv` and `weather.csv`:
```
NumSprays14Days: number of nearby sprays in last two weeks (+/- 10 miles).
AvgTemp7Days: Average temperature from the nearest weather station over the last week.
WeatherType3Days: Most common weather seen the three days before the date.
```

**How will you put the datasets together?**

train.csv contains each date and latitude/longitude of interest. 

`spray.csv`: By iterating through each spray data point, we will count how many sprays occurred within the two weeks prior to the date AND was +/- 10 miles from the location.

`weather.csv`: We will determine which station is closest to the location. Then, the average temperature over the last two weeks will be computed over the past week. The most common weather type will be found over the past three days. If not clear, the latest weather will be taken.

** What other potential features could help us solve this problem? **

- Adding features that indicate whether a region contains good conditions for mosquitoes.
- National data regarding West Nile Virus for the year.
- For each year, doubling the features to include that location's data from the previous year and whether West Nile Virus occurred then (perhaps a correlation exists).


**What data sources would contain these additional features? **

- Federal CDC website or data.gov
- Google Maps satellite imagery could give mosquito habitat information.
- The given data source contains earlier information about each location.

**What does weather type of 'SG' mean? (See [here](../data/noaa_weather_qclcd_documentation.pdf).)**

Snow Grains. From [Wikipedia](https://en.wikipedia.org/wiki/Snow_grains):

    Snow grains are a form of precipitation. Snow grains are characterized as very small (<1 mm), white, opaque grains of ice that are fairly flat or elongated. Unlike snow pellets, snow grains do not bounce or break up on impact.


## 3. EDA

---

** What are first 5 features or relationships you would investigate? **

- Are West Nile Virus sightings in nearby regions correlated? What are the sizes of affected regions?
- How does weather affect West Nile Virus?
- Are number of mosquitos correlated with West Nile Virus?
- Do some months or years have higher rates of infection?
- Does spraying help?

** What are 3 visualizations which would aid in your analysis? **

- Geographical plot of infections overlayed with spraying.
- Bar graph of weather types versus number of mosquitos and/or number of infections.
- Heat map of how distance from each spray correlates with number of mosquitos and/or number of infections.

## 4. Build a model

---
We will skip this step for now!

## 5. Interpret the results

---

** How will you structure the results of your analysis? **

- Each business objective will be discussed, along with an analysis of how our solution addresses it.
- Each feature should be analyzed to see whether it affects the West Nile Virus.
- Present the tradeoffs and decisions leading to the final model and its features.
- Recommendation of pros/cons whether the model is effective and to what extent it might be deployed with success.

## 6. Communication

---

** What are 3 potential deliverables for this project? **

- **Presentation** to display the main takeaways and how each goal was or was not met.
- **Documentation** that details why the model and its features were chosen, so that someone new to the project understands what led to the final product.
- **Jupyter notebook** demonstrating how the data sources are combined, how the model is built, and how predictions are made.

**Production code** running the model might also be delivered, perhaps as an **API**. (Typically, it is more complicated than just one API call since the customer often wants to retrain the model with new data and make tweaks.)