# Domain Background

Dengue fever is a mosquito-borne disease that occurs in tropical and sub-tropical parts of the world. In mild cases, symptoms are similar to the flu: fever, rash, and muscle and joint pain. In severe cases, dengue fever can cause severe bleeding, low blood pressure, and even death. Because it is carried by mosquitoes, the transmission dynamics of dengue are [related to climate variables](http://ehp.niehs.nih.gov/wp-content/uploads/121/11-12/ehp.1306556.pdf) such as temperature and precipitation. Although the relationship to climate is complex, a growing number of scientists argue that climate change is likely to produce distributional shifts that will have significant [public health implications worldwide](http://rstb.royalsocietypublishing.org/content/370/1665/20140135.full).

An understanding of the relationship between climate and dengue dynamics can improve research initiatives and resource allocation to help fight life-threatening pandemics.


# Problem Statement

Given the data of two cities, San Juan and Iquitos, each city spanning 5 and 3 years respectively, the objective is to learn from the data and predict the number of dengue cases each week, in each location, based on environmental variables describing changes in temperature, precipitation, vegetation, and more


# Datasets and Inputs

The Training data (features + labels) is provided by the drivendata [competition](https://www.drivendata.org/competitions/44/dengai-predicting-disease-spread/) and each entry in the features file has the following data:

City and date indicators:
* <code>city</code> – City abbreviations: sj for San Juan and iq for Iquitos
* <code>week_start_date</code> – Date given in yyyy-mm-dd format

NOAA's GHCN daily climate data weather station measurements:
* <code>station_max_temp_c</code> – Maximum temperature
* <code>station_min_temp_c</code> – Minimum temperature
* <code>station_avg_temp_c</code> – Average temperature
* <code>station_precip_mm</code> – Total precipitation
* <code>station_diur_temp_rng_c</code> – Diurnal temperature range

PERSIANN satellite precipitation measurements (0.25x0.25 degree scale)
* <code>precipitation_amt_mm</code> – Total precipitation

NOAA's NCEP Climate Forecast System Reanalysis measurements (0.5x0.5 degree scale)
* <code>reanalysis_sat_precip_amt_mm</code> – Total precipitation
* <code>reanalysis_dew_point_temp_k</code> – Mean dew point temperature
* <code>reanalysis_air_temp_k</code> – Mean air temperature
* <code>reanalysis_relative_humidity_percent</code> – Mean relative humidity
* <code>reanalysis_specific_humidity_g_per_kg</code> – Mean specific humidity
* <code>reanalysis_precip_amt_kg_per_m2</code> – Total precipitation
* <code>reanalysis_max_air_temp_k</code> – Maximum air temperature
* <code>reanalysis_min_air_temp_k</code> – Minimum air temperature
* <code>reanalysis_avg_temp_k</code> – Average air temperature
* <code>reanalysis_tdtr_k</code> – Diurnal temperature range

Satellite vegetation - Normalized difference vegetation index (NDVI) - NOAA's CDR Normalized Difference Vegetation Index (0.5x0.5 degree scale) measurements:
* <code>ndvi_se</code> – Pixel southeast of city centroid
* <code>ndvi_sw</code> – Pixel southwest of city centroid
* <code>ndvi_ne</code> – Pixel northeast of city centroid
* <code>ndvi_nw</code> – Pixel northwest of city centroid


# Benchmark Model
The competition already provided a [benchmark model](http://drivendata.co/blog/dengue-benchmark/). The benchmark model hypothesizes that the spread of dengue may follow different patterns between the two cities, therefore the dataset was divided and trained two separate models for each city.

As for the regression model it was chosen the Negative Binomial Distribution, one reason to choose this was because the variance of the labels values is greater than the mean of the labels, and as for the data it was pre-processed in order to fill the NaN values with the previous values and selected the four features that shown more correlation with the labels.

Finally as for training as a timeseries model , it was used a strict-future holdout set when splitting the training set and the test set, by keeping around three quarters of the original data for training and the rest to test. After trained, the model was use to predict the total cases for a test set provided by the competition and the results submitted. The submission score was 25.8173.