

https://www.drivendata.org/competitions/44/dengai-predicting-disease-spread/

https://www.cdc.gov/dengue/

### Goal

Predict the number of dengue cases each week (in each location) based on environmental variables describing changes in temperature, precipitation, vegetation, and more.

In [1]:
import findspark
findspark.init()

import numpy as np

from sklearn.metrics import mean_absolute_error

from pyspark import SparkContext

from pyspark.sql.functions import abs 

from pyspark.sql import SparkSession

from pyspark.ml.regression import LinearRegression
from pyspark.ml.regression import RandomForestRegressor

from pyspark.ml.feature import VectorAssembler, StringIndexer, OneHotEncoder, StandardScaler

from pyspark.ml.evaluation import RegressionEvaluator

In [None]:
spark = SparkSession.builder.appName("dengue").getOrCreate()

### Chargement des données

In [None]:
path_to_data = "data/"


features_train = spark.read.csv(path_to_data + "dengue_features_train.csv", header=True)
#df_features['month'] = df_features['week_start_date'][5:7]

labels_train = spark.read.csv(path_to_data + "dengue_labels_train.csv", header=True)

### Description des données

Your goal is to predict the total_cases label for each (`city`, `year`, `weekofyear`) in the test set. 
There are two cities, *San Juan* and *Iquitos*, with test data for each city spanning 5 and 3 years respectively.
You will make one submission that contains predictions for both cities.
The data for each city have been concatenated along with a city column indicating the source: `sj` for San Juan and `iq` for Iquitos. 
The test set is a pure future hold-out, meaning the test data are sequential and non-overlapping with any of the training data.
Throughout, missing values have been filled as `NaN`s.

#### The features in this dataset

You are provided the following set of information on a (`year`, `weekofyear`) timescale:

(Where appropriate, units are provided as a `_unit` suffix on the feature name.)

*City and date indicators*

 - `city` – City abbreviations: `sj` for San Juan and `iq` for Iquitos
 - `week_start_date` – Date given in yyyy-mm-dd format

*NOAA's GHCN daily climate data weather station measurements*

 - `station_max_temp_c` – Maximum temperature
 - `station_min_temp_c` – Minimum temperature
 - `station_avg_temp_c` – Average temperature
 - `station_precip_mm` – Total precipitation
 - `station_diur_temp_rng_c` – Diurnal temperature range
 
*PERSIANN satellite precipitation measurements (0.25x0.25 degree scale)*

 - `precipitation_amt_mm` – Total precipitation

*NOAA's NCEP Climate Forecast System Reanalysis measurements (0.5x0.5 degree scale)*

 - `reanalysis_sat_precip_amt_mm` – Total precipitation
 - `reanalysis_dew_point_temp_k` – Mean dew point temperature
 - `reanalysis_air_temp_k` – Mean air temperature
 - `reanalysis_relative_humidity_percent` – Mean relative humidity
 - `reanalysis_specific_humidity_g_per_kg` – Mean specific humidity
 - `reanalysis_precip_amt_kg_per_m2` – Total precipitation
 - `reanalysis_max_air_temp_k` – Maximum air temperature
 - `reanalysis_min_air_temp_k` – Minimum air temperature
 - `reanalysis_avg_temp_k` – Average air temperature
 - `reanalysis_tdtr_k` – Diurnal temperature range

*Satellite vegetation - Normalized difference vegetation index (NDVI) - NOAA's CDR Normalized Difference Vegetation Index (0.5x0.5 degree scale) measurements*

 - `ndvi_se` – Pixel southeast of city centroid
 - `ndvi_sw` – Pixel southwest of city centroid
 - `ndvi_ne` – Pixel northeast of city centroid
 - `ndvi_nw` – Pixel northwest of city centroid


In [None]:
print("features_train = ({}, {})".format(features_train.count(), len(features_train.columns)))

features_train.printSchema()

In [None]:
print("labels_train = ({}, {})".format(labels_train.count(), len(labels_train.columns)))

labels_train.printSchema()

We join the 2 datasets.

In [None]:
df_train = features_train.join(labels_train, ['city', 'year', 'weekofyear'])

In [None]:
print("df_train = ({}, {})".format(df_train.count(), len(df_train.columns)))

df_train.printSchema()

### Nettoyage

In [None]:
df_train.select('year', 'weekofyear', 'week_start_date', 'precipitation_amt_mm', 'reanalysis_sat_precip_amt_mm').show(50)

In [None]:
#df_train = df_train.drop('precipitation_amt_mm', 'week_start_date')

In [None]:
for col_name in df_train.columns:
    if col_name not in ['city', 'week_start_date']:
        df_train = df_train.withColumn(col_name, df_train[col_name].cast('float'))

df_train = df_train.dropna()

In [None]:
#print("size of the data: {}".format(df_train.shape()))

df_train.show()

### Préparation des données

In [None]:
indexer = StringIndexer(inputCol='city',
                        outputCol='city_')

df_train = indexer.fit(df_train).transform(df_train)

encoder = OneHotEncoder(inputCol='city_', outputCol='cityVect')
df_train = encoder.transform(df_train)

We create a new feature characterizing the "surface" of the city (in pixels) defined by:

city_surface_px = |ndvi_ne - ndvi_sw| x |ndvi_nw - ndvi_se|

Small `city_surface_px` means a lot of vegetation (and maybe a lot of mosquitos!).

In [None]:
df_train = df_train.withColumn('city_surface_px', abs(df_train.ndvi_ne - df_train.ndvi_sw) * abs(df_train.ndvi_nw - df_train.ndvi_se))

In [None]:
df_train.select('city_surface_px').show(50)

In [None]:
lr_features = ['year', 'weekofyear', 'precipitation_amt_mm',
#               'ndvi_ne', 'ndvi_nw', 'ndvi_se', 'ndvi_sw',
               'city_surface_px',
               'reanalysis_air_temp_k','reanalysis_avg_temp_k',
               'reanalysis_dew_point_temp_k', 'reanalysis_max_air_temp_k',
               'reanalysis_min_air_temp_k', 'reanalysis_precip_amt_kg_per_m2',
               'reanalysis_relative_humidity_percent', 'reanalysis_sat_precip_amt_mm', 
               'reanalysis_specific_humidity_g_per_kg', 'reanalysis_tdtr_k',
               'station_avg_temp_c','station_diur_temp_rng_c',
               'station_max_temp_c', 'station_min_temp_c',
               'station_precip_mm', 'cityVect']

In [None]:
vectorAssembler = VectorAssembler(inputCols=lr_features, outputCol = 'features')

In [None]:
df_train_vectorised = vectorAssembler.transform(df_train)
df_train_vectorised.select('features').show(10)

In [None]:
scaler = StandardScaler(inputCol='features', outputCol="scaled_features",
                        withStd=True, withMean=True)

scaler_model = scaler.fit(df_train_vectorised)

df_train = scaler_model.transform(df_train_vectorised)

In [None]:
train, test = df_train.randomSplit([0.8, 0.2], seed=42)

In [None]:
lr = LinearRegression(featuresCol='scaled_features',
                      labelCol='total_cases')

In [None]:
model_lr = lr.fit(train)

In [None]:
pred_lr = model_lr.transform(test)

In [None]:
pred_lr.select(['total_cases','prediction']).show(150)

In [None]:
rmse = model_lr.summary.rootMeanSquaredError
r2 = model_lr.summary.r2

print("rmse = {:.3f} / r2 = {:.3f}".format(rmse, r2))

In [None]:
evaluator = RegressionEvaluator(labelCol='total_cases',
                                predictionCol='prediction',
                                metricName='mae')

mae_lr = evaluator.evaluate(pred_lr)

print("mae = {:.3f}".format(mae_lr))

### Random forest

In [None]:
rf = RandomForestRegressor(featuresCol='scaled_features', labelCol='total_cases')

In [None]:
model_rf = rf.fit(train)

In [None]:
pred_rf = model_rf.transform(test)

In [None]:
mae_rf = evaluator.evaluate(pred_rf)

print("mae = {:.3f}".format(mae_rf))