

https://www.drivendata.org/competitions/44/dengai-predicting-disease-spread/

https://www.cdc.gov/dengue/

Site de la NOAA mis en place pour ce challenge:

https://dengueforecasting.noaa.gov/

### Goal

Predict the number of dengue cases each week (in each location) based on environmental variables describing changes in temperature, precipitation, vegetation, and more.

In [4]:
# Load pyspark
import findspark

findspark.init()

from pyspark import SparkContext

from pyspark.sql.functions import abs, to_date
from pyspark.ml.tuning import CrossValidator
from pyspark.ml.tuning import ParamGridBuilder
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.feature import VectorAssembler, StringIndexer, OneHotEncoder, StandardScaler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.regression import LinearRegression
from pyspark.sql import SparkSession

In [5]:
# Other modules
import numpy as np

from sklearn.metrics import mean_absolute_error

In [6]:
# create spark context

spark = SparkSession.builder.appName("dengue").getOrCreate()

### Chargement des données

In [8]:
path_to_data = "data/"

features_train = spark.read.csv(path_to_data + "dengue_features_train.csv",
                                header=True)

labels_train = spark.read.csv(path_to_data + "dengue_labels_train.csv",
                              header=True)

### Description des données

Your goal is to predict the total_cases label for each (`city`, `year`, `weekofyear`) in the test set. 
There are two cities, *San Juan* and *Iquitos*, with test data for each city spanning 5 and 3 years respectively.
You will make one submission that contains predictions for both cities.
The data for each city have been concatenated along with a city column indicating the source: `sj` for San Juan and `iq` for Iquitos. 
The test set is a pure future hold-out, meaning the test data are sequential and non-overlapping with any of the training data.
Throughout, missing values have been filled as `NaN`s.

#### The features in this dataset

You are provided the following set of information on a (`year`, `weekofyear`) timescale:

(Where appropriate, units are provided as a `_unit` suffix on the feature name.)

*City and date indicators*

 - `city` – City abbreviations: `sj` for San Juan and `iq` for Iquitos
 - `week_start_date` – Date given in yyyy-mm-dd format

*NOAA's GHCN daily climate data weather station measurements*

 - `station_max_temp_c` – Maximum temperature
 - `station_min_temp_c` – Minimum temperature
 - `station_avg_temp_c` – Average temperature
 - `station_precip_mm` – Total precipitation
 - `station_diur_temp_rng_c` – Diurnal temperature range
 
*PERSIANN satellite precipitation measurements (0.25x0.25 degree scale)*

 - `precipitation_amt_mm` – Total precipitation

*NOAA's NCEP Climate Forecast System Reanalysis measurements (0.5x0.5 degree scale)*

 - `reanalysis_sat_precip_amt_mm` – Total precipitation
 - `reanalysis_dew_point_temp_k` – Mean dew point temperature
 - `reanalysis_air_temp_k` – Mean air temperature
 - `reanalysis_relative_humidity_percent` – Mean relative humidity
 - `reanalysis_specific_humidity_g_per_kg` – Mean specific humidity
 - `reanalysis_precip_amt_kg_per_m2` – Total precipitation
 - `reanalysis_max_air_temp_k` – Maximum air temperature
 - `reanalysis_min_air_temp_k` – Minimum air temperature
 - `reanalysis_avg_temp_k` – Average air temperature
 - `reanalysis_tdtr_k` – Diurnal temperature range

*Satellite vegetation - Normalized difference vegetation index (NDVI) - NOAA's CDR Normalized Difference Vegetation Index (0.5x0.5 degree scale) measurements*

 - `ndvi_se` – Pixel southeast of city centroid
 - `ndvi_sw` – Pixel southwest of city centroid
 - `ndvi_ne` – Pixel northeast of city centroid
 - `ndvi_nw` – Pixel northwest of city centroid


In [9]:
print("features_train = ({}, {})".format(features_train.count(), len(features_train.columns)))

features_train.printSchema()

features_train = (1456, 24)
root
 |-- city: string (nullable = true)
 |-- year: string (nullable = true)
 |-- weekofyear: string (nullable = true)
 |-- week_start_date: string (nullable = true)
 |-- ndvi_ne: string (nullable = true)
 |-- ndvi_nw: string (nullable = true)
 |-- ndvi_se: string (nullable = true)
 |-- ndvi_sw: string (nullable = true)
 |-- precipitation_amt_mm: string (nullable = true)
 |-- reanalysis_air_temp_k: string (nullable = true)
 |-- reanalysis_avg_temp_k: string (nullable = true)
 |-- reanalysis_dew_point_temp_k: string (nullable = true)
 |-- reanalysis_max_air_temp_k: string (nullable = true)
 |-- reanalysis_min_air_temp_k: string (nullable = true)
 |-- reanalysis_precip_amt_kg_per_m2: string (nullable = true)
 |-- reanalysis_relative_humidity_percent: string (nullable = true)
 |-- reanalysis_sat_precip_amt_mm: string (nullable = true)
 |-- reanalysis_specific_humidity_g_per_kg: string (nullable = true)
 |-- reanalysis_tdtr_k: string (nullable = true)
 |-- stati

In [10]:
print("labels_train = ({}, {})".format(
    labels_train.count(), len(labels_train.columns)))

labels_train.printSchema()

labels_train = (1456, 4)
root
 |-- city: string (nullable = true)
 |-- year: string (nullable = true)
 |-- weekofyear: string (nullable = true)
 |-- total_cases: string (nullable = true)



We join the 2 datasets.

In [11]:
df_train = features_train.join(labels_train, ['city', 'year', 'weekofyear'])

In [12]:
print("df_train = ({}, {})".format(df_train.count(), len(df_train.columns)))

df_train.printSchema()

df_train = (1456, 25)
root
 |-- city: string (nullable = true)
 |-- year: string (nullable = true)
 |-- weekofyear: string (nullable = true)
 |-- week_start_date: string (nullable = true)
 |-- ndvi_ne: string (nullable = true)
 |-- ndvi_nw: string (nullable = true)
 |-- ndvi_se: string (nullable = true)
 |-- ndvi_sw: string (nullable = true)
 |-- precipitation_amt_mm: string (nullable = true)
 |-- reanalysis_air_temp_k: string (nullable = true)
 |-- reanalysis_avg_temp_k: string (nullable = true)
 |-- reanalysis_dew_point_temp_k: string (nullable = true)
 |-- reanalysis_max_air_temp_k: string (nullable = true)
 |-- reanalysis_min_air_temp_k: string (nullable = true)
 |-- reanalysis_precip_amt_kg_per_m2: string (nullable = true)
 |-- reanalysis_relative_humidity_percent: string (nullable = true)
 |-- reanalysis_sat_precip_amt_mm: string (nullable = true)
 |-- reanalysis_specific_humidity_g_per_kg: string (nullable = true)
 |-- reanalysis_tdtr_k: string (nullable = true)
 |-- station_avg

### Nettoyage

In [13]:
df_train.select('year', 'weekofyear', 'week_start_date',
                'precipitation_amt_mm', 'reanalysis_sat_precip_amt_mm') \
    .show(50)

+----+----------+---------------+--------------------+----------------------------+
|year|weekofyear|week_start_date|precipitation_amt_mm|reanalysis_sat_precip_amt_mm|
+----+----------+---------------+--------------------+----------------------------+
|1990|        18|     1990-04-30|               12.42|                       12.42|
|1990|        19|     1990-05-07|               22.82|                       22.82|
|1990|        20|     1990-05-14|               34.54|                       34.54|
|1990|        21|     1990-05-21|               15.36|                       15.36|
|1990|        22|     1990-05-28|                7.52|                        7.52|
|1990|        23|     1990-06-04|                9.58|                        9.58|
|1990|        24|     1990-06-11|                3.48|                        3.48|
|1990|        25|     1990-06-18|              151.12|                      151.12|
|1990|        26|     1990-06-25|               19.32|                      

In [14]:
# The 2 columns 'precipitation_amt_mm' and 'reanalysis_sat_precip_amt_mm' are the same
# we drop 'precipitation_amt_mm'

df_train = df_train.drop('precipitation_amt_mm')

In [15]:
# recast 'week_start_date' as a date. Nice to have for plotting
df_train = df_train.withColumn('week_start_date', to_date('week_start_date', 'yyyy-MM-dd'))

In [16]:
df_train.printSchema()

root
 |-- city: string (nullable = true)
 |-- year: string (nullable = true)
 |-- weekofyear: string (nullable = true)
 |-- week_start_date: date (nullable = true)
 |-- ndvi_ne: string (nullable = true)
 |-- ndvi_nw: string (nullable = true)
 |-- ndvi_se: string (nullable = true)
 |-- ndvi_sw: string (nullable = true)
 |-- reanalysis_air_temp_k: string (nullable = true)
 |-- reanalysis_avg_temp_k: string (nullable = true)
 |-- reanalysis_dew_point_temp_k: string (nullable = true)
 |-- reanalysis_max_air_temp_k: string (nullable = true)
 |-- reanalysis_min_air_temp_k: string (nullable = true)
 |-- reanalysis_precip_amt_kg_per_m2: string (nullable = true)
 |-- reanalysis_relative_humidity_percent: string (nullable = true)
 |-- reanalysis_sat_precip_amt_mm: string (nullable = true)
 |-- reanalysis_specific_humidity_g_per_kg: string (nullable = true)
 |-- reanalysis_tdtr_k: string (nullable = true)
 |-- station_avg_temp_c: string (nullable = true)
 |-- station_diur_temp_rng_c: string (null

In [17]:
# cast column to float
for col_name in df_train.columns:
    if col_name not in ['city', 'week_start_date']:
        df_train = df_train.withColumn(col_name, df_train[col_name].cast('float'))

In [18]:
# drop na value
df_train = df_train.dropna()

In [19]:
# We could also replace NaN with mean value
# FAIRE une udf

# function to fill values with the mean
#def fillna_with_mean(df):
#    for col in df:
#        if col not in ['city']:
#            df[col].fillna(df[col].mean(), inplace=True)


In [20]:
#print("size of the data: {}".format(df_train.shape()))

df_train.show()

+----+------+----------+---------------+---------+---------+---------+---------+---------------------+---------------------+---------------------------+-------------------------+-------------------------+-------------------------------+------------------------------------+----------------------------+-------------------------------------+-----------------+------------------+-----------------------+------------------+------------------+-----------------+-----------+
|city|  year|weekofyear|week_start_date|  ndvi_ne|  ndvi_nw|  ndvi_se|  ndvi_sw|reanalysis_air_temp_k|reanalysis_avg_temp_k|reanalysis_dew_point_temp_k|reanalysis_max_air_temp_k|reanalysis_min_air_temp_k|reanalysis_precip_amt_kg_per_m2|reanalysis_relative_humidity_percent|reanalysis_sat_precip_amt_mm|reanalysis_specific_humidity_g_per_kg|reanalysis_tdtr_k|station_avg_temp_c|station_diur_temp_rng_c|station_max_temp_c|station_min_temp_c|station_precip_mm|total_cases|
+----+------+----------+---------------+---------+---------+

### Préparation des données

In [21]:
indexer = StringIndexer(inputCol='city',
                        outputCol='city_')

df_train = indexer.fit(df_train).transform(df_train)

encoder = OneHotEncoder(inputCol='city_', outputCol='cityVect')
df_train = encoder.transform(df_train)

In [22]:
df_train = df_train \
    .withColumn('ndvi_ne_abs', abs(df_train['ndvi_ne'])) \
    .withColumn('ndvi_nw_abs', abs(df_train['ndvi_nw'])) \
    .withColumn('ndvi_se_abs', abs(df_train['ndvi_se'])) \
    .withColumn('ndvi_sw_abs', abs(df_train['ndvi_sw']))

In [23]:
df_train \
    .select('ndvi_ne', 'ndvi_nw', 'ndvi_se', 'ndvi_se', 'ndvi_ne_abs', 'ndvi_nw_abs', 'ndvi_se_abs', 'ndvi_se_abs') \
    .show(50)

+---------+---------+----------+----------+-----------+-----------+-----------+-----------+
|  ndvi_ne|  ndvi_nw|   ndvi_se|   ndvi_se|ndvi_ne_abs|ndvi_nw_abs|ndvi_se_abs|ndvi_se_abs|
+---------+---------+----------+----------+-----------+-----------+-----------+-----------+
|   0.1226| 0.103725| 0.1984833| 0.1984833|     0.1226|   0.103725|  0.1984833|  0.1984833|
|   0.1699| 0.142175| 0.1623571| 0.1623571|     0.1699|   0.142175|  0.1623571|  0.1623571|
|  0.03225|0.1729667|    0.1572|    0.1572|    0.03225|  0.1729667|     0.1572|     0.1572|
|0.1286333|0.2450667| 0.2275571| 0.2275571|  0.1286333|  0.2450667|  0.2275571|  0.2275571|
|   0.1962|   0.2622|    0.2512|    0.2512|     0.1962|     0.2622|     0.2512|     0.2512|
|   0.1129|   0.0928| 0.2050714| 0.2050714|     0.1129|     0.0928|  0.2050714|  0.2050714|
|   0.0725|   0.0725| 0.1514714| 0.1514714|     0.0725|     0.0725|  0.1514714|  0.1514714|
|  0.10245| 0.146175| 0.1255714| 0.1255714|    0.10245|   0.146175|  0.1255714| 

In [24]:
lr_features = ['year', 'weekofyear',
               'ndvi_ne', 'ndvi_nw', 'ndvi_se', 'ndvi_sw',
               'reanalysis_air_temp_k','reanalysis_avg_temp_k',
               'reanalysis_dew_point_temp_k', 'reanalysis_max_air_temp_k',
               'reanalysis_min_air_temp_k', 'reanalysis_precip_amt_kg_per_m2',
               'reanalysis_relative_humidity_percent', 'reanalysis_sat_precip_amt_mm', 
               'reanalysis_specific_humidity_g_per_kg', 'reanalysis_tdtr_k',
               'station_avg_temp_c','station_diur_temp_rng_c',
               'station_max_temp_c', 'station_min_temp_c',
               'station_precip_mm', 'cityVect']

In [25]:
vectorAssembler = VectorAssembler(inputCols=lr_features, outputCol = 'features')

In [26]:
df_train_vectorised = vectorAssembler.transform(df_train)
df_train_vectorised.select('features').show(10)

+--------------------+
|            features|
+--------------------+
|[1990.0,18.0,0.12...|
|[1990.0,19.0,0.16...|
|[1990.0,20.0,0.03...|
|[1990.0,21.0,0.12...|
|[1990.0,22.0,0.19...|
|[1990.0,24.0,0.11...|
|[1990.0,25.0,0.07...|
|[1990.0,26.0,0.10...|
|[1990.0,28.0,0.19...|
|[1990.0,29.0,0.29...|
+--------------------+
only showing top 10 rows



In [27]:
scaler = StandardScaler(inputCol='features', outputCol="scaled_features",
                        withStd=True, withMean=True)

scaler_model = scaler.fit(df_train_vectorised)

df_train = scaler_model.transform(df_train_vectorised)

In [28]:
train, test = df_train.randomSplit([0.8, 0.2], seed=42)

In [29]:
lr = LinearRegression(featuresCol='scaled_features',
                      labelCol='total_cases')

In [30]:
model_lr = lr.fit(train)

In [31]:
pred_lr = model_lr.transform(test)

In [32]:
pred_lr.select(['total_cases','prediction']).show(150)

+-----------+--------------------+
|total_cases|          prediction|
+-----------+--------------------+
|        0.0|   6.030431526858896|
|        0.0|   13.61870596353615|
|        0.0|   0.686327739799399|
|        0.0|   9.005785253557189|
|        1.0|  1.8895961110443125|
|        0.0|   9.224592653304002|
|        0.0| -1.7899907901514638|
|        0.0|   5.356461604192749|
|        0.0|  -5.319194196427752|
|        0.0|   4.354922355980015|
|        0.0|   6.400502280258307|
|       16.0|  1.9361692674875002|
|       10.0|  16.046465116206647|
|       10.0|     17.718635388772|
|        4.0|   5.615116723911779|
|        5.0| -2.2837708111909585|
|        0.0|  25.433327360697934|
|        1.0|   6.572743171989659|
|        1.0|  0.5675129046281917|
|        1.0|   6.989582962190848|
|        0.0| -2.9075975638428737|
|        2.0|   7.241111595167787|
|        5.0|   24.86825683072162|
|        8.0|  13.240566683100967|
|        3.0|  -5.928593578295011|
|        3.0|  2.736

In [33]:
rmse = model_lr.summary.rootMeanSquaredError
r2 = model_lr.summary.r2

print("rmse = {:.3f} / r2 = {:.3f}".format(rmse, r2))

rmse = 27.145 / r2 = 0.260


In [34]:
evaluator = RegressionEvaluator(labelCol='total_cases',
                                predictionCol='prediction',
                                metricName='rmse')

rmse_lr = evaluator.evaluate(pred_lr)

print("rmse = {:.3f}".format(rmse_lr))

rmse = 25.391


### Random forest

In [35]:
rf = RandomForestRegressor(featuresCol='scaled_features', labelCol='total_cases')

In [36]:
model_rf = rf.fit(train)

In [37]:
pred_rf = model_rf.transform(test)

In [38]:
rmse_rf = evaluator.evaluate(pred_rf)

print("rmse = {:.3f}".format(rmse_rf))

rmse = 22.958


### Logistic regression


In [34]:
logr = LogisticRegression(featuresCol='scaled_features',
                          labelCol='total_cases')

model_logr = logr.fit(train)

pred_logr = model_logr.transform(test)

rmse_logr = evaluator.evaluate(pred_logr)

print("rmse = {:.3f}".format(rmse_logr))

rmse = 34.978


In [35]:
# Create ParamGrid for cross validation
paramGrid = ParamGridBuilder() \
        .addGrid(lr.regParam, [0.1, 0.01]) \
        .addGrid(lr.fitIntercept, [False, True]) \
        .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0]) \
        .build()

cv = CrossValidator(estimator=logr,
                    estimatorParamMaps=paramGrid,
                    evaluator=evaluator,
                    numFolds=5)  # use 3+ folds in practice

# Run cross-validation, and choose the best set of parameters.
model_cv = cv.fit(train)

# Make predictions on test data. model is the model with combination of parameters
# that performed best.
predictions_cv = model_cv.transform(test)

Py4JJavaError: An error occurred while calling o607.cache.
: java.lang.OutOfMemoryError: GC overhead limit exceeded
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.reset(TreeNode.scala:61)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$11.apply(TreeNode.scala:335)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:333)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
	at org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$getAliasedConstraints$1$$anonfun$apply$5.apply(LogicalPlan.scala:254)
	at org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$getAliasedConstraints$1$$anonfun$apply$5.apply(LogicalPlan.scala:254)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
	at org.apache.spark.sql.catalyst.expressions.ExpressionSet.foreach(ExpressionSet.scala:55)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at org.apache.spark.sql.catalyst.expressions.ExpressionSet.scala$collection$SetLike$$super$map(ExpressionSet.scala:55)
	at scala.collection.SetLike$class.map(SetLike.scala:92)
	at org.apache.spark.sql.catalyst.expressions.ExpressionSet.map(ExpressionSet.scala:55)
	at org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$getAliasedConstraints$1.apply(LogicalPlan.scala:254)
	at org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$getAliasedConstraints$1.apply(LogicalPlan.scala:249)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)


In [None]:
result_CV = evaluator.evaluate(predictions_cv)

print("CV::rmse = {:.4f}".format(result_cv))