

https://www.drivendata.org/competitions/44/dengai-predicting-disease-spread/

https://www.cdc.gov/dengue/

### Goal

Predict the number of dengue cases each week (in each location) based on environmental variables describing changes in temperature, precipitation, vegetation, and more.

In [76]:
import findspark
findspark.init()

import numpy as np

from sklearn.metrics import mean_absolute_error

from pyspark import SparkContext

from pyspark.sql.functions import abs 

from pyspark.sql import SparkSession

from pyspark.ml.regression import LinearRegression
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.classification import LogisticRegression

from pyspark.ml.feature import VectorAssembler, StringIndexer, OneHotEncoder, StandardScaler

from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder
from pyspark.ml.tuning import CrossValidator

In [31]:
spark = SparkSession.builder.appName("dengue").getOrCreate()

### Chargement des données

In [32]:
path_to_data = "data/"


features_train = spark.read.csv(path_to_data + "dengue_features_train.csv", header=True)
#df_features['month'] = df_features['week_start_date'][5:7]

labels_train = spark.read.csv(path_to_data + "dengue_labels_train.csv", header=True)

### Description des données

Your goal is to predict the total_cases label for each (`city`, `year`, `weekofyear`) in the test set. 
There are two cities, *San Juan* and *Iquitos*, with test data for each city spanning 5 and 3 years respectively.
You will make one submission that contains predictions for both cities.
The data for each city have been concatenated along with a city column indicating the source: `sj` for San Juan and `iq` for Iquitos. 
The test set is a pure future hold-out, meaning the test data are sequential and non-overlapping with any of the training data.
Throughout, missing values have been filled as `NaN`s.

#### The features in this dataset

You are provided the following set of information on a (`year`, `weekofyear`) timescale:

(Where appropriate, units are provided as a `_unit` suffix on the feature name.)

*City and date indicators*

 - `city` – City abbreviations: `sj` for San Juan and `iq` for Iquitos
 - `week_start_date` – Date given in yyyy-mm-dd format

*NOAA's GHCN daily climate data weather station measurements*

 - `station_max_temp_c` – Maximum temperature
 - `station_min_temp_c` – Minimum temperature
 - `station_avg_temp_c` – Average temperature
 - `station_precip_mm` – Total precipitation
 - `station_diur_temp_rng_c` – Diurnal temperature range
 
*PERSIANN satellite precipitation measurements (0.25x0.25 degree scale)*

 - `precipitation_amt_mm` – Total precipitation

*NOAA's NCEP Climate Forecast System Reanalysis measurements (0.5x0.5 degree scale)*

 - `reanalysis_sat_precip_amt_mm` – Total precipitation
 - `reanalysis_dew_point_temp_k` – Mean dew point temperature
 - `reanalysis_air_temp_k` – Mean air temperature
 - `reanalysis_relative_humidity_percent` – Mean relative humidity
 - `reanalysis_specific_humidity_g_per_kg` – Mean specific humidity
 - `reanalysis_precip_amt_kg_per_m2` – Total precipitation
 - `reanalysis_max_air_temp_k` – Maximum air temperature
 - `reanalysis_min_air_temp_k` – Minimum air temperature
 - `reanalysis_avg_temp_k` – Average air temperature
 - `reanalysis_tdtr_k` – Diurnal temperature range

*Satellite vegetation - Normalized difference vegetation index (NDVI) - NOAA's CDR Normalized Difference Vegetation Index (0.5x0.5 degree scale) measurements*

 - `ndvi_se` – Pixel southeast of city centroid
 - `ndvi_sw` – Pixel southwest of city centroid
 - `ndvi_ne` – Pixel northeast of city centroid
 - `ndvi_nw` – Pixel northwest of city centroid


In [33]:
print("features_train = ({}, {})".format(features_train.count(), len(features_train.columns)))

features_train.printSchema()

features_train = (1456, 24)
root
 |-- city: string (nullable = true)
 |-- year: string (nullable = true)
 |-- weekofyear: string (nullable = true)
 |-- week_start_date: string (nullable = true)
 |-- ndvi_ne: string (nullable = true)
 |-- ndvi_nw: string (nullable = true)
 |-- ndvi_se: string (nullable = true)
 |-- ndvi_sw: string (nullable = true)
 |-- precipitation_amt_mm: string (nullable = true)
 |-- reanalysis_air_temp_k: string (nullable = true)
 |-- reanalysis_avg_temp_k: string (nullable = true)
 |-- reanalysis_dew_point_temp_k: string (nullable = true)
 |-- reanalysis_max_air_temp_k: string (nullable = true)
 |-- reanalysis_min_air_temp_k: string (nullable = true)
 |-- reanalysis_precip_amt_kg_per_m2: string (nullable = true)
 |-- reanalysis_relative_humidity_percent: string (nullable = true)
 |-- reanalysis_sat_precip_amt_mm: string (nullable = true)
 |-- reanalysis_specific_humidity_g_per_kg: string (nullable = true)
 |-- reanalysis_tdtr_k: string (nullable = true)
 |-- stati

In [34]:
print("labels_train = ({}, {})".format(labels_train.count(), len(labels_train.columns)))

labels_train.printSchema()

labels_train = (1456, 4)
root
 |-- city: string (nullable = true)
 |-- year: string (nullable = true)
 |-- weekofyear: string (nullable = true)
 |-- total_cases: string (nullable = true)



We join the 2 datasets.

In [35]:
df_train = features_train.join(labels_train, ['city', 'year', 'weekofyear'])

In [36]:
print("df_train = ({}, {})".format(df_train.count(), len(df_train.columns)))

df_train.printSchema()

df_train = (1456, 25)
root
 |-- city: string (nullable = true)
 |-- year: string (nullable = true)
 |-- weekofyear: string (nullable = true)
 |-- week_start_date: string (nullable = true)
 |-- ndvi_ne: string (nullable = true)
 |-- ndvi_nw: string (nullable = true)
 |-- ndvi_se: string (nullable = true)
 |-- ndvi_sw: string (nullable = true)
 |-- precipitation_amt_mm: string (nullable = true)
 |-- reanalysis_air_temp_k: string (nullable = true)
 |-- reanalysis_avg_temp_k: string (nullable = true)
 |-- reanalysis_dew_point_temp_k: string (nullable = true)
 |-- reanalysis_max_air_temp_k: string (nullable = true)
 |-- reanalysis_min_air_temp_k: string (nullable = true)
 |-- reanalysis_precip_amt_kg_per_m2: string (nullable = true)
 |-- reanalysis_relative_humidity_percent: string (nullable = true)
 |-- reanalysis_sat_precip_amt_mm: string (nullable = true)
 |-- reanalysis_specific_humidity_g_per_kg: string (nullable = true)
 |-- reanalysis_tdtr_k: string (nullable = true)
 |-- station_avg

### Nettoyage

In [37]:
df_train.select('year', 'weekofyear', 'week_start_date', 'precipitation_amt_mm', 'reanalysis_sat_precip_amt_mm').show(50)

+----+----------+---------------+--------------------+----------------------------+
|year|weekofyear|week_start_date|precipitation_amt_mm|reanalysis_sat_precip_amt_mm|
+----+----------+---------------+--------------------+----------------------------+
|1990|        18|     1990-04-30|               12.42|                       12.42|
|1990|        19|     1990-05-07|               22.82|                       22.82|
|1990|        20|     1990-05-14|               34.54|                       34.54|
|1990|        21|     1990-05-21|               15.36|                       15.36|
|1990|        22|     1990-05-28|                7.52|                        7.52|
|1990|        23|     1990-06-04|                9.58|                        9.58|
|1990|        24|     1990-06-11|                3.48|                        3.48|
|1990|        25|     1990-06-18|              151.12|                      151.12|
|1990|        26|     1990-06-25|               19.32|                      

In [38]:
#df_train = df_train.drop('precipitation_amt_mm', 'week_start_date')

In [39]:
for col_name in df_train.columns:
    if col_name not in ['city', 'week_start_date']:
        df_train = df_train.withColumn(col_name, df_train[col_name].cast('float'))

df_train = df_train.dropna()

In [40]:
#print("size of the data: {}".format(df_train.shape()))

df_train.show()

+----+------+----------+---------------+---------+---------+---------+---------+--------------------+---------------------+---------------------+---------------------------+-------------------------+-------------------------+-------------------------------+------------------------------------+----------------------------+-------------------------------------+-----------------+------------------+-----------------------+------------------+------------------+-----------------+-----------+
|city|  year|weekofyear|week_start_date|  ndvi_ne|  ndvi_nw|  ndvi_se|  ndvi_sw|precipitation_amt_mm|reanalysis_air_temp_k|reanalysis_avg_temp_k|reanalysis_dew_point_temp_k|reanalysis_max_air_temp_k|reanalysis_min_air_temp_k|reanalysis_precip_amt_kg_per_m2|reanalysis_relative_humidity_percent|reanalysis_sat_precip_amt_mm|reanalysis_specific_humidity_g_per_kg|reanalysis_tdtr_k|station_avg_temp_c|station_diur_temp_rng_c|station_max_temp_c|station_min_temp_c|station_precip_mm|total_cases|
+----+------+-----

### Préparation des données

In [41]:
indexer = StringIndexer(inputCol='city',
                        outputCol='city_')

df_train = indexer.fit(df_train).transform(df_train)

encoder = OneHotEncoder(inputCol='city_', outputCol='cityVect')
df_train = encoder.transform(df_train)

We create a new feature characterizing the "surface" of the city (in pixels) defined by:

city_surface_px = |ndvi_ne - ndvi_sw| x |ndvi_nw - ndvi_se|

Small `city_surface_px` means a lot of vegetation (and maybe a lot of mosquitos!).

In [42]:
df_train = df_train.withColumn('city_surface_px', abs(df_train.ndvi_ne - df_train.ndvi_sw) * abs(df_train.ndvi_nw - df_train.ndvi_se))

In [43]:
df_train.select('ndvi_ne', 'ndvi_nw', 'ndvi_se', 'ndvi_sw', 'city_surface_px').show(50)

+---------+---------+----------+---------+---------------+
|  ndvi_ne|  ndvi_nw|   ndvi_se|  ndvi_sw|city_surface_px|
+---------+---------+----------+---------+---------------+
|   0.1226| 0.103725| 0.1984833|0.1776167|   0.0052132895|
|   0.1699| 0.142175| 0.1623571|0.1554857|    2.909108E-4|
|  0.03225|0.1729667|    0.1572|0.1708429|    0.002185154|
|0.1286333|0.2450667| 0.2275571|0.2358857|   0.0018779475|
|   0.1962|   0.2622|    0.2512|  0.24734|    5.625403E-4|
|   0.1129|   0.0928| 0.2050714|0.2102714|    0.010932025|
|   0.0725|   0.0725| 0.1514714|0.1330286|   0.0047800285|
|  0.10245| 0.146175| 0.1255714|   0.1236|    4.357661E-4|
| 0.192875|  0.08235| 0.1919429|0.1529286|    0.004377841|
|   0.2916|   0.2118|    0.3012|0.2806667|    9.774353E-4|
|0.1505667|   0.1717|    0.2269|0.2145571|   0.0035322697|
|0.1902333|   0.1688| 0.1676571|0.1722857|   2.0512118E-5|
|   0.2529|  0.33075| 0.2641714|0.2843143|     0.00209152|
|   0.2354| 0.200025| 0.2838167|0.2304429|    4.153646E-

In [48]:
lr_features = ['year', 'weekofyear', 'precipitation_amt_mm',
#               'ndvi_ne', 'ndvi_nw', 'ndvi_se', 'ndvi_sw',
               'city_surface_px',
               'reanalysis_air_temp_k','reanalysis_avg_temp_k',
               'reanalysis_dew_point_temp_k', 'reanalysis_max_air_temp_k',
               'reanalysis_min_air_temp_k', 'reanalysis_precip_amt_kg_per_m2',
               'reanalysis_relative_humidity_percent', 'reanalysis_sat_precip_amt_mm', 
               'reanalysis_specific_humidity_g_per_kg', 'reanalysis_tdtr_k',
               'station_avg_temp_c','station_diur_temp_rng_c',
               'station_max_temp_c', 'station_min_temp_c',
               'station_precip_mm', 'cityVect']

In [49]:
vectorAssembler = VectorAssembler(inputCols=lr_features, outputCol = 'features')

In [50]:
df_train_vectorised = vectorAssembler.transform(df_train)
df_train_vectorised.select('features').show(10)

+--------------------+
|            features|
+--------------------+
|[1990.0,18.0,12.4...|
|[1990.0,19.0,22.8...|
|[1990.0,20.0,34.5...|
|[1990.0,21.0,15.3...|
|[1990.0,22.0,7.51...|
|[1990.0,24.0,3.48...|
|[1990.0,25.0,151....|
|[1990.0,26.0,19.3...|
|[1990.0,28.0,22.2...|
|[1990.0,29.0,59.1...|
+--------------------+
only showing top 10 rows



In [51]:
scaler = StandardScaler(inputCol='features', outputCol="scaled_features",
                        withStd=True, withMean=True)

scaler_model = scaler.fit(df_train_vectorised)

df_train = scaler_model.transform(df_train_vectorised)

In [52]:
train, test = df_train.randomSplit([0.8, 0.2], seed=42)

In [53]:
lr = LinearRegression(featuresCol='scaled_features',
                      labelCol='total_cases')

In [54]:
model_lr = lr.fit(train)

In [55]:
pred_lr = model_lr.transform(test)

In [56]:
pred_lr.select(['total_cases','prediction']).show(150)

+-----------+--------------------+
|total_cases|          prediction|
+-----------+--------------------+
|        0.0|   4.969937645790111|
|        0.0|  12.736308121544333|
|        0.0|  1.6810110840208417|
|        0.0|   7.865110482907976|
|        1.0|  3.4411927983483643|
|        0.0|   8.320181079278457|
|        0.0| -2.9654061097641033|
|        0.0|   3.670370987111262|
|        0.0| -6.0720470155882005|
|        0.0|   5.267777363662752|
|        0.0|   6.059438923459071|
|       16.0|   2.338261469779983|
|       10.0|  16.550776887452507|
|       10.0|  17.168606128971074|
|        4.0|   3.604000336602713|
|        5.0|-0.01393082746564...|
|        0.0|  24.606173849548057|
|        1.0|   6.846513886253799|
|        1.0| -0.4476863516008507|
|        1.0|   6.463875073252641|
|        0.0|  -3.462301697215775|
|        2.0|   6.889124704454858|
|        5.0|   22.77945877476527|
|        8.0|  14.224351646477183|
|        3.0|   -8.00171090451801|
|        3.0|  3.724

In [57]:
rmse = model_lr.summary.rootMeanSquaredError
r2 = model_lr.summary.r2

print("rmse = {:.3f} / r2 = {:.3f}".format(rmse, r2))

rmse = 27.151 / r2 = 0.259


In [67]:
evaluator = RegressionEvaluator(labelCol='total_cases',
                                predictionCol='prediction',
                                metricName='rmse')

rmse_lr = evaluator.evaluate(pred_lr)

print("rmse = {:.3f}".format(rmse_lr))

rmse = 25.434


### Random forest

In [69]:
rf = RandomForestRegressor(featuresCol='scaled_features', labelCol='total_cases')

In [70]:
model_rf = rf.fit(train)

In [71]:
pred_rf = model_rf.transform(test)

In [72]:
rmse_rf = evaluator.evaluate(pred_rf)

print("rmse = {:.3f}".format(rmse_rf))

rmse = 23.168


### Logistic regression


In [74]:
logr = LogisticRegression(featuresCol='scaled_features',
                          labelCol='total_cases')

model_logr = logr.fit(train)

pred_logr = model_logr.transform(test)

rmse_logr = evaluator.evaluate(pred_logr)

print("rmse = {:.3f}".format(rmse_logr))

rmse = 44.594


In [77]:
# Create ParamGrid for cross validation
paramGrid = ParamGridBuilder() \
        .addGrid(lr.regParam, [0.1, 0.01]) \
        .addGrid(lr.fitIntercept, [False, True]) \
        .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0]) \
        .build()

cv = CrossValidator(estimator=logr,
                    estimatorParamMaps=paramGrid,
                    evaluator=evaluator,
                    numFolds=5)  # use 3+ folds in practice

# Run cross-validation, and choose the best set of parameters.
model_cv = cv.fit(train)

# Make predictions on test data. model is the model with combination of parameters
# that performed best.
predictions_cv = model_cv.transform(test)

Py4JJavaError: An error occurred while calling o1654.cache.
: java.lang.OutOfMemoryError: GC overhead limit exceeded
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$11.apply(TreeNode.scala:335)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:333)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
	at org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$getAliasedConstraints$1$$anonfun$apply$5.apply(LogicalPlan.scala:254)
	at org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$getAliasedConstraints$1$$anonfun$apply$5.apply(LogicalPlan.scala:254)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
	at org.apache.spark.sql.catalyst.expressions.ExpressionSet.foreach(ExpressionSet.scala:55)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at org.apache.spark.sql.catalyst.expressions.ExpressionSet.scala$collection$SetLike$$super$map(ExpressionSet.scala:55)
	at scala.collection.SetLike$class.map(SetLike.scala:92)
	at org.apache.spark.sql.catalyst.expressions.ExpressionSet.map(ExpressionSet.scala:55)
	at org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$getAliasedConstraints$1.apply(LogicalPlan.scala:254)
	at org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$getAliasedConstraints$1.apply(LogicalPlan.scala:249)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)


In [None]:
result_CV = evaluator.evaluate(predictions_cv)

print("CV::rmse = {:.4f}".format(result_cv))