# Neural Networks for Predicting Covid-19 Morbidity

## Status Update
## February 16, 2022

#### Evan Falkowski, Noah Krieger, Richard Strouss-Rooney, Alex Van Kooy

$\rule{10in}{0.4pt}$

## Overview

* The Dataset: Nearly two years of Covid-19 data for the US by county, including demographics, pollution, and weather
* The Task: Predict the next one day and next five days from the 30 prior days
* The Approach: Design an optimal neural network using the Tensorflow HParams Dashboard
* The Goal: Compare the performance of the neural network to other statistical and classic machine learning approaches



## The Dataset

* Golden data set of 1,879,589 observations
* Cleaned up in DSCI591
* Removed textual data (county names, state names)
* Standardized data

### Geo-Encoding of Counties
* Converted FIPS codes to latitude and longitude of the centroid of each county
* Used binning technique to place each county in a bin
    * Tf.feature_column.bucketized_column
    * Tf.feature_column.crossed_column
    * Tf.feature_column.embedding_column
    * Tf.keras.layers.DenseFeatures
* Encoded into a 9 dimensional vector


    lat_buckets = list(np.linspace(df.latitude.min(), df.latitude.max(),100))
    long_buckets = list(np.linspace(df.longitude.min(), df.longitude.max(),100))

    #make feature columns
    lat_fc = tf.feature_column.bucketized_column(tf.feature_column.numeric_column('latitude'),lat_buckets)
    long_fc= tf.feature_column.bucketized_column(tf.feature_column.numeric_column('longitude'),long_buckets)

    # crossed columns tell the model how the features relate
    crossed_latlong = tf.feature_column.crossed_column(keys=[lat_fc, long_fc], hash_bucket_size=1000) # No precise rule, maybe 1000 buckets will be good?

    embedded_latlong = tf.feature_column.embedding_column(crossed_latlong,9)

    feature_layer = tf.keras.layers.DenseFeatures(embedded_latlong)

    df[['geo0', 'geo1', 'geo2','geo3', 'geo4','geo5','geo6','geo7','geo8']] = feature_layer({'latitude': df.latitude, 'longitude': df.longitude})


### Time Encoding

#### Goal: represent both the cyclical (seasons) and continuous nature of time

Let $t_i$ represent day $i$ of the Covid-19 pandemic, where $t_0$ = March 11, 2020 

$$
\text{Cyclical}_i = sin\left(\frac{2i\pi}{365}\right) \oplus cos\left(\frac{2i\pi}{365}\right)
$$
$$
\text{Continuous}_i = sin\left(\frac{2i\pi}{3650}\right) \oplus cos\left(\frac{2i\pi}{3650}\right)
$$

    df.dates = pd.to_datetime(df.dates, format='%Y-%m-%d')
    min_date = min(df.dates)
    max_date = max(df.dates)
    min_date, max_date, df.dates.dtype

    df['day'] =(df.dates - min_date).dt.days
    df.drop(['dates'], axis=1, inplace=True)

    cyclical_interval = 365
    continuous_interval = 3650
    df['cyclical_sin'] = np.sin((df.day * 2 * np.pi)/cyclical_interval)
    df['cyclical_cos'] = np.cos((df.day * 2 * np.pi)/cyclical_interval)
    df['continuous_sin'] = np.sin((df.day * 2 * np.pi)/continuous_interval)
    df['continuous_cos'] = np.cos((df.day * 2 * np.pi)/continuous_interval)
    df.drop('day', axis=1, inplace=True)

## The Task

* Started with predicting 1 day from the prior 30 days
* For each county
    * Create 31 day samples, $t_0, t_1, \cdots, t_{30}; t_1, t_2, \cdots, t_{31}; \cdots; t_{T-31}, t_{T-30}, \cdots, t_T$, where $T$ is the last observation
* Process county by county, saving each 200 observations to a file
* Randomly shuffle the resulting files
    * Move 70% to the train directory
    * Move 15% to the eval directory
    * Move 15% to the test directory
* Build a generator to
    * Shuffle the files
    * Pull files 5 at a time
    * Get the observations for all 5 files (1000 observations)
    * Shuffle the results
    * Yield
* Build a split function to separate the 30 X values from the label 
* Construct the train, validate, and test datasets as Tensorflow Dataset objects
    

    def create_generator(files, cycle_length=5):
        set_seed()
        random.shuffle(files)
        for i in range(0, len(files), cycle_length):
            subset = files[i:i+cycle_length]
            np_arrays = [np.load(s) for s in subset]
            np_array = np.concatenate(np_arrays, axis=0)
            np.random.shuffle(np_array)
            yield np_array


    def split_xy(np_array):
        X = np_array[:,:-1,:]
        y = np_array[:,-1:,:1]
        return X,y


    train_ds = tf.data.Dataset.from_generator(lambda: create_generator(train_files, cycle_length=n_readers), output_types=tf.float32 )
    train_ds = train_ds.map(split_xy, num_parallel_calls=n_parse_threads).prefetch(1)

    val_ds = tf.data.Dataset.from_generator(lambda: create_generator(eval_files, cycle_length=n_readers), output_types=tf.float32 )
    val_ds = val_ds.map(split_xy, num_parallel_calls=n_parse_threads).prefetch(1)

    test_ds = tf.data.Dataset.from_generator(lambda: create_generator(test_files, cycle_length=n_readers), output_types=tf.float32 )
    test_ds = test_ds.map(split_xy, num_parallel_calls=n_parse_threads).prefetch(1)


## The Approach

Design an optimal neural network using the Tensorflow HParams Dashboard


HP_LAYER_TYPE=hp.HParam('layer_type', hp.Discrete(['keras.layers.LSTM', 'keras.layers.GRU']))
HP_N_RECURRENT=hp.HParam('n_recurrent', hp.Discrete([1, 2, 3, 4, 5, 6]))
HP_N_UNIT=hp.HParam('n_unit', hp.Discrete([32, 64, 128, 256, 512]))
HP_DROPOUT=hp.HParam('dropout', hp.Discrete([0.0, 0.10, 0.20]))
HP_LR=hp.HParam('lr', hp.Discrete([1e-2, 1e-3]))
METRIC_MAE = 'mae'


with tf.summary.create_file_writer('logs/hparam_tuning').as_default():
    hp.hparams_config(
    hparams=[HP_LAYER_TYPE, HP_N_RECURRENT, HP_N_UNIT, HP_DROPOUT, HP_LR],
    metrics=[hp.Metric(METRIC_MAE, display_name='Mean Avg Error')],
  )
    
EPOCHS=64

def train_test_model(hparams, shape=(30,101)):
    set_seed()
    input = keras.layers.Input(shape=shape)
    last = input
    for i in range(hparams[HP_N_RECURRENT]):
        if i < hparams[HP_N_RECURRENT] - 1:
            last = eval(hparams[HP_LAYER_TYPE])(hparams[HP_N_UNIT], return_sequences=True)(last)
        else:
            last = eval(hparams[HP_LAYER_TYPE])(hparams[HP_N_UNIT])(last)
        
        if hparams[HP_DROPOUT]:
            last = keras.layers.Dropout(hparams[HP_DROPOUT])(last)

        output = keras.layers.Dense(1)(last)
    
    model = keras.models.Model(inputs=input, outputs=output)
    model.compile(optimizer = Adam(learning_rate=hparams[HP_LR]),  loss='mae')

    model.fit(train_ds,
            validation_data=val_ds,
            epochs=EPOCHS)
 
    val_loss = model.evaluate(test_ds)
    return val_loss
        

def run(run_dir, hparams):
    with tf.summary.create_file_writer(run_dir).as_default():
        hp.hparams(hparams)  # record the values used in this trial
        val_loss = train_test_model(hparams)
        tf.summary.scalar(METRIC_MAE, val_loss, step=1)
        

session_num = 0
for layer_type in HP_LAYER_TYPE.domain.values:
    for n_recurrent in HP_N_RECURRENT.domain.values:
        for n_unit in HP_N_UNIT.domain.values:
            for dropout in HP_DROPOUT.domain.values:
                for lr in HP_LR.domain.values:
                    hparams = {
                      HP_LAYER_TYPE: layer_type,
                      HP_N_RECURRENT: n_recurrent,
                      HP_N_UNIT: n_unit,
                      HP_DROPOUT: dropout,
                      HP_LR: lr
                    }
                    run_name = f'run-{session_num}'
                    print(f'--- Starting trial: {run_name}')
                    print({h.name: hparams[h] for h in hparams})
                    run('./logs/hparam_tuning/' + run_name, hparams)
                    session_num += 1

![image.png](tensorboard_scalar.png)

![image.png](tensorboard_hparams.png)

![image.png](parallel_coordinate_view.png)

![image.png](tensorboard_scatter_plot.png)

### Conclusions
* Ran iteratively, starting with smaller number of epochs to eliminate parts of the search path
* Concluded that LSTM and GRU had similar performance.  LSTM slightly outperformed (6 of top 10), so eliminated GRU
* Slower learning rate was generally better, so eliminated 0.01
* Dropoff didn't have much impact, so settled on 20%
* After many rounds of testing, settled on a 3 layer LSTM with 256 Units
* 4 - 6 layers did not outperform (getting worse)
* 512 units may have been slightly better, but took much longer to train.  Deemed not worth it.
* MAE of just over 11 cases


### Current Architecture

<pre>
Model: "Covid-Prediction-30-1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 Input (InputLayer)          [(None, 30, 101)]         0         

 LSTM-1 (LSTM)               (None, 30, 256)           366592    

 Dropout-1 (Dropout)         (None, 30, 256)           0         

 LSTM-2 (LSTM)               (None, 30, 256)           525312    

 Dropout-2 (Dropout)         (None, 30, 256)           0         

 LSTM-3 (LSTM)               (None, 30, 256)           525312    

 Output (Dense)              (None, 30, 1)             257       

=================================================================
Total params: 1,417,473
Trainable params: <b>1,417,473</b>
Non-trainable params: 0
_________________________________________________________________
</pre>



![image.png](model_3_256.png)





![image.png](model_3_256_detailed.png)



## Next Steps
* With and without time encoding
* With and without geolocation/lat-long coordinates
* With and without pre-clustering (need pollution metric)
* Optimizers other than Adam (SGD, RMSprop, etc.)
* Other loss functions (esp MSE)
* Additional layers/more units (I'm already running these)
* Batch size
* Additional dense layers
* Extend the window from 30+1
* Vary activation functions on LSTM and dense layers
* Batch normalization
* Vary y (i.e., predict COVID deaths)
* Calculate baselines (guessing zero, guessing last observation, guessing mean of 30 day observations, guessing mean of county, linear regression)