<a href="https://colab.research.google.com/github/StevenKim1105/StevenKim1105/blob/main/tutorial_quick_start.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MHPI tutorial

This code contains deep learning code used to model hydrologic systems from soil moisture to streamflow or from projection to forecast.

[![PyPI](https://img.shields.io/badge/pypi-version%200.1-blue)](https://pypi.org/project/hydroDL/0.1.0/)  [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3993880.svg)](https://doi.org/10.5281/zenodo.3993880) [![CodeStyle](https://img.shields.io/badge/code%20style-Black-black)]()


Welcome to our hydroDL tutorial at The Pennsylvania State University! The following notebook is designed to provide a quick start to our project and get you ready to write your own neural networks.

# git repo

In [1]:
import os
os.chdir("/content/")
!rm -rf hydroDL
!git clone https://github.com/mhpi/hydroDL.git
!mv hydroDL/hydroDL/* hydroDL

Cloning into 'hydroDL'...
remote: Enumerating objects: 1046, done.[K
remote: Counting objects: 100% (86/86), done.[K
remote: Compressing objects: 100% (76/76), done.[K
remote: Total 1046 (delta 46), reused 19 (delta 10), pack-reused 960[K
Receiving objects: 100% (1046/1046), 60.67 MiB | 30.82 MiB/s, done.
Resolving deltas: 100% (429/429), done.


In [2]:
# import Libraries

import os
import sys
os.chdir("/content/hydroDL")
sys.path.append('..')
import torch
import numpy as np
from hydroDL.master.master import loadModel
from hydroDL.model.crit import RmseLoss
from hydroDL.model.rnn import CudnnLstmModel as LSTM
from hydroDL.model.rnn import CpuLstmModel as LSTM_CPU
from hydroDL.model.train import trainModel
from hydroDL.model.test import testModel
from hydroDL.post.stat import statError as cal_metric
from hydroDL.data.load_csv import LoadCSV
from hydroDL.utils.norm import re_folder, trans_norm


loading package hydroDL


In [3]:
# set configuration
output_s = "./output/quick_start/"  # output path
csv_path_s = "/content/hydroDL/example/demo_data"  # demo data path
all_date_list = ["2015-04-01", "2017-03-31"]  # demo data time period
train_date_list = ["2015-04-01", "2016-03-31"]  # training period
# time series variables list
var_time_series = ["VGRD_10_FORA", "DLWRF_FORA", "UGRD_10_FORA", "DSWRF_FORA", "TMP_2_FORA", "SPFH_2_FORA", "APCP_FORA", ]
# constant variables list
var_constant = ["flag_extraOrd", "Clay", "Bulk", "Sand", "flag_roughness", "flag_landcover", "flag_vegDense", "Silt", "NDVI",
         "flag_albedo", "flag_waterbody", "Capa", ]
# target variable list
target = ["SMAP_AM"]
# generate output folder
re_folder(output_s)

In [5]:
# hyperparameter
EPOCH = 100 # 100 times
BATCH_SIZE = 50 # for every iteration; epoch
RHO = 30 # how long the sequence will be; if streamflow, 365 - much longer memory required
HIDDEN_SIZE = 256 # bigger network, bigger number


You can change it with your data. The data structure is as follows:

x_train (forcing data, e.g. precipitation, temperature ...): [pixels, time, features] 

c_train (constant data, e.g. soil properties, land cover ...): [pixels, features]

target (e.g. soil moisture, streamflow ...): [pixels, time, 1]

Data type: numpy.float

In [6]:
# load your datasets
train_csv = LoadCSV(csv_path_s, train_date_list, all_date_list)
x_train = train_csv.load_time_series(var_time_series)  # data size: [pixels, time, features]
c_train = train_csv.load_constant(
    var_constant, convert_time_series=False
)  # [pixels, features]
y_train = train_csv.load_time_series(target, remove_nan=False)  # [pixels, time, 1]

In [8]:
print(x_train.shape)

print(c_train.shape) # no time; since it is time-independent (not time-dependent); constant

print(y_train.shape)


(412, 366, 7)
(412, 12)
(412, 366, 1)


In [9]:
y_train[1,:,0]

array([ 2.80286782,         nan,  2.81016486,  2.6255408 ,         nan,
        2.74983874,         nan,         nan,  2.70711782,         nan,
        3.14663988,  2.77890012,         nan,  2.98685731,         nan,
               nan,  2.11992055,         nan,  2.58472858,  2.15346136,
               nan,  1.84595256,         nan,         nan,  1.92205748,
               nan,  1.87244605,  1.47171021,         nan,  1.75877124,
               nan,         nan,  1.29849423,         nan,  1.58029246,
        0.95487504,         nan,  1.0380293 ,         nan,         nan,
        0.98722074,         nan,         nan,         nan,         nan,
        1.34121515,         nan,         nan,  1.06385129,         nan,
        1.37390304,  1.26678976,         nan,  1.12265831,         nan,
               nan,  0.63321267,         nan,  1.05244253,  1.00917294,
               nan,  2.0738732 ,         nan,         nan,  0.96183452,
               nan,  1.08690692,  0.41514986,         nan,  0.85

In [10]:
# define model and loss function
loss_fn = RmseLoss()  # loss function
# select model: GPU or CPU
if torch.cuda.is_available():
    LSTM = LSTM
else:
    LSTM = LSTM_CPU
model = LSTM(nx=len(var_time_series) + len(var_constant), ny=len(target), hiddenSize=HIDDEN_SIZE)

In [11]:
print(LSTM)

<class 'hydroDL.model.rnn.CudnnLstmModel.CudnnLstmModel'>


In [12]:
# training the model
last_model = trainModel(
    model,
    x_train,
    y_train,
    c_train,
    loss_fn,
    nEpoch=EPOCH,
    miniBatch=[BATCH_SIZE, RHO],
    saveEpoch=1,
    saveFolder=output_s,
)

  output, hy, cy, reserve, new_weight_buf = torch._cudnn_rnn(
Training CudnnLstmModel: 100%|██████████| 100/100 [05:21<00:00,  3.22s/it, loss=0.311]


In [13]:
# load validation datasets
val_date_list = ["2016-04-01", "2017-03-31"]  # validation period
# load your data. same as training data
val_csv = LoadCSV(csv_path_s, val_date_list, all_date_list)
x_val = val_csv.load_time_series(var_time_series)
c_val = val_csv.load_constant(var_constant, convert_time_series=False)
y_val = val_csv.load_time_series(target, remove_nan=False)

In [14]:
# Select the epoch you want to validate.
val_epoch = 100
test_model = loadModel(output_s, epoch=val_epoch)

# set the path to save result
save_csv = os.path.join(output_s, "predict.csv")

# validation
pred_val = testModel(
    test_model, x_val, c_val, batchSize=len(x_train), filePathLst=[save_csv],
)

# select the metrics
metrics_list = ["Bias", "RMSE", "ubRMSE", "Corr"]
pred_val = pred_val.numpy()
# denormalization
pred_val = trans_norm(pred_val, csv_path_s, var_s=target[0], from_raw=False)
y_val = trans_norm(y_val, csv_path_s, var_s=target[0], from_raw=False)
pred_val, y_val = np.squeeze(pred_val), np.squeeze(y_val)
metrics_dict = cal_metric(pred_val, y_val)  # calculate the metrics
metrics = [
    "Median {}: {:.2f}".format(x, np.nanmedian(metrics_dict[x]))
    for x in metrics_list
]
print("Epoch {}: {}".format(val_epoch, metrics))
# LSTM tutorial URL: bit.ly/3Fvnwyp

Epoch 100: ['Median Bias: -0.01', 'Median RMSE: 0.03', 'Median ubRMSE: 0.03', 'Median Corr: 0.84']


