# Introduction

This script is an example to run the code for the data challenge with explanations. Details of each step can be found in each section. Functions used are from the ``` ./lib ``` folder. 

The structure of the code is:
1. Data preprocessing:
    - Removing NANs in the data
    - Standardrization
2. Training: 
    - Dimension reduction with autoencoder
    - GAN/CNN for super-resolution (SR)
    
## Code structure:
- Data preprocessing:
    * ``` preprocess.py ``` provides the functions needed for data preprocessing
- GAN for super-resolution
    * ``` models.py ``` provides the models needed in GAN
    * ``` GAN_class.py ``` contains the class defined for full GAN training

# Code for running

## Import libraries

In [None]:
%reset -f
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
import xarray as xr
import h5py

import matplotlib.pyplot as plt
import sys
sys.path.append("./lib")
from preprocess import *
from models import *
from GAN_class import *
from AE_class import *

## Data preprocessing

The function ``` data_preprocess ``` will take the list of data sets and parameters required in the preprocessing.

The parameters are explained below:

1. For NAN removal:
    - ``` "nan_dim_along" ```: the dimension along which we remove NAN data
    - ``` "nan_data_irrelevant" ```: the irrelevant data to detect NAN
    - ``` "output_folder" ```: folder for data output
    - ``` "file_format" ```: the file name format used for saving NAN-removed data
2. For Standardrization:
    - ``` "stat_dim" ```: the statistical dimension of the data, used for mean, stddev, etc.
    - ``` "std_file_format" ```: the file name format used for standardrized data file h5
    - ``` "std_data_format" ```: the data name format used for h5 standardrized data 
    - ``` "std_data_list" ```: list of data to be standardrized
    - ``` "std_dataset_list" ```: list of data sets to be standardrized
    - ``` "num_error_tolerance" ```: numerical tolerance for passing the standardrization test

In [2]:
#set up parameters
output_folder = "../data/preprocessed/"
file_format= "%s"
std_file_format = "np_gan_standard"
std_data_format = "np_%s"
parameters = {"nan_dim_along":"time", "nan_data_irrelevant":"absolute_height", \
              "output_folder":output_folder,"file_format":file_format, \
              "stat_dim" : "time",\
              "std_file_format":std_file_format, "std_data_format":std_data_format,\
              "std_data_list":["u","v"], \
              "std_dataset_list":["perdigao_low_res_1H_2020","perdigao_high_res_1H_2020"],\
              "num_error_tolerance":1e-5}
list_of_data_set_path=['../data/perdigao_era5_2020.nc', '../data/perdigao_low_res_1H_2020.nc', '../data/perdigao_high_res_1H_2020.nc' ]

In [3]:
#start data_preprocess
data_preprocess(list_of_data_set_path, parameters)

Creating DataSets by loading files:['../data/perdigao_era5_2020.nc', '../data/perdigao_low_res_1H_2020.nc', '../data/perdigao_high_res_1H_2020.nc']
Removing NAN indices along dimension:time
Searching NAN in DataSets: dict_keys(['perdigao_era5_2020', 'perdigao_low_res_1H_2020', 'perdigao_high_res_1H_2020'])...
Checking nan pattern of variable: u100  for DataSet: perdigao_era5_2020
Total number of NAN: (0,), along (0,) time indicies
Checking nan pattern of variable: v100  for DataSet: perdigao_era5_2020
Total number of NAN: (0,), along (0,) time indicies
Checking nan pattern of variable: t2m  for DataSet: perdigao_era5_2020
Total number of NAN: (0,), along (0,) time indicies
Checking nan pattern of variable: i10fg  for DataSet: perdigao_era5_2020
Total number of NAN: (0,), along (0,) time indicies
NAN pattern along dimension: time, is CONSISTENT for all other coords, with absolute_height excluded
Checking nan pattern of variable: std  for DataSet: perdigao_low_res_1H_2020
Total number of

### Example for loading data

In [4]:
output_folder = "../data/preprocessed/"
file_format = "np_gan_standard"
xy_keyword_dict = {"x":"low", "y":"high"}
data_xy = get_data_xy_from_h5(output_folder, file_format, xy_keyword_dict, exclude_list = ["stddev", "mean","raw"])

Data in file ../data/preprocessed/np_gan_standard.h5 are: 
 ['np_perdigao_high_res_1H_2020_mean', 'np_perdigao_high_res_1H_2020_raw', 'np_perdigao_high_res_1H_2020_std', 'np_perdigao_high_res_1H_2020_stddev', 'np_perdigao_low_res_1H_2020_mean', 'np_perdigao_low_res_1H_2020_raw', 'np_perdigao_low_res_1H_2020_std', 'np_perdigao_low_res_1H_2020_stddev']
Examining data np_perdigao_high_res_1H_2020_mean
Examining data np_perdigao_high_res_1H_2020_raw
Examining data np_perdigao_high_res_1H_2020_std
Examining data np_perdigao_high_res_1H_2020_stddev
Examining data np_perdigao_low_res_1H_2020_mean
Examining data np_perdigao_low_res_1H_2020_raw
Examining data np_perdigao_low_res_1H_2020_std
Loading data np_perdigao_low_res_1H_2020_std from file ../data/preprocessed/np_gan_standard.h5
Data in file ../data/preprocessed/np_gan_standard.h5 are: 
 ['np_perdigao_high_res_1H_2020_mean', 'np_perdigao_high_res_1H_2020_raw', 'np_perdigao_high_res_1H_2020_std', 'np_perdigao_high_res_1H_2020_stddev', 'np_p

## Autoencoder

### Key parameters
The parameters for HIER-AE are explained below:
1. For training, ``` parameters["train"] ```:
    - ``` "batch_size", "shuffle"```: the generic parameters for improvements
    - ``` "n_sub_net" ```: number of sub nets in the hierarchi
    - ``` "latent_dim_en" ```: latent dimension size of the encoder 
    - ``` "latent_dim_de_origin" ```: latent dimension size of the decoder in the lowest level, later ones will be multiplied by the number of level.
    - ``` "log_path_file_format" ```: file_format for string formatting for log output
    - ``` "sub_net_epochs" ```: number of epochs for each level of sub net
2. For data IO and manipulation, ``` parameters["data"] ```:
    - ``` "output_folder" ```: the output folder of the data preprocess, used as input for loading data
    - ``` "file_format" ```: the file name format used for standardrized data file h5
    - ``` "xy_keyword_dict" ```: the data name key used for detecting and catogorizing x and y from h5 standardrized datasets. Here we only load high res data and use y the same as x
    - ``` "xy_exclude_list" ```: list of dataset type to be skipped if appeared in the data set name, like stddev, mean and std if we only need raw data.

In [12]:
parameters_AE = dict()

parameters_AE["train"] = {"batch_size": 128,
                       "shuffle": True,
                       "n_sub_net":4,
                       "latent_dim_en":18,
                       "latent_dim_de_origin":18,
                       "log_path_file_format":"../data/log/%s",
                       "sub_net_epochs":[5,4,3,2]}
parameters_AE["data"] = {'output_folder': "../data/preprocessed/",
                      'file_format': "np_gan_standard",
                      'xy_keyword_dict': {"x":"high", "y":"high"}, #only load high res data
                      'xy_exclude_list': ["stddev", "mean","std"]} #here we only need to load "raw" data

In [None]:
### Step by step running example
#initialize model with parameters_AE
model_AE = AWWSM4_HIER_AE(parameters_AE)
#load data
model_AE.load_data()
#split data
model_AE.split_data()
#perform the training for each sub net one by one
model_AE.generate_AE_one_by_one()

## GAN

### Key parameters
The parameters for GAN are explained below:
1. For training, ``` parameters["train"] ```:
    - ``` "learning_rate_g" ```: the learning rate for the generator
    - ``` "learning_rate_d" ```: the learning rate for the discriminator
    - ``` "beta_1", "beta_2", "epsilon", "batch_size"```: the generic parameters for improvements
    - ``` "n_epochs_pretrain" ```: number of epochs for pretraining
    - ``` "n_epochs_GAN" ```: number of epochs for full GAN
2. For data IO and manipulation, ``` parameters["data"] ```:
    - ``` "output_folder" ```: the output folder of the data preprocess, used as input for loading data
    - ``` "file_format" ```: the file name format used for standardrized data file h5
    - ``` "xy_keyword_dict" ```: the data name key used for detecting and catogorizing x and y from h5 standardrized datasets 
    - ``` "xy_exclude_list" ```: list of dataset type to be skipped if appeared in the data set name, like stddev, mean, raw if we only need standardized data.

In [10]:
parameters_GAN = dict()

parameters_GAN["train"] = {"learning_rate_g": 1e-4, 
                       "learning_rate_d": 1e-4,
                       "beta_1": 0.9,
                       "beta_2": 0.999,
                       "epsilon": 1e-08,
                       "batch_size": 128,
                       "alpha_advers": 1e-3,
                       "n_epochs_pretrain":1, 
                       "n_epochs_GAN":1}
parameters_GAN["data"] = {'output_folder': "../data/preprocessed/",
                      'file_format': "np_gan_standard",
                      'xy_keyword_dict': {"x":"low", "y":"high"},
                      'xy_exclude_list': ["stddev", "mean","raw"]} #only keep the "std"(standardrized) data

### Step by step running example

In [None]:
#create a GAN model class with doing_pretrain = True 
model = AWWSM4_SR_GAN(parameters=parameters_GAN, is_GAN=True, doing_pretrain=True) 
# load and split data based on the parameters["data"]
model.load_data()
model.split_data()
# pretrain and save the model
model.pretrain() #use default epoch value = 20
model.save_gen_model("./temp_v0_gen.h5")
# OPTIONAL: create another model which loaded the pretrained weights and can continue doing pretrain
model2 = AWWSM4_SR_GAN(parameters=parameters_GAN, is_GAN=True, doing_pretrain=True)
model2.load_gen_model("./temp_v0_gen.h5")
model2.load_data()
model2.split_data()
# continue to work on GAN
model.reset_working_mode(doing_pretrain=False)
model.train_GAN(epochs=10)

In [None]:
model.reset_working_mode(doing_pretrain=False)
model.train_GAN(epochs=10)