# Introduction

This script is an example to run the code for the data challenge with explanations. Details of each step can be found in each section. Functions used are from the ``` ./lib ``` folder. 

The structure of the code is:
1. Data preprocessing:
    - Removing NANs in the data
    - Standardrization
2. Training: 
    - Dimension reduction with autoencoder
    - GAN/CNN for super-resolution (SR)

# Code for running

## Import libraries

In [1]:
%reset -f
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
import xarray as xr
import h5py

import matplotlib.pyplot as plt
import sys
sys.path.append("./lib")
from preprocess import *

## Data preprocessing

The function ``` data_preprocess ``` will take the list of data sets and parameters required in the preprocessing.

The parameters are explained below:

1. For NAN removal:
    - ``` "nan_dim_along" ```: the dimension along which we remove NAN data
    - ``` "nan_data_irrelevant" ```: the irrelevant data to detect NAN
    - ``` "output_folder" ```: folder for data output
    - ``` "file_format" ```: the file name format used for saving NAN-removed data
2. For Standardrization:
    - ``` "stat_dim" ```: the statistical dimension of the data, used for mean, stddev, etc.
    - ``` "std_file_format" ```: the file name format used for standardrized data file h5
    - ``` "std_data_format" ```: the data name format used for h5 standardrized data 
    - ``` "std_data_list" ```: list of data to be standardrized
    - ``` "std_dataset_list" ```: list of data sets to be standardrized
    - ``` "num_error_tolerance" ```: numerical tolerance for passing the standardrization test

In [2]:
#set up parameters
output_folder = "../data/preprocessed/"
file_format= "%s"
std_file_format = "np_gan_standard"
std_data_format = "np_%s"
parameters = {"nan_dim_along":"time", "nan_data_irrelevant":"absolute_height", \
              "output_folder":output_folder,"file_format":file_format, \
              "stat_dim" : "time",\
              "std_file_format":std_file_format, "std_data_format":std_data_format,\
              "std_data_list":["u","v"], \
              "std_dataset_list":["perdigao_low_res_1H_2020","perdigao_high_res_1H_2020"],\
              "num_error_tolerance":1e-5}
list_of_data_set_path=['/home/twang/data-challenge/data-challenge-2022/data/perdigao_era5_2020.nc', '/home/twang/data-challenge/data-challenge-2022/data/perdigao_low_res_1H_2020.nc', '/home/twang/data-challenge/data-challenge-2022/data/perdigao_high_res_1H_2020.nc' ]

In [3]:
#start data_preprocess
data_preprocess(list_of_data_set_path, parameters)

Creating DataSets by loading files:['/home/twang/data-challenge/data-challenge-2022/data/perdigao_era5_2020.nc', '/home/twang/data-challenge/data-challenge-2022/data/perdigao_low_res_1H_2020.nc', '/home/twang/data-challenge/data-challenge-2022/data/perdigao_high_res_1H_2020.nc']
Removing NAN indices along dimension:time
Searching NAN in DataSets: dict_keys(['perdigao_era5_2020', 'perdigao_low_res_1H_2020', 'perdigao_high_res_1H_2020'])...
Checking nan pattern of variable: u100  for DataSet: perdigao_era5_2020
Total number of NAN: (0,), along (0,) time indicies
Checking nan pattern of variable: v100  for DataSet: perdigao_era5_2020
Total number of NAN: (0,), along (0,) time indicies
Checking nan pattern of variable: t2m  for DataSet: perdigao_era5_2020
Total number of NAN: (0,), along (0,) time indicies
Checking nan pattern of variable: i10fg  for DataSet: perdigao_era5_2020
Total number of NAN: (0,), along (0,) time indicies
NAN pattern along dimension: time, is CONSISTENT for all othe

		Stacking data v, stdrzd with mean:-9.072684861166636e-07, std:1.000000238418579
	Stacked 2 data in data_dict: ['u', 'v']
		Stacking data u, mean with mean:0.7011979818344116, std:0.0
		Stacking data v, mean with mean:-1.0068085193634033, std:0.0
	Stacked 2 data in data_dict: ['u', 'v']
		Stacking data u, std with mean:3.149406909942627, std:0.0
		Stacking data v, std with mean:2.8781955242156982, std:0.0
	Stacked 2 data in data_dict: ['u', 'v']
	stacked: mean_shape:(2,), stddev_shape:(2,), std_shape:(8520, 192, 192, 2)
	stacked std stat(over all other dimensions (0, 1, 2)): mean:[-1.8022865e-07 -9.0726849e-07], stddev:[0.9999977 1.0000002]
Checking standardrized data set: perdigao_low_res_1H_2020
	Reconstruct stacked unstandardrized data!
	 data_set_recovered = std_data[(8520, 96, 96, 2)] * stddev[(2,)] + mean[(2,)]
		Stacking data u, raw with mean:0.7051483988761902, std:3.1869051456451416
		Stacking data v, raw with mean:-1.014777421951294, std:2.882791519165039
	Stacked 2 data in 

## GAN for SR

## load data

In [4]:
output_folder = "../data/preprocessed/"
file_format = "np_gan_standard"
xy_keyword_dict = {"x":"low", "y":"high"}
data_xy = get_data_xy_from_h5(output_folder, file_format, xy_keyword_dict, exclude_list = ["stddev", "mean"])

Data in file ../data/preprocessed/np_gan_standard.h5 are: 
 ['np_perdigao_high_res_1H_2020_mean', 'np_perdigao_high_res_1H_2020_std', 'np_perdigao_high_res_1H_2020_stddev', 'np_perdigao_low_res_1H_2020_mean', 'np_perdigao_low_res_1H_2020_std', 'np_perdigao_low_res_1H_2020_stddev']
Examining data np_perdigao_high_res_1H_2020_mean
Examining data np_perdigao_high_res_1H_2020_std
Examining data np_perdigao_high_res_1H_2020_stddev
Examining data np_perdigao_low_res_1H_2020_mean
Examining data np_perdigao_low_res_1H_2020_std
Loading data np_perdigao_low_res_1H_2020_std from file ../data/preprocessed/np_gan_standard.h5
Data in file ../data/preprocessed/np_gan_standard.h5 are: 
 ['np_perdigao_high_res_1H_2020_mean', 'np_perdigao_high_res_1H_2020_std', 'np_perdigao_high_res_1H_2020_stddev', 'np_perdigao_low_res_1H_2020_mean', 'np_perdigao_low_res_1H_2020_std', 'np_perdigao_low_res_1H_2020_stddev']
Examining data np_perdigao_high_res_1H_2020_mean
Examining data np_perdigao_high_res_1H_2020_std
L