# MJO emulation via POD

The Madden-Julian Oscillation (MJO) is an intraseasonal phenomenon that characterizes the tropical atmosphere. Its characteristic period varies between 30 and 90 days and it is basically due to a coupling between large-scale atmospheric circulation and deep convection. This pattern slowly propagates eastward with a speed of $4$ to $8$ $m s^{-1}$. MJO is a rather irregular phenomenon and this implies that the MJO can be seen at a large-scale level as a mix of multiple high-frequency, small-scale convective phenomena. The flow realizations which belong to the dataset represent the amount of total precipitation.

The first step to anlyze this dataset is to import the required libraries, including the custom libraries
- 'from pyspod.auxiliary.pod_standard import POD_standard'
- 'from pyspod.auxiliary.emulation   import Emulation'
- 'import pyspod.utils_weights as utils_weights'
- 'import pyspod.auxiliary.utils_emulation as utils_emulation'  


In [None]:
import os
import sys
import time
import xarray as xr
import numpy  as np
import opt_einsum as oe
from pathlib  import Path

# Current, parent and file paths
CWD = os.getcwd()
CF  = os.path.realpath(__file__)
CFD = os.path.dirname(CF)

# Import library specific modules
sys.path.append(os.path.join(CFD, "../../../../"))
from pyspod.spod_low_storage import SPOD_low_storage
from pyspod.spod_low_ram     import SPOD_low_ram
from pyspod.spod_streaming   import SPOD_streaming
from pyspod.auxiliary.emulation     import Emulation
import pyspod.utils_weights as utils_weights
import pyspod.postprocessing as post
import pyspod.auxiliary.utils_emulation as utils_emulation  
import mjo_plotting_utils as mjo_plot


The second step consists of downloading the data file `EI_1979_2017_TP228128_reduced5000.nc` from (...), and store it in the folder: `../../../../../test/data`. 

In [None]:
file = os.path.join('../../../../../pyspod/test/data/', 'EI_1979_2017_TP228128_reduced5000.nc')
ds = xr.open_dataset(file, chunks={"time": 10})

da = ds.to_array()
da = oe.contract('vtij->tijv', da)

## Define global variables and global parameters

The data are stored in a matrix `X` and, to be suitable to the `PySPOD` library, it must have the following features:
- first dimension must correspond to the number of time snapshots (5000 in our case)
- last dimension should corresponds to the number of variables (1 in our case)
- the remaining dimensions corresponds to the spatial dimensions (241, and 480 in our case, that correspond to radial and axial spatial coordinates).
We note that in the present test case the data matrix `X` used is already in a shape that is suitable to `PySPOD`, as its dimension is:
$$\text{$X$ dimensions} = 5000 \times 241 \times 480 $$

Other global variables and parameters are defined. In detail:
- `nt`: integer, number of snapshots, read from the data_arrays
- `t`: vector containing the time instants at which the flow realizations have been stored
- `x1`: list of coordinate along the x1-axis (in this case x1='longitude')
- `x2`: list of coordinates along the x2-axis(in this case x2='latitude')
- `trainingDataRatio`: real, ratio between training data number and total number of snapshots

In the dictionary 'params' the following variables are stored:
- `dt`: time step
- `n_space_dims`: integer, number of space dimensions, in our case 2
- `n_variables`: integer, nr of variables, in our case 1
- `n_modes_save`: integer, number of modes that are taken into account (and stored)
- `savedir`: string, name of the directory where results will be saved
- `normalize_weights` (optional): boolean which activate the normalization of weights by data variance
- `normalize_data` (optional): boolean which normalize data by data variance

**Note that we do not set any parameter for the Weights adopted to compute th einner product in the SPOD calculation. In this case, the algorithm will use automatically uniform weighting (weighting equal 1), and it will prompt a warning stating the use of default uniform weighting.** 

In the dictionary 'params_emulation' the following variables, which allow to define some relevant parameters of a single layer neural network, are stored:
- `network`: string, type of neural network. In the present tutorial 'lstm'
- `epochs`: integer, number of epochs
- `batch_size`: integer, batch size
- `n_seq_in`: integer, dimension of input sequence 
- `n_seq_out`: integer, number of steps to predict
- `n_neurons`: number of neurons in each layer
- `dropout`: value of the dropout
- `savedir`: string, name of the directory where results will be saved

In the present test case we use 40 (`n_seq_in`=40) previous values of the coefficients in order to evaluate the next one (`n_seq_out`=1).

In [None]:
# we extract time, longitude and latitude
t = np.array(ds['time'])
nt = t.shape[0]
xshape =  da[0,...,0].shape
nx = da[0,...,0].size
nv = 1 

ntSPOD = int(0.7*len(t))
tSPOD = t[:ntSPOD]
print('t = ', t)
print('tSPOD = ', tSPOD)
x1 = np.array(ds['longitude'])
x2 = np.array(ds['latitude'])
print('shape of t (time): ', t.shape)
print('shape of x1 (longitude): ', x1.shape)
print('shape of x2 (latitude) : ', x2.shape)
variables = ['tp']

# define required and optional parameters for spod
# 12-year monthly analysis
dt_hours     = 12      
period_hours = 24 * 365 
params = {
	'time_step'   	   : dt_hours,
	'n_snapshots' 	   : len(t),
	'n_snapshots_POD'  : ntSPOD, # number of time snapshots for generating SPOD base
	'n_space_dims'	   : 2,
	'n_variables' 	   : len(variables),
	'mean_type'        : 'longtime',
	'normalize_weights': False,
	'normalize_data'   : False,
	'n_modes_save'     : 2,
	'savedir'          : os.path.join(CWD, 'results', Path(file).stem)
}
print('params \n', params)

params_emulation = dict()

params_emulation['network'     ] = 'lstm' 						# type of network
params_emulation['epochs'      ] = 100						# number of epochs
params_emulation['batch_size'  ] = 32							# batch size
params_emulation['n_seq_in'    ] = 40							# dimension of input sequence 
params_emulation['n_seq_out'   ] = 1                          # number of steps to predict
params_emulation['n_neurons'   ] = 60                          # number of neurons
params_emulation['dropout'     ] = 0.15                          # dropout
params_emulation['savedir'     ] = os.path.join(CWD, 'results', Path(file).stem)


The initialization phase ends by setting the weights

In [None]:
# set weights
st = time.time()
weights = utils_weights.geo_trapz_2D(
	x1_dim=x2.shape[0], 
	x2_dim=x1.shape[0],
	n_vars=len(variables), 
	R=1
)

## Compute POD modes and coefficients 

The following lines of code are used for the initialization of variables. `X_train` and `X_test` are numpy data structures which contain the training set and the testing set respectively; therefore their dimensions are (`nt_train`, 241, 480) and (`nt_test`, 241, 480), being in this test case `nt_train`= $0.75\cdot nt$ and `nt_test`=$0.25\cdot nt$. 

In [None]:
params['mean_type'] = 'blockwise'
params['reuse_blocks'] = False

nt_train = int(0.75 * nt)
nt_test = nt - nt_train
X_train = da[:nt_train,:,:]
X_test  = da[nt_train:,:,:]

Once we have loaded the data, defined the required and optional parameters, and allocated the testing and training structures we can perform the analysis. This step is accomplished by calling the `PySPOD` constructor`POD_standard(params=params, data_handler=False, variables=variables)` and the `fit` method, `POD_analysis.fit(data=X_train, nt=nt_train)`. 

The `POD_standard` constructor takes the following arguments :
  - `params`: must be a dictionary and contains the parameters that we have just defined. 
  - `data_handler`: can be either `False` or a function handler. If it is a function handler, it must hold the function to read the data. The template for the function to read the data must have as first argument the data file, as second and third the time indices through which we will slice the data in time, and as fourth argument a list containing the name of the variables. See our data reader as an example and modify it according to your needs.
  - `variables`: is a list containing our variables. 
The function `fit` returns a reference to the instance object on which it was called, given data and size of this dataset as input.

In [None]:
# POD analysis
POD_analysis = POD_standard(
	params=params, 
	data_handler=False, 
	variables=variables
	)

# fit 
pod = POD_analysis.fit(data=X_train, nt=nt_train)

In the transform function the pressure fluctuations are computed by subtracting the mean field from the snapshots. Then the POD modes are evaluated and the coefficients are obtained by projecting the snapshots representing the pressure fluctuations onto the reduced POD basis obtained by gathering the most significant modes. In details, the `pod.transform` function accept as input
- `data`: dataset on which the analysis is performed
- `nt`: number of snapshots of the dataset 'data'

and it returns a dictionary which contains the following keywords:
- `t_mean`: the average in time of snapshots
- `phi_tilde`: the first most significant modes, i.e. the ones associated to the biggest n_save_modes eigenvalues. These modes identify a reduced basis, significant for the case at hand.
- `coeffs`: the coefficients obtained by projection
- `reconstructed_data`: snapshots reconstructed by superimposing the modes multiplied by the coefficients

The coeffs_test is instead a vector which contains the coefficients which are evaluated by projecting the snapshots of the testing database onto the reduced POD basis previously computed. 

In [None]:
coeffs_train = pod.transform(data=X_train, nt=nt_train)

X_rearrange_test = np.reshape(X_test[:,:,:], [nt_test,pod.nv*pod.nx])
for i in range(nt_test):
	X_rearrange_test[i,:] = np.squeeze(X_rearrange_test[i,:]) - np.squeeze(coeffs_train['t_mean'])
coeffs_test = np.matmul(np.transpose(coeffs_train['phi_tilde']), X_rearrange_test.T)

## Learning the latent space dynamics

The following lines are required in order to initialize the data structures needed to train the neural network and to store its output 

In [None]:
n_modes = params['n_modes_save'] 
n_feature = coeffs_train['coeffs'].shape[0]

data_train = np.zeros([n_modes,coeffs_train['coeffs'].shape[1]],dtype='double')
data_test = np.zeros([n_modes,coeffs_test.shape[1]],dtype='double')
coeffs = np.zeros([coeffs_test.shape[0],coeffs_test.shape[1]],dtype='double')
coeffs_tmp = np.zeros([n_modes,coeffs_test.shape[1]],dtype='double')

The coefficients previously evaluated can now be used for training a LSTM-based neural network.
The Emulation constructor requires the following parameters:
- `params_emulation`: dict containing the parameters described in the previous sections. They contain all the relevant data for creating a single-layer neural network with Dropout

The neural network is initialized by calling `pod.model_initialize` that requires the data set which the network will be trained with.

In [None]:
# LSTM
pod_emulation = Emulation(params_emulation)

# initialization of the network
pod_emulation.model_initialize(data=data_train)

It is a common practice to provide scaled input to the neural network. For this reason a scaler vector is computed by calling the function `utils_emulation.compute_normalization_vector_real`. Three different arguments can be used for defining the `normalize_method` variable:
- `localmax`: each coefficient is scaled by its local maximum
- `globalmax`: all the coefficients are scaled by the same value which represent the global maximum
- `None`: no scaling is applied. The output vector contains ones.
Once that the scaling vector is known, the scaling is applied both to the training dataset and to the testing one.

In [None]:
idx=0

# copy and normalize data 
scaler  = utils_emulation.compute_normalization_vector_real(coeffs_train['coeffs'][:,:],normalize_method='globalmax')
data_train[:,:] = utils_emulation.normalize_data_real(coeffs_train['coeffs'][:,:], normalization_vec=scaler)
data_test[:,:]  = utils_emulation.normalize_data_real(coeffs_test[:,:], normalization_vec=scaler)

The training of the neural network is carried out by calling the `pod_emulation.model_train` function. The following inputs are requested:
- `idx`: integer, it is an identifier associated to the neural network. Thanks to this idx, more than one network can be trained in the same run and the weights can be stored in different files.
- `data_train`: dataset used for the training
- `data_valid`: dataset used for the validation
- `plotHistory` (otpional): boolean, plot  the trainig history when set to `True`

In [None]:
# train the network
pod_emulation.model_train(idx,
	data_train=data_train, 
	data_valid=data_test
)

After that the neural network has been trained, predictions of the coefficients can be extracted with the aid of the `pod_emulation.model_inference` routine. This receives as inputs:
- `idx`: integer, a value which identify a previously trained neural network (in this case 0, since we have only one neural network)
- `data_input`: data which are used to start the prediction. This array can have an arbitary length. The first `n_seq_in` data are copied in the output vector and used for predicting the next `n_seq_out` steps

The output consists in a vector which has the same dimensions of data_input and contains the predicted scaled coefficients.

The predicted coefficients are then scaled back by calling `utils_emulation.denormalize_data_real.

In [None]:
#predict 
coeffs_tmp = pod_emulation.model_inference(
	idx,
	data_input=data_test
)

# denormalize data
coeffs = utils_emulation.denormalize_data_real(coeffs_tmp, scaler)

Now we have two distinct types of coefficients which we can use for reconstructing the snaptshots contained in X_test:
- `coeffs_test`: the ones which were obtained by projecting the snapshot on the POD basis
- `emul_coeffs`: the ones which were obtained with the prediction of the LSTM-based neural network.

Fields are reconstructed and stored in a proper numpy array by calling `reconstruct_data` and providing the following input:
- `coeffs`: the coefficients to be used for reconstructing the fields
- `phi_tilde`: a structure containing the modes computed in the `transform` function
- `t_mean`: the mean flow previously computed with the `transform` function

In [None]:
	# reconstruct solutions
	emulation_rec = pod.reconstruct_data(
			coeffs=coeffs, 
			phi_tilde=coeffs_train['phi_tilde'],
			t_mean=coeffs_train['t_mean']
		)
	proj_rec = pod.reconstruct_data(
			coeffs=coeffs_test, 
			phi_tilde=coeffs_train['phi_tilde'],
			t_mean=coeffs_train['t_mean']
		)

## Output 

In the last section of the code, some routines are placed for visulizing some results and computing the errors.

`pod.printErrors`: compute and print L1, L2, and $L_{\inf}$ average norm error for both the learning and the projection error. 
In intput the following input are required:
- `field_test`: "true" solutions, it is a snapshot which belong to the original dataset 
- `field_proj`: fields reconstructed using the coeffs_test, from the comparison between this database and the one containing the true solutions we can evalute the projection error
- `field_emul`: fields reconstructed using the coeffs_emul; from the comparison between this database and the field_proj we can evalute the learning error; from the comparison between this database and the one containing the true solutions we can evalute the total error.
- `n_snaps`: number of snapshots on which the errors are evaluated 
- `n_offset`: offset

`pod.plot_compareTimeSeries`: compare time series, it is here used for comparing actual coefficients and the learned ones. It requires in input: two time series, two labels of the time series, legendLocation(otpional), and the filename (optional)

`mjo_plot.plot_2d_2subplot`: generate subpolts for visualizing the snapshots. It can show 2 fields in the same frame. Inputs:
- `title1`: title associated to the first snapshot
- `title2`: title associated to the second snapshot  
- `var1`: 2D array of dimenions $241\times480$ that we want to plot
- `var2`: 2D array of dimenions $241\times480$ that we want to plot
- `x1`: longitude
- `x2`: latitude
- `N_round`: number of decimals one wants to keep in the legend (optional)
- `path`: path where one wants to store the results(optional)
- `filename`: name of the file where to save the plot(optional)

In [None]:
mjo_plot.plot_2d_snap(snaps=X_train,
 	snap_idx=[100], vars_idx=[0], x1=x1-180, x2=x2)

mjo_plot.plot_2d_2subplot(
	title1='Projection-based solution', 
	title2='LSTM-based solution',
	var1=proj_rec[100,:,:,0], 
	var2=emulation_rec[100,:,:,0], 
	x1 = x1-180, x2 = x2,
	N_round=6, path='CWD', filename=None, coastlines='centred', maxVal = 0.002, minVal= -0.0001
	)

pod.printErrors(field_test=X_test, field_proj=proj_rec, field_emul=emulation_rec, n_snaps = 1000, n_offset = 100)
