# Architecture Breakdown of Stormer Architecture

The following notebook is designed to break down the different elements of the Stormer architecture as documentation for the development of the code, and the corresponding customizations. 

## Acknowledgments 

The content of the breakdown is based on the following public repos:
* **Stormer**: https://github.com/tung-nd/Stormer

## Stormer Architecture

<img src="stormer_architecture.png" alt="Stormer Architecture">

## Equations

The Stormer pipeline is described by the following equations found through sections 3 and 4 of the paper. 

$$
\mathcal{L}(\theta)=\mathbb{E}_{\delta t \sim P(\delta t),\left(X_0, X_{\delta t}\right) \sim \mathcal{D}}\left[\left\|f_\theta\left(X_0, \delta t\right)-\Delta_{\delta t}\right\|_2^2\right]
$$

$$
\mathcal{L}(\theta)=\mathbb{E}\left[\frac{1}{V H W} \sum_{v=1}^V \sum_{i=1}^H \sum_{j=1}^W w(v) L(i)\left(\widehat{\Delta}_{\delta t}^{v i j}-\Delta_{\delta t}^{v i j}\right)^2\right]
$$

$$
\mathcal{L}(\theta)=\mathbb{E}\left[\frac{1}{V H W} \sum_{v=1}^V \sum_{i=1}^H \sum_{j=1}^W w(v) L(i)\left(\widehat{\Delta}_{\delta t}^{v i j}-\Delta_{\delta t}^{v i j}\right)^2\right]
$$

$$
\mathcal{L}(\theta)=\mathbb{E}\left[\frac{1}{K V H W} \sum_{k=1}^K \sum_{v=1}^V \sum_{i=1}^H \sum_{j=1}^W w(v) L(i)\left(\widehat{\Delta}_{k \delta t}^{v i j}-\Delta_{k \delta t}^{v i j}\right)^2\right]
$$


### Stormer Architecture Stages

* **Input field tokenization**:
* **Embedding function**:
* **Lead-time (${\delta t}$) embedding**:
* **Stormer Block**: (Graph)
    * **Multi-head self-attention (MSA) module**: Module used to focus on multiple parts of the input simultaneously (`torch.nn.MultiHeadAttention()`)
        * **Layer Norm + Scale&Shift**:
        * **Multi-head attention (MHA) module**:
        * **Scale and residual connection**:
    * **Position-Wise Feed-Forward (MLP) module**:
        * **Layer Norm + Scale&Shift**:
        * **Multi-head attention (MHA) module**:
        * **Scale and residual connection**:
* **Linear reconstruction and reshape**:
* **Forecast loss computation**: Where the equations come in


### Equation 1 Overview

$$
\mathcal{L}(\theta)=\mathbb{E}_{\delta t \sim P(\delta t),\left(X_0, X_{\delta t}\right) \sim \mathcal{D}}\left[\left\|f_\theta\left(X_0, \delta t\right)-\Delta_{\delta t}\right\|_2^2\right]
$$

This equation deals with Stormer's randomized-horizon dynamics forecast. 

Difference between two consecutive weather conditions:
$\Delta_{\delta t}=X_{\delta t}-X_0$

### Equation 2 Overview

$$
\mathcal{L}(\theta)=\mathbb{E}\left[\frac{1}{V H W} \sum_{v=1}^V \sum_{i=1}^H \sum_{j=1}^W w(v) L(i)\left(\widehat{\Delta}_{\delta t}^{v i j}-\Delta_{\delta t}^{v i j}\right)^2\right]
$$

### Equation 3 Overview

$$
\mathcal{L}(\theta)=\mathbb{E}\left[\frac{1}{V H W} \sum_{v=1}^V \sum_{i=1}^H \sum_{j=1}^W w(v) L(i)\left(\widehat{\Delta}_{\delta t}^{v i j}-\Delta_{\delta t}^{v i j}\right)^2\right]
$$

### Equation 4 Overview

$$
\mathcal{L}(\theta)=\mathbb{E}\left[\frac{1}{K V H W} \sum_{k=1}^K \sum_{v=1}^V \sum_{i=1}^H \sum_{j=1}^W w(v) L(i)\left(\widehat{\Delta}_{k \delta t}^{v i j}-\Delta_{k \delta t}^{v i j}\right)^2\right]
$$

## Bringing the equations together

Section 3.3 and 3.4.1 describes the Stormer architecture as follows

>

## Paper Hyperparameters