# Empircal Orthogonal Function Analysis (EOFs)
Also called Principal Component Analysis (PCA)

#### Framing of the problem:

In climate, we often have lots of data that varies (and co-varies) in space and time.  For example, we have our monthly precipitation data as time series of maps with dimensions `[time, lat, lon]`. 

We want to understand the variability of the precipitation and answer questions like: 

* Why does it rain more or less at times in this location or that location? 

* What large-scale patterns are there that are associated with more or less rainfall in certain regions?  

* Is there any regularity in time about when it rains more or less?

It is impossible to look at thousands or tens of thousands of maps or even movies of our data to identify patterns and understand this.

__Climate data is complicated because it varies in space and time__

### We use EOFs to simplify our data

We simplify our data by trying to identify the patterns in the data that are associated with the largest amount of variability and we want each of the patterns to be unrelated to each other. 

__What do I mean by this?__

We want to identify some simpler set of spatial patterns (i.e. maps) that explain the most variability and a corresponding timeseries that tells us how that spatial pattern varies.  We want each spatial pattern to tell us something different than the other spatial patterns.

### Overview Summary 

EOFs will:

* Find the spatial patterns of variabilty
* Find their time variation
* Give a measure of importance of each pattern

You can think of EOFs as:

* a method for simplifying our data (data reduction method)
* a way of identifying spatial and temporal patterns of importance (in terms of variance) in climate data 

### What is it?

_Note: This is a high-level explanation designed to not require extensive math.  The detailed mathematical explanation is left for statistics class or this [document]()._

It is a way of reducing the complexity of our data by finding a new coordinate system (instead of x,y,z,...) which aligns with the direction of the most variance in the data.  The coordinates where the data has little variance can then be eliminated, reducing the dimensionality and complexity of our data.

__Examples__

A very graphical explanation is provided [here](https://setosa.io/ev/principal-component-analysis/)

In these examples, we could see graphically what is happening for 2D,3D, and sort of for 17D.  In climate we have many more dimensions to our data. For 1x1deg data, we would say we probably have at least 64,800 x,y dimensions + time.  

### How do we calculate them (and some terminology)?

EOFs are calculated by identifying the most important patterns of variability, and how important they are. The patterns are called `eigenvectors` and their degree of importance is measured in `eigenvalues`. 

* The `eigenvectors` and `eigenvalues` are calculated from the `covariance matrix`.   

* The `covariance matrix` is a way of containing all the information about how the data varies with itself in space and time.

* The `eigenvectors` identify the new coordinates in our data where the variance is largest based on our `covariance matrix`. In our problem setup, the new coordinates correspond to the spatial dimensions of our data. An additional constraint called `orthogonality` ensures we identify independent spatial coordinates.

* The `eigenvalues` measure the importance of the `eigenvector`, so they tell us a ranking of how important is each spatial pattern identified by the `eigenvectors`.  

Given data $X$ with mean removed and with dimensions `[time,space]`, the data can be re-defined based on the new coordinate system in terms of its  spatial part `EOF spatial patterns` and its temporal part `PC timeseries` based on its `eigenvectors`:

$ X[space,time] = PC[time,enum] x EOF^T[enum,space] $

where 

* `enum` tells us which `eigenvector`
* `space` is all our points in space (`nlons*nlats`)
* `time` is all our times `nt`

We get the `EOF spatial patterns` from the `eigenvectors` and we get the corresponding `PC time series` by solving for them in the above equation. 

$ PC[time,mode] = X[time,space] x EOF[space,mode] $