# Practical Guide to Calculating EOFs

## Step 1: Data Preparation

Most programming languages have built-in functions to do the covariance, eigenvectors, and eigenvalue calculations for us.  

Our first job is to make sure we set up the data to use these functions correctly.  

Most of the work to calculate EOFs is in the preparation of our data. 

* garbage in = garbage out

* garbage can look like exciting results

Here's what we need to do:

* subset to the region of interest (this is an important choice because the result is highly dependent on the region selected; this is where your climate expertise is important)
* anomalies 
* no missing values
* weight the data by cosine of latitude
* re-shaped to be `[time,space]`

#### Import statments

In [None]:
import xarray as xr
import numpy as np
import matplotlib.pyplot as plt

#### Read in monthly SST

In [None]:
file = '/home/pdirmeye/classes/clim680_2022/OISSTv2/monthly/sst.mnmean.nc'
ds = xr.open_dataset(file)
ds

#### Make anomalies, reverse latitudes

In [None]:
ds_climo = ds.groupby('time.month').mean()
ds_anoms = ds.groupby('time.month')-ds_climo
#ds_anoms

ds_anoms = ds_anoms.reindex(lat=list(reversed(ds_anoms['lat'])))

#### Select Tropical Pacific Region

In [None]:
ds_tpac = ds_anoms.sel(lat=slice(-10,10),lon=slice(105,280))

# A quick plot to see what it looks like:
fig = plt.figure(figsize=(11,3.5))
clevs = [-3,-2,-1.5,-1,-0.5,0,0.5,1,1.5,2,3]
plt.contourf(ds_tpac['lon'],ds_tpac['lat'],ds_tpac['sst'][12,:,:],clevs,cmap='coolwarm',extend='both')
plt.colorbar(orientation='horizontal',aspect=40,pad=0.20)   # Note how we manipulate the shape and position of the colorbar
plt.title(f"Sample SST Anomaly - {str(ds_tpac['time'][12].values).split('T')[0]}")
plt.xlabel("Longitude (Degrees East)")
plt.xlabel("Latitude")
ds_tpac

### No missing or extraneous values

Our data has been "filled" over land with interpolated values.  So this is not a problem. 

What if I had land values that were all missing or other `nan` values?

* For values like over land that will be missing throughout all time, you can set them to zero for the EOF calculation. Because they will have no variance, they will not impact the calculation.

* For other `nan` values like occasional missing data, you will need to interpolate or otherwise find a way to fill the values with a reasonable value that does not impact the variance. 

* You may also may need to check for unphysical outliers

### Weighting of the data 

__Why do we do this?__

Atmosphere and ocean data has higher variance in mid-latitudes than in the tropics. Since we are maximizing variance, we are guaranteed that data closer to the equator will have less variance than data closer to the poles. 

It is common convention to weight our data by the square root of the cosine of the latitude before calculating the EOFs to mitigate this problem.

In [None]:
coslat = np.cos(np.deg2rad(ds_tpac.coords['lat'].values))
wgts = np.sqrt(coslat)[..., np.newaxis]
ds_tpac = ds_tpac*wgts
wgts.shape

#### Reshape from `[lat,lon,time]` to `[time,space]`

We will calculate a 1-dimensional array of eigenvalues (each corresponding to a principal component or PC),
and a 2-dimensional array of eigenvectors (one dimension for the PCs, 
and one for the dimension of our data over which the PCs will vary).

The EOF algorithm below needs the first dimension of our matrices to be the one corresponding to the dimension over which 
the PCs will vary - in our case, that is _time_. So we need to reshape the data.

Furthermore, to calculate the covariance matrix in _time_ the data that are not _time_ 
can only be 1-dimensional, i.e., the matrix on which we calculate covariance with `np.cov` must be 2-dimensional.
So we will _ravel_ the latitude and longitude dimensions down to one dimension to do our calculation. 
We can _unravel_ the resulting EOF spatial patterns back to two dimensions afterward.

In [None]:
nx = len(ds_tpac['lon'])
ny = len(ds_tpac['lat'])
nt = len(ds_tpac['time'])

X = np.reshape(ds_tpac['sst'].values,((nt,ny*nx)))

X.shape

## Step 2: Calculation of EOFs 

#### Calculate the Covariance Matrix
The resulting covariance matrix will be square with a size determined by the time dimension.
It describes how the spatial pattern of SST in our region in each month co-varies (correlation times variance) with every other month's pattern. 

In [None]:
C=np.cov(X)

plt.pcolormesh(ds_tpac['time'].values,ds_tpac['time'].values,C,cmap='PRGn_r',vmin=-1.5,vmax=1.5) 
plt.title("Covariance Matrix for Tropical Pacific SSTs")
plt.colorbar(label="[$˚C^2$]")

C.shape

#### Calculate the eigenvalues and vectors of the Covariance Matrix

https://numpy.org/doc/stable/reference/generated/numpy.linalg.eig.html

In [None]:
from numpy import linalg as LA

In [None]:
eigenvalues_v1,eigenvectors_v1=LA.eig(C)

This calculation can take awhile for large data sets.  
The larger the covariance matrix, the longer it will take.  
Finding eigenvalues and eigenvectors is computationally expensive.  
There are more efficient ways to do this such as using the singular value 
decomposition (SVD) function, but this method is easiest to understand.

#### Sort eigenvalues and eigenvectors

They do not come out of the eig function sorted and we want thme in the order of most variance to least variance

In [None]:
eigenvalues_v1.shape,eigenvectors_v1.shape

In [None]:
idx = eigenvalues_v1.argsort()[::-1]  # Sorts the indices of the matrix based on the magnitudes of the eigenvalues
eigenvalues_v1 = eigenvalues_v1[idx]
eigenvectors_v1 = eigenvectors_v1[:,idx]

Let's plot the first 10 eigenvalues (note that Python will notate them as array elements 0-9, 
but it is standard in statistics to label them 1-10)

In [None]:
x = np.arange(1,11)
plt.plot(x,eigenvalues_v1[0:10]) 
plt.xticks(x) 
plt.ylabel("Eigenvalue")
plt.xlabel("Principal Component") ;

#### Get the PC Temporal Patterns

They are just the eigenvectors. 

We saw above that the eigenvector array is square with a size in each dimension equal to the number of time steps.
One of these dimensions represents time (the months in our data set),
while the other represents the different PCs. 

There are as many PCs as there are time steps.

In [None]:
PC_v1=eigenvectors_v1
PC_v1.shape

#### Plot the first couple of PCs (the ones with the most variance)

In [None]:
for pc in range(2):
    plt.plot(ds_anoms['time'],PC_v1[:,pc],label=f"PC {pc}")
plt.legend()
plt.title("Principal Component Time Series for Tropical Pacific SSTs") ;

#### Get the EOF Spatial Patterns

The spatial pattern is the dot product of the reshaped (`[time,space]`) data and the eigenvector matrix.

Remember to "unweight" the data.

In [None]:
EOF_v1=np.dot(X.T,PC_v1)
EOF_v1=EOF_v1.reshape((ny,nx,nt)).T/wgts.squeeze()
EOF_v1.shape

#### Plot the first three EOFs (the ones with the most variance)

In [None]:
nrows,ncols = 3,1
clevs = np.arange(-20,21,5)

fig,ax = plt.subplots(3,1,layout="constrained",sharex='col')

for i in np.arange(3):
    panel = ax[i].contourf(ds_tpac['lon'],ds_tpac['lat'],EOF_v1[i,:,:].T,clevs,cmap='RdBu_r',extend='both')
    ax[i].set_title(f"EOF {i+1}")
    
fig.colorbar(panel, ax=ax, shrink=0.7) ;


#### Get the percentage of variance explained by each eigenvector

It is the ratio of variance explained by this eigenvector to the total variance

In [None]:
vexp = eigenvalues_v1/np.sum(eigenvalues_v1)
pct10 = vexp[0:10]*100 # Just the first 10, as percentages

# Plot the first 10
x = np.arange(1,11)
plt.bar(x,pct10)
plt.xticks(x) 
plt.grid(visible=True,which='major',axis='y')
plt.ylabel("Percent Variance Explained")
plt.xlabel("Principal Component") ;
print(pct10)

### Presenting EOFs

It is common practice to choose a sign convention for our EOFs and multiply the PC timeseries and EOF spatial pattern accordingly.

It is also common practice to divide the PC timeseries by its standard deviation and multiply the EOF spatial pattern by the same.  The spatial pattern now has the units of our data (˚C) and the PC time series is in the units of standard deviations.

In [None]:
nrows,ncols = 3,1
clevs = np.arange(-1.6,1.7,0.2)

fig,ax = plt.subplots(3,1,layout="constrained",sharex='col')

for i in np.arange(3):
    eofnorm = EOF_v1[i,:,:].T*np.std(PC_v1[:,i])
    panel = ax[i].contourf(ds_tpac['lon'],ds_tpac['lat'],eofnorm,clevs,cmap='RdBu_r',extend='both')
    ax[i].set_title(f"EOF {i+1}")
    
fig.colorbar(panel, ax=ax, shrink=0.7, label="˚C") ;

In [None]:
nrows,ncols = 3,1
fig,ax = plt.subplots(nrows,ncols,layout="constrained",sharex='col')

for i in np.arange(nrows):
    pcnorm = PC_v1[:,i]/np.std(PC_v1[:,i])
    panel = ax[i].plot(ds_anoms['time'],pcnorm)
    ax[i].set_title(f"PC {i+1}")

## Trying the other way

We could have constructed our matrix along our _raveled_ spatial dimension instead and arrived at (nearly) the same results!

#### Calculate the Covariance Matrix

In [None]:
#C=np.cov(X)
C=np.matmul(X.T,X)
C.shape

#### Calculate the eigenvalues and vectors of the Covariance Matrix

https://numpy.org/doc/stable/reference/generated/numpy.linalg.eig.html

In [None]:
eigenvalues_v2,eigenvectors_v2=LA.eig(C)

This calculation wii take longer than the one above - now we have a much larger _spatial_ matrix.

#### Sort eigenvalues and eigenvectors

They do not come out of the eig function sorted and we want the in the order of most variance to least variance

In [None]:
eigenvalues_v2.shape,eigenvectors_v2.shape

In [None]:
idx = eigenvalues_v2.argsort()[::-1]   
eigenvalues_v2 = eigenvalues_v2[idx]
eigenvectors_v2 = eigenvectors_v2[:,idx]

x = np.arange(1,11)
plt.plot(x,eigenvalues_v2[0:10]) 
plt.xticks(x) 
plt.ylabel("Eigenvalue")
plt.xlabel("Principal Component") ;

Notice that this looks exactly like the set of PCs from version 1 above, except that the values along the Y axis are different. They represent covariances across space rather than time, so are much larger (more numbers summed up)

#### Get the spatial and temporal patterns


In [None]:
EOF_v2 = eigenvectors_v2

PC_v2 = np.dot(X,EOF_v2) 
EOF_v2.shape,PC_v2.shape

#### Compare the first PC from each method

In [None]:
plt.plot(ds_anoms['time'],PC_v1[:,0]/np.std(PC_v1[:,0]),label="Version 1")
plt.plot(ds_anoms['time'],PC_v2[:,0]/np.std(PC_v2[:,0]),label="Version 2") 
plt.legend();

**Oops**, did we do something wrong?

A: No. The sign of the PC or the EOF is arbirary - it is their product that matters!
It is acceptable to multiply a PC by -1, as long as we also multiply its corresponding EOF by -1 so they are consistent.

In [None]:
#Let's zoom in on the first 2 years
plt.plot(ds_anoms['time'][:24],PC_v1[:24,0]/np.std(PC_v1[:24,0]),label="Version 1")
plt.plot(ds_anoms['time'][:24],-PC_v2[:24,0]/np.std(PC_v2[:24,0]),label="Version 2") 
plt.legend();

#### Get the variance explained by each eigenvector

It is the ratio of variance explained by this eigenvector to the total variance

In [None]:
vexp_v1=eigenvalues_v1/np.sum(eigenvalues_v1)
vexp_v2=eigenvalues_v2/np.sum(eigenvalues_v2)
pct_v1 = vexp_v1[0:10]*100 # Just the first 10, as percentages
pct_v2 = vexp_v2[0:10]*100 # Likewise

# Plot the first 10
x = np.arange(1,11)
plt.bar(x,pct_v1,color="#00000000",edgecolor="tab:blue",linewidth=2,label="Approach 1")
plt.bar(x,pct_v2,color="#00000000",edgecolor="tab:orange",width=0.55,linewidth=2,label="Approach 2")
plt.xticks(x) 
plt.legend()
plt.grid(visible=True,which='major',axis='y')
plt.ylabel("Percent Variance Explained")
plt.xlabel("Principal Component") ;


### Presenting EOFs


In [None]:
EOF_v2=EOF_v2.reshape((ny,nx,ny*nx)).T/wgts.squeeze()

nrows,ncols = 3,1
clevs = np.arange(-1.6,1.7,0.2)

fig,ax = plt.subplots(3,1,layout="constrained",sharex='col')

for i in np.arange(3):
    eofnorm = EOF_v2[i,:,:].T*np.std(PC_v2[:,i])
    panel = ax[i].contourf(ds_tpac['lon'],ds_tpac['lat'],eofnorm,clevs,cmap='RdBu_r',extend='both')
    ax[i].set_title(f"EOF {i+1}")
    
fig.colorbar(panel, ax=ax, shrink=0.7, label="˚C") ;

In [None]:
nrows,ncols = 8,1
fig,ax = plt.subplots(nrows,ncols,figsize=(5,8),
                      layout="constrained",sharex='col')

for i in np.arange(int(nrows/2)):
    pcnorm = PC_v2[:,i]/np.std(PC_v2[:,i])
    panel = ax[i].plot(ds_anoms['time'],pcnorm,color='tab:orange')
    ax[i].set_title(f"PC {i+1}, Approach 2")
    
for i in np.arange(int(nrows/2)):
    pcnorm = PC_v1[:,i]/np.std(PC_v1[:,i])
    panel = ax[i+int(nrows/2)].plot(ds_anoms['time'],pcnorm,color='tab:blue')
    ax[i+int(nrows/2)].set_title(f"PC {i+1}, Approach 1")

Notice above that PC 1 matches, but the sign is arbitrarily reversed (this will happen half the time), and PC 2 matches.
However, PC 3 from the first approach matches PC4 from the second approach (and vice versa but again with the sign reversed).
What happened?

Recall we have sorted our PCs based on their explained variance. It happens that the two different methods gave opposite ranks to the 3rd and 4th PCs. This is quite common, especially for the higher numbered EOFs that have ever decreasing explained variance.