### Baseline Batch Inversion Example

- Run the script as is making sure you can step through an example and create plots

- See if you can set errors and/or choose number of constraining observations to make estimate:

    - As good as possible (tight boxplot/confidence bounds)
    - As poor as possible (loose boxplot/confidence bounds)

- Describe the above estimates:

    - Monthly and annual flux estimates for oceans vs land, how are they different?

    - Which land regions appear more difficult to constrain with limited global observations?

- Using knowledge from above
    - about how much of these 1,100,000 observations would you think you need to reasonably constrain most of the land flux regions?

    - How much data do you think you’d need to simply constrain the global annual CO2 flux (all regions summed together) ?


#### Setting up Environment for Computing
This cell simply looks for whether we are on GHGHub (or local) and sets up environment, including directory references and libraries.

In [None]:
####################################################################
#  THIS CELL IS ALL SETUP FOR EACH OF THE NOTEBOOKS
####################################################################

#-- Look for locally installed packages on NASA JupyterHub Resources
.libPaths(new=c("/home/rstudio/shared/lib/R-4.3/x86_64-pc-linux-gnu",.libPaths())) 
.libPaths()

if(Sys.getenv("AWS_WEB_IDENTITY_TOKEN_FILE") == ""){
 code_dir <- "/projects/ssim-ghg-2024/"
 data_dir <-  "/Users/aschuh/SSIM-GHG/data/"
 output_dir <- "~/temp/output/"
 }else{
 code_dir <-  "~/ssim-ghg-2024/"
 data_dir <-  "~/shared/ssim-ghg-data/inversion_examples/"
 output_dir <- "../../output/"
 }

Rcode_dir <- file.path(code_dir,"batch/")

setwd(Rcode_dir)

#######################################################
#-- ***Parent Directory and code for ALL inversions***
#######################################################
###############################################
#-- Load Code
##############################################
source(file.path(Rcode_dir,"util_code_032024.R"))
source(file.path(Rcode_dir,"plot_concentrations.R"))
source(file.path(Rcode_dir,"inversion_032024.R"))
source(file.path(Rcode_dir,"write_inversion_2_netcdf_032024.R"))
source(file.path(Rcode_dir,"generate_transcom_flux_ensemble_from_inversion.R"))
       
###############################################
#-- Required Libraries
###############################################
require(ncdf4,warn.conflicts = FALSE)
require(plyr,warn.conflicts = FALSE)
require(dplyr,warn.conflicts = FALSE)
require(parallel,warn.conflicts = FALSE)
require(ggplot2,warn.conflicts = FALSE)
require(abind,warn.conflicts = FALSE)
require(Matrix,warn.conflicts = FALSE)
require(lattice,warn.conflicts = FALSE)
require(memuse,warn.conflicts = FALSE)
require(EnvStats,warn.conflicts = FALSE)
require(gridExtra,warn.conflicts = FALSE)
require(mvtnorm,warn.conflicts = FALSE)
require(plotly,warn.conflicts = FALSE)

########################
#--  Detect Cores
########################
print(paste("Num CPUs:",detectCores(),"cores"))
memuse::Sys.meminfo()

#### Solving the equation
Recall here we are simply solving the below equation, we therefore need inputs for each variable

$$
\newcommand{\transpose}[1]{{#1^{\scriptscriptstyle T}}} 
J(x) = \transpose{(x_0 - x)} {S_x
}^{-1}(x_0 - x) + \transpose{(z - Hx)} {S_z}^{-1}(z - Hx)\\
$$

$$
\newcommand{\transpose}[1]{{#1^{\scriptscriptstyle T}}} 
\hat{x} = (\transpose{H}{S_z}^{-1}H + {S_x}^{-1})^{-1}(\transpose{H}{S_z}^{-1}(z-Hx)+{S_x}^{-1}x_0)
$$

$$
\newcommand{\transpose}[1]{{#1^{\scriptscriptstyle T}}} 
S_{\hat{x}} = {({S_x}^{-1} + \transpose{H}{S_z}^{-1}H )}^{-1}
$$




#### Baseline Sensitivity Matrices (H and H^t)

These precalculated sensitivity matrices (jacob object) detail the sensitivity of 1,156,383 different observations to the basis functions, which consist of 22 regions, 11 land and 11 ocean, as well as 24 months. The jacob_bgd object consists of the sensitivity of the observations to emission sources which will not be optimized here, particularly fire emissions (e.g. forest/grassland fires) and fossil fuel emissions.  At end we assign these objects to 'H' to match the notation through rest of exercises/slides.

In [None]:
###############################################
#--  Load sensitivity matrices 
###############################################

load(file.path(data_dir,"jacobians/","trunc_full_jacob_030624_with_dimnames_sib4_4x5_mask.rda"))
load(file.path(data_dir,"jacobians/","jacob_bgd_060524.rda"))

#-- Difference in forward runs from GEOS-CHem resulted in CO2 vs C diff in mass is why 12/44 is here (note)
#-- Assign the jacob objects to H to match notation
H <- jacob * 12/44
H_bgd <- jacob_bgd 
rm(jacob);rm(jacob_bgd)

#-- These represent the fossil and biomass burning contributions to the observations (from fixed emission runs)
fire_fixed <- H_bgd[,2]
fossil_fixed <- H_bgd[,3]
###################################################################
#-- END END END ***Parent Directory and code for ALL inversions***
###################################################################

#### Set the "truth"

This block of code sets up the (simulated) "truth" for the 528 element long state vector we described above. We've provided real life examples of what these can look like in the truth_array.  You can also simply set the state_vector_true to any vector of length 528.  Recall this state 'x' represents the adjustment to a baseline prior guess of fluxes such that the simulated true flux = 'prior best guess flux' * (1 + x).  This state will then be used to simulate our observations 'z'

In [None]:

##################################################################
#- Inversion #1   *************************
##################################################################

#################################
#- Target truth in state space
#################################

##################################################################
#-- This array holds ratios of OCO2v10MIP fluxes and SiB4 fluxes
#-- as examples of "scalings" to be recovered. It also holds corresponding
#-- differences if the inversion attempts to directly solve for flux
#-- truth_array(24 months, 23 transcom, 98 inversions, (ratio, difference) )
#-- To try another "truth" from these, just increment the third element below:
#-- e.g. set * in xx = truth_array[,-1,*,1] to be between 1 and 98
##################################################################

#-- Don't Change
#load("/projects/sandbox/inversion_workshop_scripts/truth_array.rda")
load(file.path(data_dir,"misc/truth_array.rda"))
#-- pulling out NA transcom region and subset to scalar vs flux adj
truth_array = truth_array[,-1,,1]
#-- Don't Change


#--  Choose our state from inversion list, option #1, and "truncate" to -1 and 1
inversion_number =10   #  choose this between 1 and 98
state_vector_true= tm(as.vector(- truth_array[,,inversion_number]),-1,1)

#-- Alternatively choose a "different" true state like the below ones
#-- The first just means the truth IS the prior, the second has a simple structure
#-- Land regions fluxes are (1+0.5) * prior guess and ocean fluxes are (1- 0.5) * prior guess.
#state_vector_true = c(rep(0,24*11),rep(0,24*11))
#state_vector_true = c(rep(0.5,24*11),rep(-0.5,24*11))


#### Define the a priori flux covariance matrix
Here we define what we are calling S_x, the a priori flux covariance matrix. In essence, this defines the bounds within which we expect to find our "simulated" truth, relative to the baseline best guess for prior flux.

In [None]:
#########################################################
# Generate a prior flux covariance matrix Sx
# These first two lines form "diagonal" of Sx, e.g. marginal variances
# Long term, a catalog of predefined choices is best here I think
#########################################################
land_prior_sd = 0.5   #-- free to set this, implies you think "truth" for land is within +/- 3*this
ocean_prior_sd = 1    #-- free to set this, implies you think "truth" for ocean is within +/- 3*this

##############################################################################
#-- This is the structure of the 24 month subblock for each land/ocean region
#-- induce temporal correlations
##############################################################################

#-- This will set up a prior temporal correlation, 
#-- free to set month_to_month_correlation between 0 (independent) and 1
month_to_month_correlation = 0.5
sigma = bdiag(rep(list(ar_covariance(24, month_to_month_correlation)), 22))  #-- free to set 


#################################################
#-- scale by variance for land/ocean (set diagonal of matrix)
#-- This simply puts together pieces above
#################################################
var_scaling_diagonal = diag(c(rep(land_prior_sd,24*11),rep(ocean_prior_sd,24*11)))

Sx = as.matrix(var_scaling_diagonal %*% sigma %*% t(var_scaling_diagonal))

#-- This is an alternative state_vector_true based *exactly* upon the prior covariance matrix
#-- as opposed to being able to pick your "truth" separately from your assumed dist where "truth" lives
#-- Probably don't want to change this unless you know what you are doing
#state_vector_true = t(rmvnorm(n=1,mean=rep(0,528),sigma=sigma))


#### Choose which observations you want to assimilate
Or in other words, which observations will be used to optimize/estimate the unknown fluxes.  This problem is somewhat over determined with over a million observations to constrain a 528 element state.  With that in mind, small observation errors and LOTS of observations used should "nail the unknown" solution quite well. The goal here is to create a vector of TRUE/FALSE of length equal to the total number of observations described in the sensitivity matrix we loaded above ( 1156383 ). The obs_catalog is a data.frame (think matrix of 'items'), with information about each observation and can be used to build a subset.

In [None]:
####################################################################################
#-- WHICH obs do you want to use in the inversion? 
#-- examples of selecting on stations, type of data, lat/lon box,etc
####################################################################################

#load(file.path(data_dir,"obs/obs_catalog_030624.rda")) # obs_catalog object
load(file.path(data_dir,"obs/obs_catalog_042424_unit_pulse_hour_timestamp_witherrors_withdates.rda")) 

############################
#-- USE ALL OBS
############################
subset_indicator_obs=rep(TRUE,dim(H)[1])

############################
#-- SAMPLE BY TYPE EXAMPLE
############################
#subset_indicator_obs = obs_catalog$TYPE == "TCCON"
#subset_indicator_obs = obs_catalog$TYPE == "OCO2"


############################
#-- SAMPLE BY NOAA STATION EXAMPLE
############################
# subset_indicator_obs = (
#   grepl("mlo", obs_catalog$ID)
#   | grepl("lef", obs_catalog$ID)
# )

############################
#-- SAMPLE BY TIME EXAMPLE
############################
# subset_indicator_obs=(
#   obs_catalog$YEAR == 2016
#   & obs_catalog$MONTH == 8
# )

############################
#-- SAMPLE BY LON & LAT EXAMPLE
############################
# subset_indicator_obs=(
#   obs_catalog$LON < -10
#   & obs_catalog$LAT > 10
# )

############################
#-- USE SIMPLE SUBSET
############################
#subset_size = 10000
#subset_indicator_obs=rep(FALSE,dim(H)[1])
#subset_indicator_obs[seq(1,1156383,length=subset_size)] = TRUE





#################################################################
#-- Downsample if necessary to 578191 obs, likely RAM constraint
################################################################

if(sum(subset_indicator_obs) > 0.5*length(subset_indicator_obs)) {
  new_ind = rep(FALSE,length(subset_indicator_obs))
  new_ind[sample(x=grep(TRUE,subset_indicator_obs),size=floor(0.5*length(subset_indicator_obs)))] = TRUE
  print(paste("downsampling from",sum(subset_indicator_obs),"to",
              floor(0.5*length(subset_indicator_obs)),"observations"))
  subset_indicator_obs = new_ind
    }

#-- LEAVE THIS AS IT SUMMARIZES THE NUMBER OF OBS USED
print(paste("using",sum(subset_indicator_obs),"of",length(subset_indicator_obs),"observations"))

#### Set the observation errors
Recall this component, matrix Sz, consists of the sum of (assumed) independent errors describing instrument noise and various transport errors due to representation and aggregation. You can simply set this error to be the same across all observations or use realistic errors as given in the obs_catalog object (from the OCO2MIP project). Note we don't allow off-diagonal non-zero entries here so we're carrying this matrix forward as vector.

In [None]:
##########################################################
#-- sd for Gaussian i.i.d. errors, jacob is sens matrix
##########################################################

#-- Simple errors 
Sz_diagonal_in = rep(1,(dim(H)[1]))  # dim(H)[1] is length of obs possible

#-- More realistic errors
#Sz_diagonal_in = obs_catalog$SD

#### Simulate the true observations from the sensitivity matrix and the assumed observation errors
Here we literally take the sensitivity matrix, our "true" state and the prior guess (the 1 in the calc below) and add our expected errors (Sz) to it.

In [None]:
#############################################################
#-- Generate obs, 'y',  set.seed() ????
#-- currently leaving out bgd and all fixed
#-- non-optimizable contributions including fire and fossil
#############################################################

z_in = H %*% (1+state_vector_true) + rnorm(length(Sz_diagonal_in),sd=Sz_diagonal_in)


### The "calculations"
Now we have every component defined and we simply do the calculations....

$$
\newcommand{\transpose}[1]{{#1^{\scriptscriptstyle T}}} 
J(x) = \transpose{(x_0 - x)} {\Sigma_x
}^{-1}(x_0 - x) + \transpose{(z - Hx)} {\Sigma_z}^{-1}(z - Hx)\\
$$

$$
\newcommand{\transpose}[1]{{#1^{\scriptscriptstyle T}}} 
\hat{x} = (\transpose{H}{\Sigma_z}^{-1}H + {\Sigma_x}^{-1})^{-1}(\transpose{H}{\Sigma_z}^{-1}(z-Hx)+{\Sigma_x}^{-1}x_0)
$$

$$
\newcommand{\transpose}[1]{{#1^{\scriptscriptstyle T}}} 
\Sigma_{\hat{x}} = {({\Sigma_x}^{-1} + \transpose{H}{\Sigma_z}^{-1}H )}^{-1}
$$

Actual baseline "inversion" code is now below...





In [None]:
############################
#-- Run the actual inversion
############################
#-- Be aware DOF calc (DOF arg) and Kalman Gain  calc (output_Kalman_Gain) are a bit costly computationally
#-- Try to leave DOF T or F, but output_Kalman_Gain=FALSE except in the kalman gain notebook example

ret2  = invert_clean_notation(H=H,Sz_diagonal=Sz_diagonal_in,Sx=Sx,z=z_in,H_bgd=H_bgd,
                    subset_indicator_obs=subset_indicator_obs,DOF=FALSE,output_Kalman_Gain=FALSE,
                     state_vector_true=state_vector_true)

#### "Sanity check"
The first sanity check here is to simply compare the predicted state with actual "true" state we defined above. If all is perfect, the points will line up on the 1:1 line.

In [None]:
#hist(ret2$posterior$x_hat[,1])
options(repr.plot.width=8, repr.plot.height=8)
plot(state_vector_true,ret2$posterior$x_hat,pch=16,cex=1.5,col=c(rep("orange",264),rep("blue",264)),
     xlab="True State Scaling",ylab="Estimated State Scaling",main="Estimated state vector vs true state vector (all time and regions)")
lines(c(-100,100),c(-100,100),lty=1,lwd=3,col="grey")
legend(min(state_vector_true),max(ret2$posterior$x_hat),c("Land","Ocean"),pch=c(16,16),col=c("orange","blue"))

#### Produce a Monte Carlo estimate from analytical inversion output in order to quickly plot results
We could perfectly produce the following plots from analytical solutions coming from inversion but choose to take a large sample of observations to facilitate quick and efficient plotting, in particular boxplots.

In [None]:
org_data = generate_transcom_flux_ensemble_from_inversion(inv_object=ret2,samples=1000)

##### Here we will plot annual flux average for 9/2014 - 8/2016 for each Transcom Region.
***Very important, all these boxplots/confidence bounds plots will be on the deviation from the prior flux, i.e. H\*state_vector_true and not H\*(1+state_vector_true). These also represent the inversion estimate of the deviation from the prior with no fires or fossil fuel emissions added back in.

In [None]:
plot_timeseries_flux_bytranscom(ret=org_data)

In [None]:
plot_transcom_flux_by_month(ret=org_data)

#### Plotting prior/posterior correlations across fluxes
Here we are plotting prior/post correlation across 2 year flux average (then month by month for different regions in next code block). Note that correlations are estimated from samples in "orig_data" hence prior shows "some" correlation when none exists due to noise.


In [None]:
plot_inversion_correlations(org_data = org_data)


In [None]:
plot_inversion_correlations_by_transcom(org_data=org_data)

#### Plot concentration time series at different sites
When add_prior_nee=TRUE,add_fossil=FALSE,and add_fire=FALSE, only the inversion produced adjustments to the site level concentrations are plotted (note prior=0 then). add_prior_nee=TRUE adds the underlying best guess initial prior which is being scaled by the inversion to prior/posterior/obs, you can note the seasonal cycle in the prior *appearing*.  add_fossil=TRUE and add_fire=TRUE add fixed contributions from fossil fuels and biomass burning to all the concentrations resulting in more *realistic* concentration time series.

In [None]:
#key default arg here which you can change: site_strings=c("brw","mlo","smo","co2_spo_surface-flask","lef","wkt","wbi","nwr","hun")
plot_concentrations(inversion=ret2,add_prior_nee=TRUE,add_fossil=FALSE,add_fire=FALSE,
           site_strings=c("brw_surface-flask_1_representative","mlo_surface-flask_1_representative",
                          "smo_surface-flask_1_representative","spo_surface-flask_1_representative",
                          "lef","wkt","wbi","nwr_surface-flask_1_representative","hun")        )

In [None]:
sessionInfo()