### The risks of mis-specification of MDM, or observation error
This example illustrates effects of specifying your MDM (Sz) too small (too precise) or too large (too broad)

- Step through the analysis, the 3 inversions presented and the resulting boxplots with estimates of surface flux:

    - Describe the risks of being overconfident of your observation errors?
    - Do there appear to be any risks of being underconfident of your observation errors?

#### Setting up Environment for Computing
This cell simply looks for whether we are on GHGHub (or local) and sets up environment, including directory references and libraries.

In [None]:
####################################################################
#  THIS CELL IS ALL SETUP FOR EACH OF THE NOTEBOOKS
####################################################################

#-- Look for locally installed packages on NASA JupyterHub Resources
.libPaths(new=c("/home/rstudio/shared/lib/R-4.3/x86_64-pc-linux-gnu",.libPaths())) 
.libPaths()

if(Sys.getenv("AWS_WEB_IDENTITY_TOKEN_FILE") == ""){
 code_dir <- "/projects/ssim-ghg-2024/"
 data_dir <-  "/Users/aschuh/SSIM-GHG/data/"
 output_dir <- "~/temp/output/"
 }else{
 code_dir <-  "~/ssim-ghg-2024/"
 data_dir <-  "~/shared/ssim-ghg-data/inversion_examples/"
 output_dir <- "../../output/"
 }

Rcode_dir <- file.path(code_dir,"batch/")

setwd(Rcode_dir)

#######################################################
#-- ***Parent Directory and code for ALL inversions***
#######################################################
###############################################
#-- Load Code
##############################################
source(file.path(Rcode_dir,"util_code_032024.R"))
source(file.path(Rcode_dir,"plot_concentrations.R"))
source(file.path(Rcode_dir,"inversion_032024.R"))
source(file.path(Rcode_dir,"write_inversion_2_netcdf_032024.R"))
source(file.path(Rcode_dir,"generate_transcom_flux_ensemble_from_inversion.R"))
       
###############################################
#-- Required Libraries
###############################################
require(ncdf4)
require(plyr)
require(dplyr)
require(parallel)
require(ggplot2)
require(abind)
require(Matrix)
require(lattice)
require(memuse)
require(EnvStats)
require(gridExtra)
require(mvtnorm)
require(plotly)

########################
#--  Detect Cores
########################
print(paste("Num CPUs:",detectCores(),"cores"))
memuse::Sys.meminfo()

In [None]:
###############################################
#--  Load sensitivity matrices 
###############################################

load(file.path(data_dir,"jacobians/","trunc_full_jacob_030624_with_dimnames_sib4_4x5_mask.rda"))
load(file.path(data_dir,"jacobians/","jacob_bgd_060524.rda"))

#-- Difference in forward runs from GEOS-CHem resulted in CO2 vs C diff in mass is why 12/44 is here (note)
#-- Assign the jacob objects to H to match notation
H <- jacob * 12/44
H_bgd <- jacob_bgd 
rm(jacob);rm(jacob_bgd)

#-- These represent the fossil and biomass burning contributions to the observations (from fixed emission runs)
fire_fixed <- H_bgd[,2]
fossil_fixed <- H_bgd[,3]
###################################################################
#-- END END END ***Parent Directory and code for ALL inversions***
###################################################################

In [None]:
#################################
#- Target truth in state space
#################################

##################################################################
#-- This array holds ratios of OCO2v10MIP fluxes and SiB4 fluxes
#-- as examples of "scalings" to be recovered. It also holds corresponding
#-- differences if the inversion attempts to directly solve for flux
#-- truth_array(24 months, 23 transcom, 98 inversions, (ratio, difference) )
##################################################################

#-- Don't Change
#load("/projects/sandbox/inversion_workshop_scripts/truth_array.rda")
load(file.path(data_dir,"misc/truth_array.rda"))
#-- pulling out NA transcom region and subset to scalar vs flux adj
truth_array = truth_array[,-1,,1]
#-- Don't Change


#--  Choose our state from inversion list, option #1, and "truncate" to -1 and 1
inversion_number =1   #  choose this between 1 and 98
state_vector_true= tm(as.vector(- truth_array[,,inversion_number]),-1,1)

#-- Alternatively choose a "different" true state like the below ones
#-- The first just means the truth IS the prior, the second has a simple structure
#-- Land regions fluxes are (1+0.5) * prior guess and ocean fluxes are (1- 0.5) * prior guess.
#state_vector_true = c(rep(0,24*11),rep(0,24*11))
#state_vector_true = c(rep(0.5,24*11),rep(-0.5,24*11))


In [None]:
#########################################################
# Generate a prior flux covariance matrix Sx
# These first two lines form "diagonal" of Sx, e.g. marginal variances
# Long term, a catalog of predefined choices is best here I think
#########################################################
land_prior_sd = 0.5   #-- free to set this, implies you think "truth" for land is within +/- 3*this
ocean_prior_sd = 1    #-- free to set this, implies you think "truth" for ocean is within +/- 3*this

##############################################################################
#-- This is the structure of the 24 month subblock for each land/ocean region
#-- induce temporal correlations
##############################################################################

#-- This will set up a prior temporal correlation, 
#-- free to set month_to_month_correlation between 0 (independent) and 1
month_to_month_correlation = 0.5
sigma = bdiag(rep(list(ar_covariance(24, month_to_month_correlation)), 22))  #-- free to set 


#################################################
#-- scale by variance for land/ocean (set diagonal of matrix)
#-- This simply puts together pieces above
#################################################
var_scaling_diagonal = diag(c(rep(land_prior_sd,24*11),rep(ocean_prior_sd,24*11)))

Sx = as.matrix(var_scaling_diagonal %*% sigma %*% t(var_scaling_diagonal))

#-- This is an alternative state_vector_true based *exactly* upon the prior covariance matrix
#-- as opposed to being able to pick your "truth" separately from your assumed dist where "truth" lives
#-- Probably don't want to change this unless you know what you are doing
#state_vector_true = t(rmvnorm(n=1,mean=rep(0,528),sigma=sigma))

#### Choose which observations you want to assimilate
Or in other words, which observations will be used to optimize/estimate the unknown fluxes.  This problem is somewhat over determined with over a million observations to constrain a 528 element state.  With that in mind, small observation errors and LOTS of observations used should "nail the unknown" solution quite well. The goal here is to create a vector of TRUE/FALSE of length equal to the total number of observations described in the sensitivity matrix we loaded above ( 1156383 ). The obs_catalog is a data.frame (think matrix of 'items'), with information about each observation and can be used to build a subset.

In [None]:
####################################################################################
#-- WHICH obs do you want to use in the inversion? 
#-- examples of selecting on stations, type of data, lat/lon box,etc
####################################################################################

#load(file.path(data_dir,"obs/obs_catalog_030624.rda")) # obs_catalog object
load(file.path(data_dir,"obs/obs_catalog_042424_unit_pulse_hour_timestamp_witherrors_withdates.rda")) 

#-- USE ALL OBSERVATIONS
subset_indicator_obs=rep(TRUE,dim(H)[1])

############################
#-- Downsample if necessary
############################

if(sum(subset_indicator_obs) > 0.5*length(subset_indicator_obs)) {
  new_ind = rep(FALSE,length(subset_indicator_obs))
  new_ind[sample(x=grep(TRUE,subset_indicator_obs),size=floor(0.5*length(subset_indicator_obs)))] = TRUE
  print(paste("downsampling from",sum(subset_indicator_obs),"to",
              floor(0.5*length(subset_indicator_obs)),"observations"))
  subset_indicator_obs = new_ind
    }

#-- LEAVE THIS AS IT SUMMARIZES THE NUMBER OF OBS USED
print(paste("using",sum(subset_indicator_obs),"of",length(subset_indicator_obs),"observations"))

In [None]:
##########################################################
#-- sd for Gaussian i.i.d. errors, jacob is sens matrix
##########################################################

Sz_diagonal_in = rep(1,length(obs_catalog$SD))

##########################################
#-- Generate obs, 'y',  set.seed() ????
##########################################

z_in = H %*% (1+state_vector_true) + rnorm(length(Sz_diagonal_in),sd=1)

In [None]:
############################
#-- Run the actual inversion
############################

ret2  = invert_clean_notation(H=H,Sz_diagonal=Sz_diagonal_in,Sx=Sx,z=z_in,H_bgd=H_bgd,
                    subset_indicator_obs=subset_indicator_obs,DOF=TRUE,output_Kalman_Gain=FALSE,
                     state_vector_true=state_vector_true)


#### Save results from mdm=1, true=1 to org_data_control, i.e. this is how inversion "should" be run


In [None]:
org_data_control = generate_transcom_flux_ensemble_from_inversion(inv_object=ret2,samples=1000)

### HERE WE ARE GOING TO SET OUR ASSUMPTION OF MDM ERRORS TO 1 PPM STANDARD DEVIATION

### BUT IT REALITY WE ARE GOING TO ADD 10 PPM STANDARD DEVIATION


In [None]:
##########################################################
#-- sd for Gaussian i.i.d. errors, jacob is sens matrix
##########################################################

Sz_diagonal_in = rep(1,length(obs_catalog$SD))

##########################################
#-- Generate obs, 'y',  set.seed() ????
##########################################

z_in = H %*% (1+state_vector_true) + rnorm(length(Sz_diagonal_in),sd=10)

### NOTE CHI SQUARE VALUES OUT OF INVERSION BELOW SHOULD BE AROUND 1

In [None]:
############################
#-- Run the actual inversion
############################

ret2  = invert_clean_notation(H=H,Sz_diagonal=Sz_diagonal_in,Sx=Sx,z=z_in,H_bgd=H_bgd,
                    subset_indicator_obs=subset_indicator_obs,DOF=TRUE,output_Kalman_Gain=FALSE,
                     state_vector_true=state_vector_true)

#### Save results from mdm=1, true=10 to org_data_mdm1_true10

In [None]:
org_data_mdm1_true10 = generate_transcom_flux_ensemble_from_inversion(inv_object=ret2,samples=1000)

### HERE WE ARE GOING TO SET OUR ASSUMPTION OF MDM ERRORS TO 10 PPM STANDARD DEVIATION


In [None]:
##########################################################
#-- sd for Gaussian i.i.d. errors, jacob is sens matrix
##########################################################

Sz_diagonal_in = rep(10,length(obs_catalog$SD))


### BUT IT REALITY WE ARE GOING TO ADD 1 PPM STANDARD DEVIATION


In [None]:
##########################################
#-- Generate obs, 'y',  set.seed() ????
##########################################

z_in = H %*% (1+state_vector_true) + rnorm(length(Sz_diagonal_in),sd=1)

In [None]:
############################
#-- Run the actual inversion
############################

ret2  = invert_clean_notation(H=H,Sz_diagonal=Sz_diagonal_in,Sx=Sx,z=z_in,H_bgd=H_bgd,
                    subset_indicator_obs=subset_indicator_obs,DOF=TRUE,output_Kalman_Gain=FALSE,
                     state_vector_true=state_vector_true)



#### Save results from mdm=10, true=1 to org_data_mdm10_true1

In [None]:
org_data_mdm10_true1 = generate_transcom_flux_ensemble_from_inversion(inv_object=ret2,samples=1000)

### Compare and contrast
Compare the confidence bounds carefully below (boxplots) and try to ascertain how errors in the specification of the observation error matrix (i.e. MDM or Sz) affect the posterior predictions of fluxes.

In [None]:
plot_timeseries_flux_bytranscom(org_data_control)

In [None]:
plot_timeseries_flux_bytranscom(org_data_mdm10_true1)

In [None]:
plot_timeseries_flux_bytranscom(org_data_mdm1_true10)

In [None]:
plot_transcom_flux_by_month(org_data_control)

In [None]:
plot_transcom_flux_by_month(org_data_mdm1_true10)

In [None]:
plot_transcom_flux_by_month(org_data_mdm10_true1)