User_Guide

CEDS User Guide

Table of Contents

1. Getting Started with CEDS.
- 1.1. Prerequisites
- 1.2. Running CEDS
2. System Overview
3. Input Data
4. CEDS System Code
- 4.1. Miscellaneous Coding Notes
- 4.2. R Issues
5. How to Include Supplemental Combustion Energy Activity in CEDS
6. Code Structure and Guide
7. Diagnostic Results
- 7.1. Comparison Graphs
- 7.2. Comparing with previous development version
8. Troubleshooting Odd Results

1. Getting Started with CEDS.

CEDS is an open source framework that aims to be run on any major operating system. The following section offers instructions for installing CEDS on your local machine.

1.1. Prerequisites

1.1.1. The renv Package

R packages continually change, which can break existing R codes either directly, or indirectly through package dependencies. The CEDS system now uses the renv package to better assure that the system will be functional as R packages evolve.

renv allows users to create isolated, reproducible, project-specific R libraries. It is cross-platform and allows for the installation of older package versions.

1.1.1.1. User Installation

Upon cloning the repository and navigating to the root CEDS directory, users should activate their renv library and install CEDS R dependencies.

1. Initialize a CEDS-specific Library

Although the renv setup files are in the CEDS repository, it is still necesarry initialize the project. This is done with the renv::init() function.

To initialize a renv library for CEDS, open an R session in your CEDS root directory. Then:

Install the renv package: install.packages("renv")
Initialize the project library: renv::init(bare=TRUE)
From an R session in your CEDS project root directory, run the following command: renv::restore().

By default, the init function will scan the project’s source code for R dependency packages to download, but this can take a while to run and won’t necessarily install the package versions CEDS needs to run. Using the bare = TRUE argument will tell renv to install an empty R library for CEDS that we can populate with packages defined in the lockfile.

`renv::init()` calls renv::activate(), which writes the infrastructure needed to ensure that R will load the CEDS R library on launch, among other things.

By default renv::restore()`will retrieve the library package metadata from `renv.lock and install, if necessary, the specified package versions to the project’s private library located in CEDS/renv/library/…/….

1.1.1.2. renv Background

Structure

renv is extremely lightweight and only adds four files the the repo with a total size of 32 KB. It also automatically adds its installed package directories to the project’s .gitignore file, so users don’t have to worry about accidentally committing hundreds of MB of R packages to the repo. An example of the renv file structure as it would appear in the CEDS repo is below:

CEDS/
 |- renv/
 |   |- .gitignore
 |   |- activate.R
 |   |- settings.dcf
 |
 |- renv.lock

Once initialized, a library/ sub-directory would be added to renv/, which is where R packages would be installed.

Lockfiles

renv utilizes lockfiles to record the state of a project’s library at some point in time. They contain package metadata, such as package names, versions, and sources, as well as the R version that was used to initialize the project. While normally generated with the snapshot() and restore() functions, lockfiles are written as .json which allows them to be edited by hand. The CEDS lockfile, renv.lock, is located in the root CEDS project directory.

Global Package Cache

A defining feature of renv is the use of a global package cache, which is shared across all projects using renv on a machine. The cache saves time and disk space by allowing various projects to access the same packages, rather than installing the same packages and versions into separate projects.

When using the global package cache, the project library is formed as a directory of symlinks rather than a directory of installed R packages. Each renv project is isolated from other projects on a machine, but they can still re-use the same installed packages as needed.

The global package cache is enabled by default, however it can be disabled by setting renv::settings$use.cache(FALSE). This will ensure that packages are then installed to project libraries directly, without attempting to link to the renv cache.

1.1.1.3. Troubleshooting

R package `farver` fails to compile on pic HPC cluster

The installation of the farcer package may fail when attempting to compile, resulting in an error message that looks something like this:

* installing *source* package 'farver' ...
** package 'farver' successfully unpacked and MD5 sums checked
** libs
g++ -std=gnu++0x -I"/share/apps/R/3.5.1/lib64/R/include" -DNDEBUG   -I/usr/local/include   -fpic  -I/share/apps/R/3.5.1/include -c     ColorSpace.cpp -o ColorSpace.o
In file included from ColorSpace.cpp:1:
ColorSpace.h:19: error: ISO C++ forbids initialization of member 'valid'
ColorSpace.h:19: error: making 'valid' static
ColorSpace.h:19: error: ISO C++ forbids in-class initialization of non-const static member 'valid'
make: *** [ColorSpace.o] Error 1
ERROR: compilation failed for package 'farver'

In this case this due to the package’s C++ backend using features not present in the older (gcc 4.4.7) default gcc complier on PNNL’s internal HPC pic system.

Solution

Load a newer version (6.1.0 works as of 13 May 2020) of the gcc compiler via module load gcc/6.1.0.

R package `ncdf4` fails to install on pic HPC cluster

CEDS uses the ncdf4 package within the gridding module to produce gridded emissions files. The package is not required to produce CEDS emissions CSV files.

ncdf4 depends on an nc-config file that ships with the Unidata NetCDF library. The Unidata NetCDF library is a documented system requirement for the R ncdf4 package. The NetCDF C library is installed on pic, but is not loaded as a module at the beginning of a remote session. Attempting to install the R ncdf4 package without the netcdf module loaded into your session can result in the following error:

Installing ncdf4 [1.16] ...
        FAILED
Error installing package 'ncdf4':
=================================
* installing *source* package 'ncdf4' ...
** package 'ncdf4' successfully unpacked and MD5 sums checked
configure.ac: starting
checking for nc-config... no
-----------------------------------------------------------------------------------
Error, nc-config not found or not executable.  This is a script that comes with the
netcdf library, version 4.1-beta2 or later, and must be present for configuration
to succeed.

Solution

Load the netcdf library into your session via the command module load netcdf.

An alternative solution can be to install a more recent binary version of the netcdf R library.

Unable to locate ICU4C library

ICU is a cross-platform Unicode based globalization library. It includes support for locale-sensitive string comparison, date/time/number/currency/message formatting, text boundary detection, character set conversion and so on. When attempting the install some R packages, such as stringi v1.2.2, the ICU4C library is unable to be located and the installation fails:

checking for pkg-config... /usr/bin/pkg-config
checking with pkg-config for the system ICU4C... no
*** pkg-config did not detect ICU4C-devel libraries installed
*** Trying with "standard" fallback flags
checking whether we may build an ICU4C-based project... no
*** The available ICU4C cannot be used
checking whether we may compile src/icu61/common/putil.cpp... no
checking whether we may compile src/icu61/common/putil.cpp with -D_XPG6... no
*** The ICU4C bundle could not be build. Upgrade your compiler flags.
ERROR: configuration failed for package 'stringi'
* removing '/pic/projects/GCAM/mnichol/ceds/CEDS-dev/renv/staging/1/stringi'
Error: install of package 'stringi' failed

====== Solution 1 Load a newer compiler into your remote session: module load gcc/7.3.0 (gcc/7.3.0 works as of 15 May 2020).

Solution 2

Use the install.packages function to modify the compiler flags used in the installation process:

install.packages(c("stringi"),configure.args=c("--disable-cxx11"), lib=lib)

Use the lib argument to install the package into your project’s renv library (can be found using .libPaths()).

NOTE This solution is fine for only installing stringi, however it may not completely resolve the problem when stringi is being installed as a dependency for another R package through renv.

R/renv unable to load shared object

renv has the ability to link packages from a user’s global R library to their project-specific renv library, saving the time and space that re-downloading the same package and version a second time. However, once this cache link is established, removing the package from the global R library will break the link, causing errors in the renv library.

his error message below resulted from the cache link between the CEDS renv library and global R library being broken for the stringi package when attempting to load the stringr package, which depends on stringi:

Error in dyn.load(file, DLLpath = DLLpath, ...) :
  unable to load shared object '/qfs/people/nich980/.local/share/renv/cache/v5/R-3.3/x86_64-pc-linux-gnu/stringi/1.2.2/e99d8d656980d2dd416a962ae55aec90/stringi/libs/stringi.so':
  /usr/lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by /qfs/people/nich980/.local/share/renv/cache/v5/R-3.3/x86_64-pc-linux-gnu/stringi/1.2.2/e99d8d656980d2dd416a962ae55aec90/stringi/libs/stringi.so)
Couldn't load 'stringr'. Please Install.

Solution

Manually install the package dependency in question into the project renv library, using the library and rebuild arguments:

> .libPaths()
[1] "/pic/projects/GCAM/mnichol/ceds/CEDS-dev/renv/library/R-3.3/x86_64-pc-linux-gnu"
[2] "/tmp/RtmpofmJy0/renv-system-library"
> lib <- .libPaths()[1]  # CEDS renv library
> renv::install("stringi@1.2.2", library=lib, rebuild=TRUE)

This forces renv to install the package in the local library, rather than attempting to create another cache link.

1.1.2. R and R Packages

Note that the Rscript command needs to in the command path for your system for the CEDS makefile scripts to run. In some installations you may have have to add the location of Rscript to your environment’s PATH variable.

CEDS also requires several R packages to be installed. Once the renv system is setup and initiated as described in the previous section, the packages needed for CEDS should be available.

The current list of necessary packages an be found in the: ./code/parameters/global_settings.R file along with a set of package version numbers that have been tested to work.

As noted in the previous section, we have observed issues when installing the netCDF package from source. The netCDF package is only necessary if you are producing gridded data. You are having trouble with installation and are not producing gridded data, you can remove that package from ./code/parameters/global_settings.R and also the renv lockfile.

1.1.3. Proprietary Energy Statistics Data

In order to be able to run the CEDS system as a whole, it is necessary to acquire a copy of the IEA energy statistics data files. In the current CEDS version these are in one file called OECD_and_NonOECD_E_Stat.csv that should be placed in the emissions-data-system/input/energy directory. They are required by the script A1.2.IEA_downscale_ctry.R, but are proprietary data and not allowed to be a distributed as part of a public-domain system such as CEDS. More details about using IEA data can be found below in Energy Data section.

1.1.4. Make

While individual components of CEDS can be run individually with R, the system as a whole should be executed using a Makefile system. Commands of the form:

make [em]-emissions

(where [em] is replaced the desired emission species) will produce emissions by country, sector, and fuel.

Once the aggregate country-level emissions are produced, they can be mapped to spatial grids using:

make [em]-gridded

Note that before gridding, spatial proxy data must be installed as described in the module-G section below.

To run CEDS with the Makefile, you will need to install Make.

1.1.4.1. Installing Make on OS X

To test if you have Make already installed, simply type make in the command line. If you do not have it, you will see the error:

bash: make: command not found.

Make can easily be installed with Xcode, which can be downloaded for free from the Apple App Store.

Whithin Xcode, you can install Command Line Tools by selecting Xcode→Preferences→Downloads, then clicking Components and Install on the command line tools line.

Tip	Make can also be installed with the HomeBrew command `brew install make`.

1.1.4.2. Installing Make on Windows

Once Make is installed, Makefiles can be run by opening the command prompt, pointing to the location of the Makefile, and entering make (or nmake).

The commands Rscript or R CMD BATCH are necessary, as they are used in the Makefile. These commands can also be used in the command line independently to run specified individual scripts, if desired.

There are a number of options for running a Makefile in Windows. The make functionality is not native to the Windows operating system, so it must be downloaded. Some options for installing Make are:

Cygwin

During the installation of Cygwin make sure to specify that you would like Make as well. On the Select Packages screen, under All→Devel, ensure that the Bin box is checked for the file labeled make: the GNU version of the ‘make’ utility. Src is not necessary.

Tip	Installing Cygwin also gives the option to install other command line tools as well such as R commands like Rscript and R CMD BATCH, gcc functionalities, or command line text editors.

You will need have to tell Cygwin where the “make” and R are located (in addition to them beginning specified in the system environment variables) as follows:

cygstart .bash_profile
open the .bash_profile file in notepad
Add the paths to the “make" and R commands to the .bash_profile file

export PATH=$PATH:/cygdrive/c/cygwin64/bin
export PATH=$PATH:/cygdrive/c/Program\ Files/R/R-3.5.3/bin

GnuWin32:

This option will provide the make command to the Windows command prompt

VisualStudio

This option will provide the nmake command to the Windows command prompt

1.2. Running CEDS

Use git to download CEDS to a local repository on your system. If not already available on your system you will need to install either a command-line version or a GUI interface such as sourcetree. After making sure all prerequisites are properly installed, run the entire system by simply navigating to the CEDS folder and executing the make all command. The Makefile system will detect any changes made and re-build the outputs as necessary. If the system is up to date, it will do nothing.

All modules are included in a Makefile in the emissions data system. Running the modules through the Makefile is advantageous because make will automatically run only what needs to be run to keep everything properly updated.

Caution	Make sure you are in the root CEDS directory containing the `Makefile`. If you are not you will see the error: `No targets specified and no makefile found. Stop.`

To rerun the entire system use the make clean-all command, then the make command. This will remove all intermediate outputs and log files, forcing the Makefile system to build the system from the first output file again and running all integrated scripts. If you have made changes to the data processing or input data the clean-all command is important to assure accurate processing. For more information on using the Makefile, see the makefile section of the User Guide.

CEDS is set-up to run in parallel by species. Example shell scripts are in the exe/PIC-job-scripts directory that can be modified as needed for your system. Note that module A, activity, must be run first, then subsequent species-specific modules can then run in parallel. Note that the make file also contains commands for running all species in three parts (part1, part2, part3) which can be done manually on any system with multiple processors.

It is also possible to run individual scripts in CEDS without running the whole system. Simply use the RScript command from the command line, or open the file you wish to run in an R GUI and run from there. Make sure you are running the script from either the root directory of the system (the CEDS directory, by default), or the input directory. But a ` make clean-all` (or a 'make clean' for the relevant species or module) is highly recommended after a code change has been made.

Note	`make clean-all` followed by `make` is the only supported method to assure the system will produce accurate results.

2. System Overview

The CEDS system estimates emissions by sector, country and fuel in a few major steps (see Hoesly et al. Figure 1). First a set of default emissions are estimated for the modern era (either 1960 or 1971 to the last estimation year). These default emissions are then scaled to country inventories. Finally these are extended back to 1750. A general outline of the calculation process is given in this section.

2.1. Default Emissions

2.1.1. Combustion vs Process Emissions

CEDS has two categories of emissions, which reflects they way they are calculated: combustion emissions and process emissions. This assignment is done at the sector level, so each CEDS sector is designated as either a combustion or a process sector in the Master_Fuel_Sector_List.

In the CEDS system process emission sectors have emission estimates for each CEDS fuel (hard_coal,diesel,biomass, etc.). In most CEDS intermediate files, therefore, combustion emission sectors have a row for each CEDS fuel (per country/iso). Process emission sectors are assigned a fuel of "process" to differentiate them from combustion emissions. There is only one row in intermediate files for each process emissions sector.

2.1.2. Combustion Emissions

For combustion sectors, default emissions are always calculated using fuel consumption and emission factors. Default emissions are determined as:

Default_Emissions = Fuel_Consumed • Emission_Factor • (1 - Control_Fraction)

For SO₂ there are additional parameters:

Default_Emissions = Fuel_Consumed • Sulfur_Content • 2 • (1 - Ash_retention) • (1 - Control_Fraction)

The default emission factors used above are either from GAINS or regional and country/sector specific values from a variety of sources. The fuel consumed and other activity data is generated in Module A (Activity Data) and the default combustion emission factors in Module B (Combustion Emission Factors).

2.1.3. Process Emissions

Default emissions for CEDS process emission sectors are taken from EDGAR, country inventories, or other data sources for some specific sectors in Module C (Non-combustion Emissions).

Note that the term "process emissions" refers to the way in which the default emissions are calculated, and that some sectors classified as "process" may include emissions that result from fuel combustion. For example, flaring from oil and gas operations is a combustion process, but default emissions are specified by taking emissions from some default source and not by multiplying an emission factor times driver data.

2.2. Emissions Scaling

The second major step is scaling default emissions to country level inventories in Module F (Inventory Scaling).

2.3. Emission Extension

The last step in producing emissions is extension of emissions back to 1750 in Module H (Historical Extension). Emissions for any specific sector/emission species can be extended back using a variety of user-specified methods including exogenous trends in some proxy or emissions data, per-capita trends, or trends in an emission factor (including emission factors trending to zero at a specified year.

2.4. Emission Gridding

The user can elect to produced spatial grids of the emissions generated in the previous steps in [Code-Module-G].

3. Input Data

3.1. General Assumptions

3.1.1. Updating The Last Inventory Year

To extend the system to run to a later year, the key input data file to change is the BP energy statistics. Overwrite the current file with a more recent version. The file is located in /input/energy/. Note that this file needs to be in .xlsx format.

Then update the parameters BP_years and end_year in the file: /code/parameters/common_data.r

Clean and re-run make. The emissions data should now extend to the latest year specified. The data are simply extrapolated, updating emission inventory data (and detailed IEA energy data) will produce a more accurate estimate for recent years.

You may need to update BP mapping files if country names in the BP data have changed.

Note that the BP data must extend to the latest year specified here. The BP energy statistics only provide consumption by total fuel for the larger countries. More accurate results will be obtained by also updating to the latest version of the IEA energy statistics as described below.

Note that just updating the year will simply run the system to a later year using default emission factor pathways. In order to obtain a more accurate result, inventory data should also be updated. Please contact us as noted the gitHub Readme if you are interested in collaborating in updating the CEDS system.

3.1.1.1. Additional Considerations

There are numerous input files that provide data that extend to the last data year. Some of these will need to be updated. Look for user data files in input/default-emissions-data, input/extension, input/energy/energy-data-adjustment and input/energy/user-defined-energy directories.
"default" trend extension instructions in CEDS_historical_extension_methods_EF.csv should generally be set to the last CEDS year.
Check if any minimum EF pathways in input/extension/EF-pathway/ need to be updated to lower values for later years.

3.1.2. Adding A New Sector

CEDS has two types of sectors (set in the file Master_Fuel_Sector_List.xlsx):

combustion sectors: Emissions from these sectors use energy data by fuel and sector as driver data. Default emissions are calculated by multiplying an emission factor times fuel consumption (minus an optional control fraction).
non-combustion sectors: Emissions from these sectors use some other data (default is population) as driver data. (Also referred to in CEDS documents as process emissions.) Default emissions are read-in from an external inventory source, user data, or a sector-specific script. Note that, physically, emissions from a CEDS non-combustion sector may be from fuel combustion. This designation refers only to how emissions are calculated within the CEDS system.

Adding a process (non-combustion) emission sector

In addition to indicating your data’s sector in your data source (the U.* file you used to import the data), you will need to edit 2 files in CEDS. They are:

CEDS/input/mappings/Master_Sector_Level_map.csv
- Add a new row to the spreadsheet where appropriate. The row will contain five columns of data:
  1. The detailed sector name: a unique sector ID (one word)
  2. working sectors v1 and
  3. working sector v2: these can be either your detailed name or a first-level aggregation; I think they may not be used in the model itself but are process documentation
  4. The aggregate sector: if appropriate, the aggregate sector name will be identical to an existing aggregate sector
  5. Figure_sector: this should be identical to an existing Figure_sector: this is the category in which your data will be displayed in CEDS graphical outputs
CEDS/input/mappings/Master_Fuel_Sector_List.xlsx
- Add a new row to the spreadsheet at the appropriate location in the “Sectors” sheet only. This row will contain 4 columns of data:
  1. The working_sectors_v1 sector name
  2. The activity type
  3. Units of analysis
  4. Type: comb (combustion) or NC (non-combustion)

3.1.3. Internal Data File Updates

There are two input files that are used by CEDS but must be re-generated by manually running specific scripts (due to dependences on multiple species). It should not be necessary to do this often, but these should be revisited periodically, and always before a final data release. These are:name: value

File Name	Script	…
input/default-emissions-data/CD.OC_to_PM25_defaultratio.csv input/default-emissions-data/CD.BC_to_PM25_defaultratio.csv	D1.2.BC_OC_to_PM2.5_default_ratios.R	…
input/extension/extension-data/H.N2O_7BC_extension-NH3_and_NOx_sectors_1_2.csv	H1.1a.Aggregate_NH3_NOx_for_N2O_7BC_ext.R	…

3.2. Energy Data

The core data needed to run the data system is the IEA OECD and non-OECD energy statistics.

3.2.1. Adding or updating the IEA Energy Statistics

The IEA energy statistics database needs to be purchased from the IEA and the data exported into csv format in order to run the CEDS system. The instructions below refer to the cd-rom distribution: the entire IEA energy database needs to be exported for use in the data system.

Steps to import the IEA energy data

Export the statistics for OECD and non-OECD countries into two .csv files
1. The first column is full name (spelled out).
2. The second column is flow (as IEA abbreviation, because names are not unique otherwise. To change to abbreviation, click on the flow icon, then go to Dimensions → Change label).
3. The third column is fuel (spelled out).

The necessary format is shown in the files: OECD_E_Stat_Template.csv and NonOECD_E_Stat_template.csv.

(To export from the IEA beyond 20/20 data browser, drag the icon for country to the left to form a column and icon for time to the right to create a row with years. Then drag the icon for flows between the column for countries and data for the first year; it will add a column for flows. Then drag the icon for fuel between column for flows and data for the first year. This will result in a large table that contains all the data that can then be exported as a csv.)

In a text editor:
1. Replace .., c, and x, in the data values with zeros (note these can occur at end of lines)
2. Get rid of special characters and apostrophe’s
  1. Côte d’Ivoire → Cote dIvoire
  2. Dem. People’s Rep. of Korea → Dem. Peoples Rep. of Korea
  3. People’s Republic of China → Peoples Republic of China
  4. Curaçao → Curacao.

If the data is the same release used in the version of the CEDS system that you have (you can check this in the metadata file that is released with the system) then there are no further steps.

However, if you are using a newer (or older) version of the IEA/OECD statistics, then the following additional steps are needed.

Update year ranges in code\parameters\common_data.R. To replace the IEA data from 2012 edition to 2015 edition, change the parameter IEA_years ← 1960:2010 to IEA_years ← 1960:2013.

The BP energy statistics are used to extend energy consumption and production data to the latest CEDS year. If you use the IEA data from 2015 edition, change the parameter BP_years ← 2011:2014 to BP_years ← 2014.

If there are new countries or new country names - the master country list will need to be updated input\mappings\Master_Country_List.csv.
If there are any new fuels these might need to be updated in the master fuel list input\mappings\energy\IEA_product_fuel.csv.
If fuels have changed names, this might require changes in other files. Please contact us for assistance. (We will be working to generalize this process.)

Tip	When updating the IEA energy data check that the data in input/energy/energy-data-adjustment is still valid and update if necessary.

3.2.1.1. Adding or updating the IEA Energy Content Data

Similarly, the system also uses net energy content values from the IEA. These also must be exported, with the format provided in the files: NonOECD_Conversion_Factors_Full_template.csv, NonOECD_Conversion_Factors_template.csv, OECD_Conversion_Factors_Full_template.csv, and OECD_Conversion_Factors_template.csv.

3.3. Process Emissions Driver Data

In order to more accurately extend process emissions time series, driver data for the appropriate emissions time series is needed.

In the first phase of this project, where we are focusing on recent decades, complete, consistent time series estimates exist for most emissions (e.g. EDGAR, FAO, etc.). For this reason, process emissions driver data are not critical to this first phase and most of this data has not been incorporated.

3.3.1. User Added Default Process Emission Data

The User can add process (non combustion) emissions to CEDS by adding inventory files or instructions for using processed inventory files (from module E) in the intermediate_output folder.

3.3.1.1. Individual Files

CSV files with process emissions data may be added to input/default-emissions-data/non-combustion-emissions folder. Files should be named with "U.<em>_" followed by a description or identifier. The system will not import files named without the .<em> (example "U.SO2"). Clean commands (executed by the make file) will delete files in the folder with "C.", so users should only add files proceeded by a "U.".

Files should be in standard CEDS format with column headings iso-sector-fuel-units-Xyears similar to output emissions and EF files produced by the system. Year columns must be in the format “Xyear” such as X1980 or X2005. Files may contain any number of emission years in any order. Script will automatically order years and linearly interpolate between years. This script does not extend emissions to other years outside given data.

Files must contain iso-sector-fuel-units. Entries that are not exact matches for those 4 id columns to entries in CEDS NC_database will not be added. The script automatically filters out entries which are not mapped to non combustion sectors (designated by input/mappings/Master_Sector_Fuel_LIst.xlsx) or have “process” as fuel.

3.3.1.2. Select Emission Data From Module E Inventories

Data lines from processed inventory files (from module E) in the intermediate_output folder can be added to the default dataset by adding lines to input/default-emissions-data/non-combustion-emissions/add_inventory_instructions.csv. This is particularly useful (and recommended) when the default process emissions data is too different from the inventory data, resulting in large scaling factors. (This can be diagnosed by examining scaling script diagnostic files.)

Data specified must be inv - the name of the inventory file in the intermediate-output file such as E.SO2_EMEP_NFR09_inventory em - the emission species iso - country code inv_sector - exact match of the name of the inventory sector specified in the inventory file (inv) ceds_sector - the CEDS sector the emissions should be matched too

Data must be mapped to non combustion sectors (designated by input/mappings/Master_Sector_Fuel_LIst.xlsx).

4. CEDS System Code

4.1. Miscellaneous Coding Notes

Note when using GREP to select input files, that one cannot grep for "OC", for example, as this will also capture "NMVOC". You must use an appropriate wildcard match that distinguishes between "NMVOC" and "OC", and "CO" vs "CO2".

4.2. R Issues

If you encounter an error where a package is reported to not be available even though you installed is already, try installing without specifying a lib argument (e.g., install.packages( 'package-name' ) ) so that the package is installed in the default location. (Note that GUI’s such as RStudio might sometimes install a package in the wrong place.)
When continually running code from individual R scripts, using the function logStart() (called in the initialize function at the beginning of every script) without logStop() (called at the end of every script) will keep the log files open. An R session can only handle so many open log files before the following error occurs:
```
Error in sink(paste(logpath, fn, ".log", sep = ""), split = T) :
  sink stack is full
```
To resolve, clear the global environment manually or by restarting the R session.
Similar to the error above, having too many files open can create the following error:
```
Error in textConnection("rval", "w", local = TRUE) :
  all connections are in use
```
To resolve, enter the command closeAllConnections() into the console.

5. How to Include Supplemental Combustion Energy Activity in CEDS

CEDS has the capacity to dynamically include user-defined activity in a number of ways. This section outlines how to include supplemental combustion activity data in a run of CEDS.

5.1. Formatting the Data

Every supplemental dataset is required to be in a .csv format and must be accompanied by a corresponding instructions file. Additionally, a mapping (.xlsx) file is required for any dataset that is not already in the standard CEDS format.

These files are tied together by their root filename, with the non-data files specified by an extension of -instructions.csv or -mapping.xlsx. All files must be saved to the folder input/extension/user-defined-energy in order to be included. For example, your extension directory might look like this:

input/
├── extension/
│   ├── user-defined-energy/
│   │   ├── mydata.csv
│   │   ├── mydata-instructions.csv
│   │   ├── mydata-mapping.xlsx
│   │   ├── USA_historical_coal.csv
│   │   └── USA_historical_coal-instructions.csv
... ...

If the files are formatted correctly, they need only be placed in this folder, and CEDS will automatically identify and process the data.

Below is a detailed guide to creating and formatting these files.

5.1.1. The Data File: [filename].csv

The data file is expected in wide form. There must be exactly one column giving information on the country, and at least one column giving the fuel type (agg_fuel, CEDS_fuel, or both). Additionally, one or two columns are allowed for specifying sector depending on the level of specificity. The activity data itself should have year or Xyear headers (e.g. 1950, 1951 or X1950, X1951).

A dataframe in CEDS format with all allowed columns might look like this:

iso	agg_fuel	CEDS_fuel	agg_sector	CEDS_sector	X1970	…
deu	coal	coal_coke	1A1_Energy-transformation	1A1a_Electricity-public	1150.79	…

5.1.2. The Mapping File: [filename]-mapping.xlsx

Since CEDS operates under the principle of preserving raw input data when possible, the input dataset does not need to be neatly named to CEDS sectors and fuels. The purpose of the mapping file is so the system can identify how input data corresponds to CEDS data.

There should be one sheet in this Excel file for each ID column in the input data, and the sheet names must be the name of the resulting CEDS column. If a data ID column is already in CEDS form, no mapping sheet is needed. There are five possible sheet names:

CEDS_sector
CEDS_fuel
agg_sector
agg_fuel
iso

Any mapping file may include any or all of these, as needed. Other sheets will not be identified.

Each sheet should contain two columns, one headed by the name of the column (same as the sheet name) and the other bearing the header corresponding to the header in the data frame. The data in the columns are the equivalent IDs.

The following is an example of what a mapping sheet titled "CEDS_sector" might look like:

my_sector_name	CEDS_sector
public_electric	1A1a_Electricity-public
auto_electric	1A1a_Electricity-autoproducer
heat_production	1A1a_Heat-production

The raw data corresponding to this example could look something like this:

iso	my_sector_name	agg_fuel	X1970	…
usa	public_electric	oil	16.21	…
usa	auto_electric	oil	105.5	…
usa	heat_production	oil	124.8	…

In the case that your data cannot be easily mapped, you can make use of the parameter preprocessing_script described in section 3.2 below. If no mapping file is included, it is assumed the data is already correctly mapped. Alternatively, you can also utilize an additional sector mapping file if you wish to retain aggregate sectors that do not correspond to CEDS default aggregate sectors (see "Master_Sector_Level_map.csv" in the input/mappings directory). If so, please see the below subsection "Alternative Mapping File".

5.1.2.1. Alternative Mapping File: [filename]_sector_map.csv

If you wish to utilize aggregate sectoral data which does not correspond to CEDS default aggregate sectors (see "Master_Sector_Level_map.csv" in the "input/mappings directory"), then you must provide an additional sector mapping file. This file name must be listed within the corresponding instructions file as an entry under the column heading "sector_map" for each row within the instructions file which will use this map. This mapping file needs to be placed in the "CEDS\input\extension\user-defined-energy" directory and must have a column named "CEDS_sector" with the corresponding CEDS_sectors you wish to map to your aggregate sectors (often listed under a column headed by the label "sector"). Note that your aggregate sector column in the mapping file must have the same name as the column header for these sectors in your user data, while in the instructions file the column header will be "agg_sector". Note that your mapping file must be a .csv named with the following convention: [filename]_sector_map.csv.

5.1.3. The Instructions File: [filename]-instructions.csv

The instructions file is the place to define any parameters for how specifically to process the input dataset. This file is used to determine both which data to bring into the system from your dataset, and how it should be integrated into the default data.

The instructions file must have exactly one column giving information on the country, and at least one column giving the fuel type (agg_fuel, CEDS_fuel, or both). The instructions file should have a row for each combination of data in the corresponding data file:

iso	CEDS_fuel	CEDS_sector	start_year	end_year	options…
deu	coal_coke	1A1a_Electricity-public	1931	1934	…
deu	hard_coal	1A1a_Electricity-public	1932	1936	…
deu	brown_coal	1A1a_Electricity-public	1931	1936	…
deu	coal_coke	1A1a_Electricity-autoproducer	1931	1936	…

This example shows all of the necessary columns for reading in data with CEDS_fuel and CEDS_sector specificity. To include all isos, provide all within the column of the instructions file. To include all sectors, simply leave that column out of the instructions file, or alternatively provide the sector name all. CEDS provides several options (listed in Section 3.2 below) for specifying how to integrate the supplemental data into the default data.

Tip	These instructions must be in CEDS ID form because they specify how the system will use the data once mapped—they correspond directly to components of the CEDS activity data.

5.2. User Instructions Options

There are several use instructions that can be specified by the user. If a given option is not included, it will be set to the default. These options can be set for each row of the instructions file for a dataset by including a column with the option as the header (case-sensitive).

priority is a tool for manually specifying the order in which datasets are included in the system (see Default Order in Notes). Priority is given as integers; data with priority 1 will be dominant over priority 2, which will be dominant over data with no priority specified. Defaults to NA.
keep_total_cols is an argument that MUST be specified in a user’s instruction file. The value for this argument needs to be one of the 6 options listed below (an error message will be provided if keep_total_cols is not defined as one of these 6 options). Note that if the user provides a fuel and sector level in their user data which matches the option provided in their corresponding instructions file for keep_total_cols, the data will not be normalized (for example, if the user provides data at the agg_fuel level and has specified keep_total_cols as "agg_fuel"). If the user provides a fuel and sector level in their user data which is less detailed than the option provided for keep_total_cols in the corresponding instructions file, the data will also not be normalized (for example, if the user provides data at the agg_fuel level and has specified keep_total_cols as "agg_fuel, CEDS_fuel").
1. blank or NA - no normalization will occur
2. agg_fuel
3. [agg_fuel], CEDS_fuel
4. agg_fuel, agg_sector
5. [agg_fuel], CEDS_fuel, agg_sector
6. agg_fuel, [agg_sector], CEDS_sector
use_as_trend takes a boolean argument. If TRUE, the data will be used as a trend rather than as raw data; values will be scaled to CEDS values for a given match_year. Defaults to FALSE.
match_year takes an integer year argument. Required if use_as_trend is TRUE, otherwise defults to NA.
start_continuity is used to specify whether data should be made continuous at its beginning. Takes a boolean; defaults to TRUE.
end_continuity (see start_continuity)
interpolation_method defines how to treat missing values in the data. Must be one of the following:
- linear (default)
- match_to_default — fills in missing values based on the trend of the default activity data
- match_to_trend — fills in missing values based on a trend provided by the user; if specified, the parameter matching_file_name must be present
matching_file_name is the name of a file containing values to be used as a trend for interpolating missing values from the data. Columns outside of the years specified by start_year and end_year will be ignored. Defaults to NA.
preprocessing_script is the name of an R script to be run before attempting to map or load the data associated with this instruction. Expects a file path relative to the user-defined-energy directory.

5.3. Operations

This section details some of the major functions of the user data processing system.

5.3.1. Mapping

Occurs during pre-processing of data, but after running any user pre-processing script. This section uses user-specified *-mapping.xlsx files to bring data into CEDS form. Any data at the detail level of CEDS_fuel or CEDS_sector will be automatically have the aggregate fuel or sector mapped on.

5.3.2. Interpolation

Interpolation occurs during pre-processing of data. The process fills holes in data that has gaps or that has less-than-annual (e.g. every 5 years) data. Interpolation can occur linearly (the default) or on a trend specified in the Interp_instructions sheet of [filename]-instructions.csv.

5.3.3. Normalization

Normalization is the process by which data is included in the greater activity database without losing aggregate totals. CEDS activity defaults are generated by using percentage breakdowns to disaggregate high-level (aggregate fuel per country) data. When user-specified data is added, the system will include it by offsetting the user-defined changes in other areas of the aggregate group.

By adding specific fuel by sector activity in one place, CEDS adjusts the breakdown of fuel activity, not the total fuel activity.

Normalization Exceptions:

Whole-group overwrite: if all elements of an aggregate group are specified, the aggregate sum is overwritten (see Batching).
If a user-specified subset exceeds an aggregate group total, that total will be overwritten.

5.3.4. Batching

If several instructions correspond to the same aggregate group, these instructions will need to be processed together all at once. Groups of user data in the same batch are handled as a single input, in that they are normalized in one step. In the case that a user specifies rows of data for an entire aggregate group for a given time period, they will be batched together and will overwrite the normalization process. If they have different but overlapping year ranges, each dataset will be subsetted to year ranges allowing for the processing of overlapping sections separate from non-overlapping sections.

5.3.5. Enforcing Continuity

By default, user-specified data is made continuous with the CEDS defaults at its beginning and end. The data are linearly adjusted over a specified year range (7 years by default, fewer if necessary) so that the value of the first year represents 1/7 new data and 6/7 CEDS data and the value of the 7th year is 6/7 of the new data plus 1/7 of CEDS data.

5.4. Notes

5.4.1. Default Order of Operations

Instructions are ordered by:

Priority
Aggregation specificity
Start year

Meaning that all data with high priority will supersede data with lower priority; within equal priority, more specific data will supersede less specific (more aggregate) data; and, all else being equal, older data will supersede newer data. This order only matters if more than one dataset will impact the same activity cell.

6. Code Structure and Guide

The Community Emissions Data System (CEDS) is at its core a selection of R scripts and data files linked together by a Makefile. CEDS is flexible to user input. Throughout the system are built-in mechanics for automatically identifying and processing user-added data and scripts.

CEDS code execution is divided into modules, groups of code executed together for a common purpose. The nine CEDS modules are as follows:

Name	Purpose
Module A	Activity and driver data processing
Module B	Combustion emissions factors
Module C	Non-combustion emissions and emissions factors
Module D	Default emissions calculations
Module E	Emissions inventory processing
Module F	Scaling to inventories
Module G	Gridding
Module H	Historical extension
Module S	Summary and final data processing

This documentation provides information module by module. To find instructions for a desired change or input, identify the module purpose which best fits the aspect of CEDS you will change.

6.1. Module A (Activity Data)

Module A runs initial processing on driver data, and creates the total activity driver database.

Module A is not designed to be as flexible as the other modules. Preserving Module A defaults is recommended, except where overwriting a particular input. In general, additional supplemental data is best added later in the system.

Module A is unique in CEDS in that it contains no emissions-specific processing. It handles activity and driver data, and not emissions or emissions factors. Because of this, Module A only needs to be executed once even during a recursive make.

6.1.1. Module A.1

Population data is created from UN and HYDE population inputs. Adjustments to population data must be made in these inputs or in A1.1.UN_pop_WB_HYDE_extension.R.
A.1* contains other driver scripts dependent on only population (biomass dataset, pre-processing of IEA energy data, coal heat content). Pre-processing emissions-nonspecific scripts can be added to this section.

6.1.2. Module A.2

Module A.2 handles specific adjustments to IEA data, including converting to CEDS sectors and fuels.

6.1.3. Module A.3, A.4

Modules A.3 and A.4 handle expanding the activity database to include complete CEDS specificity and fuel/sector combinations. The results of this section are the activity databases A.comb_activity.csv and A.NC_activity_energy.csv, which store activity data defaults used throughout CEDS.

6.1.4. Module A.5

A.5 is responsible for processing various non-combustion drivers.

6.1.5. Specifying combustion vs. non-combustion sectors

Combustion Energy data is primarily from IEA and BP data (processed in Module A2-A4), while non-combustion driver data is from various sources (Module A5).
Combustion or non-combustion sectors are specified in the Master_Fuel_Sector_List.xlsx. IEA process sectors are identified in IEA_process_sector.csv.
The important distinction between combustion and non-combustion activity is driver; combustion sectors have fuel drivers, while non-combustion sectors have proxy process driver data (population, pulp paper production, etc.).

6.2. Module B (Combustion Emission Factors)

Module B is responsible for processing combustion emissions factors.

6.2.1. Structure

Module B executes in 3 steps:

B1.1 creates blank or base-level databases for default emissions factors, activity data, and default emissions (B1.1.base_…)
B1.2 reformats specific datasets and use header functions to add the results to their databases. (B1.2.add_…) There can be any number of “add” scripts per section.
B1.3 “processes” activity (B1.3.proc_…)

Module B uses a parental structure to call scripts. “B1.1_base_comb_EF.R” and “B1.2.add_comb_EF.R” are the only two scripts executed by the Makefile. Each script identifies and executes a series of other scripts based on the emissions species, for example

if ( em == "BC" || em == "OC" ){
  scripts <- c( 'B1.2.add_BCOC_recent_control_percent.R' )
}
…
invisible( lapply( scripts, source_child ) )

Any script added to the list “scripts” as a string will be executed by the parent script.

There are two types of B1.2 files. Some files generate processed data as intermediate output files, creating data on control percents, ash content, etc. (most notably for emissions species SO₂). Other scripts read in all data files of a certain type, which may have been produced earlier in B1.2, or may have been included as defaults.

6.2.2. Adding a Processing Script

Adding a processing script to Module B requires:

A script in the module-B folder, named according to conventions described in the CEDS style guide.
A change to whichever parent file is appropriate for sourcing the new script.
Any input data will need to be included in the input folder.

An example of a change in a parent script: if I want to add a new BCOC processing file, 'B1.2.add_BCOC_additional.R', the above would become:

if ( em == "BC" || em == "OC" ){
  scripts <- c( 'B1.2.add_BCOC_recent_control_percent.R', 'B1.2.add_BCOC_additional.R')
}

This modular structure means that no changes to the Makefile are needed to add scripts in Module B.

6.2.3. Adding Combustion Emissions Factor Data

Raw emissions factors can be directly incorporated into the CEDS emissions factor database.

Save the data in a .csv file with columns for iso, fuel, sector, unit (usually "fraction"), and data years in Xyears, in the folder input/default-emissions-data/EF_parameters/. Name the file U.[em]_*[suffix].csv where * represents any descriptive, meaningful title for the data and [suffix] is any of the following:

Pattern	Use
"_EF"	Adds the data as raw emissions factors
"_control_percent"	Adds the data as control percents (SO₂ only)
"_s_ash_ret"	Adds data as sulfur ash retention data (SO₂ only)
"s_content"	Adds data as sulfur content data (SO₂ only)

Files without any of these suffixes, or without the emissions species in the file name, are ignored.

Files in this directory are processed by species, in alphabetical order. Data in files read in later will overwrite data in files read in earlier. This is why user files should begin with U. so that data in these files will be given priority.

6.2.4. Output Files

Module B outputs B.[em]_comb_EF_db.csv, a database of combustion emissions factors.

Scripts in the "add" section (B1.2) also produce files to the folder input/default-emissions-data/EF_parameters/.

6.3. Module C (Non-combustion Emissions)

Module C is responsible for processing non-combustion emissions and emissions factors.

6.3.1. Structure

Module C follows the same three-part structure as Module B:

C1.1 creates blank or base-level databases for default emissions factors, activity data, and default emissions (C1.1.base_…)
C1.2 reformats specific datasets and use header functions to add the results to their databases. (C1.2.add_…) There can be any number of “add” scripts per section. C1.2 uses a parent script model.
C1.3 “processes” activity (C1.3.proc_…) C1.3 (the “process” group) does not use a parent script model, so adding a process script requires editing the Makefile.

6.3.2. Adding Non-Combustion Emissions

Non-combustion emissions can be added to Module C without the inclusion of a new script. There are two ways to do this.

Save a dataset as a .csv file in the folder input/non-combustion-emissions with headers indicating iso, fuel, sector, and years of emissions (the data will be in wide form — a column for each year). Emissions will be linearly interpolated if there are missing years. Emissions extended forward and backward from the years supplied with a constant emissions factor. This means that for most process emission sectors emissions will be scaled with population. If this is not appropriate, it is best to supply emissions over the entire modern time period (1060 forward).

The second method to add to the default emissions database is to add lines to the input/non-combustion-emissions/add_inventory_instructions.csv instruction file. This will take data from processed emission outputs and add these, as emission factors, to the default emissions database. As above, these will be added as constant emission factors before and after the specified years. This is a good way to correct for large scaling factors in instances where default process emissions data were not a good match for a scaling inventory.

6.4. Module D (Initialize Emission Database)

Module D contains a single script for initializing emissions databases based on driver and activity data calculated in modules A through C. It is relatively inflexible and is meant to bridge emissions factors + drivers and emissions. It calculates emissions, creating a default that will be scaled and extended by modules F and H.

6.5. Module E (Pre-process Emission Inventory Data)

Module E processes emissions inventories. Each script is tailored to its particular inventory. Each script outputs a processed form of the raw inventory made compatible for CEDS analysis.

Module E is typically executed immediately after Module A.

6.5.1. Scripts

Typically, Module E scripts have three sections.

The first defines inventory-specific parameters: file paths, year ranges, etc.
The second reads in and processes the data, shaping the inventory to a standard format (wide-form, iso tags) but does not map to CEDS sectors or fuels.
The third writes the data to intermediate-output.

Module E scripts diverge from this format when further data processing is required to make scripts in standard form.

6.5.2. Adding an Inventory Processing Script

Add raw input files to input/emissions-inventories/
Add a processing script to the module-E/ folder
Add a section of code to the Makefile in the area handling emissions inventories. The line should look like the following (for example script “E.myinventory_emissions.R”):
```
# process emissions from 'myinventory'
$(MED_OUT)/E.$(EM)_myinventory.csv : \
	$(MOD_E)/E.myinventory_emissions.R
	Rscript $< $(EM) --nosave --no-restore
```
This code indicates that “module-E/E.myinventory_emissions.R” needs to be executed as an Rscript, and that it will produce the output file E.[em]_myinventory.csv.

6.6. Module F (Inventory Scaling)

The purpose of Module F is to scale subsets of CEDS emissions data to the emissions data reported in other inventories. In doing so, CEDS reinforces its accuracy at an aggregate level while retaining the specificity of CEDS fuels, sectors and isos that distinguish the model from the scaling inventories.

6.6.1. Structure of Execution

Module F consists of:

A header file, emissions_scaling_functions.R
- The header file contains generalized functions that are called in each scaling script. These functions are used to read and write data, apply mapping files, and perform scaling calculations.
A parent script, F1.inventory_scaling.R
- The parent script calls inventory-specific scaling script depending on the emissions species.
A series of scaling scripts corresponding to each inventory, (e.g. F1.1.UNFCCC_scaling.R)
- Each scaling script reads in an inventory dataset and updates the default data in the CEDS data sets.
Mapping files for each inventory dataset used

Module F is executed by running the parent script. Depending on the emissions species provided, the parent script calls a series of scaling scripts, which execute scaling and then write to an intermediate output file to be scaled by the next script. Scaling the same region more than once will overwrite the earlier scaled values. This means that the order of the scaling scripts is important, and inventories with greater accuracy should be included later to avoid being overwritten by a less accurate inventory.

6.6.2. Structure of Scripts

Each Scaling script has a similar structure:

Section 0: Universal section, the same for all scripts
Section 1: Defines inventory-specific variables such as file names, countries, years the inventory includes, and scaling method
Section 1.5: Import inventory-specific data and put in standard inventory format (iso-sector-fuel-years or iso-sector/fuel-years)
Section 2: Read in all other scaling data and define variables using scaling functions
Section 3: Aggregate CEDS and inventory data to scaling sectors/fuels using scaling functions
Section 4: Calculate scaling factors and apply scaling factors to default emissions and emission factors using scaling functions
Section 5: Write scaled data to intermediate output file

Section 1 – 1.5 are unique to each inventory used for scaling. Sections 0, 2-5 can be identical for all scaling scripts, unless the user would like to define different default options in Section 4 to create scaling factors with the function “F.scaling”.

6.6.3. Required Files

Inventory files can be excel sheets that are imported and processed to standard format within the scaling routine (ex. Canada), or imported and processes within Module E (ex. UNFCCC). By section 2, inventory data must be in standard form with iso, CEDS sector/fuel (or both) columns and years in Xyear format.

Instruction files define how to relate scaling inventory and CEDS default data through scaling sectors or scaling fuels, as well iso-sector-fuel specific options for scaling routines. Instruction files must be .csv format and located in the CEDS/input/mappings/scaling folder. A mapping instruction file must be provided, and “method”, and “year” instructions can optionally be provided if needed. The name of the “mapping” instruction is is specified as follows: <inventory_name>_scaling_mapping<_extra_information>.csv (for example “EMEP_NFR09_scaling_mapping_SO2.csv”). For “method” and “year” files _extra_information is _mapping or _year respectively.

The “mapping” instructions relate the inventory data to the CEDS data by scaling method: either fuel, sector or both. It relates the inventory sector/fuel to the scaling sector/fuel and the scaling sector/fuel to CEDS sector/fuel. For example using the sector scaling method, the inv_sector column maps to the scaling_sector column, and the ceds_sector column maps to the scaling_sector column, but the inv_sector column does not map to the ceds_sector column. Entries on the same row in the inv_sector and ceds_sector columns have no meaning. Inventory sectors/fuels or CEDS sectors/fuels should only be mapped to one scaling sector (although multiple inventory or CEDS sectors/fuels can be mapped to one scaling sector). If an inventory or CEDS sector/fuel is mapped to more than one scaling sector/fuel, the system will match to the first pair in the data frame. The selected scaling sectors/fuels are applied to all countries in the inventory. An example section from a mapping file is shown below:

inv_sector	scaling_sector	ceds_sector	Notes
Electricity and gas supply	energy	1A1a_Electricity-public
Industry_Electricity	energy	1A1a_Heat-production
Industry_Oil refinery	other-transformation	1A1bc_Other-transformation
…	…	…	…

The optional “method” file defines interpolation and extrapolation methods for handling data if they differ from the default. The F.scaling function is used to execute the instructions in this file. Method file columns include:

iso: can be specified "all" (meaning all CEDS isos) or specific isos
scaling_sector: cannot be "all". This must be specified for each sector that departs from the default method.
other: space for an additional parameter if needed by specified method (see linear_1 example below)
pre_ext_method: how the data will be extended backward in time from its beginning
interp_method: how internal holes (missing years) in inventory data will be filled. The default is that emission factors are linearly interpolated between inventory years.
post_ext_method: how the data will be extended forward in time from its end

An example selection from a method file is shown below:

iso	scaling_sector	other	pre_ext_method	interp_method	post_ext_method
twn	SLV	2000	linear_1	linear	constant
twn	waste_water	2000	linear_1	linear	constant
twn	waste-incineration	2000	linear_1	linear	constant
twn	AGR	2000	linear_1	linear	constant
twn	rail	1999	linear_1	linear	constant

Extension methods include:

method	description	valid columns
constant	use the edge scaling factor constantly across all extension years	all
linear	extend the scaling factor trend linearly	all
linear_1	linearly extend the scaling factor to reach a value of 1 in either, the final extension year (post_ext_method) or the year specified in Other column (pre_ext_method).	post_ext_method, pre_ext_method

The optional "year" file defines the year extent of the scaling process. It allows the user to extend scaling factor to different years for individual iso-sector/fuels. It follows a similar structure to the "method" file with these columns:
- iso: can be "all" or specific isos
- scaling_sector: cannot be "all". Must be specified for each sector.
- pre_ext_year: The year in which the scaling data will begin (after extension, if necessary)
- post_ext_year: The year in which the scaling data will end (after extension, if necessary)

6.6.4. Defined Variables

The following variables must be defined in Section 1 of any scaling script in order to use the modular Sections 2-5.

inventory_data_file - the name of the inventory file, without the extension
inv_data_folder - name of the path to the folder the inventory file is in, from domainmapping.csv (usually "EM_INV" for the CEDS/input/emissions-inventories/ directory)
sector_fuel_mapping - the name of the inventory mapping file, without the extension
mapping_method - mapping method. Must be "sector", "fuel", or "both"
inv_name - name of the inventory (for labeling diagnostic/intermediate output, not for reading input files)
region - iso countries included in the inventory
inv_years - years covered by the inventory

6.6.5. Scaling Functions

The following functions are used throughout Module F. They are defined in code/parameters/emissions_scaling_functions

F.readScalingData( inventory=inventory_data_file, inv_data_folder, mapping=sector_fuel_mapping, method=mapping_method, region, inv_name, inv_years )

Reads in all scaling data, defines variables for scaling and assigns them to the global environment.
F.invAggregate( std_form_inv, region, mapping_method, zeroed_terms=c(NA, 'NA', 'NA ', '-'))

Aggregates inventory data to scaling sectors/fuels. There are no user-defined options in this function.
F.cedsAggregate( input_em, region, method=mapping_method )

Aggregates CEDS data to scaling sectors/fuels. There are no user-defined options in this function.
F.scaling( ceds_data, inv_data, region, ext_start_year=start_year, ext_end_year=end_year, ext=TRUE, interp_default='linear', pre_ext_default='constant', post_ext_default='constant', replacement_method='none', max_scaling_factor=100, replacement_scaling_factor=max_scaling_factor )

Calculates scaling factors where both inventory and CEDS data are available. Interpolates and extends scaling factors forward and backward if ‘ext’ = TRUE. Also checks and replaces scaling factors if too small or too large.

Parameters:
- ext_start_year - Year to extend scaling factors back to. Defaults to global environment variable ‘start_year’ (currently 1960)
- ext_end_year - Year to extend scaling factors forward to. Defaults to global environment variable ‘end_year’ (defined in code/parameters/common_data.R)
- interp_default - Default interpolation method for scaling factors within the inventory years. Either ‘interpolation’ or ‘constant’. Defaults to linear interpolation.
- pre_ext_default - Default extrapolation method for pre inventory years. Either ‘interpolation’ or ‘constant’. Defaults to ‘constant’.
- post_ext_default - Default extrapolation method for post inventory years. Either ‘interpolation’ or ‘constant’. Defaults to ‘constant’.
- replacement_method - Either 'none' or ‘replace’. If ‘replace’ then function checks scaling factors and replaces values above and below the threshold defined by max_scaling_factor.
- max_scaling_factor - If replacement method is ‘replace,’ Scaling factors greater than max_scaling_factor and less than 1/max_scaling_factor are replaced by replacement_scaling_factor or 1/replacement_scaling_factor, respectively.
- replacement_scaling_factor - value to replace too large scaling factors with. Defaults to max_scaling_factor. Small values are replaced by 1/replacement_scaling_factor.
F.applyScale (scaling_factors)

Applies scaling factors to CEDS default data. Creates scaled EF and scaled emissions.
F.write( scaled_ef=scaled_ef, scaled_em=scaled_em, domain="MED_OUT")

Writes scaled emissions factors to intermediate output folder.

6.6.6. Value Metadata

Module F tracks scaling by collecting scaling value metadata. The script global_settings.R contains a boolean switch, Write_value_metadata; if TRUE, CEDS will generate value metadata reports across every combination of fuel, sector, iso, and year indicating which scaling factors were applied and whether the cell was scaled directly to an inventory or to an extension of an inventory.

The output file of this process is F.[em]_scaled_EF-value_metadata.csv.

Two diagnostic pieces of code, code/diagnostic/Create_Val_Metadata_Heatmap.R and code/diagnostic/Create_Master_Val_Meta_Heatmap.R, provide functions for analyzing and displaying graphically trends in the value metadata.

6.6.7. Adding A New Inventory

Adding a new inventory can be done in the following steps:

Add a Module E script to process the inventory data into CEDS format. Note that in most cases it is advised to leave the inventory data in the inventories native sectors. Conversation to standard metric units, however, should be done here.
Update the Makefile to reflect the new Module E script and associated dependencies
Add a Module F scaling script. This can be done with minor changes to an existing Module F scaling script.
Add a Module F sector mapping file for that inventory (referenced from within the new scaling script)
Add the new scaling script to the master Module F script F1.inventory_scaling.R for the relevant emission species.

6.7. Module G (Gridding)

Module G handles gridding, the process by which spatial distributions of CEDS final emissions are calculated.

6.7.1. Structure

Module G is composed of three main sections. Each section executes 4 scripts. Scripts are executed sequentially; no parent script is used. Twelve grids are created for each year (monthly emissions, incorporating seasonality) from 1750-2014, for each emissions species and sector.

The three main sections are:

G1 creates yearly spatial grids. Each netCDF file contains 12 months and the sectors appropriate for that section. (These temporary files are stored at intermediate-output/gridded-emissions)
G2 chunks these grids in 50-year groups. These aggregated emission files are the final gridded output and can be found at final-emissions/gridded-emissions.
G3 creates grids and chunks for Methane. Methane is treated separately since the CEDS release data only produced Methane emissions for recent decades. A separate, approximate, extension is provided as supplemental data.

Each section has four scripts; these each handle a different type of input data.

G*.1 handles bulk emissions. The input data for this grid is CEDS final emissions by country and sector (no fuel information) for all sectors except aircraft. These scripts handle each emissions species and each sector.
G*.2 handles spectated VOC ('subVOC') emissions. For NMVOC emissions, individual grids are generated for each VOC sub-species.
G*.3 handles aircraft emissions. In addition to 12 monthly grids for each year, aircraft emissions have 25 levels of gridding corresponding to different altitudes.
G*.4 produces gridded emissions from solid biofuels. (Note that these are already included in the aggregate emission files, but are broken out at the request of users.)
G*.5, like G*.4, produces gridded emissions for user specified fuels. (Again note that these fuels are already included in the aggregate, but will be broken out and generate additional files.) This works in conjunction with custom_fuels_to_grid.csv in the gridding_mappings directory. Place the desired extension to your output file in the rows with the fuels for which you would like to grid.

6.7.2. General Methodology

Spatial distributions are generated by applying CEDS final emissions to normalized country-level spatial proxy data. Spatial proxies are chosen for each gridding sector, emissions species, and year in input/gridding/gridding-mappings/proxy_mapping.csv. The CEDS gridding routines are described in Feng et al. (2019).

6.7.3. Use Instructions

Emissions by country, sector, fuel must first be generated by running the CEDS system.

In order to produce gridded emissions, the gridded proxy data must be obtained from zenodo through this link.

The package there contains four folders: mask, proxy, proxy-backup, seasonality

Copy these folders into the input/gridding folder in your CEDS directory. Assuming you do not have any previously modified CEDS proxy data files in your system, you can replace the folders that are already there from the GitHub distribution (you will note that those folders are otherwise empty in the CEDS GitHub distribution).

Gridded emissions can then be produced using the makefile system, for example make BC-gridded. Note that if the final emissions file is not up to date, the make BC-gridded command will re-run the entire system as needed.

Note: users should edit the netCDF metadata as instructed in the file:

`code/parameters/nc_generation_functions.R`

to reflect their project and contact information.

6.7.3.1. Run Gridding via Command Line

For some purposes it may be useful to run the gridding commands individually, such as:

`Rscript code/module-G/G1.1.grid_bulk_emissions.R BC  --nosave --no-restore`

to produce annual emission grids over all years, and

`Rscript code/module-G/G2.1.chunk_bulk_emissions.R BC  --nosave --no-restore`

to aggregate emissions into 50-year chunks.

Specific commands for other emissions can be found in the Makefile.

Note that the gridding routines, when run directly as indicated above, can be used to map user generated data to spatial grids, even using input data generated outside of CEDS. To do this first provide the user-generated data as indicated in the following template:

`intermediate-output/XX_total_CEDS_emissions_template.csv`

Most of the gridding routines use summary data, which will need to be generated by running:

`Rscript code/module-S/S1.1.write_summary_data.R BC  --nosave --no-restore`

If providing user-edited information do not use the makefile system, as this may overwrite the user-supplied emissions information, and instead use the gridding commands directly as described above.

6.7.3.2. Changing Gridding Years

The years used for producing the annual grid files and the chunked grid files are set separately.

In the 1. G*.1 files, the year range for gridding is set by the end_year and end_year functions in the gridding_initialize function call, which is near the beginning of each gridding script.

The range of years used for chunking, and the length of each chunk, are set in the chunk_emissions function (in nc_generation_functions.R). The settings here will be used for chunking operations for all emissions.

6.7.3.3. Remove Country from Grid

An option is available to remove a single country from the gridded data. This is done by specifying the iso in common_data.R with the variable grid_remove_iso. For a list of isos, refer to the file: input\mappings\Master_Country_List.csv.

The gridded data files and checksums that are generated will have the suffix no_<iso>.

6.7.4. Gridding Diagnostics

The diagnostic data generated by the gridding routines includes:

The metadata within each gridded emissions netCDF file contains a global attribute global_total_emission that contains a value equal to total global emissions for one or more years.
A .csv file with a global checksum values equal to global emissions for each final gridding sector and year is also generated. The chunking routines generate a consolidated .csv file with checksum values for all years in each chunked file.
For each species the file: diagnostic-output/G.XX_bulk_emissions_checksum_comparison_diff.csv that shows the absolute difference between the sectoral checksum value as summed from the gridded emissions and as summed from the final emissions files. The same data is written as …_per in terms of percentage differences. Some differences are expected but if these numbers are large then there is some more fundamental problem with the gridding system that should be fixed.

NetCDF files with total emissions (summed across all sectors other than aircraft) are provided for convenience (both monthly and summed across months) here:

`diagnostic-output/total-emissions-grids`

6.8. Module H (Historical Extension)

Module H is responsible for the extension of CEDS data from 1960 or 1971, depending on the the country, back to 1750. Many CEDS activities, particularly fuel combustion, are extended using CDIAC trends, which hold information per aggregate fuel and country.

The user can specify extension methods and associated data by sector and emission species. These are specified in the files CEDS_historical_extension_drivers_activity.csv and CEDS_historical_extension_methods_EF.csv in the folder input/extension/.

Extension methods should be specified for every sector, emission species, and over the entire extension time period (back to 1750). Species-specific methods can also be specified. Examples of the available methods can be found in CEDS_historical_extension_methods_EF.csv.

Module H also can perform some simple adjustments to emission factor bounds (e.g., min/max emission factors).

Mass balance adjustments for SO₂ and CO₂ emissions are also performed in Module H.

6.9. Module S (Summary Data Processing)

Module S conducts final processing and summary procedures. This is the last Module in the CEDS system. Its input is intermediate-output/[em]_total_CEDS_emissions.csv and its output is a series of final emissions breakdowns and summaries, notably

6.9.1. Code Structure

The main body of Module S is contained in a single script, S1.1.write_summary_data.R.

Module S begins by reading in the final emissions disaggregated data. The script aggregates the data to all levels required.

The script then checks if an older run of this emissions species is present in the final output folder (which is not wiped clean by the Makefile during an execution of make clean-all).

If no older data is present in the final-data/current-versions folder, the script writes its summary files.

If an older dataset is present for this emissions species, the script executes a comparison between the two datasets. The script overwrites the old data if the new data is different, and if different, also produces a series of diagnostic files exploring differences between the outputs of the two runs in the diagnostics/ folder as described in the sub-section below.

The script then sources three files:

Figures.R creates and outputs a series of figures to the summary-plots/ folder including global emissions graphs and further aggregations.
Compare_to_RCP.R is called except when the emissions species is 'CO2'.
- This script creates global, regional, and sectoral comparisons between the CEDS output and the RCP inventory emissions as *.csv files in the ceds-comparisons/ subfolder of diagnostic-output/.
- It also produces graphical comparisons of the same data.
Compare_to_GAINS.R is called except when the emissions species is 'CO2' or 'NH3'.
- This script creates global comparisons, including specific comparisons for residential and non-residential emissions.
- It also produces graphical comparisons of the same data.

6.9.2. CEDS Final Outputs

Module S produces the following summary files in the final-emissions/current-versions sub-folder:

All bunker (international aviation and shipping) emissions, S.[em]_bunker_emissions.csv
CEDS final emissions aggregated to different levels:
- Aggregated to each country and aggregate sector, CEDS_[em]emissions_by_country_sector[ver].csv
- Aggregate to country totals, CEDS_[em]emissions_by_country[ver].csv
- Global emissions per specific fuel, CEDS_[em]global_emissions_by_fuel[ver].csv
- Emissions aggregated to CEDS sectors and countries, CEDS_[em]emissions_by_country_CEDS_sector[ver]csv
- Global emissions per CEDS sector, CEDS_[em]global_emissions_by_CEDS_sector[ver].csv

Each file is also suffixed by the date of the execution of the run, or the user-specified version number (also in the form of a date).

Note that if there are no changes between the current and previous run, the files will not be updated, and the files in the current-versions folder will retain the modification date from the previous run.

6.9.2.1. CEDS Final Output Diagnostics

If the results of the current run are different from those of previous run of the CEDS system for this emissions species, the following comparison diagnostics are produced, if there is relevant data.

Files that show changes in rows and columns (new or deleted rows/columns between the old/new versions of the data):
- ./diagnostics/CEDS_[em]emissions_by_country_sector[ver]_dropped-rows
- ./diagnostics/CEDS_[em]emissions_by_country_sector[ver]_added-rows
- ./diagnostics/CEDS_[em]emissions_by_country_sector[ver]_dropped-cols
- ./diagnostics/CEDS_[em]emissions_by_country_sector[ver]_added-cols

The following diagnostic files show, for aggregated sectors, differences between the old and new versions of the output data. Only rows with differences are shown to keep file size reasonable.

Percentage, absolute, and consolidated comparison files identifying changes between the two outputs:
- ./diagnostics/CEDS_[em]emissions_by_country_sector_comparison[ver]_diff-percent.csv
- ./diagnostics/CEDS_[em]emissions_by_country_sector_comparison[ver]_diff.csv (absolute differences)
- ./diagnostics/CEDS_[em]emissions_by_country_sector_comparison[ver]_comparison.csv

The last _comparison.csv file listed above is a consolidated comparison, in long format, that shows old values, new values, and their absolute and percentage differences.

For the two difference diagnostic files (percentage and absolute) only changes above a threshold are shown. In these files, a "0" indicates that there was some change in the data, but that the change was below the threshold value. If nothing is shown (blank), then the data were identical to all digits. This presentation suppresses spurious differences caused by, for example, different package versions.

6.10. The Makefile

CEDS is executed using a makefile system. A single file, called the Makefile and saved in the main CEDS folder, contains instructions for the execution of the entire CEDS system.

The Makefile is execute on the command line of your choice using the command make * where * is a valid command line argument.

Some Makefile execution commands:

`make all`	Executes a run of CEDS for each valid emissions species except CH4
`make CO2-emissions`	Executes CEDS for emissions species CO2, or any other specified emissions species (generic `make [em]-emissions`)
`make clean-all`	Deletes all intermediate, diagnostic, and final output files
`make clean-modB`	Deletes all files output by Module B (valid for all modules)
`make clean-CO2`	Deletes all intermediate files relating to CO2

The Makefile is made up of "Code blocks". Each code block is headed by the output file that will be created, and is followed by all of the input files and scripts required to create that file. Most code blocks will include an indicator that one or more Rscripts should be executed.

If an input file is missing, or if an Rscript fails to create an intermediate file needed by another script, the Makefile will throw an error, saying that there is no rule to build the missing file.

6.11. The Parameters Folder

The CEDS code contains a "parameters" folder. This folder stores header files. These files are sourced at the beginning of some scripts to load functions and global data.

The files in this folder are as followed:

File	Contains
analysis_functions.R	Functions that map to CEDS, check if all sectors/countries/fuels are present.
common_data.R	Global variables, e.g. years, default conversion factors.
data_functions.R	Various data processing functions, e.g. %!in%, replacing data, build CEDS template, remove NAs or blanks, etc.
diagnostic_functions.R	A function that compares two identically formatted dataframes for equality.
emissions_scaling_functions.R	All functions specific to Module F, e.g. value-metadata functions, scaling functions, functions that add to scaled databases, etc.
global_settings.R	To be called at the beginning of every script. Initializes CEDS version number and global options.
gridding_functions.R	All functions specific to Module G.
header.R	Contains functions required for initializing the log and for sourcing other parameters scripts. Contains the `initialize` function for smoothly sourcing header scripts and beginning the log.
interpolation_extension_functions.R	Contains functions for interpolate or extend time series data (NOT interpolate_NAs, extend on trend).
IO_functions.R	Contains readData, writeData, and printLog, along with other functions for reading in or outputting information.
ModH_extension_functions.R	All functions specific to Module H, including data processing, merging, and disaggregating functions.
nc_generation_functions.R	Some supplemental non-combustion gridding functions
process_db_functions.R	Contains functions for generically adding data to databases (e.g. addToEmissionsDb). Also contains a cleanData function.
timeframe_functions.R	Contains a series of helper functions for dealing with data time range, from identification to truncation.

6.12. Diagnostic Folders

The CEDS code contains a "diagnostic" folder. This folder stores scripts used for processing graphics and figures for publication. It also contains comparison scripts that report how CEDS outputs compare to specific emissions inventories.

7. Diagnostic Results

The CEDS system produces a number of diagnostic graphs and files that can be used to visualize results, evaluate changes from previous versions, and can be used to look for aberrant behavior. These include:

7.1. Comparison Graphs

The CEDS/diagnostic-output folder contains a large number of intermediate output files and graphs from various CEDS scripts. Two main sets of graphs are produced that are useful in examining results. Excel files that correspond to the data in graphs are also produced.

Comparison with GAINS: Graphs of CEDS emissions by region are compared with GAINS point estimates (at every five years) are provided. Graphs are generated for total emissions, residential, combustion, and non-residential.
Another set of graphs that show long-term CEDS (1850 forward) trends by "RCP region" compared to the Lamarque et al. 2010 ("RCP") emissions.

Diagnostic files for emission scaling are also produced. These provide the inventory data, aggregated to scaling sectors, and the scaling factors over time. These are useful to identify sectors where CEDS default emissions were very different from inventory values. This may indicate a sector definition mismatch or a place where CEDS default assumptions may need to be updated. This is particularly important if there is a mis-match in the earliest scaling year.

7.2. Comparing with previous development version

Whenever the CEDS system is run via the "make" system, the final emissions by country and sector are compared with data from the previous run. Differences by sector and country are written out to a set of diagnostic files in the final-emissions/diagnostics folder as described in the CEDS Final Output Diagnostics section.

8. Troubleshooting Odd Results

8.1. Results ordered differently

CEDS should produce the same results when running on different machines or versions of R. We have experienced, however, that the order of the results will differ when using different versions of R. Different versions of R, for example, have different behavior for the order that base::merge combines two data frames, which will produce final results in a different order, but with equal values.

8.2. Packages masking functions

When adding new scripts to CEDS, make sure to correctly scope functions from any packages you are using. For example, plyr and dplyr contain many functions with the same name but different functionality. You cannot count on CEDS to load packages in the order that you want and will get errors or wrong results if you make that assumption. The best practice is to scope functions when used to assure the desired version will be used (e.g., dplyr::mutate(…))

8.3. Errors reading in new .csv files

While it is convenient to create .csv files using Microsoft Excel, it is also easy to create files that have extra columns or extraneous characters. If you encounter errors reading in a user created .csv file it can be useful to open the file in a text editor to check formatting.

Note that if there are some blank columns in one row but not another you will get an error such as this:

Error in read.table(file = file, header = header, sep = sep, quote = quote,  :
  more columns than column names

Cleaning up the .csv file will correct this error.

8.4. Machine Specific Errors in DFs Due To BOM Text File Encoding

Most machines will read UTF encoded text files ok, but we have found that some systems will error if text files are encoded with a Byte Order Mark (BOM). In these cases you might get a machine specific error such as the following:

Error in `[.data.frame`(exten_df, , c("iso", "sector", "fuel")) :
  undefined columns selected

If you were to look at the dataframe exten_df in a debugger (such as by using the R browser() command) you will see that the first column name shows up as …iso which is indicating that a hidden character is part of this column name.

The solution for the BOM issue is to edit the file to have plain UTF-8 encoding without a BOM. You can try Microsoft Excel - we’ve found that this will sometimes show this in a way that can be edited out. BBEdit for mac OSX also allows you to change the text encoding type.

This type of problem can show up in any column if there is a hidden character in the text. These types of errors are difficult to debug since the hidden characters will not display in many cases in editors. But the issue will always be apparent when looking at the data frame in an R debugger since the problematic name will show up as containing … in addition to the visible text.

User_Guide

CEDS User Guide

1. Getting Started with CEDS.

1.1. Prerequisites

1.1.1. The renv Package

1.1.1.1. User Installation

1. Initialize a CEDS-specific Library

1.1.1.2. renv Background

Structure

Lockfiles

Global Package Cache

1.1.1.3. Troubleshooting

R package farver fails to compile on pic HPC cluster

Solution

R package ncdf4 fails to install on pic HPC cluster

Solution

Unable to locate ICU4C library

Solution 2

R/renv unable to load shared object

Solution

1.1.2. R and R Packages

1.1.3. Proprietary Energy Statistics Data

1.1.4. Make

1.1.4.1. Installing Make on OS X

1.1.4.2. Installing Make on Windows

Cygwin

GnuWin32:

VisualStudio

1.2. Running CEDS

2. System Overview

2.1. Default Emissions

2.1.1. Combustion vs Process Emissions

2.1.2. Combustion Emissions

2.1.3. Process Emissions

2.2. Emissions Scaling

2.3. Emission Extension

2.4. Emission Gridding

3. Input Data

3.1. General Assumptions

3.1.1. Updating The Last Inventory Year

3.1.1.1. Additional Considerations

3.1.2. Adding A New Sector

3.1.3. Internal Data File Updates

3.2. Energy Data

3.2.1. Adding or updating the IEA Energy Statistics

3.2.1.1. Adding or updating the IEA Energy Content Data

3.3. Process Emissions Driver Data

3.3.1. User Added Default Process Emission Data

3.3.1.1. Individual Files

3.3.1.2. Select Emission Data From Module E Inventories

4. CEDS System Code

4.1. Miscellaneous Coding Notes

4.2. R Issues

5. How to Include Supplemental Combustion Energy Activity in CEDS

5.1. Formatting the Data

5.1.1. The Data File: [filename].csv

5.1.2. The Mapping File: [filename]-mapping.xlsx

5.1.2.1. Alternative Mapping File: [filename]_sector_map.csv

5.1.3. The Instructions File: [filename]-instructions.csv

5.2. User Instructions Options

5.3. Operations

5.3.1. Mapping

5.3.2. Interpolation

5.3.3. Normalization

5.3.4. Batching

5.3.5. Enforcing Continuity

5.4. Notes

5.4.1. Default Order of Operations

6. Code Structure and Guide

6.1. Module A (Activity Data)

6.1.1. Module A.1

6.1.2. Module A.2

6.1.3. Module A.3, A.4

6.1.4. Module A.5

6.1.5. Specifying combustion vs. non-combustion sectors

6.2. Module B (Combustion Emission Factors)

6.2.1. Structure

6.2.2. Adding a Processing Script

6.2.3. Adding Combustion Emissions Factor Data

6.2.4. Output Files

R package `farver` fails to compile on pic HPC cluster

R package `ncdf4` fails to install on pic HPC cluster