# Introduction to the "GCAM Regional Tuning" Framework

## Motivation
Core GCAM assumptions are often to be regionally-uniform / long-term scenario assumptions.  The focus being on creating internally consistent scenarios.

However, at times, very specific assumptions or model outcomes are required.  Perhaps to harmonize with other modeling teams for a model inter-comparison project.  Or to match sponsor data and projections which will better align studies and communication.

Of course this has been true for some time.  And researchers using GCAM have been navigating this challenge, both at PNNL and in the GCAM community at large.  Thus we started by picking their brain on:

* What are the projects?
* What has the experience been like?
* What are the variables / outcomes that need tuning?
* What is the ”resolution” (spatial, temporal, sectoral, etc)?
* What sorts of data sets are involved, are they open access or proprietary?


### What we found from our survey
* A wide range of project responses
  * Many country/region level "breakouts"
  * Existing regions (states) but require more near term "realism"
* Matching "on the ground" policies and energy system developments
* Matching "country specific" data
* A range of GCAM versions
* Often results need to be "in the right ballpark"
* A lot of interest in energy final demand and power sector
  * GCAM sectoral detail may be too coarse
  * Technology detail may be ok
* Datasets are mostly openly available and frequently released



## Coming up with a Solution

### What we see as Regional Tuning
* A structure that is flexible to handle a wide variety of use cases
* Could be new or existing Regions
* Mapping country data to GCAM definitions (sector, fuel, etc)
* Matching GCAM outputs to country specific projections

### A Proposed Approach 
* Synthesize detailed datasets to “map” into GCAM
* Identify and change parameters in GCAM to ”match” desired outcomes
* Produce parameters as a standalone XML which can be used in subsequent GCAM scenario runs
* Sounds a lot like gcamdata
* Many of the examples of tuning are characterizing model outputs
* We will need something that goes back and forth between gcamdata and GCAM


## A Framework to Perform Tuning

Takes a “GCAM” to serve as a starting point:

* gcamdata
* Gcamwrapper
* An un-tuned reference scenario configuration to start from

Then apply a series of “Tuning directives”.  Which update parameters within GCAM that can affect the tuning criteria of interest.  Then evaluates GCAM to see how well its results matches the tuning criteria.  Attempting various parameters util we eventually iterate to a set of GCAM outputs which most closely aligned with the desired outcomes.  Then finally export the results to an XML input.

Ultimately a number of "base" tuning directives and helpers will be included in the package to help users build directives that suit their own tailored needs.

In [1]:
# Import required Packages including GCAM via gcamwrapper
import gcamwrapper as gw
from tuner_utils import broyden
import numpy as np
import pandas as pd
import jax.numpy as jnp
from jax.numpy.linalg import norm as jnorm
import time

## Defining a Tuning Target
A tuning target can be quite generic.  At a high level it needs to identify parameters in GCAM that can be changed in a way that influences some GCAM outcome of interest.  Then evaluate how closely that GCAM outcome matches a target.

A bit more rigorously we require the following methods:

* `initialize` - Do anything needed to start tuning this target.  It will be given an instance of GCAM in case threre is anything needed from a running GCAM instance.  In addition, we could imagine pulling information from external source or from `gcamdata` directly at this step.
* `get_initial_tuning_params` - A vector of GCAM parameter values which will serve as a starting point in our parameter search.
* `get_num_tuning_output` - The number of specific output values we will attempt to "match".
* `set_tuning_params` - Take a current "guess" of parameter values and set them back into GCAM so that we can subsequently evaluate GCAM at those values.  Implicitly the size of this vector will match those returned by us in `get_initial_tuning_params` and it is up to the tuning target to "map" these values back into GCAM.  Most likely by using the `set_data` feature of `gcamwrapper` using the appropriate query.
* `get_tuning_deviation` - Identify how well the current state of GCAM "matches" the desired outcome.  It is up to the tuning target to get the required information out of GCAM, likely using `get_data` with the appropriate queries, then aggregating / mapping and finally comparing to some desired outcome dataset.  Each value should be a percent difference of the current GCAM state from the desired outcome.
* `export_to_xml` - Leverage data and tools from `gcamdata` to add the final solution parameter values into an XML add-on file which can be used in subsequent GCAM scenarios and analysis outside of the tuning framework.

In [2]:
class BEVDeploymentTuning:
    def __init__(self, target_data_fn, year_limit=None):
        # stash useful data that will be needed later
        if year_limit is None:
            year_limit = 2100
        
        # read the desired output to compare to from a CSV file
        self.target_data = pd.read_csv(target_data_fn)

        # get the query of GCAM outputs that we will want to tune
        self.service_query = gw.get_query("transportation", "service")
        # get the query of GCAM input parameters which we can change to match our target
        self.sw_query = 'world/region{region@name}/sector[+NamedFilter,StringRegexMatches,^trn_]/subsector{subsector@name}/technology{tech@name}/period{year@year}/real-share-weight'

        # limit the time window for which we desire to match
        self.target_data = self.target_data[self.target_data['year'] >= 2025]
        self.target_data = self.target_data[self.target_data['year'] <= year_limit]
    
    def initialize(self, g):
        # capture the initial "tuning params" or GCAM inputs that we will change
        # it is useful to keep a copy because the tuning algorithm will be interested in just the values
        # however we need to keep track of _where_ those values came from (region, tech, etc) so we can
        # map them back in `set_tuning_params`
        trn_sw = g.get_data(self.sw_query)
        self.base_sw = (self.target_data[['region', 'sector', 'subsector', 'technology', 'year']].copy()
            # we do a "merge" of the input params with the tuning target really as a mechanism to filter
            # the input params to only the region, tech, year, etc we are interested in tuning
            .merge(trn_sw.rename(columns={"period": "year"}), on=['region', 'sector', 'subsector', 'technology', 'year']))
    
    def get_initial_tuning_params(self):
        # we already captured these from GCAM in `initialize` so here we can simply pull out the "value" column
        return self.base_sw['real-share-weight'].copy()
    
    def get_num_tuning_output(self):
        # this is going to be the number of rows we have in our `target_data`
        return len(self.target_data['target'])
    
    def set_tuning_params(self, g, new_values):
        # the values we are getting in `new_values` is simply an updated "value" column
        # which we can copy over in the value column of our initial "tuning params" DataFrame
        new_sw = self.base_sw.copy()
        new_sw['real-share-weight'] = new_values
        
        # we can now just set this entire DataFrame back into GCAM using the same query we used
        # to get the data in the first place
        g.set_data(new_sw, self.sw_query)
    
    def get_tuning_deviation(self, g):
        # query GCAM for its current "output" using the query we stashed earlier
        new_service = g.get_data(self.service_query)
        
        # "map" those outcomes to the target data
        # in this case it is just a simple one to one merge
        service_compare = (self.target_data[['region', 'sector', 'subsector', 'technology', 'year', 'target']].copy()
            .merge(new_service, on=['region', 'sector', 'subsector', 'technology', 'year']))
        
        # compute the "error" as relative difference of GCAM output from the target
        service_compare['error'] = (service_compare.target - service_compare['physical-output']) / service_compare.target
        
        # the tuner is just interested in the "error" column so pull that out and return it
        return service_compare['error']

## Create a GCAM instance to tune
Use gcamwrapper to start up a GCAM instance that we can use in our tuning algorithm.

Note for this demonstration we have configured a VERY minimalistic GCAM which essentially contains only the sector of
interest and no markets to solve.

In [3]:
# create a GCAM instance
g = gw.Gcam("config_minimal.xml", "./exe/")
# And do an initial run in case there was something in those results needed to initialize a tuning target
g.run_period(g.convert_year_to_period(2100))

Running GCAM model code base version 7.1 revision gcam-v7.1

Configuration file:  config_minimal.xml
Parsing input files...
Parsing ./input/gcamdata/xml/no_climate_model.xml scenario component.
Parsing ./input/gcamdata/xml/socioeconomics_gSSP2.xml scenario component.
Parsing ./input/gcamdata/xml/transportation_UCD_CORE.xml scenario component.
Parsing ./input/gcamdata/debug_test.xml scenario component.
Parsing ./input/solution/cal_broyden_config.xml scenario component.
XML parsing complete.
Starting new scenario: Reference
SEVERE ERROR:renewable in USA is not related to any other activities.
Starting a model run. Running period 21
Model run beginning.
Period 0: 1975
Model solved with last period's prices.

Period 1: 1990
Model solved with last period's prices.

Period 2: 2005
Model solved with last period's prices.

Period 3: 2010
Model solved with last period's prices.

Period 4: 2015
Model solved with last period's prices.

Period 5: 2020
Model solved with last period's prices.

Perio

## Create a tuning target
In this case we are going to have just a singular target.  In paractice we could have numerous which we attempt to "match" simultaneously.

In [4]:
bev_target = BEVDeploymentTuning("./bev_target_us.csv", year_limit=2100)
bev_target.initialize(g)
initial_bev_sw = bev_target.get_initial_tuning_params()

## Initialize tuning
Initialize the tuning directives and come up with an initial guess.

The "tuning" process is ultimately a numerical solver.  Thus we will wrap the details of running GCAM and evaluating progress towards our targets into a function of the form `F(x)` which we will attempt to minimize.

In [5]:
# create a randomized initial guess for no other reason than to make the solvers job more difficult
rand_gen = np.random.default_rng(1919)
x_init = initial_bev_sw * rand_gen.random((len(initial_bev_sw))) * 2.0

# A nice wrapper function which can be passed to a numerical optimization routine
def F(x, tuner, g):
    # The steps for evaulating our function:
    # 1. Have the tuners set the given "x" values back into GCAM
    tuner.set_tuning_params(g, x.astype(np.float64))
    
    # 2. Re-run GCAM.  Note gcamwrapper won't understand by itself which periods need to be
    # recalculated so we need to explicitly have it re-run our earliest tuning year followed
    # by running out to the latest tuning year
    g.run_period(g.convert_year_to_period(2025))
    g.run_period(g.convert_year_to_period(2100))
    
    # 3. Have the tuners return the deviation or error which the numerical algorithm will
    # attempt to make zero
    error = tuner.get_tuning_deviation(g)
    error = jnp.array(error.to_numpy())
    return error

## Run it through a numerical solver
In principle all that is left is to hand `F(x)` to a numerical solver and let it iterate until it can find the desired solution within a tolerance.

This could result in tens to hundreds of evaluations of GCAM scenarios.  With our minimal configuration this is no problem at all.  When running a "full" GCAM configuration we can still manage by leverage techniques such as utilizing a "Solution Oracle" to reduce the run time and thus remains a tractable problem.

In [6]:
# Use a numerical solver to find the set of GCAM input parameters which best match our
# desired outcomes
# Here we are using a simple Broyden's method with backtracking, without supplying numerical derivatives 
start_time = time.time()
ans = broyden(F, x_init, bev_target, g)
print("Broyden took ", time.time() - start_time, " to run")
print(ans)

Starting a model run. Running period 6
Model run beginning.
Period 6: 2025
Model solved with last period's prices.

All model periods solved correctly.
Model run completed.
Starting a model run. Running period 21
Model run beginning.
Period 7: 2030
Model solved with last period's prices.

Period 8: 2035
Model solved with last period's prices.

Period 9: 2040
Model solved with last period's prices.

Period 10: 2045
Model solved with last period's prices.

Period 11: 2050
Model solved with last period's prices.

Period 12: 2055
Model solved with last period's prices.

Period 13: 2060
Model solved with last period's prices.

Period 14: 2065
Model solved with last period's prices.

Period 15: 2070
Model solved with last period's prices.

Period 16: 2075
Model solved with last period's prices.

Period 17: 2080
Model solved with last period's prices.

Period 18: 2085
Model solved with last period's prices.

Period 19: 2090
Model solved with last period's prices.

Period 20: 2095
Model solved

## Save the "Answer"
The final step of the process is to export the solution tuning parameters to an XML add-on file which can be added to GCAM configuration files to produce "tuned" scenarios outside of this tuning framework.  And of course produce further analysis from that starting point.

We leverage the XML tools from `gcamdata` to do the heavy lifting of gathering DataFrame's paired with "Model Interface Headers" which can then be transformed to a GCAM input XML.  Each tuning directive will be responsible for calls to `add_xml_data()` as appropriate for the input parameters it was changing to produce the desired outcome.

**Note:** For simplicity we have not included in this notebook a functional `gcamdata`, and the `R` and `rpy2` environments which would be required to run it.  Therefore, the following is provided as illustrative only.

In [7]:
%%python -c "pass"

# This cell is not actually run as we have not installed a working gcamdata and rpy2 in this environment
gcamdata = importr('gcamdata')

class BEVDeploymentTuning:
    def export_to_xml(xml, solution):
        # create a DataFrame for output with the "value" column set to the given solution
        final_sw = self.base_sw.copy()
        final_sw['real-share-weight'] = solution
        
        # use gcamdata via rpy2 to call add_xml_data
        xml = gcamdata.add_xml_data(xml, final_sw, "TechShrwt")
        return xml

# use gcamdata to create an XML pipeline object
xml = gcamdata.create_xml("bev_tuned.xml")
# have the tuning directives add XML data
xml = bev_target.export_to_xml(xml, ans)
# finally run the conversion and save the XML
gcamdata.run_xml_conversion(xml)

## Challenges Remain!
Our initial design has proven effective to tune relatively straightforward outcomes such as BEV deployment or service demands.  Even when tuning multiple targets simultaneously in a full GCAM scenario.  However, we recognize many challenges lay ahead as we start to think about more complex or interconnected targets.

Especially if:

* There could be multiple parameters (share weight and/or cost adders) that could affect a desired outcome.  How to tell which use and in a way that minimizes "unintended consequences"?
* How to detect / avoid over-fitting?  Do we need to think about parameter choice across scenarios and not just in a Reference?
* Where there are strong inter-connections between tuning directives and/or our initial guess is not "close" we may need to come up with a way to approximate a derivative in a manner which is computationally tractable.

## Implications for "Core" GCAM Scenarios
Reflected in our survey of GCAM users on this subject was a desire to gather, organize, and maintain over time a database of, essentially, tuning directives and their related datasets.  Which can be applied and re-applied as GCAM naturally evolves over time.  Which will then be used to produce "Tuned" scenarios which are distrubuted as the GCAM defaults.

At this stage little progress has been made towards this request.  However, it remains an open area of thought.  As well as a natural place for collaboration with regional partners to improve modeling in specific GCAM regions. 