# Pre-filtering data
The crunchers use the relationships between variables in the infiller database, however these may depend on very different political and economic assumptions to the scenarios you wish to infill. It can therefore be helpful to feed only a subset of the downloaded data into the cruncher. This may be done by leaving out models that make radically different assumptions to yours, or by selecting only scenarios that are similar in some way.

## Imports

In [1]:
import os.path
import traceback

import pandas as pd
import pyam
import matplotlib.pyplot as plt
import numpy as np

import silicone.database_crunchers
from silicone.utils import (
    _get_unit_of_variable,
    find_matching_scenarios,
    _make_interpolator,
    _make_wide_db,
    download_or_load_sr15,
)

<IPython.core.display.Javascript object>

In [2]:
valid_model_ids = [
    "MESSAGE*",
    "AIM*",
    "C-ROADS*",
    "GCAM*",
    "WITCH*",
]
sr15_data = download_or_load_sr15("./sr15_scenarios.csv", valid_model_ids)

pyam - INFO: Running in a notebook, setting up a basic logging at level INFO
pyam.core - INFO: Reading file sr15_scenarios.csv


## Filtering

A simple way to filter for similarity is by completing SSP-labelled scenarios using only scenarios with the same SSP labelling. The silicone package has a function for detecting which of a group of scenarios creates an interpolation that best matches a dataset. Using this tool, we see that the CO$_2$-CH$_4$ relations in some SSP2 scenarios in the AIM data are more similar to SSP3/SSP1 data from MESSAGE models. Below you can see examples of how to use this. 

In [3]:
data_to_classify = sr15_data.filter(model="AIM/CGE 2.0", scenario="SSP2-34")
data_to_search = sr15_data.filter(model=["MESSAGE*", "WITCH*"])
possible_ssps = ["SSP1*", "SSP2*", "SSP3*", "SSP4*", "SSP5*"]
find_matching_scenarios(
    data_to_search,
    data_to_classify,
    "Emissions|CH4",
    ["Emissions|CO2"],
    possible_ssps,
)

('*', 'SSP2*')

You can break down results by models and scenarios at the same time:

In [4]:
possible_models = ["MESSAGE*", "WITCH*"]
find_matching_scenarios(
    data_to_search,
    data_to_classify,
    "Emissions|CH4",
    ["Emissions|CO2"],
    possible_ssps,
    possible_models,
)



('MESSAGE*', 'SSP2*')

The answer returned is a tuple with first the model filter (* if none selected), then the scenario filter. It's also possible to get the results back in numerical form (they are reported in increasing order of distance, so the top result is the closest). In the event that a model/scenario combination doesn't have any data, a warning is displayed and the 'distance' is reported as infinity. We see that in this case, SSP2 matches best. It's possible to quantify how much better:

In [5]:
find_matching_scenarios(
    data_to_search,
    data_to_classify,
    "Emissions|CH4",
    ["Emissions|CO2"],
    possible_ssps,
    return_all_info=True,
)

[(('*', 'SSP2*'), 14517.36329466991),
 (('*', 'SSP1*'), 16694.15692871152),
 (('*', 'SSP4*'), 20872.869446778597),
 (('*', 'SSP5*'), 25842.838070253136),
 (('*', 'SSP3*'), 48537.48868242612)]

In [6]:
find_matching_scenarios(
    data_to_search,
    data_to_classify,
    "Emissions|CH4",
    ["Emissions|CO2"],
    possible_ssps,
    possible_models,
    return_all_info=True,
)



[(('MESSAGE*', 'SSP2*'), 14355.224200509294),
 (('MESSAGE*', 'SSP1*'), 20613.85546902908),
 (('WITCH*', 'SSP4*'), 20872.869446778597),
 (('WITCH*', 'SSP2*'), 23523.40135904596),
 (('WITCH*', 'SSP5*'), 25842.838070253136),
 (('WITCH*', 'SSP1*'), 30654.27642887084),
 (('WITCH*', 'SSP3*'), 50374.836302783006),
 (('MESSAGE*', 'SSP3*'), 58008.896480477604),
 (('MESSAGE*', 'SSP4*'), inf),
 (('MESSAGE*', 'SSP5*'), inf)]

Here we see that specific SSP2 scenarios in some models do not necesssarily match up best with SSP2 scenarios in other models, and that SSP1 and 4 models may be a closer match in this space than SSP2 in WITCH models. 

In some cases, we may wish to ignore the initial differences and only look for the closest trendlines, i.e. match the differentials. This is equivalent to setting all initial values to the same number, then performing the above analysis. This can be done by setting the "use_change_not_abs" value to True. Unfortunately, this requires a slightly more consistent database than when using absolute values (as we must subtract the initial point).  

In [7]:
try:
    find_matching_scenarios(
        data_to_search,
        data_to_classify,
        "Emissions|CH4",
        ["Emissions|CO2"],
        possible_ssps,
        possible_models,
        return_all_info=True,
        use_change_not_abs=True,
    )
except KeyError as w:
    print("Key error for: ", w)

Key error for:  ('AIM/CGE 2.0', 'SSP2-34', 2015)


In [8]:
find_matching_scenarios(
    data_to_search.filter(year=2015, keep=False),
    data_to_classify.filter(year=2015, keep=False),
    "Emissions|CH4",
    ["Emissions|CO2"],
    possible_ssps,
    possible_models,
    return_all_info=True,
    use_change_not_abs=True,
)



[(('MESSAGE*', 'SSP1*'), 12997.669096039546),
 (('WITCH*', 'SSP5*'), 17260.059013121267),
 (('WITCH*', 'SSP4*'), 18136.618660962344),
 (('WITCH*', 'SSP2*'), 19634.197414024595),
 (('WITCH*', 'SSP1*'), 28837.13352613093),
 (('MESSAGE*', 'SSP2*'), 29675.623275766597),
 (('WITCH*', 'SSP3*'), 42931.63612144407),
 (('MESSAGE*', 'SSP3*'), 99308.44011621371),
 (('MESSAGE*', 'SSP4*'), inf),
 (('MESSAGE*', 'SSP5*'), inf)]

So in terms of differentials, SSP2 models do not match up very well. Since the original scenario was an SSP2 scenario, this shows that filtering by SSP value is not necessarily the most appropriate. We will use Message data to perform the calculations in later chapters. 