# Correlated and dependent sampling with datapackages

Author: [Aleksandra Kim](https://github.com/aleksandra-kim/)

Brightway 2.5 Autumn school

Grosshöchstetten, Switzerland, 2022

# Uncertainties in life cycle assessment models

<img src="unct_propagation.png" width=1600 />

# Analytical uncertainty propagation

- Variance of the output can be analytically derived knowing variances of the inputs, covariances between the inputs, and model structure.
- Is computationally light!
- But can only be easily derived for simple model structures.
- And is restricted to a single measure - variance.

# Numerical uncertainty propagation
- Uses Monte Carlo simulations, so LCA model will be recomputed many times with varying values of model inputs.
- Does not assume any model structure.
- But is computationally heavy...

# Either way, it is important to account for correlations and dependencies between model inputs.

#### Goal: Learn how numerical uncertainty propagation can be done with datapackages.

# Examples

- Parameterized inventories
- Carbon balancing in combustion processes -> Example
- Implicit correlations in measurement data -> Exercise

Switch to kernel `bw25`!

# Contents

1. [Monte Carlo simulations with datapackages ](#1_mc)
2. [Carbon balancing example](#2_carbon)
3. [How to add real measurements as MC samples](#3_entso)

<a id='1_mc'></a>
# 1. Monte Carlo simulations with datapackages 

There are two ways of running Monte Carlo in the presence of uncertainties in exchanges.

Previously we looked at scenario uncertainty, let's now look at parameteric uncertainty (uncertainty in exchanges).

In [None]:
# Brightway packages
import bw2data as bd
import bw2io as bi
import bw2calc as bc
import bw_processing as bwp

# General packages
import numpy as np
import math
from collections import defaultdict
from fs.zipfs import ZipFS

# Visualization
import plotly.graph_objects as go

# Uncertainties
import stats_arrays as sa

## A. Datapackage with uncertainty distributions

In [None]:
# Technosphere
t_data = np.array([
    1,   # production of natural gas
    1,   # production of carbon fibre
    1,   # production of bike
    237, # input of natural gas
    2.5, # input of carbon fibre
])
t_indices = np.array([
    (101, 101), # production of natural gas
    (102, 102), # production of carbon fibre
    (103, 103), # production of bike
    (101, 102), # input of natural gas
    (102, 103), # input of carbon fibre
    ], 
    dtype=bwp.INDICES_DTYPE
)
t_flip = np.array([False, False, False, True, True]) # Numerical sign of the inputs needs to be flipped negative

In [None]:
# Let's see how distributions can be specified
bwp.UNCERTAINTY_DTYPE

In [None]:
t_distributions = np.array([
    (sa.UndefinedUncertainty.id, 1, np.nan, np.nan, np.nan, np.nan, np.nan),
    (sa.TriangularUncertainty.id, 1, np.nan, np.nan, 0.95, 1.2, np.nan),
    (sa.UniformUncertainty.id, 1, np.nan, np.nan, 0.9, 1.1, np.nan),
    (sa.NormalUncertainty.id, 237, 30, np.nan, np.nan, np.nan, np.nan),
    (sa.NormalUncertainty.id, 2.5, 0.7, np.nan, np.nan, np.nan, np.nan),
    ],
    dtype=bwp.UNCERTAINTY_DTYPE
)

In [None]:
# Biosphere
b_data = np.array([26.6])
b_indices = np.array([
    (201, 102), # emission of CO2
    ], 
    dtype=bwp.INDICES_DTYPE
)

In [None]:
# Characterization
c_data = np.array([1])
c_indices = np.array([
    (201, 201), # CF of CO2
    ], 
    dtype=bwp.INDICES_DTYPE
)

In [None]:
dp_distributions = bwp.create_datapackage()
dp_distributions.add_persistent_vector(
    matrix='technosphere_matrix',
    indices_array=t_indices,
    data_array=t_data,
    distributions_array=t_distributions,
    flip_array=t_flip,
)
dp_distributions.add_persistent_vector(
    matrix='biosphere_matrix',
    indices_array=b_indices,
    data_array=b_data,
)
dp_distributions.add_persistent_vector(
    matrix='characterization_matrix',
    indices_array=c_indices,
    data_array=c_data,
)

In [None]:
bike = 103
lca_a = bc.LCA(
    demand={bike: 1},
    data_objs=[dp_distributions],
    use_distributions=True,
    seed_override=42,
)
lca_a.lci()
lca_a.lcia()
lca_a.score

In [None]:
%%time
iterations = 200
scores_a = np.array([lca_a.score for _ in zip(range(iterations), lca_a)])
scores_a

## B. Datapackage with arrays

In [None]:
t_data_array = np.array([
    [1, 1.1, 0.9],   # production of natural gas
    [1, 2, 1.5],   # production of carbon fibre
    [1, 1, 1],   # production of bike
    [230, 240, 200], # input of natural gas
    [2.8, 2.7, 2.3], # input of carbon fibre
])

In [None]:
dp_arrays = bwp.create_datapackage(sequential=True, seed=25323)
dp_arrays.add_persistent_array(
    matrix='technosphere_matrix',
    indices_array=t_indices,
    data_array=t_data_array,
    flip_array=t_flip,
)
dp_arrays.add_persistent_vector(
    matrix='biosphere_matrix',
    indices_array=b_indices,
    data_array=b_data,
)
dp_arrays.add_persistent_vector(
    matrix='characterization_matrix',
    indices_array=c_indices,
    data_array=c_data,
)

In [None]:
lca_b = bc.LCA(
    demand={bike: 1},
    data_objs=[dp_arrays],
    use_distributions=False,
    use_arrays=True,
#     seed_override=42,  # Seed should not be used
)
lca_b.lci()
lca_b.lcia()
lca_b.score

In [None]:
lca_b.keep_first_iteration()
iterations = 10
scores_b = [lca_b.score for _ in zip(range(iterations), lca_b)]
scores_b

## Plot LCIA scores

In [None]:
num_bins = 60
bins_ = np.linspace(min(scores_a), max(scores_a), num_bins, endpoint=True)

freqa, _ = np.histogram(scores_a, bins=bins_, density=False)

fig = go.Figure()
fig.add_trace(
    go.Bar(
        x=bins_,
        y=freqa,
        name="MC with distributions",
        showlegend=True,
    )
)
fig = fig.update_xaxes(title='LCIA scores')
fig = fig.update_yaxes(title='Frequency')

# fig.show()

<img src="mc_distribution.png" width=1600 />

<a id='2_carbon'></a>
# 2. Carbon balancing example

## BW setup

In [None]:
bd.projects.set_current('Correlated and dependent sampling')

In [None]:
# === Option 1 ===
# Usual BW import

# bi.bw2setup()

# ei_path = "/Users/akim/Documents/LCA_files/ecoinvent_38_cutoff/datasets"
# ei_name = "ecoinvent 3.8 cutoff"
# if ei_name in bd.databases:
#     print("ecoinvent database already exists")
# else:
#     ei = bi.SingleOutputEcospold2Importer(ei_path, ei_name)
#     ei.apply_strategies()
#     assert ei.all_linked
#     ei.write_database()
    
# # To backup a project use this command. Note that it will save project in your home directory!
# bi.backup_project_directory('Correlated and dependent sampling')

In [None]:
# === Option 2 ===
# Restoring BW project from a backup. This option is faster and what you should be using in this class.

bi.restore_project_directory(
    f"/srv/data/brightway2-project-Correlated and dependent sampling-backup.19-October-2022-11-08AM.tar.gz"
)
# bi.restore_project_directory(
#     f"brightway2-project-Correlated and dependent sampling-backup.19-October-2022-11-08AM.tar.gz"
# )

bd.databases

## Finding liquid fuel combustors in ecoinvent

In [None]:
ei = bd.Database("ecoinvent 3.8 cutoff")

There are other combustion processes where fuels are measured in megajoules, these will be addressed later.

In [None]:
flows = ('market for diesel,', 'diesel,', 'petrol,', 'market for petrol,')
liquid_fuels = [x 
                for x in ei 
                if x['unit'] == 'kilogram'
                and((any(x['name'].startswith(flow) for flow in flows) or x['name'] == 'market for diesel'))
               ]
{x['name'] for x in liquid_fuels}

### Look into modelling specifics of a fuel activity

In [None]:
petrol = [
    act for act in ei if 'market for petrol, low-sulfur' in act['name'] 
    and 'Europe without Switzerland'==act['location']
][0]
production = list(petrol.production())[0]

print(production['properties']['carbon content'])

In [None]:
# One of the consumers of this liquid fuel activity
consumer_exchange = list(petrol.consumers())[0]
consumer = consumer_exchange.output
consumer_exchange

## Carbon dioxide emissions

In [None]:
# All carbon dioxide emissions from fossil fuels in the biosphere
co2_flows = [x for x in bd.Database('biosphere3') if x['name'] == 'Carbon dioxide, fossil']
co2_flows

In [None]:
# Let's see CO2 emissions of the current consumer:
for exc in consumer.biosphere():
    if exc.input in co2_flows:
        print(exc.input, exc.amount)

In [None]:
total_co2 = sum([exc['amount'] for exc in consumer.biosphere() if exc.input in co2_flows])
total_co2

In [None]:
# Based on stoichiometry, the total CO2 is:
consumer_exchange['amount'] / production['amount'] * \
    production['properties']['carbon content']['amount'] * (12 + 16 * 2) / 12

In [None]:
# These numbers don't match because there is a second petrol input.
[exc for exc in consumer.technosphere() if exc.input in liquid_fuels]

## Carbon balancing in the static LCIA computations

In [None]:
# Let's write a validation function.
def carbon_fuel_emissions_balanced(activity, fuels, co2):
    """Check that in the static case, carbon balance in combustion processes is preserved.
    
    Returns a ``bool`."""
    try:
        total_carbon = sum(
            # Carbon content amount is fraction of mass, unitless
            exc['amount'] * exc['properties']['carbon content']['amount'] 
            for exc in activity.technosphere() 
            if exc.input in fuels)
    except KeyError:  # in case some of the fuels do not have carbon content information
        return False
    conversion = 12 / (12 + 16 * 2)
    total_carbon_in_co2 = sum(
        exc['amount'] * conversion
        for exc in activity.biosphere()
        if exc.input in co2
    )
    print(total_carbon, total_carbon_in_co2)
    return math.isclose(total_carbon, total_carbon_in_co2, rel_tol=1e-06, abs_tol=1e-3)

In [None]:
# Basically, if we find a consumer of liquid fuels, we should be able to estimate the amount of 
# emitted carbon dioxide. We can see that the balance is preserved in the static calculations.
carbon_fuel_emissions_balanced(consumer, liquid_fuels, co2_flows)

## Carbon balancing in the presence of uncertainties

In the presence of uncertainties, only fuels should vary, and the carbon dioxide emissions should be rescaled to satisfy the carbon balance equations. 

For that, let's write few functions that rescale the amount of CO2.

In [None]:
def get_samples_and_scaling_vector(activity, fuels, size=10, seed=None):
    """Draw ``size`` samples from technosphere exchanges for ``activity`` whose inputs are in ``fuels``.
    
    Returns:
        * Numpy indices array with shape ``(len(found_exchanges)),)``
        * Numpy flip array with shape ``(len(found_exchanges,))``
        * Numpy data array with shape ``(size, len(found_exchanges))``
        * Scaling vector with relative total carbon consumption and shape ``(size,)``.
    """
    
    # Find exchanges with liquid fuel inputs
    exchanges = [exc for exc in activity.technosphere() if exc.input in fuels]
    # Save indices and flip arrays of these exchanges to generate a datapackage later on
    indices = np.array([(exc.input.id, exc.output.id) for exc in exchanges], dtype=bwp.INDICES_DTYPE)
    flip = np.ones(indices.shape, dtype=bool)
    # Generate independent samples for all liquid fuel exchanges
    sample = sa.MCRandomNumberGenerator(
        sa.UncertaintyBase.from_dicts(*[exc.as_dict() for exc in exchanges]), 
        seed=seed
    ).generate(samples=size)
    
    # Save total carbon in the static case
    static_total = sum(exc['amount'] * exc['properties']['carbon content']['amount'] for exc in exchanges)
    # Save carbon content of each liquid fuel exchange
    carbon_content = np.array([exc['properties']['carbon content']['amount'] for exc in exchanges]).reshape((-1, 1))
    # Save total carbon in the stochastic case for all samples
    carbon_total_per_sample = (sample * carbon_content).sum(axis=0).ravel()
    # Compute fraction of carbon compared to the static case
    # Scaling vector is needed to rescale carbon dioxide exchanges accordingly in the next step
    scaling = carbon_total_per_sample / static_total
    assert carbon_total_per_sample.shape == (size,)
    
    return indices, flip, sample, scaling

In [None]:
def rescale_biosphere_exchanges_by_scaling_vector(activity, scaling, flows):
    """Rescale biosphere exchanges with flows ``flows`` from ``activity`` by vector ``scaling``.
    
    ``flows`` are biosphere flow objects, with e.g. all the CO2 flows, but also other flows such as metals, 
    volatile organics, etc. 
    Only rescales flows in ``flows`` which are present in ``activity`` exchanges.
    
    Assumes the static values are balanced, i.e. won't calculate CO2 emissions from carbon in 
    fuels but just rescales given values.

    Returns: Numpy indices and data arrays with shape (number of exchanges found, len(scaling)).
    Returns:
        * Numpy indices array with shape ``(len(found_exchanges)),)``
        * Numpy flip array with shape ``(len(found_exchanges,))``
        * Numpy data array with shape ``(len(found_exchanges), len(factors))``
    """
    indices, data = [], []
    assert isinstance(scaling, np.ndarray) and len(scaling.shape) == 1
    
    for exc in activity.biosphere():
        if exc.input in flows:
            indices.append((exc.input.id, exc.output.id))
            data.append(scaling * exc['amount'])
    
    indices = np.array(indices, dtype=bwp.INDICES_DTYPE)
    flip = np.zeros(len(indices), dtype=bool)
    data = np.vstack(data)
            
    return indices, flip, data

In [None]:
nsamples = 10

lf_indices, lf_flip, lf_data, lf_scaling = get_samples_and_scaling_vector(
    consumer, liquid_fuels, size=nsamples, seed=42
)

co2_indices, co2_flip, co2_sample = rescale_biosphere_exchanges_by_scaling_vector(
    consumer, lf_scaling, co2_flows
)

## Create datapackage with uncertain fuels and balanced carbon

In [None]:
dp_carbon = bwp.create_datapackage(sequential=True)
dp_carbon.add_persistent_array(
    matrix='technosphere_matrix',
    indices_array=lf_indices,
    data_array=lf_data,
    flip_array=lf_flip,
)
dp_carbon.add_persistent_array(
    matrix='biosphere_matrix',
    indices_array=co2_indices,
    data_array=co2_sample,
    flip_array=co2_flip,
)

## Run Monte Carlo simulations

In [None]:
ei = bd.Database("ecoinvent 3.8 cutoff")

ch_low = [
    act for act in ei if act['name'] == "market for electricity, low voltage" and "CH" == act['location']
][0]
ipcc = ('IPCC 2013', 'climate change', 'GWP 100a')

fu, data_objs, _ = bd.prepare_lca_inputs({ch_low: 1}, method=ipcc, remapping=False)
lca = bc.LCA(
    demand=fu, 
    data_objs=(
        data_objs
    ),
    use_arrays=True,
    use_distributions=True,
    seed_override=42,
)
lca.lci()
lca.lcia()

In [None]:
%%time
iterations = 10
scores = np.array([lca.score for _ in zip(range(iterations), lca)])
scores

In [None]:
lca_carbon = bc.LCA(
    demand=fu, 
    data_objs=(
        data_objs + [dp_carbon]
    ),
    use_arrays=True,
    use_distributions=True,
    seed_override=42,
)
lca_carbon.lci()
lca_carbon.lcia()

In [None]:
%%time
scores_carbon = np.array([lca_carbon.score for _ in zip(range(iterations), lca_carbon)])
scores_carbon

In [None]:
# These differences are small, but try adding more liquid fuel consumers!
scores_carbon - scores

## Exercise

#### 1. Create datapackage with uncertain fuels and balanced carbon, but now for all activities in ecoinvent that consume liquid fuels
#### 2. Compare LCIA scores with and without carbon balancing

<a id='3_entso'></a>
# 3. How to add real measurements as Monte Carlo samples on the example of ENTSO-E electricity data

ENTSO-E is a European association for the cooperation of transmission system operators (TSOs) for electricity.

[ENTSO-E Transparency platform](https://transparency.entsoe.eu/) collects and publishes electricity generation, transportation and consumption data and information for the pan-European market.

It is possible to export data from the Transparency platform or query it with Python client [entsoe-py](https://github.com/EnergieID/entsoe-py).

In this exercise we provide time series datapackage for the years 2019-2021. We used ENTSO-E data to overwrite the low-, medium-, and high-voltage market mixes for 32 European countries.

Generation categories were matched and disaggregated to ecoinvent 3.8 cutoff activities. Disaggregation was necessary in cases where ENTSO-E only listed one generation type (e.g. "Hydro Water Reservoir" was allocated to both alpine and non-alpine reservoirs), and was done using annual production volumes from ecoinvent as allocation factors. 

## Restore BW project

It's slightly different from the previous project in that it contains `swiss residual electricity mix` database that was added, because Switzerland is modelled as a special case in ecoinvent, so it needed some corrections.

In [None]:
bi.restore_project_directory(f"/srv/data/brightway2-project-Correlated and dependent sampling-backup.24-October-2022-08-13AM.tar.gz")
# bi.restore_project_directory(f"brightway2-project-Correlated and dependent sampling-backup.24-October-2022-08-13AM.tar.gz")
bd.projects.set_current("Correlated and dependent sampling")
bd.databases

In [None]:
fp_entso_ts = "/srv/data/entso-timeseries.zip"
# fp_entso_ts = "entso-timeseries.zip"

## Load ENTSO-E timeseries datapackage

In [None]:
dp_entso_ts = bwp.load_datapackage(ZipFS(fp_entso_ts))
dp_entso_ts.metadata

In [None]:
# Note that these timeseries are given on an hourly basis for the years 2019, 2020 (leap year) and 2021. 
# First datapoint corresponds to 01.01.2019 00:00, and the last - to 31.12.2021 23:00.
data = dp_entso_ts.get_resource("timeseries ENTSO electricity values.data")
data[0].shape

## Exercise

1. Run MC simulations with all timeseries data
2. Run MC simulations with selected timeseries data, for instance with only daytime measurements or selected season. Compare LCA scores obtained from different cases by eg plotting the resulting distributions of LCIA scores, and computing their means and standard deviations. 