<img src="images/randonneur.png">

Randonneur is a library to make changes to life cycle inventory databases. Specifically, randonneur provides the following:

* A standard data format for specifying life cycle inventory data transformations
* A reference implementation for applying these changes
* A reference implementation for writing files following the standard

Although designed to work with Brightway, this library is not Brightway-specific.

Here's a basic example:

In [None]:
import randonneur as rn

In [None]:
my_lci = [{
    'name': "my process",
    'edges': [{
        'name': 'Xylene {RER}| xylene production | Cut-off, U',
        'amount': 1.0
    }]
}]

In [None]:
transformed = rn.migrate_edges_with_stored_data(
    my_lci,
    'simapro-ecoinvent-3.9.1-cutoff',
    config=rn.MigrationConfig(fields=['name'])
)
transformed

In [None]:
rn.migrate_edges_with_stored_data(
    transformed,
    'ecoinvent-3.9.1-cutoff-ecoinvent-3.10-cutoff',
)

## Data schema

Migration data is specified in a JSON file as a single dictionary. This file **must** include the following keys:

* `name`: Follows the [data package specification](https://specs.frictionlessdata.io/data-package/#name).
* `licenses`: Follows the [data package specification](https://specs.frictionlessdata.io/data-package/#licenses). Must be a list.
* `version`: Follows the [data package specification](https://specs.frictionlessdata.io/data-package/#version). Must be a string.
* `contributors`: Follows the [data package specification](https://specs.frictionlessdata.io/data-package/#contributors). Must be a list.
* `mapping`: A dictionary mapping the labels used in the transformation to data accessors.
* `graph_context`: A list with either the string 'nodes', 'edges', or both 'nodes' and 'edges'. This defines what kinds of objects in the graph should be transformed.

We strongly recommend you provide the following optional attributes:

* `source_id`: An identifier for the source dataset following the *common identifier standard*. Useful if the source data is specific.
* `target_id`: An identifier for the target dataset following the *common identifier standard*. Useful if the target data is specific.

## Common database release identifier standard

At Brightcon 2022 we developed the following simple format for common database release identifiers:

`<database name>-<version>-<optional modifier>`

`database name` is usually lower case.

Here are some examples:

* `agribalyse-3.1.1`
* `forwast-1`
* `ecoinvent-3.10-cutoff`
* `simapro-9-biosphere`

## Additional Properties

The following properties should follow the [data package specification](https://specs.frictionlessdata.io/data-package/) if provided:

* `description`
* `sources`
* `homepage`
* `created`

## Change type

Finally, at least one change type should be included. The change types are:

* `create`
* `replace`
* `update`
* `delete`
* `disaggregate`


Here is an example - migrating from one ecoinvent biosphere version to another:

```json
{
  "name": "ecoinvent-3.9.1-biosphere-ecoinvent-3.10-biosphere",
  "description": "Data migration file from ecoinvent-3.9.1-biosphere to ecoinvent-3.10-biosphere generated with `ecoinvent_migrate` version 0.2.0",
  "contributors": [
    {
      "title": "ecoinvent association",
      "path": "https://ecoinvent.org/",
      "role": "author"
    },
    {
      "title": "Chris Mutel",
      "path": "https://chris.mutel.org/",
      "role": "wrangler"
    }
  ],
  "created": "2024-07-24T11:38:11.144509+00:00",
  "version": "2.0.0",
  "licenses": [
    {
      "name": "CC-BY-4.0",
      "path": "https://creativecommons.org/licenses/by/4.0/legalcode",
      "title": "Creative Commons Attribution 4.0 International"
    }
  ],
  "graph_context": [
    "edges"
  ],
  "mapping": {
    "source": {
      "expression language": "XPath",
      "labels": {
        "name": "//*:elementaryExchange/*:name/text()",
        "unit": "//*:elementaryExchange/*:unitName/text()",
        "uuid": "//*:elementaryExchange/@elementaryExchangeId"
      }
    },
    "target": {
      "expression language": "XPath",
      "labels": {
        "name": "//*:elementaryExchange/*:name/text()",
        "unit": "//*:elementaryExchange/*:unitName/text()",
        "uuid": "//*:elementaryExchange/@elementaryExchangeId"
      }
    }
  },
  "source_id": "ecoinvent-3.9.1-biosphere",
  "target_id": "ecoinvent-3.10-biosphere",
  "homepage": "https://github.com/brightway-lca/ecoinvent_migrate",
  "replace": [
    {
      "source": {
        "uuid": "90a94ea5-bca4-483d-a591-2e886c0ff47f",
        "name": "TiO2, 54% in ilmenite, 18% in crude ore"
      },
      "target": {
        "uuid": "2f033407-6060-4e1e-868c-9f362d10fdb2",
        "name": "Titanium"
      },
      "conversion_factor": 0.599,
      "comment": "To be modelled as pure elements, the titanium content of titanium dioxide is 0.599."
    }
  ]
}
```

## Theory

In normal life cycle assessment practice, we work with a large variety of software and database applications, and often need to harmonize data across these heterogeneous systems. Because many of these systems do not commonly use simple and unique identifiers, we often need to link across systems based on data attibutes. For example, if the name, location, and unit of an input are the same in system `A` and `B`, then we can infer that these refer to the same underlying concept.

In the real world it's not so simple. Each player in the LCA data world is trying to give their users a positive experience, but over time this has led to many different terms for the same concept. Some legacy systems restrictions also prevent complete imports, and cause data transformations that are difficult to reverse engineer.

This library defines both a specification for transformation data files which allow different systems to be linked together by harmonizing the matching attributes, and a software-agnostic reference implementation of functions needed to use that format.

Note that *not all verbs or graph object types* are currently supported by the reference implmentation.

## Transformations

> [!NOTE]
> Transformations are serialized to JSON. Therefore, only [JSON data types](https://en.wikipedia.org/wiki/JSON) are supported.

## Application

In the reference implmenetation, all transformation operations can be configured via a `MigrationConfig` object. The following can be specified:

`mapping`: Change the labels in the `migrations` data to match your data schema. `mapping` can
change the labels in the migration `source` and `target` sections. The `mapping` input should be
a dict with keys "source" and "target", and have values of `{old_label: new_label}` pairs:

In [None]:
rn.migrate_edges(
    graph=[{"edges": [{"name": "foo"}]}],
    migrations={"update": [{"source": {"not-name": "foo"}, "target": {"location": "bar"}}]},
    config=rn.MigrationConfig(mapping={"source": {"not-name": "name"}})
)

`node_filter`: A callable which determines whether or not the given node should be modified.
Applies to both verbs and edges, with the exception of node creation - it doesn't make sense to
filter existing nodes as we are creating new objects.

`node_filter` needs to be a callable which takes a node object and returns a boolean which tells
if the node *should* be modified. In this example, the filter returns `False` and the node isn't
modified:

In [None]:
rn.migrate_edges(
    graph=[{"edges": [{"name": "foo"}]}],
    migrations={"update": [{"source": {"name": "foo"}, "target": {"location": "bar"}}]},
    config=rn.MigrationConfig(node_filter=lambda node: node.get("sport") == "🏄‍♀️")
)

`edge_filter`: A callable which determines whether or not the given edge should be modified.
Applies only to edge transformations, and does *not* apply to edge creation, as this function is
always called on the edge to modified, not on the transformation object.
Returns

`edge_filter` needs to be a callable which takes an edge object and returns a boolean which
indicates if the edge *should* be modified.

`fields`: A list of object keys as strings, used when checking if the given transformation
matches the node or edge under consideration. In other words, only use the fields in `fields`
when checking the `source` values in each transformation for a match. Each field in `fields`
doesn't have to be in each transformation.

If you changed labels in `mapping`, use the changed labels, not the original key labels.

In [None]:
rn.migrate_edges(
    graph=[{"edges": [{"name": "foo"}]}],
    migrations={"update": [
        {"source": {"name": "foo", "missing": "🔍"}, "target": {"location": "bar"}}
    ]},
    config=rn.MigrationConfig(fields=["name"]),
)

`verbose`: Display progress bars and more logging messages.

`edges_label`: The label used for edges in the nodes of the `graph`. Defaults to "edges". In
other data formats, this could be "flows" or "exchanges".

In [None]:
rn.migrate_edges(
    graph=[{"e": [{"name": "foo"}]}],
    migrations={"update": [{"source": {"name": "foo"}, "target": {"location": "bar"}}]},
    config=rn.MigrationConfig(edges_label="e"),
)

`verbs`: The list of transformation types from `migrations` to apply. Transformations are run
in the order as given in `verbs`, and in some complicated cases you may want to keep the same
verbs but change their order to get the desired output state. In general, such complicated
transformations should be broken down to smaller discrete and independent transformations
whenever possible, and logs checked carefully after their application.

The default value of `verbs` are the "safe" transformations - replace, update, and disaggregate.
To get create and delete you need to specify them in the configuration.

Only the verbs `create`, `disaggregate`, `replace`, `update`, and `delete` are used in our
functions, regardless of what is given in `verbs`, as we don't know how to handle custom verbs.
We need to write custom functions for each verb as they have difference behaviour.

`case_sensitive`: Flag indicating whether to do case sensitive matching of transformations to
nodes or edges in the graph. Default is false, as practical experience has shown us that cases
get commonly changed by software developers or users. Only applies to string values.

In [None]:
rn.migrate_edges(
    graph=[{"edges": [{"name": "foo"}]}],
    migrations={"update": [{"source": {"name": "FOO"}, "target": {"location": "bar"}}]},
    config=rn.MigrationConfig(case_sensitive=False),
)

`add_extra_attributes`: Flag indicating whether to include additional attributes when doing
replace, update, and disaggregate changes. Extra attributes are defined outside the "source" and
"target" transformation keys. Note that keys in `randonneur.utils.EXCLUDED_ATTRS` are never
added.

In [None]:
rn.migrate_edges(
    graph=[{"edges": [{"name": "foo"}]}],
    migrations={"update": [{
        "source": {"name": "FOO"},
        "target": {"location": "bar"},
        "comment": "Reason for change",
    }]},
    config=rn.MigrationConfig(add_extra_attributes=True),
)

## Transformation verbs

See the README documentation starting at: https://github.com/brightway-lca/randonneur?tab=readme-ov-file#replace-and-update

## Package creation

The randonneur class `randonneur.Datapackage` can be used to generate files which comply with the format standard. Class instantiation takes the following required arguments which are described above:

* name: str
* description: str
* contributors: list 

The required attribute `mapping` is broken up into two inputs in this class:

* mapping_source: dict
* mapping_target: dict

The following optional arguments have default values:

* created: datetime.datetime.now(timezone.utc).isoformat()now()
* version: "1.0.0",
* licenses: rn.licenses.LICENSES["CC-BY-4.0"]
* graph_context: ["edges"]
  
`rn.licenses.LICENSES` is dictionary of selected [SPDX licenses](https://spdx.org/licenses/).

The following optional arguments will be omitted from the datapackage if not provided:

* source_id: Optional[str]
* target_id: Optional[str]
* homepage: Optional[str]

Generating XPath or JsonPath mappings isn't easy; you can use the built-in values from `randonneur.MappingConstants` directly or as a guide to write your own:

* randonneur.MappingConstants.SIMAPRO_CSV
* randonneur.MappingConstants.ECOSPOLD2
* randonneur.MappingConstants.ECOSPOLD1_BIO
* randonneur.MappingConstants.ECOSPOLD2_BIO
* randonneur.MappingConstants.ECOSPOLD2_BIO_FLOWMAPPER

In [None]:
rn.MappingConstants.ECOSPOLD2_BIO_FLOWMAPPER

Here's an example of `randonneur` datapackage creation:

In [None]:
dp = rn.Datapackage(
    name="ecoinvent-2.2-biosphere-context-ecoinvent-3.0-biosphere-context",
    description="Convert context (category and subcategory) labels from ecoinvent 2 to 3 standards",
    contributors=[
        {"title": "Chris Mutel", "path": "https://chris.mutel.org/", "role": "author"},
    ],
    source_id="ecoinvent-2.2-biosphere",
    target_id="ecoinvent-3.0-biosphere",
    mapping_source=rn.MappingConstants.ECOSPOLD1_BIO,
    mapping_target=rn.MappingConstants.ECOSPOLD2_BIO,
    graph_context=["nodes", "edges"],
)

Data for each transformation verb should be added separately using `Datapackage.add_data(verb, data)`.

In [None]:
ECOSPOLD_2_3_BIOSPHERE = {
    ("resource", "in ground"): ("natural resource", "in ground"),
    ("resource", "in air"): ("natural resource", "in air"),
    ("resource", "in water"): ("natural resource", "in water"),
    ("resource", "land"): ("natural resource", "land"),
    ("resource", "biotic"): ("natural resource", "biotic"),
    ("resource",): ("natural resource",),
    ("air", "high population density"): ("air", "urban air close to ground"),
    ("air", "low population density"): ("air", "non-urban air or from high stacks"),
    ("water", "fossil-"): ("water", "ground-"),
    ("water", "lake"): ("water", "surface water"),
    ("water", "river"): ("water", "surface water"),
    ("water", "river, long-term"): ("water", "surface water"),
}

In [None]:
dp.add_data(
    "replace", 
    [
        {"source": {"context": k}, "target": {"context": v}} 
        for k, v in ECOSPOLD_2_3_BIOSPHERE.items()
    ]
)

We do basic validation so that the data matches the given mapping:

In [None]:
dp.add_data(
    "replace", 
    [
        {"source": {"missing": k}, "target": {"context": v}} 
        for k, v in ECOSPOLD_2_3_BIOSPHERE.items()
    ]
)

Application of these files **in the Brightway context** is left for the session on the new SimaPro importer.

# `randonneur_data`

This "library" is just a registry of transformation files (large files are transparently compressed). You can also use it for custom registries on your local machine if you want.

In [None]:
import randonneur_data as rd

In [None]:
registry = rd.Registry()

In [None]:
print(registry)

It behaves like a dictionary:

In [None]:
'ecoinvent-2.2-biosphere-ecoinvent-3.0-biosphere' in registry

In [None]:
registry['ecoinvent-2.2-biosphere-ecoinvent-3.0-biosphere']

In [None]:
len(registry)

You can get a data schema:

In [None]:
registry.schema('ecoinvent-2.2-biosphere-ecoinvent-3.0-biosphere')

You can sample a large data file:

In [None]:
registry.sample('ecoinvent-2.2-biosphere-ecoinvent-3.0-biosphere')

You get different results for random sampling:

In [None]:
registry.sample('ecoinvent-2.2-biosphere-ecoinvent-3.0-biosphere')

You can also retrieve the whole file:

In [None]:
registry.get_file('ecoinvent-2.2-biosphere-ecoinvent-3.0-biosphere')