# Import Agribalyse and link to ecoinvent 2.2

[Agribalyse](http://www.ademe.fr/en/expertise/alternative-approaches-to-production/agribalyse-program) is a French LCI database of agricultural products. It builds on top of ecoinvent 2.2. It was exported from SimaPro, so the names of ecoinvent processes are mangled, and need to be fixed back to standard ecoinvent.

This notebook uses Agribalyse 1.2, released in March 2015.

In [11]:
from brightway2 import *
import brightway2 as bw

In [12]:
bw.projects.set_current("EF calculation")

In [13]:
databases

Databases dictionary with 2 object(s):
	agribalyse3
	biosphere3

Create a new project for this notebook

In [None]:
projects.set_current("Agribalyse")

## Add biosphere flows

Biosphere flow names follow the standard in ecoinvent 3.3. We will need to match these names to those in Agribalyse.

In [4]:
bw2setup()

Writing activities to SQLite3 database:


Creating default biosphere

Applying strategy: normalize_units
Applying strategy: drop_unspecified_subcategories
Applying strategy: ensure_categories_are_tuples
Applied 3 strategies in 0.00 seconds


0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


Title: Writing activities to SQLite3 database:
  Started: 04/04/2022 14:52:12
  Finished: 04/04/2022 14:52:12
  Total time elapsed: 00:00:00
  CPU %: 87.90
  Memory %: 2.01
Created database: biosphere3
Creating default LCIA methods

Applying strategy: normalize_units
Applying strategy: set_biosphere_type
Applying strategy: fix_ecoinvent_38_lcia_implementation
Applying strategy: drop_unspecified_subcategories
Applying strategy: link_iterable_by_fields
Applied 5 strategies in 0.80 seconds
Wrote 975 LCIA methods with 254388 characterization factors
Creating core data migrations



## Import ecoinvent 2.2 as background database

In [None]:
path

In [None]:
path = %pwd
path += "/Agribalyse - Processes"
importer = SingleOutputEcospold1Importer(path, "Agribalyse 1.3 - Processes")
importer.apply_strategies()
importer.statistics()

In [None]:
importer.write_database()

## Load Agribalyse data

This notebook uses the ecospold 1 version of Agribalyse, but the SimaPro CSV version should be quite similar, it would just use a different `Importer` class.

We only need to give the directory, the `Importer` will find the XML file.

In [None]:
path = "/Users/cmutel/Documents/LCA Documents/Agribalyse"
ag = SingleOutputEcospold1Importer(path, "Agribalyse 1.2")
ag.apply_strategies()
ag.statistics()

This is quite a lot of linking problems. Let's export the unlinked exchanges to a spreadsheet so we can browse them.

In [None]:
ag.write_excel(True)

## 1. Fix biosphere names

One obvious problem is the names of biosphere flows changed from ecoinvent 2 to ecoinvent 3, **and** SimaPro uses another set of biosphere names and categories.

Let's fix the SimaPro-specific problems first.

In [None]:
from bw2io.strategies.simapro import normalize_simapro_biosphere_categories, normalize_simapro_biosphere_names
ag.apply_strategy(normalize_simapro_biosphere_categories)
ag.apply_strategy(normalize_simapro_biosphere_names)

We have modified the source data, but still need to try to link to the biosphere database.

Read more about [currying functions](https://docs.python.org/3/library/functools.html#functools.partial) if this is new to you.

In [None]:
from bw2io.strategies import link_iterable_by_fields
import functools
ag.apply_strategy(functools.partial(link_iterable_by_fields, other=Database("biosphere3"), kind="biosphere"))
ag.statistics()

That solved 70% of the biosphere flows, but there are still many unmatched flows. Again, we export the full list of unmatched exchanges.

In [None]:
ag.write_excel(True)

The remaining unlinked biosphere flows *can't* be linked, because they don't exist in our biosphere database. This isn't the end of the world - we can add these new flows - but it does mean that they won't be assessed by our current LCIA methods.

You can search the biosphere database to see what is in the current biosphere database using the search function:

In [None]:
Database("biosphere3").search("nitrogen")

We add these missing biosphere flows. We could add them to the default `biosphere3` database, but it is cleaner to create a new database with just the new flows added for Agribalyse.

In [None]:
Database("Agribalyse new biosphere").register()
ag.add_unlinked_flows_to_biosphere_database("Agribalyse new biosphere")

We should now have no unlinked biosphere flows:

In [None]:
ag.statistics()

## 2. Fix production exchanges

This is a weird one - production exchanges represent the flow produced by an activity, and should have the exact same name (because this is the standard in ecospold 1 - in ecospold 2 there is a difference between activity and product names). 

Let's look at the data for an unlinked production exchange and its activity. We are trying to figure out which field is different. We pick the first exchange in our spreadsheet.

In [None]:
def get_unlinked(data):
    for ds in ag.data:
        for exc in ds['exchanges']:
            if exc['type'] == 'production' and exc['name'] == 'Alfalfa, conventional, for animal feeding, at farm gate':
                return ds, exc
            
ds, exc = get_unlinked(ag.data)
            
for field in ('name', 'unit', 'location', 'categories'):
    print(field)
    print("\tActivity:", field, ds.get(field))
    print("\tProduct:", field, exc.get(field))    

In this case, for whatever reason, the `categories` are different. The solution is to link without using the `categories` field. This strategy is smart - if excluding `categories` led to multiple possible links, it would raise an error instead of linking the (possibly) incorrect activity.

In [None]:
from bw2io.strategies import link_technosphere_based_on_name_unit_location
ag.apply_strategy(link_technosphere_based_on_name_unit_location)
ag.statistics()

Still a few problems. Let's look at one of them:

In [None]:
def get_unlinked(data):
    for ds in ag.data:
        for exc in ds['exchanges']:
            if exc['type'] == 'production' and not exc.get('input'):
                return ds, exc
            
ds, exc = get_unlinked(ag.data)
            
for field in ('name', 'unit', 'location', 'categories'):
    print(field)
    print("\tActivity:", field, ds.get(field))
    print("\tProduct:", field, exc.get(field))  

All the remaining outputs are disposal or recycling processes.

In [None]:
for exc in ag.unlinked:
    if exc['type'] == "production":
        print(exc['name'])

The disposal processes are in ecoinvent, but the recycling processes aren't.

In [None]:
Database("ecoinvent 2.2").search("Disposal, plastics, mixture")

In [None]:
Database("ecoinvent 2.2").search("recycling mixed plastics")

We have to be a little careful here. SimaPro considers these exchanges *outputs*, but ecoinvent models disposal as in input (you consume the disposal service). The easiest way to handle this is to simply change these outputs into inputs, which will fix the sign.

Note that we can't use `ag.unlinked`, as this only gives each unlinked exchange once, not every time it appears in the original data.

In [None]:
for ds in ag.data:
    for exc in ds['exchanges']:
        if exc['type'] == 'production' and not exc.get('input'):
            print("Fixing:", exc['name'])
            exc['type'] = 'technosphere'

We will leave the recycling processes alone for now; first, we will fix all the ecoinvent links, including the disposal ones, and then we will get back to recycling.

## 3. Fix technosphere inputs

Looking at the spreadsheet, you notice that there is no `categories` field for any of the inputs. By default, `categories` is used when linking, so if ecoinvent 2.2 has the `categories` field (it does), then no suitable link will be found.

This is a common problem with SimaPro, and we already know have a strategy to handle it already. We will try to fix both the internal links and the links to ecoinvent 2.2.

In [None]:
ag.apply_strategy(link_technosphere_based_on_name_unit_location)
ag.apply_strategy(functools.partial(link_technosphere_based_on_name_unit_location, external_db_name="ecoinvent 2.2"))
ag.statistics()

So, that was relatively simple.

## 4. Adding recycling processes

The recycling processes don't exist, and don't have any impact, so the easiest way to handle these exchanges is to create new activities that produce the recycling flows. Luckily we have a method that does that for us. Note that the new recycling activities will be created in the Agribalyse database.

In [None]:
ag.add_unlinked_activities()
ag.statistics()

## Write the modified, fully linked database

We are finished with the importing process.

In [None]:
ag.write_database()

OK, that is not good. The unique identifying codes for the activities come from the source data, which wouldn't be so foolish as to give non-unique identifiers to activities in the same export file, would it? Let's look at the codes.

In [None]:
print(len({ds['code'] for ds in ag.data}), len(ag.data))
print({ds['code'] for ds in ag.data})

That is not good. 826 activities, and only 265 unique codes. Let's look at the source data:

    <dataset number="28" timestamp="2015-02-22T17:27:17" generator="SimaPro 8.0.3.14">
    <referenceFunction name="Bovine feed,MAT18, at farm gate">

    <dataset number="28" timestamp="2014-12-21T14:10:26" generator="SimaPro 8.0.3.14">
    <referenceFunction name="Greenhouse, glass walls and roof, plastic tubes">

    <dataset number="28" timestamp="2013-09-18T16:53:22" generator="CDT V1.2">
    <referenceFunction name="Harrowing, with rotary harrow (standard equipment)">


We need to add unique codes. We have a strategy for this, `set_code_by_activity_hash`, but it won't overwrite codes already present. We can fix that :)

In [None]:
for ds in ag.data:
    del ds['code']

In [None]:
from bw2io.strategies import set_code_by_activity_hash
ag.apply_strategy(set_code_by_activity_hash)

Only the internal links will need to be redone - the links to ecoinvent 2.2 and the biosphere database are fine.

We can't use `link_technosphere_based_on_name_unit_location`, because we need to pass the parameter `relink`.

In [None]:
ag.apply_strategy(functools.partial(
        link_iterable_by_fields,
        other=ag.data,
        fields=('name', 'location', 'unit'),
        relink=True
))
ag.statistics()

## Actually writing the final database

We are now ready to try again.

In [None]:
ag.write_database()

## Checking the imported datasets

We need to do some basic validation to make sure we have meaningful results. Here I just do some basic testing, but you should validate against known scores if you are frequently using this database. The following code is rahter simple and is not a real validation check.

In [5]:
databases

Databases dictionary with 1 object(s):
	biosphere3

In [7]:
gwp = [x for x in methods if "IPCC 2013" in str(x)][0]
gwp

('IPCC 2013 no LT', 'climate change', 'GTP 100a')

In [6]:
db = Database("Agribalyse 1.2")

lca = LCA({db.random(): 1}, gwp)
lca.lci(factorize=True)
lca.lcia()
lca.score



NameError: name 'gwp' is not defined

Let's calculate the LCIA scores of all activities in Agribalyse

In [8]:
import pyprind

scores = []

for act in pyprind.prog_bar(db):
    lca.redo_lcia({act: 1})
    scores.append(lca.score)

AssertionError: 

In [None]:
import numpy as np

scores = np.array(scores)
mask = scores == 0
print(mask.sum(), len(db))

scores = scores[~mask]

In [None]:
%matplotlib notebook

In [None]:
import seaborn as sns

In [None]:
sns.distplot(scores)

## Conclusion

We have imported the Agribalyse database. In the process of importing, we found and resolved several problems:

1. First, we had to fix the names and categories of biosphere flow names, to make them compatible with the names and categories used in ecoinvent version 3.
2. Next, we created a new database for the new biosphere flows that we couldn't match.
3. We linked production flows to the activities that produced them, using a strategy that didn't use the field `categories`, as this field is not given consistently in SimaPro exports.
4. We switched some outputs to inputs, to be consistent with how ecoinvent models disposal and recycling processes.
5. We linked inputs to activities in ecoinvent 2.2, again ignoring the field `categories`, because SimaPro.
6. We created new processes to provide recycling services.
7. We deleted the unique identifying codes used by SimaPro, as they were not actually unique, and created our own codes.

This was a bit of a pain, but compared to other database exports, was actually not all that difficult. This is the sad truth of LCA data compatibility - it currently isn't all that great.