# Importing openLCA data into Python

## Introduction

Greendata developed [openLCA](www.openlca.org) to provide a set of open source tools for Life Cycle Assessment (LCA) using a process system model. The tools are written in Java. 

The main openLCA client application provides access to a suite of databases that provide the basic building blocks of a model, an interface to construct the model as a system of connected products and flows, and a range of impact methodologies that calculate the impact of the system on the environment. 

Greendata also provide a range of [software tools](http://www.openlca.org/openlca/) including a database server called the [collaboration server](http://www.openlca.org/collaboration-server/). The collaboration server provides a centralised source of data with version control. You can freely [download](http://www.openlca.org/download/) the tools to your local machine.

This note describes how we can use the openLCA databases for optimisation of life cycle models in Python. In order
to achieve this task we need to import openLCA data into a Python session and there a number of ways to do this. 

## Database Files

The simplest option to obtain openLCA data is to read an openLCA database file directly into Python. They can be fairly large compressed files (~1 Gb) which will be larger if loaded into memory. It may not very be efficient to hold all the data in local memory if the database grows by an order of magnitude say.

The default format for an openLCA database file is `zolca`, which is a [zipped Derby database](https://ask.openlca.org/811/who-ever-opened-modified-zolca-file-python-possible-thanks). Derby is a relational database system written in Java. The schema for the openLCA database is [here](https://github.com/GreenDelta/olca-modules/blob/master/olca-core/src/main/resources/org/openlca/core/database/internal/current_schema_derby.sql).

We can see the contents of a `zolca` file using the SQuirreL database client roughly following the instructions [here](https://db.apache.org/derby/integrate/SQuirreL_Derby.html). SQuirreL is written in Java so it makes a JDBC connection to Derby via an embedded driver. A screen shot of the `Processes` table for the file `ecoinvent_36_apos_unit_20191212.zolca` is shown below.

![squirrel](Figures/squirrel.png)

To get the Derby database into Python is not that straightforward. We have not found a Python package that natively reads the Derby database file structure. 

We can use the Python package `JayDeBeapi` which pipes commands to a Java JDBC driver. Thus there is a Python dependency on Java with this route to obtain openLCA data. The database `zolca` file has been uncompresssed with 7zip and placed on a linux network drive `\mnt\disk1\` along with the [Derby JDBC drivers](http://db.apache.org/derby/releases/release-10.13.1.1.html). Note that double slashes are required by Java in path names if you are using Windows. There must be no spaces in the path. Note also that the driver class name can change with the version of Derby that is used.

The Python code loads the first 100 rows of the `TBL_PROCESSES` table in `ecoinvent_36_apos_unit_20191212.zolca` into a Python session. The `DESCRIPTION` column needs additional processing.

In [1]:
import jaydebeapi as jdbc
import pandas as pd
 
conn = jdbc.connect("org.apache.derby.iapi.jdbc.AutoloadedDriver", 
                    "jdbc:derby:/mnt/disk1/data/openlca/Derby/ecoinvent_36_apos_unit_20191212",
                    ["", ""], "/mnt/disk1/share/db-derby-10.15.1.3-bin/lib/derby.jar")
curs = conn.cursor()

curs.execute("SELECT * from TBL_PROCESSES")
rec = curs.fetchmany(100)
col_names = [i[0] for i in curs.description]
df = pd.DataFrame(rec, columns = col_names)

curs.close()
conn.close()
df

Unnamed: 0,ID,REF_ID,NAME,VERSION,LAST_CHANGE,F_CATEGORY,DESCRIPTION,PROCESS_TYPE,DEFAULT_ALLOCATION_METHOD,INFRASTRUCTURE_PROCESS,F_QUANTITATIVE_REFERENCE,F_LOCATION,F_PROCESS_DOC,F_CURRENCY,F_DQ_SYSTEM,DQ_ENTRY,F_EXCHANGE_DQ_SYSTEM,F_SOCIAL_DQ_SYSTEM,LAST_INTERNAL_ID
0,136031,37411101-f6d0-39e9-9fc6-165a784f7025,market for waste polyethylene terephthalate | ...,12884967427,1575304112680,125151,org.apache.derby.impl.jdbc.EmbedClob@2807bdeb,UNIT_PROCESS,,0,136033,905,136032,,,,125001,,7
1,136043,c02e4a8f-145c-38f5-9501-de8b282f06a9,"ethylvinylacetate production, foil | ethylviny...",12892766211,1575304057454,125157,org.apache.derby.impl.jdbc.EmbedClob@5f6722d3,UNIT_PROCESS,,0,136045,955,136044,,,,125001,,3
2,136049,9901f9c3-0540-3b41-ac63-06410f75a8c7,"Mannheim process | sodium sulfate, anhydrite |...",12893356035,1575304054485,125165,org.apache.derby.impl.jdbc.EmbedClob@2c532cd8,UNIT_PROCESS,,0,136051,899,136050,,,,125001,,15
3,136066,205f998f-9be2-349f-8df4-14c67fa1f5cf,"Mannheim process | hydrochloric acid, without ...",12893356035,1575304054485,125165,org.apache.derby.impl.jdbc.EmbedClob@294e5088,UNIT_PROCESS,,0,136068,899,136067,,,,125001,,15
4,136085,d3b43cfc-05cb-3b5f-9742-f9b1827d35e6,aluminium oxide factory construction | alumini...,12894011396,1575304119635,125207,org.apache.derby.impl.jdbc.EmbedClob@51972dc7,UNIT_PROCESS,,0,136087,899,136086,,,,125001,,11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,144007,b7c2ab05-13d6-3810-986a-64787ad564d6,"clinker production | waste paint | APOS, U",12885098499,1575304057769,125431,org.apache.derby.impl.jdbc.EmbedClob@3d97a632,UNIT_PROCESS,,0,144009,519,144008,,,,125001,,1
96,144010,f423c941-9e2f-30fc-8863-bb3b9316ba78,"electricity production, natural gas, conventio...",12884967427,1575304082985,125478,org.apache.derby.impl.jdbc.EmbedClob@616fe72b,UNIT_PROCESS,,0,144012,853,144011,,,,125001,,40
97,144054,7be2b324-f492-3727-b50e-f8d85832f5b3,"maintenance, solid oxide fuel cell 125kW elect...",12887195651,1575304062095,129601,org.apache.derby.impl.jdbc.EmbedClob@37efd131,UNIT_PROCESS,,0,144056,904,144055,,,,125001,,6
98,144062,dc7e0307-abc0-3ba5-a8dc-87675564e349,"electricity production, wind, 1-3MW turbine, o...",12884967427,1575304082995,125478,org.apache.derby.impl.jdbc.EmbedClob@7e7b159b,UNIT_PROCESS,,0,144064,129607,144063,,,,125001,,7


We need to remap the table data into Python data structures for use in an optimisation problem.

The openLCA impact assessment methods data is stored in a zip file that appears to contain data in JSON_LD format.

## Other data formats

This Greendata Format Converter Tool allows conversion between different database formats, which might make it easier to get the openLCA data into Python in a form that we can manipulate more easily. 

For example, the ILCD format is XML based so it might be easier to map to an object model, while the JSON-LD format should be easy to parse in Python. The Python libraries [PyLD](https://github.com/digitalbazaar/pyld) and [RDFLib-jsonld](https://github.com/RDFLib/rdflib-jsonld) can parse JSON_LD files. We can convert to openLCA data to CSV, but not using the Format Converter Tool. However, we can bulk export to CSV using a Derby database via a JDBC connection, which is an easy to read into Python, but then we need to map to an object model.

In the openLCA client application we can export the database to a variety of formats. It might be easier to use one of these formats to obtain data in the optimiser rather than read the native Derby database. Exporting data to another format though seems to take some time.

The openLCA Format Converter does not look to be that actively maintained. The latest version is reported as 2015. Its functionality has to some extent been reproduced in the main openLCA client application through its import and export data capabilities.

As an example, we load a single JSON file from the openLCA JSON_LD methods zip.

In [2]:
import json
import zipfile

json_zip_filename = "/mnt/disk1/data/openlca/JSON_LD/ecoinvent_36_lcia_methods.zip"
zip_file = zipfile.ZipFile(json_zip_filename, 'r')
name_list = zip_file.namelist()
json_filename = name_list[3]
my_file_contents = zip_file.read(json_filename)
my_object = json.loads(my_file_contents)

print(json_filename, "\n", my_object)

categories/f318fa60-bae9-361f-ad5a-5066a0e2a9d1.json 
 {'@context': 'http://greendelta.github.io/olca-schema/context.jsonld', '@type': 'Category', '@id': 'f318fa60-bae9-361f-ad5a-5066a0e2a9d1', 'name': 'Elementary flows', 'version': '00.00.000', 'modelType': 'FLOW'}


Now, lets read in the whole zip into memory and store it in a dict with keys representing the file names.

In [3]:
json_db = {}
for name in name_list:
    raw = zip_file.read(name)
    if len(raw) > 0: json_db[name] = json.loads(raw)

context = json_db['context.json']
context

{'@vocab': 'http://openlca.org/schema/v1.1/',
 '@base': 'http://openlca.org/schema/v1.1/',
 'modelType': {'@type': '@vocab'},
 'flowPropertyType': {'@type': '@vocab'},
 'flowType': {'@type': '@vocab'},
 'distributionType': {'@type': '@vocab'},
 'parameterScope': {'@type': '@vocab'},
 'allocationType': {'@type': '@vocab'},
 'defaultAllocationMethod': {'@type': '@vocab'},
 'allocationMethod': {'@type': '@vocab'},
 'processType': {'@type': '@vocab'},
 'riskLevel': {'@type': '@vocab'}}

## Embedded Python in OpenLCA

Python 2.7 is embedded in openLCA using Jython, which is a Java implementation of Python. Jython compiles Python code to Java bytecode, so it is doubtful that many of the recent standard python libraries will work because of their extensive dependencies. This functionality provides a scripting and calculation tool within the openLCA environment. There appears to be the possibility of providing a UI to the calculation.

There is an openLCA tutorial for using this option. 

https://github.com/GreenDelta/openlca-python-tutorial

The github repository does not look very actively maintained based on the commit log. It is doubtful that this sort of scripting functionality will be sufficient for building an optimisation tool. It is not clear how we would incorporate optimisation solvers and build a user interface that complements the openLCA interface.

## IPC Python API

Greendata provide an implementation of an JSON-RPC based protocol for Inter-Process Communication (IPC). It is possible to call openLCA functions in a live session and process results outside of openLCA in Python. The Python session and the openLCA session must be on the same machine. I think the main use of this functionality is to automate the process of building complex models and tasks in the openLCA interface. It also allows you to retreive data from openLCA and perform your own calculations.

For an optimisation model, we need to enhance to openLCA model building capabilities significantly. We need to show a range of values and networks over which to optimise and it is not clear how one can build a model like this in the current implementation. It would require significant modification to the openLCA user interface and it is not clear that Greendata would want do make these changes.

At present, it is proposed that we just use openLCA as a source of data, and our optimisation tool UI allows the building of a model for optimisation. The openLCA database can provide the building blocks for the model. The advantage of using IPC for retrieving data is that it is represented in an object model that can be more easily manipulated in Python.

The IPC functionality is provided by Greendata in the [olca-ipc package](https://github.com/GreenDelta/olca-ipc.py), which provides a convenience API for using the IPC protocol from standard Python (Cpython v3.6+). The `openlca-ipc` Python package is available on [PyPI](https://pypi.org/project/olca-ipc/) (The Python Package Index), but not for conda it seems.

Note that you need your **local openLCA client open with a database** and IPC server active for this library to work, but you get access to the [olca-schema](https://github.com/GreenDelta/olca-ipc.py) data model of openLCA.

This might be a lot quicker way of getting at openLCA data in a convenient format. However, it requires openLCA to be up and running for an optimisation tool to work. It tightly couples the optimisation tool to the openLCA client application and so may be cumbersome to use.

There are some questions on [ask.openlca](https://ask.openlca.org/) regarding using openLCA data in Python. For example, I think [this question](https://ask.openlca.org/2782/generating-product-system-results-in-python?show=2782#q2782) just refers to IPC rather than the collaboration server. Also see [this](https://ask.openlca.org/3123/how-create-productsystem-calculation-with-olca-olca-package?show=3123#q3123) which gives some Python code to interact with openLCA.

The script below shows how Python can be used to create a new flow in an open database. It does not currently work in a Jupyter notebook because we need to have the `olca-ipc` package installed in Python and an open database in the openLCA client application. It is easy to get working in an IDE such as as Spyder or PyCharm.

In [4]:
# we need openlca running with an open database on the machine where this notebook is running for this script to work
try:
    import olca
    import uuid

    client = olca.Client(8080)

    # find the flow property 'Mass' from the database
    mass = client.find(olca.FlowProperty, 'Mass')

    # create a flow that has 'Mass' as reference flow property
    steel = olca.Flow()
    steel.id = str(uuid.uuid4())
    steel.flow_type = olca.FlowType.PRODUCT_FLOW
    steel.name = "Steel"
    steel.description = "Added from the olca-ipc python API..."
    # in openLCA, conversion factors between different
    # properties/quantities of a flow are stored in
    # FlowPropertyFactor objects. Every flow needs at
    # least one flow property factor for its reference
    # flow property.
    mass_factor = olca.FlowPropertyFactor()
    mass_factor.conversion_factor = 1.0
    mass_factor.flow_property = mass
    mass_factor.reference_flow_property = True
    steel.flow_properties = [mass_factor]

    # save it in openLCA, you may have to refresh
    # (close & reopen the database to see the new flow)
    client.insert(steel)
except Exception as e:
    print(e)

No module named 'olca'


## Collaboration Server

This is a data server for openLCA databases with version control on the data. We may be able to programmatically query this server, but I am not sure if this is supported by Greendata. I can find no mention in the [documentation](http://www.openlca.org/wp-content/uploads/2018/07/LCA_CS_Manual.pdf). The server uses the ElasticSearch search engine, so perhaps this can be queried externally, but it is not setup to be accessible by default. 

I have installed the collaboration server on my local linux server in order to assess its functionality. The install instructions are [here](http://www.openlca.org/collaboration-server-installation-guide). We need to make sure that `tomcat` is using `java-8-openjdk-amd64` for the colloration server to work. This is simple to configure in `/etc/default/tomcat8` by uncommenting `JAVA_HOME` and setting it to the correct version of Java. However, I have had trouble getting ElasticSearch working with a relatively new version of Ubuntu.

It is easiest to setup the server on a machine running Ubuntu 16.04 as described in the installation guide. There are issues with Java compatibility and the interoperability of packages if you install on a new version of Ubuntu or another Linux OS.

![squirrel](Figures/Collaboration_Server.png)

It may be useful to have a collaboration server within the group because it provides a centralised repository of models. There may be security and licensing issues with holding data on an accessible server. It depends how easy it is for a research group to have such a server.

Maybe the best option for getting server data is to just fetch it into a local database and then read it into Python using the other methods described here.

## Elasticsearch

[ElasticSearch](https://www.elastic.co/elasticsearch/) is a search engine application written in Java that is bundled with the openLCA collaboration server. We may be able to query this server remotely to get at data in an openLCA database. The ElasticSearch port 9200 is not by default exposed externally. The format of the data returned from a query might need a lot processing.

There is a Python library to interact with ElasticSearch called [elasticsearch-dsl](https://elasticsearch-dsl.readthedocs.io/en/latest/).