# `bw_simapro_csv`: Python software for robsutly reading SimaPro CSV files

The SimaPro CSV file format can be challenging to read for a number of reasons. `bw_simapro_csv` tries to solve problems where it can, and to clearly notify you if it can't.

`bw_simapro_csv` draws heavily from the very nice [olca-simapro-csv](https://github.com/GreenDelta/olca-simapro-csv/), but is implemented in Python, and I think we have some more checks for potential data issues. It also allows for direct use in Brightway.

Because we don't have a formal specification or reference test cases, we have to reverse engineer the format. This means that our understanding is limited by the available input data. Current, `bw_simapro_csv` **does not** parse "product stage" types exports, as we don't have enough data to understand how this format type works. *Please help* by providing real world data and results so we can fill this hole in the library!

All told, SimaPro CSV isn't that bad. [There are much worse formats](https://github.com/gco/xee/blob/master/XeePhotoshopLoader.m#L108).

## Text encoding

The first meta-challenge is the text encoding. We are quite confident that SimaPro CSV is supposed to be exporting text following Windows 1252 (this is what [olca-simapro-csv](https://github.com/GreenDelta/olca-simapro-csv/blob/c11e40e7722f2ecaf62e813eebcc8d0793c8c3ff/src/test/java/org/openlca/simapro/csv/CsvLineTest.java#L53) uses as well).

However, it is quite common for SimaPro CSV files to use either control characters from the [Windows 1252 code plane](https://www.ascii-code.com/grid), and we have even seen bytes used which were undefined in the code plane. This seems to be partially intentional (e.g. they use the character `/u007f` - the delete character - as a line break in multiline strings so that these strings can be stored on a single CSV line (while there isn't a single CSV specification, everyone else uses standard ways to get this behaviour, mostly by just escaping a normaly line break). But sometimes it seems like people are entering data with their language set to something else, and this bytes are entered as if they were in Windows 1252. 

We don't really know, but we need to do something, so we remove characters which could never be meaningful in Windows 1252 (undefined bytes, or things like "device control 4" character).

## Products versus processes

Despite the presence of a `Products` block in processes, SimaPro doesn't really differentiate between between the two. Therefore, all process datasets should be considered as [`ProcessWithReferenceProduct`](https://github.com/brightway-lca/bw_interface_schemas/blob/5fb1d40587aec2a4bb2248505550fc883a91c355/bw_interface_schemas/lci.py#L83). Consider this quote from the tutorial:

    Process name in SimaPro
    Under the Documentation tab, you can enter the process name. Please note that this is only for
    your own reference and this name is not used anywhere. Processes are identified by the name
    defined under the Input/Output tab in the product section. Therefore, if you want to search for a
    certain process, you should use the product name defined in the Input/Output as the keyword.

## SimaPro CSV file format

We can take a divide and conquer approach to these files. We will divide the files into a set of *blocks*, and have custom classes for each block. The first block is the header.

In [2]:
import bw_simapro_csv
from pathlib import Path

In [3]:
my_filepath = Path("/Users/cmutel/Projects/Agribalyse 3.1 import/3.1.1/AGB 3.1.1.csv")

In [4]:
sp = bw_simapro_csv.SimaProCSV(my_filepath)

[32m2024-09-26 22:00:41.053[0m | [1mINFO    [0m | [36mbw_simapro_csv.main[0m:[36m__init__[0m:[36m119[0m - [1mWriting logs to /Users/cmutel/Library/Logs/bw_simapro_csv/AGB 3.1.1-2024-09-26T22-00-41[0m
[32m2024-09-26 22:00:41.087[0m | [1mINFO    [0m | [36mbw_simapro_csv.main[0m:[36m__init__[0m:[36m142[0m - [1mUsing database name 'AGRIBALYSE - Unit'[0m
[32m2024-09-26 22:00:41.087[0m | [1mINFO    [0m | [36mbw_simapro_csv.main[0m:[36m__init__[0m:[36m147[0m - [1mSimaPro CSV import started.
	File: '/Users/cmutel/Projects/Agribalyse 3.1 import/3.1.1/AGB 3.1.1.csv'
	Delimiter: ';'
	Name: 'AGRIBALYSE - Unit'[0m
[32m2024-09-26 22:01:55.172[0m | [1mINFO    [0m | [36mbw_simapro_csv.main[0m:[36mresolve_parameters[0m:[36m314[0m - [1mExtracted and cleaned 17557 process datasets[0m


In my case, I will be looking at the Agribalyse database. This is actually a quite clean database, and the import doesn't notice any obvious warnings or errors. Here is the header as provided in the file:

```console
{SimaPro 9.5.0.0}
{processus}
{Date: 05/05/2023}
{Time: 11:10:18}
{Projet: AGRIBALYSE - Unit}
{CSV Format version: 9.0.0}
{CSV separator: Semicolon}
{Decimal separator: ,}
{Date separator: /}
{Short date format: dd/MM/yyyy}
{Export platform IDs: No}
{Skip empty fields: Non}
{Convert expressions to constants: Oui}
{Selection: Selection (18557)}
{Related objects (system descriptions, substances, units, etc.): Oui}
{Include sub product stages and processes: Non}
{Skip unused parameters: Oui}
{Ouvrir bibliothèque : 'AGRIBALYSE - Unit'}
```

You can probably see that this part of the file *can't* be treated as a CSV. It needs, and gets, a special parser.

We also notice that some *field names* and even values are identified in French. We have tried to *guess* what [these fields could be called](https://github.com/brightway-lca/bw_simapro_csv/blob/main/bw_simapro_csv/header.py#L44) in other common European languages - please help us if you find terms that we are missing!

We also [try many possible values](https://github.com/brightway-lca/bw_simapro_csv/blob/main/bw_simapro_csv/utils.py#L38) for booleans.

Here is our parsing of the header:

In [5]:
sp.header

{'simapro_version': '9.5.0.0',
 'kind': <SimaProCSVType.processes: 'processes'>,
 'delimiter': ';',
 'project': 'AGRIBALYSE - Unit',
 'csv_version': '9.0.0',
 'libraries': [],
 'selection': 'Selection (18557)',
 'open_project': None,
 'open_library': 'AGRIBALYSE - Unit',
 'date_separator': '/',
 'dayfirst': True,
 'export_platform_ids': False,
 'skip_empty_fields': False,
 'convert_expressions': True,
 'related_objects': True,
 'include_stages': False,
 'decimal_separator': ',',
 'created': datetime.datetime(2023, 5, 5, 11, 10, 18),
 'exclude_library_processes': None}

## `Process` blocks

We then move to the next type of blocks, which describe processes. We have to be a bit careful here, as SimaPro blocks normally start with a control line, like

```console
Process
```

and end with another control line:

```console
End
```

**But** they don't always and the `End`. We therefore need to iterate through lines to find the implicit end of a block (the start of a new block), and then *backtrack* to finish the processing of the first block. We do this by using a [rewindable iterator](https://github.com/brightway-lca/bw_simapro_csv/blob/main/bw_simapro_csv/csv_reader.py#L32).

One tricky point in the processing is that there are [some block headers](https://github.com/brightway-lca/bw_simapro_csv/blob/main/bw_simapro_csv/main.py#L70) which can start new blocks, but can also be used *inside* a `Process` block. If we see a file where this indeterminate section headers are used and `End` command blocks aren't, we raise an error.

`Process` blocks do have some metadata:

In [24]:
process_blocks = [block for block in sp if isinstance(block, bw_simapro_csv.blocks.Process)]
one_process = process_blocks[10200]

In [35]:
one_process.parsed

{'metadata': {'Category type': 'material',
  'Process identifier': 'AGRIBALU000000003110191',
  'Type': 'Unit process',
  'Status': 'Finished',
  'Time period': 'Unspecified',
  'Geography': 'Unspecified',
  'Technology': 'Unspecified',
  'Representativeness': 'Unspecified',
  'Multiple output allocation': 'Unspecified',
  'Substitution allocation': 'Unspecified',
  'Cut off rules': 'Unspecified',
  'Capital goods': 'Unspecified',
  'Boundary with nature': 'Unspecified',
  'Date': datetime.date(2020, 4, 15),
  'Comment': 'Inventory of AGRIBALYSE v3.1, 2022 (update to v3.1 in August 2022 by EVEA S.A.S Coopérative). See the complete description of AGRIBALYSE database.',
  'System description': 'AGRIBALYSE'}}

`Process` blocks are themselves made up of smaller blocks.

In [25]:
one_process.blocks

{'Products': <bw_simapro_csv.blocks.products.Products at 0x39ff92440>,
 'Materials/fuels': <bw_simapro_csv.blocks.technosphere_edges.TechnosphereEdges at 0x39ff924a0>,
 'Emissions to air': <bw_simapro_csv.blocks.generic_biosphere.GenericUncertainBiosphere at 0x39ff924d0>}

`Process` blocks can have the following constituent blocks:

* Avoided products
* Calculated parameters
* Economic issues
* Electricity/heat
* Emissions to air
* Emissions to soil
* Emissions to water
* Final waste flows
* Input parameters
* Materials/fuels
* Non material emissions
* Products
* Remaining waste
* Resources
* Separated waste
* Social issues
* Waste scenario
* Waste to treatment
* Waste treatment

Many of these can [reuse a generic parser](https://github.com/brightway-lca/bw_simapro_csv/blob/main/bw_simapro_csv/blocks/process.py#L22) for technosphere inputs or biosphere edges.

### Parameterization

The format for input or output lines is more or less OK. We do need to if the amount field is a number or instead a formula. This is complicated because SimaPro allows for arbitrary decimal separators (e.g. `10,40`), and for the percentage sign (e.g. `80%`).

If we determine that there is a formula, we need to keep a reference to it, because the variables used in that formula are sometimes only be defined at the end of the SimaPro CSV file.

We also need to deal with SimaPro using string parsing in formulas which is case independent - i.e. 'FOO' is the same as 'foo'. This is **not** the case when parsing with Python. To handle this, and to handle variables being defined with names which are reserved words in Python (like `yield`), we 1) uppercase all variable names, and 2) preface all variables with `SP_`.

### Inputs and Outputs

This is actually pretty simple. Parse each line according to its reverse-engineered format and add it to the list.

We check uncertainty values, and convert impossible distributions to "No uncertainty".

In [26]:
one_process.blocks['Products'].parsed

[{'name': 'Mutton, leg, raw, processed in FR | Chilled | PS | at distribution {FR} U',
  'unit': 'kg',
  'waste_type': 'Compost',
  'category': 'Agricultural\\Food\\Distribution\\Meat, egg and fish\\Raw meat\\Lamb and mutton',
  'comment': '',
  'line_no': 4098335,
  'amount': 1.0,
  'allocation': 100.0}]

In [27]:
one_process.blocks['Materials/fuels'].parsed

[{'name': 'Mutton, leg, raw, processed in FR | Chilled | PS | at packaging {FR} U',
  'unit': 'kg',
  'comment': 'Includes losses at distribution',
  'line_no': 4098342,
  'amount': 1.0,
  'uncertainty type': 0,
  'loc': 1.0},
 {'name': 'Electricity, low voltage {FR}| market for | Cut-off, S - Copied from Ecoinvent U',
  'unit': 'MJ',
  'comment': "Energy consumption at distribution, derived from distribution energy default for: 'Chilled' products and product density of 1.0 kg/l",
  'line_no': 4098343,
  'amount': 0.003114,
  'uncertainty type': 0,
  'loc': 0.000865,
  'original unit before conversion': 'kWh',
  'unit conversion factor': 3.6},
 {'name': 'Electricity, low voltage {FR}| market for | Cut-off, S - Copied from Ecoinvent U',
  'unit': 'MJ',
  'comment': "Cooling at distribution, derived from distribution energy default for: 'Chilled' products and product density of 1.0 kg/l",
  'line_no': 4098344,
  'amount': 0.008308800000000002,
  'uncertainty type': 0,
  'loc': 0.002308,


## Units

We convert units to the "natural" unit in that dimension. The natural unit is the one used to define the unit in the `Units` block. The `Units` block looks like this:

```console
Units
kg;Mass;1;kg
p;Amount;1;p
g;Mass;0,001;kg
kWh;Energy;3,6;MJ
l;Volume;0,001;m3
m3;Volume;1;m3
```

In [33]:
units = [block for block in sp.blocks if isinstance(block, bw_simapro_csv.blocks.Units)][0]
units.parsed[:3]

[{'name': 'kg',
  'dimension': 'Mass',
  'conversion': 1.0,
  'reference unit name': 'kg',
  'line_no': 6938958},
 {'name': 'p',
  'dimension': 'Amount',
  'conversion': 1.0,
  'reference unit name': 'p',
  'line_no': 6938959},
 {'name': 'g',
  'dimension': 'Mass',
  'conversion': 0.001,
  'reference unit name': 'kg',
  'line_no': 6938960}]

Because we remove illegal characters, we can run into funny situations. For example, here is a log message from client data:

```console
2024-08-13 17:18:05.833 | CRITICAL | bw_simapro_csv.units:normalize_units:43 - 
    Multiple different unit conversions given for input unit "g".
    After removing illegal characters and fixing potential encoding issues,
    unit "g" has multiple possible conversion factors. This will lead to
    incorrect results and undefined behaviour. To fix this, please remove
    all unwanted unit conversions lines. We found the follow possible conversions:
    Source unit; target unit; conversion; line number:
	('g', 'kg', 0.001, 738)
	('g', 'kg', 1e-09, 849)
```

This has to be fixed manually.

## Missing references

There is one other class of errors we have seen in real data. SimaPro has metadata blocks at the end of the file, like the list of literature references, units, and different types of biosphere flows. However, there is **no guarantee** that a reference to a unit or literature reference actually exists in that metadata block. Sometimes you are stuck with reference labels and no more info.

## Exporting to `brightway`

`bw_simapro_csv` is not Brightway-specific, even if `bw` is in the name. In fact, Brightway isn't installed if you run `pip install bw_simapro_csv`. But of course it can export to Brightway if desired. When exporting, we do the following:

* `Process` metdata is turned into tags
* `Waste treatment` inputs and `Products` outputs are labelled as functional edges
* Processes with more than one functional edge are stored as `multifunctional` processes
* Allocation values are added to the `properties` dict as `manual_allocation`