## 0.0. setup

1. if there is not already a _hybridization_ `conda` environment set up, [create a new `conda` environment](https://conda.io/projects/conda/en/latest/user-guide/getting-started.html) using the provided environment file `conda_env_hybridization.yml`:

```
conda env create -f <path to environment yaml file>
```

2. create directories and copy database files from the shared drive (`/srv/data/autumn_school/hybridization/`)

In [3]:
# ensures that we start with a fresh directory, since ecospold2matrix can mess up Ecoinvent files
!rm -rf ~/hybridization_data/databases_raw 
!mkdir -p ~/hybridization_data/databases_raw
!cp -a /srv/data/autumn_school/hybridization/databases_raw/* ~/hybridization_data/databases_raw

In [4]:
!mkdir -p ~/hybridization_data/databases_pickle

## 0.1. imports
### 0.1.1. regular imports

In [5]:
# i/o
import sys
import os
from pathlib import Path
import gzip
import pickle
# lca
import ecospold2matrix as e2m
import pymrio
import brightway2 as bw
# data science
import pandas as pd

Using environment variable BRIGHTWAY2_DIR for data directory:
/home/weinold/bw_data


### 0.1.2. local imports

In [6]:
%%capture
pylcaio_directory = os.path.join(Path.home(), 'pylcaio')
!git clone https://github.com/michaelweinold/pylcaio.git $pylcaio_directory # this is a fork with various fixes

In [7]:
sys.path.append(os.path.join(Path.home(), 'pylcaio', 'src')) # required for local import of pylcaio
import pylcaio

## 0.2. file paths

set location of databases (Ecoinvent and Exiobase) for use by the appropriate Python packages

### 0.2.1. directories

In [8]:
%%capture
print(path_dir_databases_raw := os.path.join(Path.home(), 'hybridization_data/databases_raw'))
print(path_dir_databases_pickle := os.path.join(Path.home(), 'hybridization_data/databases_pickle'))

### 0.2.2. databases (Ecoinvent 3.5, Exiobase 3.X)

In [9]:
%%capture
# Exiobase
print(path_file_exiobase_input := os.path.join(path_dir_databases_raw, 'IOT_2012_pxp.zip'))
print(path_file_exiobase_output := os.path.join(path_dir_databases_pickle, 'exiobase_monetary_pxp_2012.pickle'))
# Ecoinvent
print(path_dir_ecoinvent_input := os.path.join(path_dir_databases_raw, 'ecoinvent-3.5-cutoff'))
print(path_file_ecoinvent_characterisation := os.path.join(path_dir_databases_raw, 'LCIA_Implementation_v3.5.xlsx'))

### 0.2.3. databases (Ecoinvent 3.8, Exiobase 3.8)

In [23]:
%%capture
# Exiobase
print(path_file_exiobase_input := os.path.join(path_dir_databases_raw, 'IOT_2011_pxp.zip'))
print(path_file_exiobase_output := os.path.join(path_dir_databases_pickle, 'exiobase_3_8_monetary_pxp_2011.pickle'))
# Ecoinvent
print(path_dir_ecoinvent_input := os.path.join(path_dir_databases_raw, 'ecoinvent-3.8-cutoff'))
print(path_file_ecoinvent_characterisation := os.path.join(path_dir_databases_raw, 'LCIA_Implementation_v3.8.xlsx'))

## 1.1. read databases and save to disk
### 1.1.1 download the latest version of the Exiobase database and save `pickle` to disk

❔ creates `pymrio.IOSystem` class instance (collection of pd.DataFrames etc.) \
⏳ ~1min

In [15]:
exio_meta = pymrio.download_exiobase3(
    storage_folder = path_dir_databases_raw,
    system = 'pxp',
    years = [2011]
)

In [16]:
%%time
exiobase: pymrio.IOSystem = pymrio.parse_exiobase3(path_file_exiobase_input)
with open(path_file_exiobase_output, 'wb') as file_handle:    
    pickle.dump(obj = exiobase, file = file_handle, protocol=pickle.HIGHEST_PROTOCOL)

CPU times: user 1min 4s, sys: 2.14 s, total: 1min 6s
Wall time: 1min 6s


### 1.1.2 read Ecoinvent database and save `pickle` to disk

❔ creates e2m.Ecospold2Matrix class instance \
⏳ ~12min

In [17]:
%%capture
print(e2m_project_name := 'ecoinvent_3_8_cutoff')
print(tmp_dir_e2m := os.path.join(path_dir_databases_pickle, str(e2m_project_name + '_log')))
print(tmp_pattern_e2m := '*.db')

#### 1.1.2.1. run `ecospold2matrix`

In [28]:
parser = e2m.Ecospold2Matrix(
    sys_dir = path_dir_ecoinvent_input,
    lci_dir = os.path.join(path_dir_ecoinvent_input, 'datasets'),
    project_name = e2m_project_name,
    characterisation_file = path_file_ecoinvent_characterisation,
    out_dir = path_dir_databases_pickle,
    positive_waste = False,
    nan2null = True
)

2022-10-26 20:43:07,993 - ecoinvent_3_8_cutoff - INFO - Ecospold2Matrix Processing
INFO:ecoinvent_3_8_cutoff:Ecospold2Matrix Processing
2022-10-26 20:43:08,160 - ecoinvent_3_8_cutoff - INFO - Current git commit: 3dfc1ab7df1057f40fb75f9b6457cd40b2022c29
INFO:ecoinvent_3_8_cutoff:Current git commit: 3dfc1ab7df1057f40fb75f9b6457cd40b2022c29
2022-10-26 20:43:08,161 - ecoinvent_3_8_cutoff - INFO - Project name: ecoinvent_3_8_cutoff
INFO:ecoinvent_3_8_cutoff:Project name: ecoinvent_3_8_cutoff
2022-10-26 20:43:08,162 - ecoinvent_3_8_cutoff - INFO - Unit process and Master data directory: /home/weinold/hybridization_data/databases_raw/ecoinvent-3.8-cutoff
INFO:ecoinvent_3_8_cutoff:Unit process and Master data directory: /home/weinold/hybridization_data/databases_raw/ecoinvent-3.8-cutoff
2022-10-26 20:43:08,163 - ecoinvent_3_8_cutoff - INFO - Data saved in: /home/weinold/hybridization_data/databases_pickle
INFO:ecoinvent_3_8_cutoff:Data saved in: /home/weinold/hybridization_data/databases_pickl

In [29]:
parser.ecospold_to_Leontief(
    fileformats = 'Pandas',
    with_absolute_flows=True
)

2022-10-26 20:43:10,127 - ecoinvent_3_8_cutoff - INFO - Products extracted from IntermediateExchanges.xml with SHA-1 of 1da23bc8fd24d97422a2a21ba3626d2cdfa6a428
INFO:ecoinvent_3_8_cutoff:Products extracted from IntermediateExchanges.xml with SHA-1 of 1da23bc8fd24d97422a2a21ba3626d2cdfa6a428
2022-10-26 20:43:30,534 - ecoinvent_3_8_cutoff - INFO - Activities extracted from ActivityIndex.xml with SHA-1 of 03403c01ac6f74a5d6cc5ca8820593f7e516b709
INFO:ecoinvent_3_8_cutoff:Activities extracted from ActivityIndex.xml with SHA-1 of 03403c01ac6f74a5d6cc5ca8820593f7e516b709
2022-10-26 20:43:30,566 - ecoinvent_3_8_cutoff - INFO - Processing 19565 files in /home/weinold/hybridization_data/databases_raw/ecoinvent-3.8-cutoff/datasets
INFO:ecoinvent_3_8_cutoff:Processing 19565 files in /home/weinold/hybridization_data/databases_raw/ecoinvent-3.8-cutoff/datasets
2022-10-26 20:44:57,397 - ecoinvent_3_8_cutoff - INFO - Flows saved in /home/weinold/hybridization_data/databases_raw/ecoinvent-3.8-cutoff/f

starting characterisation
            cas                                        aName     bad_cas  \
0       93-65-2                                     mecoprop   7085-19-0   
1   107534-96-3                                 tebuconazole  80443-41-0   
2      302-04-5                                          NaN  71048-69-6   
3   138261-41-3                                          NaN  38261-41-3   
4      108-62-3                                  metaldehyde   9002-91-9   
5      107-15-3                                          NaN    117-15-3   
6       74-89-5                                 methyl amine     75-89-5   
7    17428-41-0                                 arsenic, ion   7440-38-2   
8    22537-48-0                                 cadmium, ion   7440-43-9   
9    18540-29-9                                  chromium vi   7440-47-3   
10   17493-86-6                                  Copper, ion   7440-50-8   
11   14701-22-5                                  Nickel, ion  

  self.STR.cas = self.STR.cas.str.replace('^[0]*','')


OperationalError: duplicate column name: id

#### 1.1.2.2. clean up temporary files

unfortunately, `ecospold2matrix` creates lots of files (`.log, .db`) where the output directory can be not set. they are not cleaned up automatically. they might interfere with repeated runs of the code. this is why we must clean up these files.

In [None]:
def delete_e2m_files(list_string: list) -> None:
    for i in list_string:
        !rm -rf $i
    pass

In [None]:
delete_e2m_files(
        [
            tmp_dir_e2m,
            tmp_pattern_e2m,
        ]
)