## Using the Data Management Framework in IDAES to store data and parameters

In the previous two notebooks, we saw how to use the `parmest` tool with IDAES models (unit models or the state block) to estimate the binary interaction parameters for the NRTL model with benzene and toluene as components. In this module, we will specifically see how to use the Data Management Framework (DMF) in IDAES that enables data provenance. Specifically, this module will demonstrate storing estimated parameters and also the associated datasets that were used in estimating those parameters. In this example, we will be using the `Parameter_estimation_NRTL_using_unit_model` notebook. 

We will complete the following tasks:
* Split the dataset into two sub-datasets and use this to estimate the binary interaction parameters
* Use DMF to store the estimated parameters with the data source that was used

## Key links to documentation:
* DMF - https://idaes-pse.readthedocs.io/en/stable/user_guide/components/dmf/index.html



<div class="alert alert-block alert-info">
<b>Inline Exercise:</b>
import `ConcreteModel` from Pyomo, `FlowsheetBlock` and `Flash` from IDAES. 
</div>

In [None]:
# Todo: import ConcreteModel from pyomo.environ
from pyomo.environ import ConcreteModel, value

# Todo: import FlowsheetBlock from idaes.core
from idaes.core import FlowsheetBlock

# Todo: import Flash unit model from idaes.models.unit_models
from idaes.models.unit_models import Flash


In the next cell, we will be importing the parameter block that we will be using in this module and the idaes logger. 

In [None]:
from idaes.models.properties.activity_coeff_models.\
    BTX_activity_coeff_VLE import BTXParameterBlock
import idaes.logger as idaeslog

In the next cell, we import `parmest` from Pyomo and the `pandas` package. We need `pandas` as `parmest` uses `pandas.dataframe` for handling the input data and the results.

In [None]:
import pyomo.contrib.parmest.parmest as parmest
import pandas as pd

## Setting up an initialized model

We need to provide a method that returns an initialized model to the `parmest` tool in Pyomo.

<div class="alert alert-block alert-info">
<b>Inline Exercise:</b>
Using what you have learned from previous modules, fill in the missing code below to return an initialized IDAES model. 
</div>

In [None]:
def NRTL_model(data):
    
    #Todo: Create a ConcreteModel object
    m = ConcreteModel()
    
    #Todo: Create FlowsheetBlock object
    m.fs = FlowsheetBlock(dynamic=False)
    

    #Todo: Create a properties parameter object with the following options:
    # "valid_phase": ('Liq', 'Vap')
    # "activity_coeff_model": 'NRTL'
    m.fs.properties = BTXParameterBlock(valid_phase=('Liq', 'Vap'),
                                        activity_coeff_model='NRTL')
    m.fs.flash = Flash(property_package=m.fs.properties)

    # Initialize at a certain inlet condition
    m.fs.flash.inlet.flow_mol.fix(1)
    m.fs.flash.inlet.temperature.fix(368)
    m.fs.flash.inlet.pressure.fix(101325)
    m.fs.flash.inlet.mole_frac_comp[0, "benzene"].fix(0.5)
    m.fs.flash.inlet.mole_frac_comp[0, "toluene"].fix(0.5)

    # Set Flash unit specifications
    m.fs.flash.heat_duty.fix(0)
    m.fs.flash.deltaP.fix(0)

    # Fix NRTL specific variables
    # alpha values (set at 0.3)
    m.fs.properties.\
        alpha["benzene", "benzene"].fix(0)
    m.fs.properties.\
        alpha["benzene", "toluene"].fix(0.3)
    m.fs.properties.\
        alpha["toluene", "toluene"].fix(0)
    m.fs.properties.\
        alpha["toluene", "benzene"].fix(0.3)

    # initial tau values
    m.fs.properties.\
        tau["benzene", "benzene"].fix(0)
    m.fs.properties.\
        tau["benzene", "toluene"].fix(-0.9)
    m.fs.properties.\
        tau["toluene", "toluene"].fix(0)
    m.fs.properties.\
        tau["toluene", "benzene"].fix(1.4)

    # Initialize the flash unit
    m.fs.flash.initialize(outlvl=idaeslog.INFO_LOW)

    # Fix at actual temperature
    m.fs.flash.inlet.temperature.fix(float(data["temperature"]))

    # Set bounds on variables to be estimated
    m.fs.properties.\
        tau["benzene", "toluene"].setlb(-5)
    m.fs.properties.\
        tau["benzene", "toluene"].setub(5)

    m.fs.properties.\
        tau["toluene", "benzene"].setlb(-5)
    m.fs.properties.\
        tau["toluene", "benzene"].setub(5)

    # Return initialized flash model
    return m


In [None]:
from idaes.core.util.model_statistics import degrees_of_freedom
import pytest

# Testing the initialized model
test_data = {"temperature": 368}

m = NRTL_model(test_data)

# Check that degrees of freedom is 0
assert degrees_of_freedom(m) == 0

# Check for output values
assert value(m.fs.flash.liq_outlet.mole_frac_comp[0, 'benzene']) == pytest.approx(0.389, abs=1e-2)
assert value(m.fs.flash.vap_outlet.mole_frac_comp[0, 'benzene']) == pytest.approx(0.605, abs=1e-2)

assert value(m.fs.flash.liq_outlet.mole_frac_comp[0, 'toluene']) == pytest.approx(0.610, abs=1e-2)
assert value(m.fs.flash.vap_outlet.mole_frac_comp[0, 'toluene']) == pytest.approx(0.394, abs=1e-2)

## Parameter estimation using parmest

In addition to providing a method to return an initialized model, the `parmest` tool needs the following:

* List of variable names to be estimated
* Dataset with multiple scenarios
* Expression to compute the sum of squared errors



In this example, we only estimate the binary interaction parameter (`tau_ij`). Given that this variable is usually indexed as `tau_ij = Var(component_list, component_list)`, there are 2*2=4 degrees of freedom. However, when i=j, the binary interaction parameter is 0. Therefore, in this problem, we estimate the binary interaction parameter for the following variables only:

* fs.properties.tau['benzene', 'toluene']
* fs.properties.tau['toluene', 'benzene']

<div class="alert alert-block alert-info">
<b>Inline Exercise:</b>
Create a list called `variable_name` with the above-mentioned variables declared as strings.
</div>

In [None]:
# Todo: Create a list of vars to estimate
variable_name = ["fs.properties.tau['benzene', 'toluene']",
                 "fs.properties.tau['toluene', 'benzene']"]


Pyomo's `parmest` tool supports the following data formats:
- pandas dataframe
- list of dictionaries
- list of json file names.

Please see the documentation for more details. 

For this example, we load data from the csv file `BT_NRTL_dataset.csv`. The dataset consists of fifty data points which provide the mole fraction of benzene in the vapor and liquid phase as a function of temperature. 

In [None]:
# Load all data from csv
data = pd.read_csv('BT_NRTL_dataset.csv')

# Display the dataset
display(data)

# Split the data set into two data sets
data_subset_1 = data.loc[0:24]
display(data_subset_1)

data_subset_2 = data.loc[25:49].reset_index()
display(data_subset_2)

### Set up DMF
* Perform imports
* Create a new workspace in a temporary directory under `~/.idaes`. You can modify this path if you need to.
* Set that workspace as the default to use for `%dmf` "magics" in this Jupyter Notebook

In [None]:
# DMF imports
from pathlib import Path
from idaes.core.dmf import DMF, magics
from idaes.core.dmf.resource import Resource, create_relation, Predicates
# use or create idaes "dotfile" in home directory
wspath = Path("~/.idaes").expanduser()
if not wspath.exists():
    wspath.mkdir()
# use/create a subdirectory as a DMF workspace for the workshop
wspath = wspath / "workshop_workspace"
_dmf = DMF(path=wspath, create=not wspath.exists())

%dmf init ~/.idaes/workshop_workspace

### Show contents of DMF workspace
This shows what is currently in the DMF workspace. If you ran this notebook earlier with the same workspace location,
you will see the DMF resources you created at that time. Otherwise it will be empty.

In [None]:
# Show current contents of DMF workspace
%dmf list

In [None]:
# Base dataset
name = "BT NRTL dataset"
ds_base = _dmf.find_one(name=name)
if not ds_base:
    ds_base = _dmf.new(file="BT_NRTL_dataset.csv", name=name)
    
# Splits
ds_splits, new_relations = [], False
for i in (1, 2):
    name = f'BT NRTL split{i}'
    df = data_subset_1 if i == 1 else data_subset_2
    df_file = f'BT_NRTL_dataset_split{i}.csv'
    df.to_csv(df_file)
    dss = _dmf.find_one(name=name)
    if not dss:
        dss = _dmf.new(file=df_file, name=name)
        create_relation(dss, Predicates.derived, ds_base)
        new_relations = True
    ds_splits.append(dss)

# Update if relations were added
if new_relations:
    _dmf.update()

print("done")

### List DMF objects
Check if the raw data and splits are recorded in the DMF.
Note that if you ran this notebook earlier, the estimated parameters will also be listed. This is a feature, not a bug!

In [None]:
_dmf.resource_count
%dmf list

We need to provide a method to return an expression to compute the sum of squared errors that will be used as the objective in solving the parameter estimation problem. For this problem, the error will be computed for the mole fraction of benzene in the vapor and liquid phase between the model prediction and data. 

<div class="alert alert-block alert-info">
<b>Inline Exercise:</b>
Complete the following cell by adding an expression to compute the sum of square errors. 
</div>

In [None]:
# Create method to return an expression that computes the sum of squared error
def SSE(m, data):
    expr = ((float(data["vap_benzene"]) -
             m.fs.flash.vap_outlet.mole_frac_comp[0, "benzene"])**2 +
            (float(data["liq_benzene"]) -
             m.fs.flash.liq_outlet.mole_frac_comp[0, "benzene"])**2)
    return expr*1E4

<div class="alert alert-block alert-warning">
<b>Note:</b>
Notice that we have scaled the expression up by a factor of 10000 as the SSE computed here will be an extremely small number given that we are using the difference in mole fraction in our expression. A well-scaled objective will help improve solve robustness when using IPOPT. 
</div>


We are now ready to set up the parameter estimation problem. We will create a parameter estimation object called `pest`. As shown below, we pass the method that returns an initialized model, dataset, list of variable names to estimate, and the SSE expression to the Estimator object. `tee=True` will print the solver output after solving the parameter estimation problem.

In [None]:
# Initialize a parameter estimation object for data subset 1
pest_data_subset_1 = parmest.Estimator(NRTL_model, data_subset_1, variable_name, SSE, tee=True)

# Initialize a parameter estimation object for data subset 2
pest_data_subset_2 = parmest.Estimator(NRTL_model, data_subset_2, variable_name, SSE, tee=True)

# Run parameter estimation using data subset 1
obj_value_1, parameters_1 = pest_data_subset_1.theta_est()

# Run parameter estimation using data subset 2
obj_value_2, parameters_2 = pest_data_subset_2.theta_est()

Let us display the results by running the next cell. 

In [None]:
print("----Using Data Subset 1----")
print()
print("The SSE at the optimal solution is %0.6f" % (obj_value_1*1e-4))
print()
print("The values for the parameters are as follows:")
for k,v in parameters_1.items():
    print(k, "=", v)

print()
print("----Using Data Subset 2----")
print()
print("The SSE at the optimal solution is %0.6f" % (obj_value_2*1e-4))
print()
print("The values for the parameters are as follows:")
for k,v in parameters_2.items():
    print(k, "=", v)

In [None]:
import re

def get_tau(d, chem1, chem2):
    found_key = None
    for key in d.index.to_list():
        if re.match(fr"fs\.properties\.tau.*{chem1}.*{chem2}.*", key):
            found_key = key
            break
    if found_key is None:
        raise KeyError(f"Did not find any key for fs.properties.tau with '{chem1}' followed by '{chem2}'")
    return d.at[found_key]

# reformulate in simpler form using get_tau() function to be robust to formatting details of the keys
# returned by theta_est() above.
dmf_params = {}
for n, p_n in ((1, parameters_1), (2, parameters_2)):
    for chem1, chem2 in (('benzene', 'toluene'), ('toluene', 'benzene')):
        dmf_params[f"{n}:{chem1[0]}{chem2[0]}"] = get_tau(p_n, chem1, chem2)

In [None]:
# Check for values of the parameter estimation problem
import pytest

assert dmf_params["1:bt"] == pytest.approx(-0.906, 1e-2) 
assert dmf_params["1:tb"] == pytest.approx(1.445, 1e-2)
assert dmf_params["2:bt"] == pytest.approx(-1.02, 1e-2) 
assert dmf_params["2:tb"] == pytest.approx(1.667, 1e-2)

### Save parameters in DMF
The estimated parameters will be saved in the DMF and a "relation" will be recorded that remembers which data split
each set of estimated parameters came from. When we're done the relations in the DMF will look like this:
```
BT NRTL dataset
    │
    ├───◀─┤derived│ BT NRTL split2 ◀─┤derived│ BT NRTL est param2
    │
    └───◀─┤derived│ BT NRTL split1 ◀─┤derived│ BT NRTL est param1
```

In [None]:
# save to DMF
# create resources
name = "BT NRTL est param1"
ds_s1 = _dmf.find_one(name=name)
if not ds_s1:
    ds_s1 = _dmf.new(name=name, desc="Solution for data subset 1", data={'SSE': obj_value_1, 'parameters': 
                                                                         {'tau': {'benzene,toluene': dmf_params["1:bt"],
                                                                                  'toluene,benzene': dmf_params["1:tb"]}}})
    create_relation(ds_s1, Predicates.derived, ds_splits[0])
name = "BT NRTL est param2"
ds_s2 = _dmf.find_one(name=name)
if not ds_s2:
    ds_s2 = _dmf.new(name=name, desc="Solution for data subset 2", data={'SSE': obj_value_2, 'parameters': 
                                                                         {'tau': {'benzene,toluene': dmf_params["2:bt"],
                                                                                  'toluene,benzene': dmf_params["2:tb"]}}})

    create_relation(ds_s2, Predicates.derived, ds_splits[1])
# save relations (prints number of objects processed)
_dmf.update()

## Using the Estimated Parameters

In the notebook [Flash Unit Model using NRTL](../ParamEst/DMF_2_flash_unit_Model_with_NRTL_solution.ipynb), we will see how the parameters that were estimated will be used to simulate a flash unit model with NRTL property package. 