# Single Process Regression Optimization

Single process regression optimization targets optimization of a single unit or process, as shown in the figure below. The single process is defined by its inputs, output, and operational behavior. We represent the process inputs as input variables, some of which can be controlled (control variables) and some can be just observed (uncontrollable variables, e.g., sensor data). The process behavior defines the functional relationship between input variables and the output variable(s). In our framework, it is modeled using a regression function with respect to controllable and uncontrollable variables. The regression function is learned as a predictive model from historical data. We support the following regression models: linear regression, decision tree, random forest, MARS, and multilayer perceptron. The regression optimization framework first learns the regression function from historical data (several candidates models can be considered using any of mentioned machine learning models to find the best fit) and the resulting trained regression model is used during the optimization phase to identify optimal values of controllable variables to optimize the output. 

<img src="figures/Picture1.jpg" alt="Drawing" style="width: 550px;"/>


This notebook demonstrates the *[Process and System Regression Optimization](https://developer.ibm.com/apis/catalog/ai4industry--regression-optimization-product/Introduction)* AI and optimization service, comprising models and algorithms for optimizing set points for process control to achieve greater efficiency, productivity, and reduced risk.

The APIs offer two specific applications:

1. Single process regression optimization aims to learn behavior and provide optimal set points for single unit systems with varying characteristics, and

2. System-wide process optimization provides optimal set points for systems comprised on multiple process units such as complex plants or manufacturing processes.

The service leverages advanced machine learning techniques including piece-wise linear regression models, deep learning and ensemble models with optimization algorithms based on Mixed-Integer Linear Programming (MILP), nonlinear optimization, and derivative-free optimization for ensemble models.

Specifically, in this notebook we demonstrate how to use *single process regression optimization*.

## Credentials

This notebook requires valid credentials to invoke the regression optimization APIs that it is based on. Please obtain your own credentials when customizing this notebook for your own work. Please visit [Regression Optimization @ IBM](https://developer.ibm.com/apis/catalog/ai4industry--regression-optimization-product/Introduction) for trial subscription which will let you get the credentials.

In [2]:
# Credentials required for running notebook

Client_ID = "replace-with-valid-client-ID"
Client_Secret = "replace-with-valid-client-Secret"

In [3]:
# imports, headers and API endpoint URLs
import requests
import time
import pprint

headers = {
    'X-IBM-Client-Id': Client_ID,
    'X-IBM-Client-Secret': Client_Secret,
    'accept': "application/json",
}

reg_job_url = "https://api.ibm.com/ai4industry/run/pred-opt/v1/regression-model" 
opt_job_url = "https://api.ibm.com/ai4industry/run/pred-opt/v1/single-process-optimization" 

## Regression Service Creation and Job Submission

We build regression models to capture the inflow-outflow (inputs-outputs) relationship for the process node. Prior to building regression models, the individual sub-processes and their controls, observed variables and outcome variables at the right granularity need to be identified. Prior to using this service, a user would need to complete data cleaning, and feature extraction and engineering techniques to train a regression model, with the intent to maximize accuracy and interpretability. The service supports several ML techniques to represent the historical behavior of a process. The machine learning model with the best test accuracy is deployed for each process which is in turn used during the optimization phase. 

The regression service call needs 5 inputs, as shown in the cell below: *model-id, training-data, model-type, input-variables and target-variable*. 
* *model-id* serves as a unique model identifier
* *training-data* contains data for training the regression model in the csv format
* The choices for *model-type* include 
    * LR_SK (LinearRegression from scikit-learn), 
    * RF_SK (RandomForestRegressor from scikit-learn), 
    * RegTree_SK (DecisionTreeRegressor from scikit-learn combined with linear regression in the leaf nodes), 
    * MARS (multivariate adaptive regression splines from pyearth), 
    * NN_SK (MLPRegressor from scikit-learn) and 
    * NN_KERAS (deep neural network trained using Keras).
* Target and input variables may be given any label but need to match the variable names in the header in training data file. 

Any missing values will need to be removed prior to using this service.

We will use the *Regression Optimization service* to first generate regression models from training data and then optimize the controllable variables to maximize output.

#### Load Training Dataset to Train the Regression Model

The dataset below represents historical data of a single process comprising 4 features and 1 target. 

In [4]:
#read dataset from a local file
import pandas as pd

datafile_name = 'data/P1.csv'
data_df = pd.read_csv(datafile_name)
data_df.head()

Unnamed: 0,P1_1,P1_2,P1_3,P1_4,P1_5,P1_y
0,3558.734445,4372.057772,4368.767882,0.496077,7654.29693,6910.577119
1,4443.402776,5204.925143,4666.803883,0.57404,4568.129063,4229.638464
2,4704.597316,5299.015293,3789.424103,0.822494,1429.750307,1502.815985
3,4641.848748,5343.532958,5290.086972,0.630075,8514.035452,7669.505625
4,3779.492033,4582.668164,3887.728128,0.535722,5134.960053,4698.800721


#### Specify the Regression Service Parameters

In [12]:
# specify attributes of the regression service call

model_id = 'model_P1_1_LR_SK_nb_example'
model_type = 'LR_SK'
input_vars='P1_1,P1_2,P1_3,P1_4,P1_5'
target_var= 'P1_y'

fields = {
        'model-id':  model_id,
        'target-variable': target_var,
        'input-variables': input_vars,
        'model-type': model_type,
        }

files = {
        'training-data':  ('training-data', open(datafile_name, 'rb'),'text/csv'),
        }

#### Submit the Regression Job

In [13]:
# try to delete the model with the same model-id in case it exists already
response = requests.delete(reg_job_url +  "/" + model_id, headers=headers)
#print(response.status_code)

# Invoke the POST to submit a new Regression model training job
response = requests.post(reg_job_url, data=fields, files=files, headers=headers)
pprint.pprint(response.json())

{'input-variables': 'P1_1,P1_2,P1_3,P1_4,P1_5',
 'model-id': 'model_P1_1_LR_SK_nb_example',
 'model-type': 'LR_SK',
 'queued-time': '2022-02-04T21:19:29.749517',
 'scaling-factor': 1.0,
 'status': 'queued',
 'target-variable': 'P1_y'}


#### Get the Status of the Training Job and Wait Until Finished

In [15]:
# Here we are waiting until the job is finished
retries = 0
status = "queued"
while retries < 10 and (status=="queued" or status=="running"):
    time.sleep(5)
    get_response = requests.get(reg_job_url +  "/" + model_id, headers=headers)
    pprint.pprint(get_response.json())
    status = get_response.json()['status']
    # print(status)

status = get_response.json()['status']
print(status)

{'duration': 0.025615,
 'end-time': '2022-02-04T21:23:15.335681',
 'finish-time': '2022-02-04T21:23:15.961673',
 'input-variables': 'P1_1,P1_2,P1_3,P1_4,P1_5',
 'model-id': 'model_P1_1_LR_SK_nb_example',
 'model-metadata': {'mean-absolute-error': 13.83381422041587,
                    'r2-score': 0.9999491246136101},
 'model-type': 'LR_SK',
 'output-files': ['model_P1_1_LR_SK_nb_example',
                  'model_P1_1_LR_SK_nb_example_stats'],
 'queued-time': '2022-02-04T21:19:29.749517',
 'scaling-factor': 1.0,
 'start-time': '2022-02-04T21:23:15.310066',
 'status': 'finished',
 'target-variable': 'P1_y'}
finished


## Optimization Service Creation and Job Submission

Our AI-based optimization framework allows the embedding of process behavior information, derived from data-driven regression models, within run-time process and system-wide scale optimization models. The choice of the regression model type has implications for the complexity of the resulting optimization model. The novelty of our approach is the ability to efficiently solve the optimization problem for various types of regressors. 

We have developed several customized algorithms for the generalized network optimization model. For regression models such as feed-forward neural networks with rectified linear unit (ReLU) activation functions (model-type = "NN_SK") or tree-based ensemble models (model-type = "RegTree_SK" or "RF_SK"), we have showed that the optimization model reduces to a mixed-integer linear program (MILP) which can be solved using existing mature MILP solvers. For nonlinear optimization models resulting from complex deep neural networks or general black-box ensemble methods (model-type = "NN_KERAS"), a novel two-level augmented Lagrangian method is developed. 

The figure below shows the optimization technique ("optimization-type") to be used based on the regression model type ("model-type") chosen above in the regression service call.

<img src="figures/Picture2.jpg" alt="Drawing" style="width: 550px;"/>

Inputs to the optimization service include:
1. optimization-id: optimization ID
2. regression-model: regression model trained by the regression service above (referenced by its model-id)
3. optimization-type: type of optimization model to be used (refer to the figure above)
4. input-regression-config: see explanation below
5. output-regression-config: see explanation below
6. total-period: number of time periods for which set point recommendations are needed

#### Specify the Single Process Optimization Service Parameters

In [16]:
# Setup Single Process Optimization Variables
opt_id = "opt_P1_1_LR_SK_nb_example"
reg_model_for_opt = "model_P1_1_LR_SK_nb_example"
opt_type = 'MILP'
input_reg = 'configs/input_regression_cfg.csv'
output_reg = 'configs/output_regression_cfg.csv'
period="1"

fields = {
            'optimization-id': opt_id,
            'regression-model': model_id,
            'optimization-type': opt_type,
            'total-period': period
        }

files = {
        'input-regression-config':  ('input-regression-config', open(input_reg, 'rb'),'text/csv'),
        'output-regression-config':  ('output-regression-config', open(output_reg, 'rb'),'text/csv'),
        }


#### input-regression-config parameter (input_regression_cfg.csv)
This file describes the process ("plant"), control and observed variables ("labels"), their lower and upper bounds, their initial values, maximum change allowed within 1 time period, and fixed observed values for the observed variables. Please note that the plant ID has to begin with the uppercase "P" followed by a numeral. E.g., "P1". The user may choose any label for the control and observed variables but the labels need to match variable names used for training the regression model.

In [17]:
input_reg_df = pd.read_csv(input_reg)
input_reg_df.head()

Unnamed: 0,plant,label,lower,upper,init,rate_change,observed_value
0,P1,P1_1,3500.0,5500.0,4000.0,0.5,
1,,P1_2,3500.0,5500.0,4500.0,0.5,
2,,P1_3,3500.0,5500.0,4000.0,0.5,
3,,P1_4,,,,,0.7
4,,P1_5,,,,,85.0


#### output-regression-config parameter (output_regression_cfg.csv)
This file describes the output (or outflow) variable of a process ("plant"), its label, product ID (1, 2, 3 etc. if there are multiple outputs from a process), lower and upper bounds for each product, "model_type" and "model_name" used in the regression service above, and "model_stats" generated by the regression service. Please note that plant ID has to begin with the uppercase "P" followed by a numeral. E.g., "P1".

In [18]:
output_reg_df = pd.read_csv(output_reg)
output_reg_df.head()

Unnamed: 0,plant,label,product,lower,upper,model_type,model_name,model_stats
0,P1,bitumen,1,50,15000,LR_SK,model_P1_1_LR_SK_nb_example,model_P1_1_LR_SK_nb_example_stats


#### Submit the Optimization Job

In [19]:
# try to delete the optimization job with the same optimization-id in case it exists already
response = requests.delete(opt_job_url +  "/" + opt_id, headers=headers)
print(response.status_code)

# Submit Single Process Optimization Job
response = requests.post(opt_job_url, data=fields, files=files, headers=headers)
pprint.pprint(response.json())

400
{'optimization-id': 'opt_P1_1_LR_SK_nb_example',
 'optimization-type': 'MILP',
 'queued-time': '2022-02-04T21:42:05.520628',
 'regression-model': 'model_P1_1_LR_SK_nb_example',
 'status': 'queued',
 'total-periods': '1'}


#### Get the Status of the Optimization Job and Wait Until Finished

In [21]:
# wait for optmization job to finish
retries = 0
status = "queued"
while retries < 50 and (status=="queued" or status=="running"):
    time.sleep(5)
    get_response = requests.get(opt_job_url +  "/"  + opt_id, headers=headers)
    pprint.pprint(get_response.json())
    status = get_response.json()['status']
    print(status)
print(status)


{'duration': 0.551591,
 'end-time': '2022-02-04T21:46:41.635804',
 'finish-time': '2022-02-04T21:46:42.803813',
 'optimization-id': 'opt_P1_1_LR_SK_nb_example',
 'optimization-type': 'MILP',
 'output-files': ['input_regression_solution.csv',
                  'output_regression_solution.csv',
                  'tank_level_solution.csv',
                  'flow_solution.csv'],
 'queued-time': '2022-02-04T21:42:05.520628',
 'regression-model': 'model_P1_1_LR_SK_nb_example',
 'start-time': '2022-02-04T21:46:41.084213',
 'status': 'finished',
 'total-periods': '1'}
finished
finished


### Outputs

1. input_regression_solution.csv: Columns ‘plant’, ‘Label’, ‘index’, and ‘Type’ of this file contain the information of the inputs of each process node. Columns ‘Period 1’, ..., ‘Period T’ contain the solutions of the inputs (decision variables) of each process node for all periods. Note that the ‘index’ column is the feature index of the regression function in ‘plant’ column. Column ‘Type’ is to specify the type of the input feature (primary or secondary). 
2. output_regression_solution.csv: Columns ‘plant’, ‘Label’, and ‘Product’ of this file contain the information of the outputs of each process node. Columns ‘Period 1’, ..., ‘Period T’ contain the solutions of the outputs (decision variables) of each process node for all periods. 

#### Input Regression Solution

In [22]:
response = requests.get(opt_job_url +  "/" + opt_id + "/solution/input-regression", headers=headers)
print(response.status_code)
from io import StringIO
in_reg = StringIO(response.text)
in_df = pd.read_csv(in_reg, sep=",")
in_df.head()

200


Unnamed: 0,Plant,Label,Index,Type,Period 1
0,P1,P1_1,1,primary,5500.0
1,P1,P1_2,2,primary,5500.0
2,P1,P1_3,3,primary,5500.0


#### Output Regression Solution

In [23]:
# Output Regression Solution
response = requests.get(opt_job_url +  "/" + opt_id + "/solution/output-regression", headers=headers)
print(response.status_code)
out_reg = StringIO(response.text)
out_df = pd.read_csv(out_reg, sep=",")
out_df.head()

200


Unnamed: 0,Plant,Label,Product,Period 1
0,P1,bitumen,1,382.29
