# Optimization Modeling and Relational Data

This notebook shows the relationship between mathematical models used in optimization and data models used to store and retrieve the data that populates a model instance. It illustrates, by means of an example, how the data structures of the OPL modeling language used in optimization can be constructed using SQL, focusing specifically on how to use Spark dataframes for this purpose.
<p>
In this notebook, you will learn how to set up the optimization problem using IBM's OPL modeling language and how to solve it using IBM's Decision Optimization on Cloud service. The notebook also shows you how access data from a source in IBM's DSX Community and how to use Apache Spark to manage the data input to and output from the optimization service.
<p>
>This notebook is part of [IBM Decision Optimization on Cloud service with the Python Client ](https://developer.ibm.com/docloud/documentation/docloud/python-api/).

>You will need a valid subscription to Decision Optimization on Cloud ([here](https://developer.ibm.com/docloud)). 

Some familiarity with Python is recommended. This notebook runs on Python 2 with Spark 2.0.

## Table of Contents

- [1 Introduction](#1-Introduction)
	- [1.1 The Tableau Form of an Optimization Model](#1.1-The-Tableau-Form-of-an-Optimization-Model)
- [2 An Example – Warehouse Location](#2-An-Example-–-Warehouse-Location)
	- [2.1 The Business Context](#2.1-The-Business-Context)
	- [2.2 The Application Data Model](#2.2-The-Application-Data-Model)
	- [2.3 The OPLCollector Class](#2.3-The-OPLCollector-Class)
	- [2.4 The Spark Data Model for the Warehouse Location Application](#2.4-The-Spark-Data-Model-for-the-Warehouse-Location-Application)
	- [2.5 The Data](#2.5-The-Data)
	- [2.6 Optimization Model](#2.6-Optimization-Model)
	- [2.7 Solving the Warehousing Model with IBM Decision Optimization on Cloud](#2.7-Solving-the-Warehousing-Model-with-IBM-Decision-Optimization-on-Cloud)
		- [2.7.1 Get Your Credentials for IBM Decision Optimization on Cloud](#2.7.1-Get-Your-Credentials-for-IBM-Decision-Optimization-on-Cloud)
		- [2.7.2 The Optimizer Class](#2.7.2-The-Optimizer-Class)
		- [2.7.3 Setting Up and Submitting the Solve Job](#2.7.3-Setting-Up-and-Submitting-the-Solve-Job)
		- [2.7.4 Retrieving the Optimal Solution](#2.7.4-Retrieving-the-Optimal-Solution)
- [3 Transforming an Optimization Problem to Tableau Form](#3-Transforming-an-Optimization-Problem-to-Tableau-Form)
	- [3.1 Using Relational Database Operations to Reshape the Instance Data](#3.1-Using-Relational-Database-Operations-to-Reshape-the-Instance-Data)
		- [3.1.1 The Tableau Data Model](#3.1.1-The-Tableau-Data-Model)
		- [3.1.2 The Transformation Data Model](#3.1.2-The-Transformation-Data-Model)
	- [3.2 Encoding the Decision Variables](#3.2-Encoding-the-Decision-Variables)
	- [3.3 Encoding the Constraints and Decision Expressions](#3.3-Encoding-the-Constraints-and-Decision-Expressions)
	- [3.4 Reshaping the Coefficient Data into the Tableau](#3.4-Reshaping-the-Coefficient-Data-into-the-Tableau)
	- [3.5 Creating the Tableau Input Data](#3.5-Creating-the-Tableau-Input-Data)
	- [3.6 Solving the Tableau Model](#3.6-Solving-the-Tableau-Model)
	- [3.7 Recovering the Warehousing Solution](#3.7-Recovering-the-Warehousing-Solution)
- [4 Conclusion: Equivalence of Tuple Slicing and SQL](#4-Conclusion:-Equivalence-of-Tuple-Slicing-and-SQL)
- [Author](#Author)
- [References](#References)

## 1 Introduction

The purpose of this notebook is to explicate the relationship between mathematical models used in optimization and data models used to store and retrieve the data that populates a model instance. 
Business solutions based on predictive analytics typically have a three-layer architecture:

In [1]:
figure1= """
                                         Extract,
                                         Validate,                 Visualize,
                                         Transform                 Interact
                        +---------------+         +---------------+        +---------------+
                        |               +-------> +               +------> +               |
                        |  Enterprise   |         |  Application  |        |   Business    |
                        |  Data         |         |  Layer        |        |   Interface   |
                        |               + <-------+               + <------+               |
                        +---------------+         +----+-----+----+        +---------------+
                                                       |     ^
                                                       |     |
                                                       |     |  Application Data Model
                                                       |     |
                                                       v     |
                                                  +----+-----+----+
                                                  |               |
                                                  |   Modeling    |
                                                  |   Layer       |
                                                  |               |
                                                  +----+-----+----+
                                                       |     ^
                                                       |     |
                                                       |     |  Tableau Data Model
                                                       |     |
                                                       v     |
                                                  +----+-----+----+
                                                  |               |
                                                  |    Solving    |
                                                  |    Layer      |
                                                  |               |
                                                  +---------------+
"""

The application layer provides the business logic of the solution, encompassing such things as data management (exchange with enterprise sources, validation, and transformation), business process workflow, interaction with business users, and so forth. The modeling layer represents the underlying business problem mathematically. The solving layer executes the optimization algorithms on the mathematical model. This paper focuses on the modeling layer and its relationship with the solving layer.

Given the large sizes of data sets used in optimization and the large-scale capabilities of modern mathematical programming solvers, a primary objective of the modeling layer is to move data efficiently and quickly between external databases and the internal data structures of the solver. For the purposes of this discussion, it is assumed that that the data exists in the form of relational tables and that the solver works with a (sparse) matrix representation of the optimization problem.

The general form of the optimization problem is linear:

\begin{align}
&\min z=\sum_{j\in J} c_{j}x_{j}\\
&subject \; to:\\
&\sum_{j\in J}a_{ij}x_{j}\leq b_{i} \quad \forall{i\in I}\\
&l_{j}\leq x_{j}\leq u_{j} \quad \forall{j\in J}\\
\end{align}


The formulation could add boolean and integer variables, as will be discussed below, and it could include special structure constraints, such as generalized upper bounds, but these are not particularly germane to the discussion. Adding non-linear constraints, however, raises the issue of how to represent the non-linear functions, which is beyond the scope of this discussion.

The fundamental data structure underlying this optimization problem is the *simplex tableau*, which consists of the coefficient matrix $a_{ij}$, the cost and right-hand side (*rhs*) vectors $c_{j}$ and $b_{i}$, and the lower and upper bound vectors $l_{j}$ and $u_{j}$. In most real world applications of decision optimization, the tableau is very sparse; that is, very few of its coefficients are non-zero, often fewer than 1%. The algorithms for solving the optimization are especially structured to take advantage of this sparsity. Thus, only entries for which $a_{ij}≠0$ are explicitly represented. Ordinarily, the indices $i$ and $j$ are taken to be integers and the index sets $I$ and $J$ are taken to be sets of integers. This paper will call this formulation the *tableau* representation, in homage to George Dantzig, the inventor of linear programming and the Simplex algorithm for solving them. (See [References](#References) below. Dantzig himself credits the economist Quesnay with coining the term *tableau* in his 1759 book.) In this case, the transformation of the model into the internal matrix representation for the solver is straightforward; each index $j$ corresponds to a column and each $i$ corresponds to a row. However, integer indexing is often too restrictive in real-world optimization problems. Most large optimization models have a great deal of structure, with multiple classes of decision variables and constraints, each of which has its own indexing scheme. The modeling layer then needs to handle the mapping between the index sets and the rows and columns of the tableau representation.

Authors of optimization models have long recognized the value of using a more general formulation of an optimization problem. In the more general representation, the indices *i* and *j* are taken to be tuples. A *tuple* is multicomponent data structure where each component, or *field*, is a simple data type, such as an integer or a character string. A subset of the fields of the tuple that uniquely identifies it is called the *key*; the data in the key fields must be discrete. The data in the non-key fields, on the other hand, may be continuous, e.g. a floating point number. (Note that, while there are similarities between the tuples discussed in this paper and the similarly named Python objects, it is important that they remain distinct.) Here is an example of a tuple that might arise in a network flow optimization:

<code>tuple Route {
 key string location;
 key string store;
 float shippingCost;	// $/pallet
}</code>

The preceding example is written in the well-known Optimization Programming Language, or *OPL*, one of a number of specialized languages for  expressing optimization problems (see __[IBM Knowledge Center](http://www.ibm.com/support/knowledgecenter/SSSA5P_12.6.1/ilog.odms.ide.help/OPL_Studio/maps/groupings/opl_Language.html)__). Henceforth, further examples will be displayed in OPL, rather than in mathematical notation. Using OPL has the advantage of producing executable code; how to call OPL from Python will be discussed in section 2.7 below. 

### 1.1 The Tableau Form of an Optimization Model

In OPL, the general optimization problem has the following form:

In [2]:
tableau_data_model = '''
 //Defines a column for a boolean variable
 tuple BooleanColumn {
    key string variable;    //name = variable+index
    int lower;              //lower bound (always 0)
    int upper;              //upper bound (always 1)
    int value;              //optimal value (output only)
 }
 
 //Defines a column for an integer variable
 tuple IntegerColumn {
    key string variable;    //name = variable+index
    int lower;              //lower bound
    int upper;              //upper bound
    int value;              //optimal value (output only)
 }
 
 //Defines a column for a continuous variable
 tuple FloatColumn {
    key string variable;    //name = variable+index
    float lower;            //lower bound
    float upper;            //upper bound
    float value;            //optimal value (output only)
 }
 
 //Defines a row for a constraint or a decision expression
 tuple Row {
    key string cnstraint;   //name = constraint+index  
    string sense;           //GE(>=), EQ(==), LE(<=) or dexpr (must be dexpr for a decision expression)
    float rhs;              //right-hand side term (must be zero for a decision expression)
 }
  
 //Defines a coefficient at a specific row and column
 tuple Entry {
    key string cnstraint;   //name = constraint+index   (can also be used for a decision expression) 
    key string variable;    //name = variable+index  
    float coefficient;      //coefficent
 }
 
 //Defines a decision expression value
 tuple Objective {
    key string name;        //must correspond to the name of one of the decision expressions
    string sense;           //minimize or maximize
    float value;            //optimal decision expression value (output only)
 }
 '''
tableau_inputs = '''
 {BooleanColumn} 	booleanColumns= ...;
 {IntegerColumn} 	integerColumns=  ...; 
 {FloatColumn} 		floatColumns=  ...; 
 {Row} 				rows= ...; 
 {Entry} 			entries= ...;  
 {Objective}		objectives= ...;	//a singleton tuple set designating the decison express to use as the objective function
 float				objectiveSense= (first(objectives).sense=="maximize" ? 1.0 : -1.0);
 {Row} 				rows_dexpr= {i| i in rows: i.sense=="dexpr"};
 '''
 
tableau_optimization_problem='''
 dvar boolean	x[booleanColumns];
 dvar int		y[j in integerColumns]	in j.lower..j.upper;
 dvar float		z[j in floatColumns]	in j.lower..j.upper;
 
 dexpr float v[i in rows_dexpr]= 			 	
 				  sum(j in booleanColumns,   t in entries: t.cnstraint==i.cnstraint && t.variable==j.variable) t.coefficient*x[j] 
 				+ sum(j in integerColumns,   t in entries: t.cnstraint==i.cnstraint && t.variable==j.variable) t.coefficient*y[j] 
			 	+ sum(j in floatColumns,     t in entries: t.cnstraint==i.cnstraint && t.variable==j.variable) t.coefficient*z[j];

 dexpr float obj= sum(i in rows_dexpr: i.cnstraint==first(objectives).name)v[i]; //selects the decision expression designated as the objective function

 constraint ct[rows];

 maximize obj*objectiveSense;
 subject to {

 forall(i in rows)
   ct[i]:	if(i.sense=="GE")   
	   			sum(j in booleanColumns,	t in entries: t.cnstraint==i.cnstraint && t.variable==j.variable)  t.coefficient*x[j] +
	   			sum(j in integerColumns,	t in entries: t.cnstraint==i.cnstraint && t.variable==j.variable)  t.coefficient*y[j] +
	   			sum(j in floatColumns,		t in entries: t.cnstraint==i.cnstraint && t.variable==j.variable)  t.coefficient*z[j]
	   			>= i.rhs;
   			else if(i.sense=="EQ")
	   			sum(j in booleanColumns,	t in entries: t.cnstraint==i.cnstraint && t.variable==j.variable)  t.coefficient*x[j] +
	   			sum(j in integerColumns,	t in entries: t.cnstraint==i.cnstraint && t.variable==j.variable)  t.coefficient*y[j] +
	   			sum(j in floatColumns, 	 	t in entries: t.cnstraint==i.cnstraint && t.variable==j.variable)  t.coefficient*z[j]
	   			== i.rhs;
   			else if(i.sense=="LE")
	   			sum(j in booleanColumns,	t in entries: t.cnstraint==i.cnstraint && t.variable==j.variable)  t.coefficient*x[j] +
	   			sum(j in integerColumns,	t in entries: t.cnstraint==i.cnstraint && t.variable==j.variable)  t.coefficient*y[j] +
	   			sum(j in floatColumns,		t in entries: t.cnstraint==i.cnstraint && t.variable==j.variable)  t.coefficient*z[j]
	   			<= i.rhs;
				
 }
 '''
tableau_outputs='''
 {BooleanColumn}	booleanDecisions=		{<j.variable, j.lower, j.upper, x[j]> | j in booleanColumns};
 {IntegerColumn}	integerDecisions=		{<j.variable, j.lower, j.upper, y[j]> | j in integerColumns};
 {FloatColumn}		floatDecisions=			{<j.variable, j.lower, j.upper, z[j]> | j in floatColumns};
 {Objective}		optimalObjectives=		{<i.cnstraint, "", v[i]> | i in rows_dexpr};
 '''

This formulation is more general than the mathematics shown in section 1, in the following ways:
- allows decision variables of types <code>boolean</code> ("yes" or "no"), <code>int</code> (integer), and <code>float</code> (double precision)
- allows constraints with senses less than or equal to, equal to, and greater than or equal to
- allows multiple decision expressions, one of which serves as the objective function
- allows either minimization or maximization of the objective function

This formulation takes a modest step towards the tuple representation, using <code>BooleanColumn</code> and <code>FloatColumn</code> to associate an index string with each variable in the general optimization problem, and similarly for the constraints. More importantly, the tuple representation of the tableau model is sparse, requiring an <code>Entry</code> only where the <code>coefficient</code> field is non-zero. OPL permits representing a matrix as <code>float a[Row][Column]</code>, but this representation would entail reading an entry for every row and column, a much larger dataset when, typically, fewer than 10% of the entries are non-zero.

Note that the solution takes the form of arrays rather than tuple sets:
<code>
 dvar boolean   x[booleanColumns];
 dvar int       y[j in integerColumns]  in j.lower..j.upper;
 dvar float     z[j in floatColumns]    in j.lower..j.upper;
</code>

Thus it is necessary to reshape the solution into tuple sets as shown in <code>tableau_outputs</code>.

It is important to realize that the tableau model represents *any* linear mathematical optimization model (possibly with extensions to incorporate special structure constraints such as generalized upper bounds and others). It actually embodies the low-level interface to the solver and, thus, is properly part of the solving layer rather than the modeling layer. The remainder of this paper focuses on how the modeling layer translates a specific model instance into the generic tableau representation.

## 2 An Example &ndash; Warehouse Location

This notebook uses an example to make the concepts discussed more concrete. This example is more fully explored in the notebook *Locating Warehouses to Minimize Costs Case 1*.

### 2.1 The Business Context

A consumer packaged goods supplier needs to decide where to locate its warehouses to serve a set of retail stores at different locations. At the same time, it also needs to determine how much capacity each warehouse should have. The cost of opening a warehouse has a fixed component, related to the acquisition of land and designing the facility, and a variable component proportional to the capacity of the warehouse. The cost to ship the goods from a warehouse to a store depends on the distance between them. The objective is to minimize the cost of opening the warehouses and shipping the goods. Such an optimization application would typically be used as part of an annual planning process in which the company’s management would decide on sales targets and the capital investments needed to support them.

### 2.2 The Application Data Model

The application data model is the schema of the data input to and output from the optimization. It is typically realized in several forms: as the table schema of a relational database system, as the tuple structure of the optimization model, or as a set of classes in a programming language. Here is the OPL representation of the application data model for the Warehousing optimization model:

In [3]:
warehousing_data_dotmod = '''
 //Input data
 
 tuple Warehouse {
 	key string location;
 	float fixedCost;	// $/yr
 	float capacityCost;	// $/pallet/yr
 }
 
 tuple Store {
 	key string storeId; 
 }
 
 tuple Route {
 	key string location;
 	key string store;
 	float shippingCost;	// $/pallet
 }
 
 //Note: the mapCoordinates table is not used in the optimization and so is not sent to the optimizer

 tuple Demand {
 	key string store;
 	key string scenarioId;
 	float amount;		// pallets/period
 }
 
  tuple Scenario {
 	key string id;
 	float totalDemand;
 	float periods; 	//the number of periods per year during which this scenario prevails; periods = scenario probability * total periods/year
 }
 
 //Output data
 
  tuple Objective {
	key string problem;
 	key string dExpr;
 	key string scenarioId;
	key int iteration;
	float value;   
 }
 
 tuple Shipment {
  	key string location;
 	key string store;
 	key string scenarioId;
 	key int iteration;
 	float amount; 
 }
 
 tuple OpenWarehouse {
 	key string location;
 	key string scenarioId;
 	key int iteration;
 	int open;
 	float capacity;		// pallets
 }
 '''

### 2.3 The OPLCollector Class

In order to work with the application data, we use a couple of objects (one for input, the other for output) of a class called the <code>OPLCollector</code>. These objects hold the data as Spark datasets. 
<p>
The <code>OPLCollector</code> class itself does not require customization for each application. Instead, it is configured by specifying the schemas of the tables it contains, using a builder method, as will be shown below. The schemas themselves are instances of the Spark <code>StructType</code> class. The design of the <code>OPLCollector</code> class minimizes the amount of custom coding required to build an optimization-based application. (Note that, despite its name, <code>OPLCollector</code> has no dependence on the OPL modeling language and can be used with the DOCplex Python modeling language as well.) Here is the Python code for <code>OPLCollector</code> and some related functions (go to edit mode to see the contents of this hidden cell).

In [4]:
# @hidden_cell

'''
Created on Feb 8, 2017
@author: bloomj
'''
import sys
import os
import json
import requests

try:
    from pyspark import SparkConf, SparkContext
    from pyspark.sql import SparkSession, Row, functions
    from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
except ImportError as e:
    print ("Error importing Spark Modules", e)
    sys.exit(1)

SPARK_CONTEXT = sc # sc is predefined
SQL_CONTEXT = sqlContext  # sqlContext is predefined
SPARK_SESSION = SparkSession.builder.config("spark.sql.crossJoin.enabled", "true").getOrCreate()


class OPLCollector(object):
    '''
    Represents an OPL data model in Spark.
    Note: Use of this class does not depend on OPL, and in particular, it can be used with the DOcplex Python API.
    An application data model (ADM) consists of a set of tables (OPL Tuplesets), each with its own schema.
    An ADM is represented by a dictionary in which the keys are the table names and the values are the table schemas.
    A builder is provided to create the ADM.

    The OPLCollector holds the actual data in Spark Datasets. There are several ways to populate
    the data.
    - Spark SQL operations can transform tables into other tables.
    - A builder is provided when the data is generated programmatically.
    - JSON deserialization and serialization are provided when data is exchanged with external applications or stores.

    The design of the OPLCollector class aims to reduce the amount of data that must be
    manipulated outside of Spark. Where possible, data is streamed among applications without
    creating auxiliary in-memory structures or files.

    The design of OPLCollector also aims to minimize the amount
    of custom coding required to build an application. Collectors are configured
    by specifying their schemas through builders rather than by extending with subclasses.
    '''

    def __init__(self, collectorName, applicationDataModel={}, sparkData={}):
        '''
        Creates a new OPLCollector instance.

        :param collectorName: the name of this collector.
        :type collectorName: String
        :param applicationDataModel: holds the table schemas for this collector. Each schema is a Spark StructType.
        Note that each collector has one and only one application data model.
        :type applicationDataModel: dict<String, StructType>
        :param sparkData: holds the actual data tables of this collector as a set of Spark datasets.
        :type sparkData: dict<String, Dataframe>
        '''
        self.name = collectorName
        self.applicationDataModel = applicationDataModel
        self.sparkData = sparkData
        self.size = {name: None for name in applicationDataModel.keys()}
        self.jsonDestination = None
        self.jsonSource = None

    def copy(self, name):
        """
        Creates a new OPLCollector instance with copies of the application data model and Spark datasets of this collector.
        The ADM copy is immutable. The Spark datasets themselves are immutable, but the copy supports the addTable, addData, and replaceTable methods.
        Does not copy the JSONSource or JSONDestination fields.

        :param name of the new collector
        :param tableNames tables to be copied (all tables in this collector, if absent)
        :return a new OPLCollector instance
        """
        result = OPLCollector(name, self.applicationDataModel, self.sparkData.copy())
        result.size = self.size.copy()
        return result

    def copy(self, name, *tables):
        result = OPLCollector(name)
        admBuilder = ADMBuilder(result);
        for table in tables:
            admBuilder.addSchema(table, self.getSchema(table))
        admBuilder.build()
        dataBuilder = DataBuilder(result.applicationDataModel, collector=result)
        for table in tables:
            dataBuilder.addTable(table, self.getTable(table))
            result.size[table] = self.size[table]
        dataBuilder.build();
        return result

    def getName(self):
        """
        Returns the name of this collector.

        :return collector name as a string
        """
        return self.name

    def addTables(self, other):
        """
        Adds a set of tables of data from another collector.
        An individual table can be set only once.

        :param other: another collector
        :type other: OPLCollector
        :raise ValueError: if the other ADM is empty or if a table name duplicates a name already present in this collector.
        """

        if not other.applicationDataModel:  # is empty
            raise ValueError("empty collector")
        for tableName in other.applicationDataModel.viewkeys():
            if tableName in self.applicationDataModel:
                raise ValueError("table " + tableName + " has already been defined")
        self.applicationDataModel.update(other.applicationDataModel)
        self.sparkData.update(other.sparkData)
        self.size.update(other.size)
        return self

    def replaceTable(self, tableName, table, size=None):
        """
        Replaces an individual table of data.

        :param tableName:
        :type String
        :param table:
        :type Spark Dataframe
        :param size: number of rows in table (None if omitted)
        :return: this collector
        :raise ValueError: if the table is not already defined in the ADM
        """
        if tableName not in self.applicationDataModel:
            raise ValueError("table " + tableName + "has not been defined")
        self.sparkData[tableName] = table
        if size is not None:
            self.size[tableName] = size
        else:
            self.size[tableName] = table.count()
        return None

    def addData(self, tableName, table, size=None):
        """
        Adds data to an existing table.
        Use when a table has several input sources.
        Does not deduplicate the data (i.e. allows duplicate rows).

        :param tableName:
        :type String
        :param table:
        :type Spark Dataframe
        :param size: number of rows in table (None if omitted)
        :return: this collector
        :raise ValueError: if the table is not already defined in the ADM
        """
        if tableName in self.applicationDataModel:
            raise ValueError("table " + tableName + " has already been defined")
        self.sparkData[tableName] = self.sparkData[tableName].union(table)
        count = (self.size[tableName] + size) if (self.size[tableName] is not None and size is not None) else None
        self.size[tableName] = count
        return self

    #NEW
    def getADM(self):
        """
        Exposes the application data model for this OPLCollector.
        The ADM is represented by a map in which the keys are the table names
        and the values are the table schemas held in Spark StructType objects.

        :return: the application data model
        :rtype: dict<String, StructType>
        """
        return self.applicationDataModel

    def setADM(self, applicationDataModel):
        """
        Sets the application data model for this OPLCollector.
        The ADM cannot be changed once set.
        """
        if (self.applicationDataModel):  # is not empty or None
            raise ValueError("ADM has already been defined")
        self.applicationDataModel = applicationDataModel
        return self

    def getTable(self, tableName):
        return self.sparkData[tableName]

    def getSchema(self, tableName):
        return self.applicationDataModel[tableName]

    def selectSchemas(self, *tableNames):
        """
        Returns a subset of the application data model.
        """
        return {tableName: self.applicationDataModel[tableName] for tableName in tableNames}

    def selectTables(self, collectorName, *tableNames):
        """
        Creates a new OPLCollector from a subset of the tables in this collector.
        The tables in the new collector are copies of the tables in the original.
        """
        adm = self.selectSchemas(tableNames)
        data = {tableName: SPARK_SESSION.createDataFrame(self.sparkData[tableName], self.getSchema(tableName))
                for tableName in tableNames}
        size = {tableName: self.size[tableName] for tableName in tableNames}
        return OPLCollector(collectorName, adm, data, size)

    def getSize(self, tableName):
        """
        Returns the number of rows in a table.
        Note: the Spark data set count method is fairly expensive,
        so it is used only if there is no other way to count the number of rows.
        It is best to count the rows as the table is being deserialized, as is done in the fromJSON method.
        Once counted, the number is stored in the size map for future use.
        """
        if tableName not in self.size:
            raise ValueError("size not defined for table " + tableName)
        if self.size[tableName] is None:
            self.size[tableName] = self.sparkData[tableName].count()
        return self.size[tableName]

    def buildADM(self):
        """
        Creates the application data model for this collector
        """
        if (self.applicationDataModel):  # is not empty
            raise ValueError("application data model has already been defined")
        return ADMBuilder(self)

    def buildData(self):
        """
        Creates a builder for the data tables for this collector.
        Uses this collector's application data model.

        :return: a new DataBuilder instance
        :raise ValueError: if the application data model has not been defined or if data tables have already been loaded
        """
        if not self.applicationDataModel:  # is empty
            raise ValueError("application data model has not been defined")
        if self.sparkData:  # is not empty
            raise ValueError("data tables have already been loaded")
        return DataBuilder(self.applicationDataModel, collector=self)

    def setJsonSource(self, source):
        """
        Sets the source for the JSON text that populates the collector.
        There is a one-to-one correspondence between an OPLCollector instance and its JSON representation;
        that is, the JSON source file must fully include all the data tables to be populated in the collector instance.
        Thus, it makes no sense to have more than on JSON source for a collector or to change JSON sources.

        :param source: a file-like object containing the JSON text.
        :return: this collector instance
        :raise ValueError: if JSON source has already been set
        """
        if self.jsonSource is not None:
            raise ValueError("JSON source has already been set")
        self.jsonSource = source
        return self

    #REVISED
    def fromJSON(self):
        """
        Provides a means to create a collector from JSON.
        You must first set the destination (an output stream, file, url, or string) where the JSON will be read.
        Then you call the deserializer fromJSON method.
        The application data model for the collector must already have been created.

        There is a one-to-one correspondence between an OPLCollector instance and its JSON representation;
        that is, the JSON source file must fully include all the data tables to be populated in the collector instance.
        Methods are provided to merge two collectors with separate JSON sources (addTables),
        add a data set to a collector (addTable), and to add data from a data set to an existing table in a collector.

        :return: this collector with its data tables filled
        :raise ValueError: if the data tables have already been loaded
        """
        if self.sparkData:  # is not empty
            raise ValueError("data tables have already been loaded")
        # data: dict {tableName_0: [{fieldName_0: fieldValue_0, ...}, ...], ...}
        data = json.load(self.jsonSource)
        builder = self.buildData()
        for tableName, tableData in data.viewitems():
            count = len(tableData)
            tableRows = (Row(**fields) for fields in tableData)
            builder = builder.addTable(tableName,
                                       SPARK_SESSION.createDataFrame(tableRows, self.getADM()[tableName]),
                                       count)   # would like to count the rows as they are read instead,
                                                # but don't see how
        builder.build()
        return self

    def setJsonDestination(self, destination):
        """
        Sets the destination for the JSON serialization.
        Replaces an existing destination if one has been set previously.

        :param destination: an output string, stream, file, or URL
        :return: this collector
        """
        self.jsonDestination = destination
        return self

    #REVISED
    def toJSON(self):
        """
        Provides a means to write the application data as JSON.
        You must first set the destination (an output stream, file, url, or string) where the JSON will be written.
        Then you call the serializer toJSON method.
        """
        self.jsonDestination.write("{\n")                           # start collector object
        firstTable = True
        for tableName in self.sparkData:
            if not firstTable:
                self.jsonDestination.write(',\n')
            else:
                firstTable = False
            self.jsonDestination.write('"' + tableName + '" : [\n') # start table list
            firstRow = True
            for row in self.sparkData[tableName].toJSON().collect():# better to use toLocalIterator() but it gives a timeout error
                if not firstRow:
                    self.jsonDestination.write(",\n")
                else:
                    firstRow= False
                self.jsonDestination.write(row)                     # write row object
            
            self.jsonDestination.write("\n]")                       # end table list
        self.jsonDestination.write("\n}")                           # end collector object

    #REVISED
    def displayTable(self, tableName, out=sys.stdout):
        """
        Prints the contents of a table.

        :param out: a file or other print destination where the table will be written
        """
        out.write("collector: " + self.getName() + "\n")
        out.write("table: " + tableName + "\n")
        self.getTable(tableName).show(self.getSize(tableName), truncate=False)

    # REVISED
    def display(self, out=sys.stdout):
        """
        Prints the contents of all tables in this collector.

        :param out: a file or other print destination where the tables will be written
        """
        for tableName in self.sparkData:
            self.displayTable(tableName, out=out)


# end class OPLCollector

def getFromObjectStorage(credentials, container=None, filename=None):
    """
    Returns a stream containing a file's content from Bluemix Object Storage.

    :param credentials a dict generated by the Insert to Code  service of the host Notebook
    :param container the name of the container as specified in the credentials (defaults to the credentials entry)
    :param filename the name of the file to be accessed (note: if there is more than one file in the container,
    you might prefer to enter the names directly; otherwise, defaults to the credentials entry)
    """

    if not container:
        container = credentials['container']
    if not filename:
        filename = credentials['filename']

    url1 = ''.join([credentials['auth_url'], '/v3/auth/tokens'])
    data = {'auth': {'identity': {'methods': ['password'],
                                  'password': {
                                      'user': {'name': credentials['username'], 'domain': {'id': credentials['domain_id']},
                                               'password': credentials['password']}}}}}
    headers1 = {'Content-Type': 'application/json'}
    resp1 = requests.post(url=url1, data=json.dumps(data), headers=headers1)
    resp1_body = resp1.json()
    for e1 in resp1_body['token']['catalog']:
        if (e1['type'] == 'object-store'):
            for e2 in e1['endpoints']:
                if (e2['interface'] == 'public' and e2['region'] == credentials['region']):
                    url2 = ''.join([e2['url'], '/', container, '/', filename])
    s_subject_token = resp1.headers['x-subject-token']
    headers2 = {'X-Auth-Token': s_subject_token, 'accept': 'application/json'}
    resp2 = requests.get(url=url2, headers=headers2, stream=True)
    return resp2.raw


class DataBuilder(object):
    """
        Builds the Spark datasets to hold the application data.
        Used when the data are created programmatically.
    """

    def __init__(self, applicationDataModel, collector=None):
        """
        Creates a builder for loading the Spark datasets.

        :param applicationDataModel
        :param collector: if present, loads the data tables and their sizes directly into the collector;
        if not present or null, the Spark data dict is returned directly
        :return: a new DataBuilder instance
        :raise ValueError: if the application data model has not been defined
        """
        if not applicationDataModel:  # is empty
            raise ValueError("application data model has not been defined")
        self.applicationDataModel = applicationDataModel
        self.collector = collector
        self.result = {}
        self.length = {}

    def addTable(self, tableName, data, size=None):
        """
        Get the external data and create the corresponding application dataset.
        Assumes that the schema of this table is already present in the ADM.

        :param data: a Spark dataset
        :param size: length number of rows in table (null if omitted)
        :return this builder instance
        :raise ValueError: if the table is not included in the ADM or if the table has already been loaded
        """
        if tableName not in self.applicationDataModel:
            raise ValueError("table " + tableName + "has not been defined")
        if tableName in self.result:
            raise ValueError("table " + tableName + "has already been loaded")
        self.result[tableName] = data
        self.length[tableName] = size
        return self

    def copyTable(self, tableName, data, size=None):
        return self.addTable(tableName,
                             SPARK_SESSION.createDataFrame(data.rdd()), size);

    def addEmptyTable(self, tableName):
        return self.addTable(tableName,
                             SPARK_SESSION.createDataFrame(SPARK_CONTEXT.emptyRDD(),
                                                           self.applicationDataModel[tableName]), 0)

    #NEW
    def referenceTable(self, tableName):
        """
        Enables referring to a table in the collector under construction to create a new table.
        Can be used in SQL statements. 
        
        :param tableName: 
        :type tableName: String
        :return: 
        :rtype:
        """
        if tableName not in self.result:
            raise ValueError(tableName + " does not exist")
        return self.result.get(tableName)

    def build(self):
        """
        Completes building the Spark data.
        Registers the application data sets as Spark SQL tables.
        If an OPLCollector has been supplied in the constructor, loads the data tables and their sizes into it.

        :return a dict of table names to Spark data sets containing the application data
        :raise ValueError:  if a table in the ADM has no associated data or if data tables have already been loaded into the collector
        """
        for tableName in self.applicationDataModel:
            if tableName not in self.result:
                raise ValueError("table " + tableName + "has no data")
        for tableName in self.result:
            self.result[tableName].createOrReplaceTempView(tableName)
        if self.collector is not None:
            if self.collector.sparkData:  # is not empty
                raise ValueError("data tables have already been loaded")
            self.collector.sparkData = self.result
            self.collector.size = self.length
        return self.result

    def retrieveSize(self):
        """
        :return the size dict created by this builder
        Note: calling this method before the build method could return an inaccurate result
        """
        return self.length


# end class DataBuilder

class ADMBuilder(object):
    """
    Builds an Application Data Model that associates a set of Spark Datasets with their schemas.
    Usage:

    adm= ADMBuilder()\
        .addSchema("warehouse", buildSchema(
            ("location", StringType()),
            ("capacity", DoubleType()))\
        .addSchema("route", buildSchema(
            ("from", StringType()),
            ("to", StringType()),
            ("capacity", DoubleType()))\
        .build();
    """

    def __init__(self, collector=None):
        """
        Creates a new builder.
        :param collector if present, loads the application data model directly into the collector;
        if not present or null, the ADM map is returned directly
        """
        self.collector = collector
        self.result = {}

    def addSchema(self, tableName, tupleSchema):
        """
        Adds a new table schema to the ADM.

        :param tupleSchema can be built with the buildSchema function
        :return this builder
        :raise ValueError: if a schema for tableName has already been defined
        """
        if tableName in self.result:
            raise ValueError("tuple schema " + tableName + " has already been defined")
        self.result[tableName] = tupleSchema
        return self

    #NEW
    def referenceSchema(self, tableName):
        """
        Enables referring to a schema in the ADM under construction to create a  new schema.

        :param tableName
        :return the schema
        """
        if tableName not in self.result:
            raise ValueError("tuple schema " + tableName + " does not exist")
        return self.result[tableName]

    def build(self):
        """
        Completes building the application data model.
        If an OPLCollector has been supplied in the constructor, loads the ADM into it.

        :return the ADM
        :raise ValueError: if the ADM for the collector has already been defined
        """
        if self.collector is not None:
            if self.collector.applicationDataModel:  # is not empty
                raise ValueError("application data model has already been defined")
            self.collector.applicationDataModel = self.result
        return self.result

# end class ADMBuilder

def buildSchema(*fields):
    """
    Creates a schema from a list field tuples
    The resulting schema is an instance of a Spark StructType.
    :param fields:
    :type fields: tuple<String, DataType>
    :return:
    :rtype: StructType
    """
    schema = StructType()
    for fieldName, fieldType in fields:
        schema = schema.add(fieldName, fieldType, False, None)
    return schema
# end buildSchema

#NEW
class SchemaBuilder:
    """
    Builds a tuple schema.
    Strictly speaking, this builder is not needed since the StructType class provides the necessary functionality.
    However, it is provided as a convenience.
    Only the following data types are supported in the schema: String, integer, float (represented as Double), and 1-dimensional arrays of integer or float.
    The array types are supported only for internal use and cannot be serialized to or deserialized from JSON.
    Note, the fields in the resulting schema are sorted in dictionary order by name to insure correct matching with data elements.

    Usage:
        StructType warehouseSchema= (new OPLTuple.SchemaBuilder()).addField("location", DataTypes.StringType).addField("capacity", DataTypes.DoubleType).buildSchema();
    The fields of the resulting StructType are not nullable and have no metadata.
    """

    def __init__(self):
        # fields is a dictionary<String, StructField>
        self.fields= {}

    def addField(self, fieldName, fieldType):
        if self.fields.has_key(fieldName):
            raise ValueError("field " + fieldName + " has already been set")
        self.fields[fieldName]= StructField(fieldName, fieldType, False)
        return self;

    def addFields(self, *fields):
        """
        Adds fields from a list field tuples
        :param fields: tuple<String, DataType>
        :return: StructType
        """
        for field in fields:
            fieldName, fieldType= field
            self.addField(fieldName, fieldType)
        return self

    def copyField(self, otherSchema, fieldName):
        """
        Copies fields from another schema
        :param otherSchema: StructType
        :param fieldName: String
        :return: StructType
        """
        if self.fields.has_key(fieldName):
            raise ValueError("field " + fieldName + " has already been set")
        self.fields[fieldName]= otherSchema[fieldName]
        return self

    def copyFields(self, otherSchema):
        for fieldName in otherSchema.names:
            self.copyField(otherSchema, fieldName)
        return self

    def buildSchema(self):
        return StructType(self.fields.values())
# end class SchemaBuilder

### 2.4 The Spark Data Model for the Warehouse Location Application

Using the tools in the OPLCollector class, the application data model (ADM) is defined in three collector objects, each of which has a corresponding representation in OPL and an associated JSON data file. The ADM is defined as follows:

In [5]:
networkDataModel = ADMBuilder()\
    .addSchema("warehouses", buildSchema(
        ("location",     StringType()),
        ("fixedCost",    DoubleType()),
        ("capacityCost", DoubleType())))\
    .addSchema("routes", buildSchema(
        ("location",     StringType()),
        ("store",        StringType()),
        ("shippingCost", DoubleType())))\
    .addSchema("stores", buildSchema(
        ("storeId",      StringType())))\
    .addSchema("mapCoordinates", buildSchema(
        ("location",     StringType()),
        ("lon",          DoubleType()),
        ("lat",          DoubleType())))\
    .build()

demandDataModel = ADMBuilder()\
    .addSchema("demands", buildSchema(
        ("store",        StringType()),
        ("scenarioId",   StringType()),
        ("amount",       DoubleType())))\
    .addSchema("scenarios", buildSchema(
        ("id",           StringType()),
        ("totalDemand",  DoubleType()),
        ("periods",      DoubleType())))\
    .build()

warehousingResultDataModel= ADMBuilder()\
    .addSchema("objectives", buildSchema(
        ("problem",     StringType()),
        ("dExpr",       StringType()),
        ("scenarioId",  StringType()),
        ("iteration",   IntegerType()),
        ("value",       DoubleType())))\
    .addSchema("openWarehouses", buildSchema(
        ("location",    StringType()),
        ("scenarioId",  StringType()),
        ("iteration",   IntegerType()),
        ("open",        IntegerType()),
        ("capacity",    DoubleType())))\
    .addSchema("shipments", buildSchema(
        ("location",    StringType()),
        ("store",       StringType()),
        ("scenarioId",  StringType()),
        ("iteration",   IntegerType()),
        ("amount",      DoubleType())))\
    .build()
    
# Note: the "MapCoordinates table and the "scenarioId" and "iteration" fields are not used in this notebook but are included for use in other contexts.

### 2.5 The Data

The data for the warehouse location problem consist of two JSON files. The first contains the characteristics of the distribution network, the potential warehouse locations and their capital costs, the store locations, and the transportation routes between the stores and the warehouses and their shipping costs. The second contains the demands at each store. These two constitute all the data required to optimize the warehouse network.

- Warehousing-data.json contains the warehouses, stores, routes, and mapCoordinates
- Warehousing-sales_data-nominal_scenario.json contains the scenarios and demands

These two data files resides in the DSX Community at: __[Potential warehouse locations](https://apsportal.ibm.com/exchange/public/entry/view/2a493fe2f0d475f0b5b52bce6191f129)__ and __[Demand per store](https://apsportal.ibm.com/exchange/public/entry/view/0a5f75c8e2177f0f64fe22e677588b1a)__. 

Extracts of these two files are shown in the notebook *Locating Warehouses to Minimize Costs Case 1*.

Using the <code>OPLCollector</code> class and its adjuncts, the following cell reads the input data and creates the Spark datasets with which to populate the decision model: 

In [6]:
import cStringIO
networkDataSource = cStringIO.StringIO(requests.get("https://apsportal.ibm.com/exchange-api/v1/entries/2a493fe2f0d475f0b5b52bce6191f129/data?accessKey=4d9d8f5f8569ea123076f7ae988649f4").text) # Warehousing-data.json
demandDataSource =  cStringIO.StringIO(requests.get("https://apsportal.ibm.com/exchange-api/v1/entries/0a5f75c8e2177f0f64fe22e677588b1a/data?accessKey=e3e48701299acea80483b1624d696795").text) # Warehousing-sales_data-nominal_scenario.json

warehousingData= OPLCollector("warehousingData", networkDataModel).setJsonSource(networkDataSource).fromJSON()
warehousingData.addTables(OPLCollector("demandData", demandDataModel).setJsonSource(demandDataSource).fromJSON())

warehousingData.displayTable("warehouses", sys.stdout)

networkDataSource.close()
demandDataSource.close()

collector: warehousingData
table: warehouses
+------------------+---------+------------+
|location          |fixedCost|capacityCost|
+------------------+---------+------------+
|Brockton, MA      |550000.0 |148.0       |
|Bristol, CT       |600000.0 |148.0       |
|Union City, NJ    |600000.0 |148.0       |
|New York, NY      |500000.0 |148.0       |
|Philadelphia, PA  |500000.0 |148.0       |
|Parkville, MD     |550000.0 |148.0       |
|Greensboro, NC    |500000.0 |148.0       |
|Goose Creek, SC   |500000.0 |148.0       |
|Lawrenceville, GA |450000.0 |148.0       |
|Jacksonville, FL  |550000.0 |148.0       |
|Birmingham, AL    |450000.0 |148.0       |
|Memphis, TN       |450000.0 |148.0       |
|Frankfort, KY     |500000.0 |148.0       |
|Akron, OH         |500000.0 |148.0       |
|Dayton, OH        |500000.0 |148.0       |
|West Lafayette, IN|500000.0 |148.0       |
|Taylor, MI        |500000.0 |148.0       |
|Dubuque, IA       |400000.0 |148.0       |
|Beloit, WI        |500000.0 |1

In [7]:
#Uncomment the following statement and rerun the cell to see the data. Warning: it creates a lengthy table.
#warehousingData.displayTable("stores", sys.stdout)

### 2.6 Optimization Model

Here is the OPL statement of the Warehousing optimization model:

In [8]:
warehousing_inputs='''
 //Input Data
  
 {Warehouse} warehouses= ...;	//Denotes reading from a data source
 
 {Store} stores= ...;
 
 {Route} routes= ...;
 
 {Demand} demands= ...;
 float demand[routes]= [r: d.amount | r in routes,  d in demands: r.store==d.store]; //demand at the store at the end of route r
 
 {Scenario} scenarios= ...;
 Scenario scenario= first(scenarios); //scenarios is a singleton set
'''

In [9]:
warehousing_dotmod='''
 dvar boolean open[warehouses];
 dvar float+ capacity[warehouses];		//pallets
 dvar float+ ship[routes] in 0.0..1.0;	//percentage of each store's demand shipped on each route
 
 dexpr float capitalCost=	sum(w in warehouses) (w.fixedCost*open[w] + w.capacityCost*capacity[w]);
 dexpr float operatingCost=	sum(r in routes) r.shippingCost*demand[r]*ship[r];
 dexpr float totalCost=		sum(w in warehouses) (w.fixedCost*open[w] + w.capacityCost*capacity[w]) +
 							sum(r in routes) r.shippingCost*demand[r]*ship[r];
 
 constraint ctCapacity[warehouses];
 constraint ctDemand[stores];
 constraint ctSupply[routes];
 
 minimize totalCost;					// $/yr
 subject to {
 	 
 	forall(w in warehouses)
//	  Cannot ship more out of a warehouse than its capacity
 	  ctCapacity[w]: capacity[w] >= sum(r in routes: r.location==w.location) demand[r]*ship[r];
 	 
	forall(s in stores)
//    Must ship at least 100% of each store's demand
	  ctDemand[s]: sum(r in routes: r.store==s.storeId) ship[r] >= 1.0;
   	   
	forall(r in routes, w in warehouses: w.location==r.location)
//	  Can only ship along a supply route if its warehouse is open	  
	  ctSupply[r]: -ship[r] >= -open[w];	//ship[r] <= open[w]
   
 }
'''

In [10]:
warehousing_outputs= '''
 //Output Data
 
 {Objective} objectives= {
    <"Warehousing", "capitalCost", scenario.id, 0, capitalCost>,
    <"Warehousing", "operatingCost", scenario.id, 0, operatingCost>, 
    <"Warehousing", "totalCost", scenario.id, 0, totalCost>};
 
 {Shipment} shipments= {<r.location, r.store, scenario.id, 0, ship[r]*d.amount> | r in routes, d in demands: r.store==d.store && ship[r]>0.0};
 
 {OpenWarehouse} openWarehouses= {<w.location, scenario.id, 0, open[w], capacity[w]> | w in warehouses};
'''

### 2.7 Solving the Warehousing Model with IBM Decision Optimization on Cloud

#### 2.7.1 Get Your Credentials for IBM Decision Optimization on Cloud

In order to use IBM Decision Optimization on Cloud, you need to insert your credentials here (if you don't already have them, you can register for a trial  __[here](https://dropsolve-oaas.docloud.ibmcloud.com/software/analytics/docloud)__)

In [11]:
url= "" # ENTER YOUR URL HERE
key= "" # ENTER YOUR KEY HERE

#### 2.7.2 The Optimizer Class

The optimization computation uses IBM Decision Optimization on Cloud, which exposes IBM's CPLEX through a cloud-based interface. In order to simplify applications built on this cloud platform, the calls to the solver have been abstracted as the Optimizer class, shown below. This class is independent of the actual decision model and instance data, and so it can be reused in other decision optimization applications without modification. Here is the Python code for Optimizer class and some related functions (go to edit mode to see the contents of this hidden cell).

In [12]:
# @hidden_cell
'''
Created on Feb 9, 2017

@author: bloomj
'''
try:
    import docloud
except:
    if hasattr(sys, 'real_prefix'):
        #we are in a virtual env.
        !pip install docloud 
    else:
        !pip install --user docloud

from docloud.job import JobClient
from docloud.status import JobSolveStatus, JobExecutionStatus

from urlparse import urlparse

import fileinput
import urllib
import cStringIO
from pprint import pprint

class Optimizer(object):
    '''
     Handles the actual optimization task.
     Creates and executes a job builder for an optimization problem instance.
     Encapsulates the DOCloud API.
     This class is designed to facilitate multiple calls to the optimizer, such as would occur in a decomposition algorithm,
     although it transparently supports single use as well.
     In particular, the data can be factored into a constant data set that does not vary from run to run (represented by a JSON or .dat file)
     and a variable piece that does vary (represented by a Collector object).
     The optimization model can also be factored into two pieces, a best practice for large models and multi-models:
     A data model that defines the tuples and tuple sets that will contain the input and output data.
     An optimization model that defines the decision variables, decision expressions, objective function, 
     constraints, and pre- and post-processing data transformations.
     Factoring either the data or the optimization model in this fashion is optional.
     
     The problem instance is specified by the OPL model and input data received from the invoking (e.g. ColumnGeneration) instance.
     Input and output data are realized as instances of OPLCollector, which in turn are specified by their respective schemas.
     This class is completely independent of the specific optimization problem to be solved.
    '''

    def __init__(self, problemName, model=None, resultDataModel=None, credentials=None, *attachments):
        '''
         Constructs an Optimizer instance.
         The instance requires an optimization model as a parameter.
         You can also provide one or more data files as attachments, either in OPL .dat or in JSON format. This data does not
         change from solve to solve. If you have input data that does change, you can provide it to the solve method as an OPLCollector object.
         :param problemName: name of this optimization problem instance
         :type problemName: String    
         :param model: an optimization model written in OPL
         :type model: Model.Source object or String
         :param resultDataModel: the application data model for the results of the optimization
         :type resultDataModel: dict<String, StructType>
         :param credentials: DOcplexcloud url and api key
         :type credentials: {"url":String, "key":String}
         :param attachments: URLs for files representing the data that does not vary from solve to solve
         :type attachments: list<URL>
        '''
        self.name= problemName
        self.model= model
        self.resultDataModel= resultDataModel
        self.attachData(attachments)
        self.streamsRegistry= []
        self.history= []
        
        self.credentials= credentials
 
        self.jobclient= JobClient(credentials["url"], credentials["key"]);
        self.solveStatus= JobSolveStatus.UNKNOWN;
        
    def getName(self):
        """
        Returns the name of this problem
        """
        return self.name
    
    def setOPLModel(self, name, dotMods=None, modelText=None):
        '''
         Sets the OPL model.
         This method can take any number of dotMod arguments, but
         there are two common use cases:
         First, the optimization model can be composed of two pieces: 
             A data model that defines the tuples and tuple sets that will contain the input and output data.
             An optimization model that defines the decision variables, decision expressions, objective function, 
             constraints, and pre- and post-processing data transformations.
             The two are concatenated, so they must be presented in that order.
             If such a composite model is used, you do not need to import the data model into the optimization model using an OPL include statement.
         Second, you do not have to use a separate data model, in which case a single dotMod must be provided 
         which encompasses both the data model and the optimization model.  
        @param name: the name assigned to this OPL model (should have the format of a file name with a .mod extension)
        @type name: String
        @param dotMods: URLs pointing to OPL .mod files, which will be concatenated in the order given
        @type dotMods: List<URL>
        @param modelText: the text of the OPL model, which will be concatenated in the order given
        @type modelText: List<String>
        @return this optimizer
        @raise ValueError if a model has already been defined or if dotMods or modelText is empty
        '''
        if self.model is not None:
            raise ValueError("model has already been set")
        self.model= ModelSource(name=name, dotMods=dotMods, modelText=modelText)
        return self
    
    def setResultDataModel(self, resultDataModel):
        '''
        Sets the application data model for the results of the optimization
        @param resultDataModel: the application data model for the results of the optimization
        @type resultDataModel: dict<String, StructType>
        '''
        if self.resultDataModel is not None:
            raise ValueError("results data model has already been defined")        
        self.resultDataModel = resultDataModel
        return self
    
    def attachData(self, attachments):
        '''
        Attaches one or more data files, either in OPL .dat or in JSON format. This data does not
        change from solve to solve. If you have input data that does change, you can provide it as a Collector object.
        @param attachments: files representing the data that does not vary from solve to solve
        @type attachments: list<URL>
        @return this optimizer
        @raise ValueError if an item of the same name has already been attached
        '''
        self.attachments= {}
        if attachments is not None:
            for f in attachments:
                fileName= os.path.splitext(os.path.basename(urlparse(f)))[0]
                if fileName in self.attachments:
                    raise ValueError(fileName+ " already attached")
                self.attachments[fileName]= f
        return self;
    
    def solve(self, inputData=None, solutionId=""):
        '''
        Solves an optimization problem instance by calling the DOCloud solve service (Oaas).
        Creates a new job request, incorporating any changes to the variable input data, 
        for a problem instance to be processed by the solve service. 
        Once the problem is solved, the results are mapped to an instance of an OPL Collector.
        Note: this method will set a new destination for the JSON serialization of the input data.
        @param inputData: the variable, solve-specific input data
        @type inputData: OPLCollector
        @param solutionId: an identifier for the solution, used in iterative algorithms (set to empty string if not needed)
        @type solutionId: String
        @return: a solution collector
        '''
        inputs= []
        if self.model is None:
            raise ValueError("A model attachment must be provided to the optimizer")
        if self.model: #is not empty
            stream= self.model.toStream()
            inputs.append({"name":self.model.getName(), "file":stream})
            self.streamsRegistry.append(stream)
        if self.attachments: #is not empty
            for f in self.attachments:
                stream= urllib.FancyURLopener(self.attachments[f])
                inputs.append({"name":f, "file":stream})
                self.streamsRegistry.append(stream)
        if inputData is not None:
            outStream = cStringIO.StringIO()
            inputData.setJsonDestination(outStream).toJSON()
            inStream = cStringIO.StringIO(outStream.getvalue())
            inputs.append({"name": inputData.getName()+".json", "file": inStream})
            self.streamsRegistry.extend([outStream, inStream])
       
        response= self.jobclient.execute(
            input= inputs, 
            output= "results.json", 
            load_solution= True, 
            log= "solver.log", 
            gzip= True,
            waittime= 300,  #seconds
            delete_on_completion= False)
         
        self.jobid= response.jobid
        
        status= self.jobclient.get_execution_status(self.jobid)
        if status==JobExecutionStatus.PROCESSED:
            results= cStringIO.StringIO(response.solution)
            self.streamsRegistry.append(results)
            self.solveStatus= response.job_info.get('solveStatus') #INFEASIBLE_SOLUTION or UNBOUNDED_SOLUTION or OPTIMAL_SOLUTION or...
            solution= (OPLCollector(self.getName()+"Result"+solutionId, self.resultDataModel)).setJsonSource(results).fromJSON()
            self.history.append(solution)
        elif status==JobExecutionStatus.FAILED:
            # get failure message if defined
            message= ""
            if (response.getJob().getFailureInfo() != None):
                message= response.getJob().getFailureInfo().getMessage()
            print("Failed " +message)
        else:
            print("Job Status: " +status)
        
        for s in self.streamsRegistry:
            s.close();
        self.jobclient.delete_job(self.jobid);
        
        return solution
    
    def getSolveStatus(self):
        """
        @return the solve status as a string
        Attributes:
            UNKNOWN: The algorithm has no information about the solution.
            FEASIBLE_SOLUTION: The algorithm found a feasible solution.
            OPTIMAL_SOLUTION: The algorithm found an optimal solution.
            INFEASIBLE_SOLUTION: The algorithm proved that the model is infeasible.
            UNBOUNDED_SOLUTION: The algorithm proved the model unbounded.
            INFEASIBLE_OR_UNBOUNDED_SOLUTION: The model is infeasible or unbounded.
        """
        return self.solveStatus
    
# end class Optimizer        
    
class ModelSource(object):
    '''
     This class manages the OPL source code for an optimization model.
     It can use an OPL model specified by one or more files, indicated by their URLs, or
     it can use an OPL model specified by one or more Strings. 
     Use of one OPL component is the norm, but this class also
     enables factoring an OPL model into a data model and an optimization model.
     Using such a two-piece factorization is a best practice for large models and multi-models:
     The data model defines the tuples and tuple sets that will contain the input and output data.
     The optimization model defines the decision variables, decision expressions, objective function, 
     constraints, and pre- and post-processing data transformations.
     
     When the OPL model consists of multiple components, ModelSource concatenates them in the order
     presented, and it is not necessary to use OPL include statements to import the components.
     The multiple model files need not be located in the same resource folder.
     
     Note: developers generally need not use this class directly. Instead, it is recommended
     to use the setOPLModel method of the Optimizer class.
    '''
    
    def __init__(self, name= "OPL.mod", dotMods= None, modelText= None):
        '''
         Creates a new ModelSource instance from URLs pointing to OPL .mod files.
         This method can take any number of URL arguments, but
         there are two common use cases:
         First, the optimization model can be composed of two pieces: 
         A data model that defines the tuples and tuple sets that will contain the input and output data.
         An optimization model that defines the decision variables, decision expressions, objective function, 
         constraints, and pre- and post-processing data transformations.
         The two are concatenated, so they must be presented in that order.
         If such a composite model is used, you do not need to import the data model into the optimization model using an OPL include statement.
         Second, you do not have to use a separate data model, in which case a single model URL must be provided 
         which encompasses both the data model and the optimization model.
        
        @param name: the name assigned to this OPL model (should have the format of a file name with a .mod extension)
        @type name: String
        @param dotMods: URLs pointing to OPL .mod files, which will be concatenated in the order given
        @type dotMods: List<URL>
        @param modelText: the text of the OPL model, which will be concatenated in the order given
        @type modelText: List<String>
        @raise: ValueError if dotMods or modelText is empty
        '''
        
        self.name= name;
        if dotMods is not None and not dotMods: #is empty
            raise ValueError("argument cannot be empty");
        self.dotMods= dotMods;
        if modelText is not None and not modelText: #is empty
            raise ValueError("argument cannot be empty");
        self.modelText= modelText;
        
    def getName(self):
        '''
         @return:  the name assigned to this OPL model
         @type String 
        '''
        return self.name
    
    def isEmpty(self):
        '''
         @return true if both dotMods and modelText are null; false otherwise
        '''
        return self.dotMods is None and self.modelText is None
    
    def toStream(self):
        '''
         Concatenates the model components and creates an input file for reading them.
         
         @return a file
        '''
        if self.dotMods: #is not empty        
            result= fileinput.input((urllib.FancyURLopener(f) for f in self.dotMods))
            return result           
        if self.modelText: #is not empty
            result= cStringIO.StringIO("".join(self.modelText)) 
            return result           
        raise ValueError("model source is empty")
 
# end class ModelSource

#### 2.7.3 Setting Up and Submitting the Solve Job

Using the Optimizer class, the following solves the Warehousing problem:

In [13]:
problem= Optimizer("Warehousing", credentials={"url":url, "key":key})\
        .setOPLModel("Warehousing.mod", modelText= [warehousing_data_dotmod, warehousing_inputs, warehousing_dotmod, warehousing_outputs])\
        .setResultDataModel(warehousingResultDataModel)
warehousingResult= problem.solve(warehousingData.copy("warehousingDataNoCoord", "warehouses", "routes", "stores", "demands", "scenarios"))
# Note: the mapCoordinates table is not used in the optimization and so is not sent to the optimizer
problem.getSolveStatus()

u'OPTIMAL_SOLUTION'

#### 2.7.4 Retrieving the Optimal Solution

In [14]:
warehousingResult.displayTable("objectives", sys.stdout);

collector: WarehousingResult
table: objectives
+-----------+-------------+----------+---------+--------------------+
|problem    |dExpr        |scenarioId|iteration|value               |
+-----------+-------------+----------+---------+--------------------+
|Warehousing|capitalCost  |Nominal   |0        |6373620.0           |
|Warehousing|operatingCost|Nominal   |0        |4580688.489999998   |
|Warehousing|totalCost    |Nominal   |0        |1.0954308489999998E7|
+-----------+-------------+----------+---------+--------------------+



In [15]:
openWarehouses= warehousingResult.getTable("openWarehouses").select('*').where("open == 1")
print("collector: WarehousingResult")
print("table: openWarehouses")
openWarehouses.show(openWarehouses.count())

collector: WarehousingResult
table: openWarehouses
+-----------------+----------+---------+----+--------+
|         location|scenarioId|iteration|open|capacity|
+-----------------+----------+---------+----+--------+
|     New York, NY|   Nominal|        0|   1|  3961.0|
|Lawrenceville, GA|   Nominal|        0|   1|  1146.0|
|      Chicago, IL|   Nominal|        0|   1|  2720.0|
|       Dallas, TX|   Nominal|        0|   1|  1395.0|
|       Denver, CO|   Nominal|        0|   1|   874.0|
|  Los Angeles, CA|   Nominal|        0|   1|  5581.0|
|San Francisco, CA|   Nominal|        0|   1|  2388.0|
+-----------------+----------+---------+----+--------+



## 3 Transforming an Optimization Problem to Tableau Form

The simplex tableau form is almost the problem format used by the solver. The modeling layer automatically transforms an optimization problem such as the Warehousing example to tableau form and recovers the solution, transparently to the model developer. This section discusses how these transformations occur, using the Warehousing problem as an example.

### 3.1 Using Relational Database Operations to Reshape the Instance Data

Many readers will have noticed at this point the correspondence between the data structures of OPL and those of the relational model commonly used in database programming. A tuple set in OPL corresponds to a table in a relational database. A tuple in OPL corresponds to a row, or record, in a database table. A data set, or *collector*, for an optimization model consists of one or more tables (tuple sets), usually having foreign keys that link them together. 

This correspondence is more than casual; it reflects a deep relationship between optimization modeling and the structure of data. This relationship, in turn, implies that the common operations on relational data, namely *selects* and *joins*, can also be used to transform the data used in optimization modeling. Since relational database systems are designed to execute these operations very efficiently on very large datasets, using such a system as the basis for optimization modeling can have very great advantages in speed and efficiency. Not all of the latency in solving an optimization problem is due to the computational algorithms; many times, the upstream and downstream data transformations contribute significantly as well.

This section illustrates that insight. It builds the transformations described in the preceding section using a database system, in this case Apache Spark (although other database systems could be used just as well). See http://spark.apache.org/. Spark was chosen because it is designed for use with very large, distributed datasets, called *Resilient Distributed Datasets (RDD)*. While an RDD does not have to have a relational structure, for the purposes of this paper, relational structure will be assumed. A Spark RDD with a relational structure is called a *dataset* (or *dataframe* in Python).

In [16]:
# Identify the input dataframes
warehouses = warehousingData.getTable('warehouses')
stores =     warehousingData.getTable('stores')
routes =     warehousingData.getTable('routes')
demands =    warehousingData.getTable('demands')
scenarios =  warehousingData.getTable('scenarios')
#scenarios is a singleton (has a single row)
scenarioId = scenarios.first()["id"]


#### 3.1.1 The Tableau Data Model

The tableau data model is reprinted below for reference.

In [17]:
print(tableau_data_model)


 //Defines a column for a boolean variable
 tuple BooleanColumn {
    key string variable;    //name = variable+index
    int lower;              //lower bound (always 0)
    int upper;              //upper bound (always 1)
    int value;              //optimal value (output only)
 }
 
 //Defines a column for an integer variable
 tuple IntegerColumn {
    key string variable;    //name = variable+index
    int lower;              //lower bound
    int upper;              //upper bound
    int value;              //optimal value (output only)
 }
 
 //Defines a column for a continuous variable
 tuple FloatColumn {
    key string variable;    //name = variable+index
    float lower;            //lower bound
    float upper;            //upper bound
    float value;            //optimal value (output only)
 }
 
 //Defines a row for a constraint or a decision expression
 tuple Row {
    key string cnstraint;   //name = constraint+index  
    string sense;           //GE(>=), EQ(==), LE(<=)

The corresponding Spark schemas are built with an <code>OPLCollector</code> instance:

In [18]:
#Create the tableau data model
tableauData= OPLCollector("tableauData")
tableauADMBuilder= tableauData.buildADM()
tableauADMBuilder.addSchema("integerColumns", buildSchema(
    ("variable", StringType()),
    ("lower", IntegerType()),
    ("upper", IntegerType()),
    ("value", IntegerType())))
tableauADMBuilder.addSchema("booleanColumns", SchemaBuilder()\
    .copyFields(tableauADMBuilder.referenceSchema("integerColumns"))\
    .buildSchema())
tableauADMBuilder.addSchema("floatColumns", buildSchema(
    ("variable", StringType()),
    ("lower", DoubleType()),
    ("upper", DoubleType()),
    ("value", DoubleType())))
tableauADMBuilder.addSchema("rows", buildSchema(
    ("cnstraint", StringType()),
    ("sense", StringType()),
    ("rhs", DoubleType())))
tableauADMBuilder.addSchema("entries", buildSchema(
    ("cnstraint", StringType()),
    ("variable", StringType()),
    ("coefficient", DoubleType())))
tableauADMBuilder.addSchema("objectives", buildSchema(
    ("name", StringType()),
    ("sense", StringType()),
    ("value", DoubleType())))
tableauDataModel= tableauADMBuilder.build()

#### 3.1.2 The Transformation Data Model

It is necessary to transform the instance data for the warehousing model into the tableau data model. This transformation occurs in several steps which are mediated by a transitional collector. The schema of this collector is specified in the next cell.

In [19]:
# Create the data model to transform the warehousing data into the tableau
tableauTransformations = OPLCollector("tableauTransformations")
tableauTransformationsADMBuilder = tableauTransformations.buildADM()
tableauTransformationsADMBuilder.addSchema("columns_open", SchemaBuilder()\
    .copyFields(tableauData.getADM().get("booleanColumns"))\
    .addField("location", StringType())\
    .buildSchema())
tableauTransformationsADMBuilder.addSchema("columns_capacity", SchemaBuilder()\
    .copyFields(tableauData.getADM().get("floatColumns"))\
    .addField("location", StringType())\
    .buildSchema())
tableauTransformationsADMBuilder.addSchema("columns_ship", SchemaBuilder()\
    .copyFields(tableauData.getADM().get("floatColumns")) \
    .addField("location", StringType())\
    .addField("store", StringType())\
    .buildSchema())
tableauTransformationsADMBuilder.addSchema("rows_ctCapacity", SchemaBuilder()\
    .copyFields(tableauData.getADM().get("rows"))\
    .buildSchema())
tableauTransformationsADMBuilder.addSchema("rows_ctDemand", SchemaBuilder()\
    .copyFields(tableauData.getADM().get("rows"))\
    .buildSchema())
tableauTransformationsADMBuilder.addSchema("rows_ctSupply", SchemaBuilder()\
    .copyFields(tableauData.getADM().get("rows"))\
    .buildSchema())
tableauTransformationsADMBuilder.addSchema("rows_dexpr", SchemaBuilder()\
    .copyFields(tableauData.getADM().get("rows"))\
    .buildSchema())
tableauTransformationsADMBuilder.addSchema("entries_ctCapacity_capacity", SchemaBuilder()\
    .copyFields(tableauData.getADM().get("entries"))\
    .buildSchema())
tableauTransformationsADMBuilder.addSchema("entries_ctCapacity_ship", SchemaBuilder()\
    .copyFields(tableauData.getADM().get("entries"))\
    .buildSchema())
tableauTransformationsADMBuilder.addSchema("entries_ctDemand_ship", SchemaBuilder()\
    .copyFields(tableauData.getADM().get("entries"))\
    .buildSchema())
tableauTransformationsADMBuilder.addSchema("entries_ctSupply_open", SchemaBuilder()\
    .copyFields(tableauData.getADM().get("entries"))\
    .buildSchema())
tableauTransformationsADMBuilder.addSchema("entries_ctSupply_ship", SchemaBuilder()\
    .copyFields(tableauData.getADM().get("entries"))\
    .buildSchema())
tableauTransformationsADMBuilder.addSchema("entries_dexpr_open", SchemaBuilder()\
    .copyFields(tableauData.getADM().get("entries"))\
    .buildSchema())
tableauTransformationsADMBuilder.addSchema("entries_dexpr_capacity", SchemaBuilder()\
    .copyFields(tableauData.getADM().get("entries"))\
    .buildSchema())
tableauTransformationsADMBuilder.addSchema("entries_dexpr_ship", SchemaBuilder()\
    .copyFields(tableauData.getADM().get("entries"))\
    .buildSchema())
tableauTransformationsDataModel= tableauTransformationsADMBuilder.build()

### 3.2 Encoding the Decision Variables

The first step is to map the decision variables to columns in the tableau. The purpose of this encoding is two-fold. First, it associates the data in the application data model with the schema of the tableau data model. Second it establishes a mapping between the keys in the application data model (which are specifc to its domain) and the column indices (i.e. the keys) of the tableau, which need to be independent of the application data schema. This mapping enables two-way communication between the application layer and the solving layer.

The Warehousing example has three sets of decision variables: 
<ul>
    <li><code>dvar boolean open[warehouses]</code></li> 
    <li><code>dvar float capacity[warehouses] in 0..infinity</code></li> 
    <li><code>dvar float ship[routes] in 0.0..1.0</code></li>
</ul>

The column indicies are simply computed as the name of the decision variable appended with the key in the underlying index table, e.g. <code>"open_Brockton, MA"</code>.

In [20]:
tableauTransformer = tableauTransformations.buildData()

# Encode the columns
tableauTransformer.addTable("columns_open",
    warehouses.select("location")\
        .withColumn("variable", functions.concat(functions.lit("open_"), warehouses["location"]))\
        .withColumn("upper", functions.lit(1))\
        .withColumn("lower", functions.lit(0))\
        .withColumn("value", functions.lit(0)))
tableauTransformer.addTable("columns_capacity",
    warehouses.select("location")\
        .withColumn("variable", functions.concat(functions.lit("capacity_"), warehouses["location"]))\
        .withColumn("upper", functions.lit(1.0e20))\
        .withColumn("lower", functions.lit(0.0))\
        .withColumn("value", functions.lit(0.0)))
tableauTransformer.addTable("columns_ship",
    routes.select("location", "store")\
        .withColumn("variable", functions.concat(functions.lit("ship_"), routes["location"], functions.lit("_"),
                                                 routes["store"]))\
        .withColumn("upper", functions.lit(1.0))\
        .withColumn("lower", functions.lit(0.0))\
        .withColumn("value", functions.lit(0.0)))

<__main__.DataBuilder at 0x7f5db5b202d0>

### 3.3 Encoding the Constraints and Decision Expressions

The next  step is to map the constraints to rows in the tableau. An encoding similar to that used for the decision variables applies to the three sets of constraints: 
<ul>
    <li><code>constraint ctCapacity[warehouses]</code></li> 
    <li><code>constraint ctDemand[stores]</code></li> 
    <li><code>constraint ctSupply[routes]</code></li>
</ul>

Also encoded are the rows corresponding to the decision expressions:
<ul>
    <li><code>dexpr float capitalCost</code></li>
    <li><code>dexpr float operatingCost</code></li>
    <li><code>dexpr float totalCost</code></li>
</ul>

In [21]:
tableauTransformer.addTable("rows_ctCapacity",
    warehouses.select("location")\
        .withColumn("cnstraint", functions.concat(functions.lit("ctCapacity_"), warehouses["location"]))\
        .withColumn("sense", functions.lit("GE"))\
        .withColumn("rhs", functions.lit(0.0)))
tableauTransformer.addTable("rows_ctDemand",
    stores.select("storeId")\
        .withColumn("cnstraint", functions.concat(functions.lit("ctDemand_"), stores["storeId"]))\
        .withColumn("sense", functions.lit("GE"))\
        .withColumn("rhs", functions.lit(1.0))\
        .withColumnRenamed("storeId", "store"))
tableauTransformer.addTable("rows_ctSupply",
    routes.select("location", "store")\
        .withColumn("cnstraint", functions.concat(functions.lit("ctSupply_"), routes["location"], functions.lit("_"),
                                                  routes["store"]))\
        .withColumn("sense", functions.lit("GE"))\
        .withColumn("rhs", functions.lit(0.0)))
tableauTransformer.addTable("rows_dexpr",
    SPARK_SESSION.createDataFrame(
        [   Row(cnstraint= "capitalCost", sense= "dexpr", rhs= 0.0),
            Row(cnstraint= "operatingCost", sense= "dexpr", rhs= 0.0),
            Row(cnstraint= "totalCost", sense= "dexpr", rhs= 0.0)],
        tableauTransformations.getADM().get("rows_dexpr"))\
    .select("cnstraint", "sense", "rhs"))    #orders the columns properly

<__main__.DataBuilder at 0x7f5db5b202d0>

### 3.4 Reshaping the Coefficient Data into the Tableau

The next step is to populate the coefficient matrix of the tableau with the data from the problem instance, called *reshaping* for reasons that will become clear. The table below schematically represents the coefficient matrix. The row and column labels (the left-most column and the top row) have already been defined in the preceding sections. The sparsity of the matrix is already apparent in the diagram, as three of the cells are empty. Furthermore, the sub-matrices in the non-empty cells are also themselves sparse. The construction of the tableau procedes for each sub-matrix individually.

In [22]:
figure2= """
+-----------------+-----------------------+-----------------------------+-------------------------+
|                 | columns_open          | columns_capacity            | columns_ship            |
+=================+=======================+=============================+=========================+
| rows_dexpr      | entries_dexpr_open    | entries_dexpr_capacity      | entries_dexpr_ship      |
+-----------------+-----------------------+-----------------------------+-------------------------+
| rows_ctCapacity |                       | entries_ctCapacity_capacity | entries_ctCapacity_ship |
+-----------------+-----------------------+-----------------------------+-------------------------+
| rows_ctDemand   |                       |                             | entries_ctDemand_ship   |
+-----------------+-----------------------+-----------------------------+-------------------------+
| rows_ctSupply   | entries_ctSupply_open |                             | entries_ctSupply_ship   |
+-----------------+-----------------------+-----------------------------+-------------------------+
"""

In [23]:
# @hidden_cell
# Generates the table shown above
from tabulate import tabulate

tableauColumns= ['', 'columns_open', 'columns_capacity', 'columns_ship']
tableauDiagram= [['rows_dexpr', 'entries_dexpr_open','entries_dexpr_capacity', 'entries_dexpr_ship'],
                 ['rows_ctCapacity', '','entries_ctCapacity_capacity', 'entries_ctCapacity_ship'],
                 ['rows_ctDemand', '', '', 'entries_ctDemand_ship'],
                 ['rows_ctSupply', 'entries_ctSupply_open', '', 'entries_ctSupply_ship']]


#print tabulate(tableauDiagram, tableauColumns, tablefmt="grid")

These transformations associate the row and column entries in the tableau with the coefficients of the constraint inequalities. Recall that the constraints of the Warehousing model are:
<p>
<code>
 dexpr float capitalCost=   sum(w in warehouses) (w.fixedCost X open[w] + w.capacityCost X capacity[w]);
 dexpr float operatingCost= sum(r in routes) r.shippingCost X demand[r] X ship[r];
 dexpr float totalCost=     sum(w in warehouses) (w.fixedCost X open[w] + w.capacityCost X capacity[w]) +
                            sum(r in routes) r.shippingCostXdemand[r] X ship[r];
forall(w in warehouses)
// Cannot ship more out of a warehouse than its capacity
   ctCapacity[w]: capacity[w] - sum(r in routes: r.location==w.location) demand[r] X ship[r] >= 0.0;	
forall(s in stores)
// Must ship at least 100% of each store's demand
   ctDemand[s]: sum(r in routes: r.store==s.storeId) ship[r] >= 1.0;  
forall(r in routes, w in warehouses: w.location==r.location)
// Can only ship along a supply route if its warehouse is open	  
   ctSupply[r]: open[w] - ship[r] >= 0.0;
</code>
<p>

The SQL statements (using Spark SQL) corresponding to the OPL are as follows:

In [24]:
tableauTransformer.addTable(
    "entries_ctCapacity_capacity",
    tableauTransformer.referenceTable("rows_ctCapacity")\
        .join(tableauTransformer.referenceTable("columns_capacity"), on="location")\
        .select("cnstraint", "variable")\
        .withColumn("coefficient", functions.lit(1.0)))
# demand at the store at the end of each route
demandOnRoute = routes\
        .join(demands.where(demands["scenarioId"] == functions.lit(scenarioId)), on="store")\
        .select("location", "store", "amount")\
        .withColumnRenamed("amount", "demand")
tableauTransformer.addTable(
    "entries_ctCapacity_ship",
    tableauTransformer.referenceTable("rows_ctCapacity")\
        .join(tableauTransformer.referenceTable("columns_ship"), on="location")\
        .join(demandOnRoute, on=["location", "store"])\
        .withColumn("coefficient", -demandOnRoute["demand"])\
        .select("cnstraint", "variable", "coefficient"))
tableauTransformer.addTable(
    "entries_ctDemand_ship",
    tableauTransformer.referenceTable("rows_ctDemand")\
        .join(tableauTransformer.referenceTable("columns_ship"), on="store")\
        .select("cnstraint", "variable")\
        .withColumn("coefficient", functions.lit(1.0)))
tableauTransformer.addTable(
    "entries_ctSupply_open",
    tableauTransformer.referenceTable("rows_ctSupply")\
        .join(tableauTransformer.referenceTable("columns_open"), on="location")\
        .select("cnstraint", "variable")\
        .withColumn("coefficient", functions.lit(1.0)))
tableauTransformer.addTable(
    "entries_ctSupply_ship",
    tableauTransformer.referenceTable("rows_ctSupply")\
        .join(tableauTransformer.referenceTable("columns_ship"), on=["location", "store"])\
        .select("cnstraint", "variable")\
        .withColumn("coefficient", functions.lit(-1.0)))
rows_dexpr = tableauTransformer.referenceTable("rows_dexpr")
tableauTransformer.addTable(
    "entries_dexpr_open",
    (rows_dexpr.where((rows_dexpr["cnstraint"] == functions.lit("capitalCost"))\
                    | (rows_dexpr["cnstraint"] == functions.lit("totalCost"))))\
        .join(tableauTransformer.referenceTable("columns_open").join(warehouses, on="location"), 
              how="cross")\
        .select("cnstraint", "variable", "fixedCost")\
        .withColumnRenamed("fixedCost", "coefficient"))
tableauTransformer.addTable(
    "entries_dexpr_capacity",
    (rows_dexpr.where((rows_dexpr["cnstraint"] == functions.lit("capitalCost"))\
                    | (rows_dexpr["cnstraint"] == functions.lit("totalCost"))))\
        .join(tableauTransformer.referenceTable("columns_capacity").join(warehouses, on="location"), 
              how="cross")\
        .select("cnstraint", "variable", "capacityCost")\
        .withColumnRenamed("capacityCost", "coefficient"))
tableauTransformer.addTable(
    "entries_dexpr_ship",
    (rows_dexpr.where((rows_dexpr["cnstraint"] == functions.lit("operatingCost"))\
                    | (rows_dexpr["cnstraint"] == functions.lit("totalCost"))))\
        .join((tableauTransformer.referenceTable("columns_ship")\
                .join((routes.join(demandOnRoute, on=["location", "store"])\
                      .withColumn("coefficient", demandOnRoute["demand"] * routes["shippingCost"])),
                      on=["location", "store"])), 
              how="cross")\
        .select("cnstraint", "variable", "coefficient"))

tableauTransformer.build()

{'columns_capacity': DataFrame[location: string, variable: string, upper: double, lower: double, value: double],
 'columns_open': DataFrame[location: string, variable: string, upper: int, lower: int, value: int],
 'columns_ship': DataFrame[location: string, store: string, variable: string, upper: double, lower: double, value: double],
 'entries_ctCapacity_capacity': DataFrame[cnstraint: string, variable: string, coefficient: double],
 'entries_ctCapacity_ship': DataFrame[cnstraint: string, variable: string, coefficient: double],
 'entries_ctDemand_ship': DataFrame[cnstraint: string, variable: string, coefficient: double],
 'entries_ctSupply_open': DataFrame[cnstraint: string, variable: string, coefficient: double],
 'entries_ctSupply_ship': DataFrame[cnstraint: string, variable: string, coefficient: double],
 'entries_dexpr_capacity': DataFrame[cnstraint: string, variable: string, coefficient: double],
 'entries_dexpr_open': DataFrame[cnstraint: string, variable: string, coefficient: d

### 3.5 Creating the Tableau Input Data

The input tables are inserted into an <code>OPLCollector</code> instance:

In [25]:
# Drop the instance-specific keys (location and store), which are not supported in the tableau model
tableauData.buildData()\
    .addTable("booleanColumns",
           tableauTransformations.getTable("columns_open").drop("location"))\
    .addTable("floatColumns",
               tableauTransformations.getTable("columns_capacity").drop("location")\
        .union(tableauTransformations.getTable("columns_ship").drop("location").drop("store")))\
    .addEmptyTable("integerColumns")\
    .addTable("rows",
               tableauTransformations.getTable("rows_ctCapacity").drop("location")\
        .union(tableauTransformations.getTable("rows_ctDemand").drop("store"))\
        .union(tableauTransformations.getTable("rows_ctSupply").drop("location").drop("store"))\
        .union(tableauTransformations.getTable("rows_dexpr")))\
    .addTable("entries",
               tableauTransformations.getTable("entries_ctSupply_open")\
        .union(tableauTransformations.getTable("entries_ctSupply_ship"))\
        .union(tableauTransformations.getTable("entries_ctCapacity_capacity"))\
        .union(tableauTransformations.getTable("entries_ctCapacity_ship"))\
        .union(tableauTransformations.getTable("entries_ctDemand_ship"))\
        .union(tableauTransformations.getTable("entries_dexpr_open"))\
        .union(tableauTransformations.getTable("entries_dexpr_capacity"))\
        .union(tableauTransformations.getTable("entries_dexpr_ship")))\
    .addTable("objectives",
        SPARK_SESSION.createDataFrame(
            [Row(name= "totalCost", sense= "minimize", value= 0.0)],
            tableauData.getADM().get("objectives"))
        .select("name", "sense", "value"))\
.build()
# note: the select clause in objectives table is needed to insure the order of the columns so that the JSON serialization works properly

{'booleanColumns': DataFrame[variable: string, upper: int, lower: int, value: int],
 'entries': DataFrame[cnstraint: string, variable: string, coefficient: double],
 'floatColumns': DataFrame[variable: string, upper: double, lower: double, value: double],
 'integerColumns': DataFrame[variable: string, lower: int, upper: int, value: int],
 'objectives': DataFrame[name: string, sense: string, value: double],
 'rows': DataFrame[cnstraint: string, sense: string, rhs: double]}

### 3.6 Solving the Tableau Model

Once these transformations have been applied, the tableau form of the model shown in section 1, <code>tableau_optimization_problem</code>, can be solved:

In [26]:
tableauProblem = Optimizer("TableauProblem", credentials={"url": url, "key": key})\
    .setOPLModel("TableauProblem.mod",
                 modelText=[tableau_data_model, tableau_inputs, tableau_optimization_problem, tableau_outputs])\
    .setResultDataModel(ADMBuilder()\
        .addSchema("booleanDecisions", tableauData.getSchema("booleanColumns"))\
        .addSchema("integerDecisions", tableauData.getSchema("integerColumns"))\
        .addSchema("floatDecisions", tableauData.getSchema("floatColumns"))\
        .addSchema("optimalObjectives", tableauData.getSchema("objectives"))\
        .build())
tableauResult = tableauProblem.solve(tableauData)
tableauProblem.getSolveStatus()

u'OPTIMAL_SOLUTION'

Again, it must be emphasized that the tableau model, whether realized in OPL or Python (or any other language for that matter) is independent of the structure of the model instance represented by the modeling layer. The modeling layer constructs the tableau data structures through the series of SQL transformations discussed above. Therefore, ordinarily, the model developer need not concern herself with the tableau model, which is not altered during the development process.

### 3.7 Recovering the Warehousing Solution

Once the solver has run, the solution must be mapped back from the tableau form to the original decision variables. As discussed in section 1, this step also entails rehaping the solution arrays into tuple sets. The following transformations are used:

In [27]:
warehousingResult = OPLCollector("warehousingResult", warehousingResultDataModel)
resultsBuilder = warehousingResult.buildData()
resultsBuilder.addTable("objectives",
    tableauResult.getTable("optimalObjectives").select("name", "value")\
    .withColumnRenamed("name", "dExpr")\
    .withColumn("problem", functions.lit("warehousing"))\
    .withColumn("scenarioId", functions.lit(scenarioId))\
    .withColumn("iteration", functions.lit(0)))
resultsBuilder.addTable("openWarehouses",
    (tableauResult.getTable("booleanDecisions").select("variable", "value").withColumnRenamed("value", "open")\
    .join(tableauTransformations.getTable("columns_open"), on="variable")).drop("variable")\
    .join(
        tableauResult.getTable("floatDecisions").select("variable", "value").withColumnRenamed("value", "capacity")\
            .join(tableauTransformations.getTable("columns_capacity"), on="variable").drop("variable"),
        on="location")
    .select("location", "open", "capacity")\
    .where("open > 0")\
    .withColumn("scenarioId", functions.lit(scenarioId))\
    .withColumn("iteration", functions.lit(0)))
floatDecisions = tableauResult.getTable("floatDecisions").select("variable", "value")
resultsBuilder.addTable("shipments",
    floatDecisions\
    .join(tableauTransformations.getTable("columns_ship"), on="variable").drop("variable")\
    .join(demandOnRoute, on=["location", "store"])\
    .withColumn("amount", demandOnRoute["demand"]*(floatDecisions["value"]))\
    .select("location", "store", "amount")\
    .where("amount > 0.0")\
    .withColumn("scenarioId", functions.lit(scenarioId))\
    .withColumn("iteration", functions.lit(0)))
resultsBuilder.build()

warehousingResult.displayTable("objectives")
warehousingResult.displayTable("openWarehouses")
#to see the lengthy shipments table, uncomment the next line
#warehousingResult.displayTable("shipments")

collector: warehousingResult
table: objectives
+-------------+-----------------+-----------+----------+---------+
|dExpr        |value            |problem    |scenarioId|iteration|
+-------------+-----------------+-----------+----------+---------+
|capitalCost  |6373620.0        |warehousing|Nominal   |0        |
|operatingCost|4580688.489999998|warehousing|Nominal   |0        |
|totalCost    |1.095430849E7    |warehousing|Nominal   |0        |
+-------------+-----------------+-----------+----------+---------+

collector: warehousingResult
table: openWarehouses
+-----------------+----+--------+----------+---------+
|location         |open|capacity|scenarioId|iteration|
+-----------------+----+--------+----------+---------+
|San Francisco, CA|1   |2388.0  |Nominal   |0        |
|Los Angeles, CA  |1   |5581.0  |Nominal   |0        |
|Dallas, TX       |1   |1395.0  |Nominal   |0        |
|Chicago, IL      |1   |2720.0  |Nominal   |0        |
|New York, NY     |1   |3961.0  |Nominal   |0  

As can readily be seen, this solution is identical to the obtained by directly solving the warehousing model above.

# 4 Conclusion: Equivalence of Tuple Slicing and SQL

In OPL, operations such as this constraint specification
<code>
	forall(s in stores)
	  ctDemand[s]: sum(r in routes: r.store==s.storeId) ship[r] >= 1.0;
</code>
are called *tuple slicing*, in which a filtering condition is applied within an iteration over index sets.

Section 3.4 implemented such slicing operations using their equivalents in SQL (we have stylized the code somewhat to make the connection clearer):
<code>
entries_ctDemand_ship= rows_ctDemand
        .join(columns_ship on store)
        .select(cnstraint, variable)
        .withColumn(coefficient= 1.0)
</code>

A more complex example:
<code>
float demand[routes]= [r: d.amount | r in routes,  d in demands: r.store==d.store];
...
forall(w in warehouses)
 	  ctCapacity[w]: capacity[w] >= sum(r in routes: r.location==w.location) demand[r]*ship[r];
</code>
is implemented as
<code>
demandOnRoute = 
    routes
        .join((demands.where(demands.scenarioId == literal(scenarioId)) on store)
        .select(location, store, amount as demand)
entries_ctCapacity_capacity= 
   rows_ctCapacity
        .join(columns_capacity on location)
        .select(cnstraint, variable)
        .withColumn(coefficient=1.0)
entries_ctCapacity_ship=
    rows_ctCapacity
        .join(columns_ship on location)
        .join(demandOnRoute on location, store)
        .select(cnstraint, variable)
        .withColumn(coefficient, -demandOnRoute.demand)
</code>

This is a general rule: every slicing operation on tuple sets used in creating an optimization model has an equivalent SQL operation on database tables (or dataframes). Thus, one can regard a modeling language such as OPL as simply a different dialect of SQL. 

The primary topic addressed in this paper is to demonstrate the equivalence of data handling in an optimization modeling language to operations on relational data base tables. Additionally, it illustrates the use of Apache Spark for data handling in optimization.

## Author

**Dr. Jeremy Bloom** is an offering manager for IBM Data Science Experience and an expert on decision optimization. In the course of his more than 35-year career, he has lead research programs for energy companies and developed software products using operation research to solve practical business problems. Dr. Bloom has a bachelor's degree in Electrical Engineering from Carnegie-Mellon University and a master's degree and doctorate from Massachusetts Institute of Technology in Operations Research.

## References

 - George B. Dantzig, *Linear Programming and Extensions*, Princeton University Press (1963)
 - __[Decision Optimization on Cloud](http://dropsolve-oaas.docloud.ibmcloud.com/software/analytics/docloud)__
 - __[Decision Optimization on Cloud Python API](https://developer.ibm.com/docloud/documentation/docloud/python-api/)__
 - __[IBM Optimization Programming Language (OPL)](http://www.ibm.com/support/knowledgecenter/SSSA5P_12.7.1/ilog.odms.ide.help/OPL_Studio/maps/groupings/opl_Language.html)__

**Copyright © 2017 IBM. This notebook and its source code are released under the terms of the MIT License.**