## Install (Colab Only)

In [1]:
# install
!pip install pyepo

Collecting pyepo
  Downloading pyepo-0.3.8-py3-none-any.whl (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m581.5 kB/s[0m eta [36m0:00:00[0m
Collecting pathos (from pyepo)
  Downloading pathos-0.3.2-py3-none-any.whl (82 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.1/82.1 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Collecting Pyomo>=6.1.2 (from pyepo)
  Downloading Pyomo-6.7.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.7/12.7 MB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting gurobipy>=9.1.2 (from pyepo)
  Downloading gurobipy-11.0.1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (13.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.4/13.4 MB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0m
Collecting ply (from Pyomo>=6.1.2->pyepo)
  Downloading ply-3.11-py2.py3-none-any.whl (4

# Optimization Dataset

PyEPO ontains synthetic data generator and a dataset class ``optDataset`` to wrap data samples.

## 1 Data Generator

``pyepo.data`` includes synthetic datasets for three of the most classic optimization problems: the shortest path problem, the multi-dimensional knapsack problem, and the traveling salesperson problem.

The synthetic datasets include features $\mathbf{x}$ and cost coefficients $\mathbf{c}$. The feature vector $\mathbf{x}_i \in \mathbb{R}^p$ follows a standard multivariate Gaussian distribution $\mathcal{N}(0, \mathbf{I})$, and the corresponding cost $\mathbf{c}_i \in \mathbb{R}^d$ comes from a polynomial function $f(\mathbf{x}_i)$ multiplicated with a random noise $\mathbf{\epsilon}_i \sim  U(1-\bar{\epsilon}, 1+\bar{\epsilon})$. The details of $f(\mathbf{x}_i)$ can be seen [here](https://khalil-research.github.io/PyEPO/build/html/content/examples/data.html).

In general, there are several parameters that users can control:
- num_data ($n$): data size
- num_features ($p$): feature dimension of costs $\mathbf{c}$
- deg ($deg$): polynomial degree of function $f(\mathbf{x}_i)$
- noise_width ($\bar{\epsilon}$):  noise half-width of $\mathbf{\epsilon}$
- seed: random state seed to generate data

### 1.1 Shortest Path

$c_i^j = [\frac{1}{{3.5}^{deg}} (\frac{1}{\sqrt{p}}(\mathcal{B} \mathbf{x}_i)_j + 3)^{deg} + 1] \cdot \epsilon_i^j$, where $\mathcal{B}$ is a random matrix.

The following code is to generate data for the shortest path on the **5x5** grid network:

In [2]:
import pyepo
grid = (5,5) # grid size
n = 1000 # number of data
p = 5 # feature dimention
deg = 4 # polynomial degree
e = 0.5 # noise half-width

Auto-Sklearn cannot be imported.


In [3]:
# generate data for grid network (features and costs)
feats, costs = pyepo.data.shortestpath.genData(num_data=n+1000, num_features=p, grid=grid,
                                               deg=deg, noise_width=e, seed=42)

In [4]:
# features
print("Features:")
print(feats[0])
# costs
print("Costs:")
print(costs[0])

Features:
[-0.68002472  0.2322537   0.29307247 -0.71435142  1.86577451]
Costs:
[0.35391761 1.07858994 1.01396442 0.66493631 0.29919599 0.19120229
 0.92962621 0.24648609 0.33950669 0.19379419 0.78927149 0.52358102
 0.66013164 1.0846318  0.73023344 0.53221019 0.34958448 0.33129203
 1.4367641  0.72844442 1.39402493 0.89432676 1.03169003 0.46667478
 0.60684515 1.7148205  1.62555298 2.13901473 0.375338   0.51937908
 1.30751427 2.39109315 0.51398154 1.02980917 0.73931099 0.23779171
 0.35521389 0.25666491 0.70956306 1.69988172]


### 1.2 Knapsack

Because we assume that the uncertain coefficients exist only on the objective function, the weights of items are fixed throughout the data. We define the number of items as $m$ and the dimension of resources is $k$.

$c_i^j = \lceil [\frac{5}{{3.5}^{deg}} (\frac{1}{\sqrt{p}}(\mathcal{B} \mathbf{x}_i)_j + 3)^{deg} + 1] \cdot \epsilon_i^j \rceil$, where $\mathcal{B}$ is a random matrix.

In [5]:
import pyepo
m = 32 # number of items
k = 2 # resource dimension
n = 1000 # number of data
p = 5 # feature dimention
deg = 4 # polynomial degree
e = 0.5 # noise half-width

In [6]:
# generate data for 2D kansack (weights, features and costs)
weights, feats, costs = pyepo.data.knapsack.genData(num_data=n+1000, num_features=p, num_items=m,
                                                    dim=k, deg=deg, noise_width=e, seed=42)

In [7]:
# features
print("Weights of Items:")
print(weights)
print("Features:")
print(feats[0])
# costs
print("Values:")
print(costs[0])

Weights of Items:
[[4.02 7.35 6.48 5.7  4.06 3.71 4.88 3.2  4.02 4.21 7.66 5.14 6.3  7.58
  3.87 6.72 3.99 6.59 4.51 4.3  4.49 6.08 5.57 6.43 7.91 7.13 5.93 6.85
  4.91 7.43 5.76 4.6 ]
 [7.59 6.13 3.21 5.52 5.35 6.44 3.48 7.74 3.58 4.69 7.75 4.87 7.63 5.7
  4.89 7.45 4.74 7.45 3.5  6.63 3.54 5.43 6.19 4.3  7.84 6.06 4.34 3.2
  6.28 4.66 5.73 6.87]]
Features:
[-0.60139725 -0.84661436  0.81615272  0.47170729 -0.2854893 ]
Values:
[2. 5. 6. 4. 1. 2. 1. 4. 2. 2. 5. 5. 5. 5. 3. 2. 4. 2. 2. 2. 4. 2. 2. 4.
 3. 2. 1. 3. 4. 2. 3. 5.]


### 1.3 Traveling Salesperson

For Traveling Salesperson (TSP),  number of nodes $m$ is the addtional parameter to generate data.

The distance consists of two parts: one comes from Euclidean distance, the other derived from feature encoding. For Euclidean distance, we generate random coordinates. For feature encoding, it is $\frac{1}{{3}^{deg - 1}} (\frac{1}{\sqrt{p}} (\mathcal{B} \mathbf{x}_i)_j + 3)^{deg} \cdot \epsilon_i^j$, where $\mathcal{B}$ is a random matrix.

In [8]:
import pyepo
m = 20 # number of nodes
n = 1000 # number of data
p = 5 # feature dimention
deg = 4 # polynomial degree
e = 0.5 # noise half-width

In [9]:
feats, costs = pyepo.data.tsp.genData(num_data=n+1000, num_features=p, num_nodes=m,
                                      deg=deg, noise_width=e, seed=42)

In [10]:
# features
print("Features:")
print(feats[0])
# costs
print("Costs:")
print(costs[0])

Features:
[-1.14168911 -0.19365946 -0.71682232 -1.86653662 -0.08268069]
Costs:
[ 3.7469  3.5177  3.5663  3.2863 20.8808  3.5013  4.6786  3.663   9.4076
  1.6293  8.884   2.8502  5.8908  4.8622 18.5146  5.7186  0.5103  3.1516
  8.1697  6.4878 10.9322  0.766  12.8021  7.2463 16.3108  3.0362 11.5717
  9.5445  3.901   4.9748  2.0201  2.4792  5.3137 26.4226 19.2363  2.3725
  1.9566  6.9787  6.7527  6.1478  9.8816  4.8733  4.8745  2.1084  6.8356
  7.9441  6.6348  3.2901 10.0526  4.4324  3.8492  6.6119  8.6355 22.0806
  5.5779  1.3496  8.7577  5.0559  6.8631  6.8527  3.5512  3.4764  4.0971
  4.0949  4.0859  7.7309  7.4587  1.2292 34.8099  4.9677 17.7628  3.1806
  6.3381  3.5345 18.8379  6.9102  6.5611  3.818   4.2513 16.7784  4.5815
  5.3674  3.3281 13.4974  2.304   4.8323  3.4517  7.0979  5.5551 10.3592
  3.6593 15.9906  5.1407  4.9056  2.8864  5.8821  4.7064 16.2864  8.4832
  6.8939  3.0092  6.5853  3.5005  6.8579  4.3257  2.9691  5.9071  4.8555
  4.6766  4.2926  3.0147  4.5696  1.5417  2.2

### 1.4 Portfolio

Let $\bar{r}_{ij} = (\frac{0.05}{\sqrt{p}}(\mathcal{B} \mathbf{x}_i)_j + {0.1}^{\frac{1}{deg}})^{deg}$.

The expected return of the assets is defined as $\bar{\mathbf{r}}_i + \mathbf{L} \mathbf{f} + 0.01 \tau \mathbf{\epsilon}$ and the covariance matrix is expressed $\mathbf{L} \mathbf{L}^{\intercal} + (0.01 \tau)^2 \mathbf{I}$, where $\mathcal{B}$, $\mathbf{L}$, and $\mathbf{f}$ and $\mathbf{\epsilon}$ are random variables.

In [11]:
import pyepo
m = 50 # number of assets
n = 1000 # number of data
p = 4 # feature dimention
deg = 4 # polynomial degree
e = 1 # noise level

In [12]:
cov, feats, revs = pyepo.data.portfolio.genData(num_data=n+1000, num_features=p, num_assets=m,
                                                deg=deg, noise_level=e, seed=42)

In [13]:
# covariance
print("Covariance:")
print(cov)
# features
print("Features:")
print(feats[0])
# Revenue
print("Revenues:")
print(costs[0])

Covariance:
[[ 1.00116613e-02  1.04809132e-05  3.26874944e-06 ... -7.32011315e-06
   2.31052654e-06  2.42490156e-06]
 [ 1.04809132e-05  1.00109443e-02  3.14384952e-06 ... -8.34225311e-06
  -3.85284665e-07  3.65860588e-07]
 [ 3.26874944e-06  3.14384952e-06  1.00099803e-02 ... -5.52447038e-06
   1.11220326e-06 -2.30590331e-06]
 ...
 [-7.32011315e-06 -8.34225311e-06 -5.52447038e-06 ...  1.00101297e-02
  -1.57470787e-06  4.22118562e-06]
 [ 2.31052654e-06 -3.85284665e-07  1.11220326e-06 ... -1.57470787e-06
   1.00075920e-02  1.05277517e-07]
 [ 2.42490156e-06  3.65860588e-07 -2.30590331e-06 ...  4.22118562e-06
   1.05277517e-07  1.00065307e-02]]
Features:
[ 1.30547881  0.02100384  0.68195297 -0.31026676]
Revenues:
[ 3.7469  3.5177  3.5663  3.2863 20.8808  3.5013  4.6786  3.663   9.4076
  1.6293  8.884   2.8502  5.8908  4.8622 18.5146  5.7186  0.5103  3.1516
  8.1697  6.4878 10.9322  0.766  12.8021  7.2463 16.3108  3.0362 11.5717
  9.5445  3.901   4.9748  2.0201  2.4792  5.3137 26.4226 19.236

## 2 Introduction to optDataset

``optDataset`` is PyTorch Dataset, which stores the features and their corresponding costs of the objective function, and **solves optimization problems to get optimal solutions and optimal objective values**.

``optDataset`` is **not** necessary for training with PyEPO, but it can be easier to obtain optimal solutions and objective values when they are not available in the original data.

As the following example, ``optDataset`` and Pytorch ``DataLoader`` wrap the data samples, which can make the model training cleaner and more organized.

### 2.1 Generate Data

We generate data for the shortest path on the 5x5 grid network first.

In [14]:
# grid size
grid = (5,5)

In [15]:
# generate data
num_data = 1000 # number of data
num_feat = 5 # size of feature
deg = 4 # polynomial degree
e = 0.5 # noise width
feats, costs = pyepo.data.shortestpath.genData(num_data+1000, num_feat, grid, deg, e, seed=42)

### 2.2 Build OptModel

> "PyTorch provides two data primitives: ``Dataset`` and ``DataLoader`` that allow you to use pre-loaded datasets as well as your own data. Dataset stores the samples and their corresponding labels, and DataLoader wraps an iterable around the Dataset to enable easy access to the samples. "  -- PyTorch Documentation

``optDataset`` is extended from PyTorch ``Dataset``. In order to obtain optimal solutions, ``optDataset`` requires the corresponding ``optModel`` is a module of PyEPO library, which is designed as a container for any "black box" solver. The tutorial on ``optModel`` is [here](https://github.com/khalil-research/PyEPO/blob/main/notebooks/01%20Optimization%20Model.ipynb).

Here we load the pre-defined [``pyepo.model.grb.shortestPathModel``](https://khalil-research.github.io/PyEPO/build/html/content/examples/model.html#pre-defined-models), which uses Gurobi to build a linear program.

In [16]:
from pyepo.model.grb import shortestPathModel
# init model
optmodel = shortestPathModel(grid)

Restricted license - for non-production use only - expires 2025-11-24


### 2.3 Train-Test Split

Split features and costs into random train and test subsets, where the test size is 1000.

In [17]:
# split train test data
from sklearn.model_selection import train_test_split
x_train, x_test, c_train, c_test = train_test_split(feats, costs, test_size=1000, random_state=42)

### 2.4 Wrap Features and Costs with optDataset

``optDataset`` accepts features features $\mathbf{x}$ and cost coefficients $\mathbf{c}$, then ultilizes ``optModel`` to find optimal solutions $\mathbf{w}^*$ and objective values $\mathbf{z}^*$.

In [18]:
# get optDataset
dataset_train = pyepo.data.dataset.optDataset(optmodel, x_train, c_train)
dataset_test = pyepo.data.dataset.optDataset(optmodel, x_test, c_test)

Optimizing for optDataset...


100%|██████████| 1000/1000 [00:03<00:00, 290.63it/s]


Optimizing for optDataset...


100%|██████████| 1000/1000 [00:04<00:00, 216.02it/s]


It takes time for solving optimization problem per instance.

The ``optDataset`` contains 4 attributes: **features**, **costs**, (optimal) **solutions**, and (optimal) **objective values**.

In [19]:
# features
dataset_train.feats[:5]

array([[ 0.68356932, -1.36595571,  1.21194399,  0.26125053, -0.36927714],
       [-2.8321556 , -0.45115886,  0.5517408 ,  1.20026175, -0.46316136],
       [-0.85238677,  0.47536561,  0.63245422, -0.47417818, -0.77177196],
       [ 0.72894119, -0.69235077, -0.65903927, -0.57410068,  0.5364139 ],
       [-1.23252331,  0.55229994,  0.62563093, -0.69677182,  0.58202657]])

In [20]:
# costs
dataset_train.costs[:5]

array([[0.66423204, 1.31570609, 0.39314494, 1.56369511, 0.75385019,
        1.31562851, 0.64220475, 0.93906923, 0.91311846, 2.04161427,
        0.74907072, 0.5318552 , 0.68541336, 1.23490138, 0.93002909,
        0.44799287, 0.25995308, 0.34913271, 0.41296698, 0.31735299,
        0.21611971, 0.86564175, 0.8231998 , 0.21084682, 0.34851563,
        0.5123044 , 0.64182268, 0.39734889, 0.5353895 , 0.58975985,
        0.82311134, 0.22680251, 0.78549927, 0.87061416, 1.1983596 ,
        0.2857453 , 1.45720774, 0.99978659, 0.30051617, 0.29062359],
       [1.36202037, 1.60539413, 0.43680829, 0.43538718, 0.09457201,
        0.20794562, 0.15560289, 0.10297456, 1.32280424, 0.17186773,
        0.16530315, 0.06899043, 1.08747827, 0.03893738, 0.1008939 ,
        0.02468207, 0.05977335, 1.20703588, 0.90134767, 0.26220181,
        0.6374565 , 0.6714728 , 0.8947684 , 0.11022986, 0.02432527,
        0.50323957, 0.54521407, 0.61761709, 0.08326333, 0.45347036,
        0.04713232, 0.28148922, 0.68087899, 0.0

In [21]:
# solutions
dataset_train.sols[:5]

array([[0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
        0., 0., 1., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
        0., 1., 0., 0., 0., 0., 1., 1.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 1.,
        0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 1., 0.,
        0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0.,
        0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0.,
        0., 1., 0., 0., 0., 0., 1., 1.],
       [1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
        0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
        1., 0., 0., 0., 0., 1., 1., 1.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0.,
        0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0.,
        0., 1., 0., 0., 0., 0., 1., 1.]])

In [22]:
# objective values
dataset_train.objs[:5]

array([[4.52934109],
       [1.45274853],
       [2.33050111],
       [2.11744969],
       [2.72324556]])

### 2.5 Set PyTorch DataLoader

PyTorch ``DataLoader`` combines a dataset and a sampler, and provides an iterable over the given dataset. We should set the **batch size**, which is a number of samples processed before the model is updated.

In [23]:
from torch.utils.data import DataLoader
batch_size = 32
loader_train = DataLoader(dataset_train, batch_size=batch_size, shuffle=True)
loader_test = DataLoader(dataset_test, batch_size=batch_size, shuffle=False)

To iterate ``DataLoader``, we can obtain a batch of **features**, **costs**, (optimal) **solutions**, and (optimal) **objective values**.

In [24]:
for x, c, w, z in loader_train:
    # shape of features batch
    print(x.shape)
    # shape of true costs batch
    print(c.shape)
    # shape of true optimal solutions batch
    print(w.shape)
    # shape of true optimal objective values batch
    print(z.shape)
    break

torch.Size([32, 5])
torch.Size([32, 40])
torch.Size([32, 40])
torch.Size([32, 1])
