<a href="https://colab.research.google.com/github/RoetGer/decisions-under-uncertainty/blob/main/data_science_and_stochastic_programming.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install cvxpy
!pip install cvxstoc

Collecting cvxstoc
  Downloading https://files.pythonhosted.org/packages/ad/0d/6e47ddb7c55a35c765dc6ddad5b4cc9ade7a0b90fbfa692bf1120819b1d4/cvxstoc-0.2.2-py3-none-any.whl
Collecting pymc>=2.3.4
[?25l  Downloading https://files.pythonhosted.org/packages/37/81/9a222c38c65019de9ad5a1ee2448cc4a9b5f7a64eeaf246c77f81c0e6f94/pymc-2.3.8.tar.gz (385kB)
[K     |████████████████████████████████| 389kB 4.3MB/s 
Building wheels for collected packages: pymc
  Building wheel for pymc (setup.py) ... [?25l[?25hdone
  Created wheel for pymc: filename=pymc-2.3.8-cp37-cp37m-linux_x86_64.whl size=1352865 sha256=faa86fb5422cd7303cbfcc08cc7e03cd833cabe99cff4c89058fedb74f274ed0
  Stored in directory: /root/.cache/pip/wheels/0b/a8/e7/8f3ba91a39294d538a92db052fd1fcba1fca74a58c8b022026
Successfully built pymc
Installing collected packages: pymc, cvxstoc
Successfully installed cvxstoc-0.2.2 pymc-2.3.8


# Data Science and Stochastic Programming

In this notebook we explore, how stochastic programming can be used to incorporate uncertainty stemming from data science models into our decision making process.

Let us start by introducing cvxstoc, a Python package for solving stochastic convex optimization problems.

In [78]:
import cvxstoc
import numpy as np
import pymc

from cvxstoc import NormalRandomVariable, expectation, prob
from cvxpy import Maximize, Problem
from cvxpy.expressions.variable import Variable

In [113]:
# Samples to be taken
num_samples = 100

# Create problem data.
n = 5
mu = np.zeros(n)
Sigma = 0.1*np.eye(n)
returns = NormalRandomVariable(mu, Sigma)
alpha = -0.5
beta = 0.05

# Create the stochastic optimization problem.
weights = Variable(n)
probl = Problem(
    Maximize(expectation(weights.T*returns, num_samples=num_samples)),
    [
      cvxpy.max(weights) <= 0.3,
      weights >= 0, 
      weights.T*np.ones(n) == 1,
      prob(weights.T*returns <= alpha, num_samples=num_samples) <= beta
    ]
)



What we are trying to solve here is a simplified portfolio allocation problem, where the goal is to find a weight vector which maximizes the return under some constraints. 

The main differences to a more classical approach is that we are not working with a fixed vector of returns, but we assume that the returns are following a Gaussian distribution (with mean mu and covariance Sigma).

A consequence of this choice is that we are not merely trying to maximize the weighted sum of the returns (= weights.T*returns), but an expectation of this weighted sum with respect to the uncertain returns.

Moreover, while the first three constraints are rather standard (none of the portfolio positions should exceed 30% of the overall portfolio, the weights should be non-negative, and the combined weights add up to one), the last one is different from a deterministic optimization problem. The last constraint restricts the probability of the optimal portfolio to exceed a loss of 50% to 5%, i.e. for 100 samples of the return vector, we would only expect to have 5 times a loss higher than 50% with the optimized weights.

In [80]:
probl.solve()

print(probl.status)
print("Optimal value:", probl.value)
print("Optimal weights:", weights.value)

optimal
Optimal value: 0.032614506652284166
Optimal weights: [3.00000000e-01 9.99999994e-02 3.00000000e-01 6.16178799e-10
 3.00000000e-01]


While it is fairly straightforward to see how this approach can be integrated with a data science solution (i.e. the data science model provides mean and covariance estimates for the Gaussian distribution), it is rather limited in its usage with a model.

For example, if we are using a Bayesian model to obtain posterior predictive samples, utilize dropout with a deep learning model to generate samples, or simply not use one of the distributions currently supported by cvxstocm, we would not be able to solve the resulting optimization problem.

In order to simplify the work with more complex distribution, we have developed the following function

In [136]:
import numpy as np
import pymc
from cvxstoc import RandomVariable


def EmpiricalRandomVariable(name, 
                            samples,
                            mean,
                            interpolate=False, 
                            lower=-np.inf, 
                            upper=np.inf):
    '''
    Create a pymc node whose distribution comes either from a 
    kernel smoothing density estimate or via boostrapping from 
    the provided samples.
    '''
    
    if interpolate:
      rv_pymc = pymc.stochastic_from_data(
          name=rv_name, 
          data=samples, 
          lower=lower, 
          upper=upper)
    else:
        nobs = samples.shape[0]

        def logp(value):
            return -np.log(nobs)

        def random():
            ridx = np.random.randint(low=0, high=nobs, size=1)
            return samples[ridx].flatten()

        value = random() 
        dtype = type(value)
    
        rv_pymc = pymc.Stochastic(
            logp = logp,
            doc = "A node which bootstrap samples from the provided dataset",
            name = name,
            parents = {},
            random = random,
            trace = True,
            dtype = dtype)
    
    metadata = {"mu": mean}
    
    return RandomVariable(rv=rv_pymc, metadata=metadata)

In [156]:
# Samples to be taken
num_samples = 100

# Create problem data.
n = 5
mu = np.zeros(n)
Sigma = 0.1*np.eye(n)
returns = EmpiricalRandomVariable("EmpiricalRV", 
                                  NormalRandomVariable(mu, Sigma).sample(100),
                                  mean = mu,
                                  interpolate=False)
alpha = -0.5
beta = 0.05

# Create the stochastic optimization problem.
weights = Variable(n)
probl = Problem(
    Maximize(expectation(weights.T*returns, num_samples=num_samples)),
    [
      cvxpy.max(weights) <= 0.3,
      weights >= 0, 
      weights.T*np.ones(n) == 1,
      prob(weights.T*returns <= alpha, num_samples=num_samples) <= beta
    ]
)

probl.solve()

print(probl.status)
print("Optimal value:", probl.value)
print("Optimal weights:", weights.value)



optimal
Optimal value: 0.023429738282685787
Optimal weights: [3.00000000e-01 3.00000000e-01 1.00000000e-01 1.19533753e-11
 3.00000000e-01]
