# Practical Optimization for Stats Nerds

Ryan J. O'Neil  
Data Science DC  
March 20, 2017

[ryanjoneil@gmail.com](mailto:ryanjoneil@gmail.com)  
[https://ryanjoneil.github.io](https://ryanjoneil.github.io)  
[@ryanjoneil](https://twitter.com/ryanjoneil)

## Take-aways, in case you want to skip the talk
  
  
* Many statistical techniques are based on some sort of optimization.
  


* Optimization has many other uses, such as solving decision models.
  


* Learning to structure problems you already know for optimization solvers is a great way to understand them!


## Least Squares

We observe noisy data from an unknown function. We want to infer that function.

But let's assume that, deep down, we actually know the function. That way we can generate noisy data and see if our techniques work right.

#### Function:
$$y = 3x^2 - 2x + 10 + \epsilon$$

#### Noise:
$$\epsilon \sim N\left(0, 25\right)$$

In [1]:
# Generate some random data.
import numpy as np
import random

# Sort the data so they're easier to plot later.
x = [random.uniform(-10, 10) for _ in range(500)]
x.sort()

y = []
for xi in x:
    eps = random.normalvariate(0, 25)
    yi = 3*xi**2 - 2*xi + 10 + eps
    y.append(yi)
    
x = np.array(x)
y = np.array(y)

In [2]:
from bokeh.charts import Scatter, output_notebook, show
output_notebook()

scatter = Scatter({'x': x, 'y': y}, width=750, height=400)
show(scatter)

### Least Squares the way _you_ do it...

...assuming you use `scikit-learn` like every other sane Python programmer.

We looked at a chart of our data and decided to describe it with:

* A quadratic term  
* A linear term
* An offset

In [3]:
from sklearn.linear_model import LinearRegression

# Note: A is our feature matrix.
#       We intentionally add a "1" for the offset, instead of letting 
#       sklearn do that for us. This will make sense soon.

X = np.array([[xi**2, xi, 1] for xi in x])

lin = LinearRegression(fit_intercept=False)
lin.fit(X, y)

print(lin.coef_)

[ 3.01657041 -2.20803262  9.41616619]




In [4]:
from bokeh.charts import Line

# How'd we do?
y_hat = lin.predict(X)
show(Line({'x': x, 'y': y_hat}, x='x', y='y', width=750, height=400))

### Least Squares the way your _grandparents_ did it...

...with chalk and a slab of slate.

Construct a function to calculate the sum of squared residuals...

$$
\begin{align}
    \text{min}\ f(\beta) & = \frac{1}{2} ||y - X \beta||^2 \\
                         & \\
                         & = \frac{1}{2} (y - X \beta)'(y - X \beta) \\
                         & \\
                         & = \frac{1}{2} y'y - y'X\beta + \frac{1}{2} \beta'X'X\beta \\
                         & \\
                         & = \frac{1}{2} y'y - y'X\beta + \frac{1}{2} \beta'X'X\beta \\
                         & \\
\end{align}
$$

#### First Order Necessary Conditions

...and take its derivative to find a closed-form solution.

$$\nabla f(\beta) = X'X\beta - y'X\beta = 0$$  
$$\beta = (X'X)^{-1}X'y$$

In [9]:
from numpy.linalg import inv

# beta = (X'X)^-1 * X * y
Xt = X.transpose()
pseudo_inv = inv(np.matmul(Xt, X))
beta = np.matmul(np.matmul(pseudo_inv, Xt), y)
print(beta)

[  2.97001072  -2.28578637  10.56301685]


In [10]:
# How'd grandpa and grandma do?
y_hat = [beta[0]*xi**2 + beta[1]*xi + beta[2] for xi in x]
show(Line({'x': x, 'y': y_hat}, x='x', y='y', width=750, height=400))

### Least Squares the way your _crazy uncle Eddie_ does it...

...'cause he used to work at NASA and code in Forth.

[`cvxopt`](http://cvxopt.org/) provides a [`qp`](http://cvxopt.org/userguide/coneprog.html#quadratic-programming) method that can solve anything of this form.

$$
\begin{align}
    \text{min}  \ \ \ & \ \frac{1}{2} \beta'P\beta + q'\beta \\
                      & \\
    \text{s.t.} \ \ \ & \ G\beta \preceq h \\
                      & \\
                      & \ A\beta = b
\end{align}
$$


So we need to convert from 

$$\frac{1}{2} \beta'X'X\beta - y'X\beta + \frac{1}{2} y'y $$

to another form

$$\frac{1}{2} \beta'P\beta + q'\beta$$

which is simply

$$P = X'X, q = -y$$

In [11]:
import cvxopt as cvx

P = cvx.matrix(np.matmul(Xt, X))
q = cvx.matrix(-1 * np.matmul(y.transpose(), X))
solution = cvx.solvers.qp(P, q)
beta = solution['x'] # unrelated to our x
print(beta)

[ 2.97e+00]
[-2.29e+00]
[ 1.06e+01]



In [12]:
# How'd Crazy Uncle Eddie do?
y_hat = [beta[0]*xi**2 + beta[1]*xi + beta[2] for xi in x]
show(Line({'x': x, 'y': y_hat}, x='x', y='y', width=750, height=400))

#### So what's different about Crazy Uncle Eddie?

Well, besides the obvious.

![](images/crazy-uncle-eddie.jpg)

While all three techniques produced the _same result_, Crazy Uncle Eddie's is interesting because it is more general than the others.

Crazy Eddie can solve any quadratic optimization problem, of which Least Squares is _just one instance_.

If we change the structure of the problem slightly:

* We probably can't solve it with `scikit-learn`
* Grams and Gramps have to go back to their chalkboard
* Crazy Eddie can update the inputs to his problem and reoptimize

## Example: Portfolio Optimization

We have a big pot of money to allocate among different investments. Lucky us!

Some investment returns are correlated. They go up and down together.

Other returns are anticorrelated. They tend to do the opposite things.

How do we allocate our money to maximize our expected return, subject to our tolerance for risk?

We'll use 100 months of [exchange rate data](http://www.federalreserve.gov/datadownload/Build.aspx?rel=H10) from the Fed, circa 2014.

Since this talk isn't about data wrangling, I've already cleaned it up into the important pieces.

* Expected monthly return data for each foreign currency
* A covariance matrix for those investments

### The Markowitz Porfolio Optimization Model

#### Inputs:

$$\mu = \text{vector of expected investment returns}$$

$$\Sigma = \text{covariance matrix for returns}$$

$$\alpha = \text{unitless measure of risk aversion}$$

#### Model:

$x$ tells me how much of my total budget to put in each investment.

$$
\begin{align}
    \text{max}  \ \ \ & \ \mu'x - \alpha x \Sigma x \\
    \text{s.t.} \ \ \ & \ e'x = 1 \\
                      & \ x \ge 0
\end{align}
$$

This may look more familiar.

$$
\begin{align}
    \text{min}  \ \ \ & \ \alpha x \Sigma x - \mu'x \\
    \text{s.t.} \ \ \ & \ e'x = 1 \\
                      & \ x \ge 0
\end{align}
$$

The only differences between this model and the least squares model are the constraints we've added.

This one forces the model to allocate all of my budget into investments:

$$e'x = 1$$

This one disallows the model from making negative investments:

$$x \ge 0$$

But wait! This is a _maximization_ problem! The last model used $\text{min}$.

In [13]:
# Read in the returns and covariance data.
import pandas as pd
exp_returns = pd.read_csv('portfolio-optimization/currency-returns.csv')
exp_returns.head()

Unnamed: 0.1,Unnamed: 0,mean,variance
0,RXI$US_N.M.AL,0.152151,10.996718
1,RXI$US_N.M.EU,0.002683,5.928188
2,RXI$US_N.M.NZ,0.220217,9.690793
3,RXI$US_N.M.UK,-0.159779,5.098969
4,RXI_N.M.BZ,-0.128507,12.743632


In [14]:
returns_cov = pd.read_csv('portfolio-optimization/currency-covariance.csv', header=None)
returns_cov.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,13,14,15,16,17,18,19,20,21,22
0,10.996718,5.181782,8.654762,4.457704,9.789818,5.421746,0.194428,5.166875,0.028496,4.379632,...,6.494732,7.078498,7.083824,8.574515,3.287411,0.343673,4.163674,2.450772,2.381722,0.54245
1,5.181782,5.928188,4.48789,3.659548,4.686754,2.467804,0.313081,5.867674,0.038117,2.5452,...,2.894461,5.350733,5.515164,4.31448,2.265781,0.418269,5.173168,1.559857,1.351946,1.776855
2,8.654762,4.48789,9.690793,4.330531,8.130412,4.268307,0.129303,4.466514,0.017975,3.91917,...,5.536301,5.602586,6.054567,6.733828,2.811946,0.451742,3.90326,2.135899,2.569281,0.379954
3,4.457704,3.659548,4.330531,5.098969,4.301933,2.5731,0.137068,3.632088,0.01242,1.821456,...,2.775715,4.081803,4.233658,3.749874,1.78062,0.326828,3.335175,1.356927,1.014109,1.584577
4,9.789818,4.686754,8.130412,4.301933,12.743632,5.433992,0.279201,4.689891,-0.034405,5.567318,...,7.177333,7.288341,6.774689,8.761299,2.99936,0.843294,4.269524,2.163251,2.439204,0.496304


In [15]:
# A model that will return an optimal portfolio for any risk aversion.
def portfolio(alpha):
    P = cvx.matrix(alpha * returns_cov.as_matrix())
    q = cvx.matrix(-exp_returns['mean'].as_matrix())
    G = cvx.matrix(0.0, (len(q),len(q)))
    G[::len(q)+1] = -1.0
    h = cvx.matrix(0.0, (len(q),1))
    A = cvx.matrix(1.0, (1,len(q)))
    b = cvx.matrix(1.0)

    solution = cvx.solvers.qp(P, q, G, h, A, b)
    return exp_returns['mean'].dot(solution['x'])[0]

In [16]:
risk_aversion = [ra/2.0 for ra in range(41)]
returns = [portfolio(alpha) for alpha in risk_aversion]

     pcost       dcost       gap    pres   dres
 0: -1.4109e+00 -1.2870e+00  6e+01  9e+00  5e+00
 1: -1.8207e-02 -1.2285e+00  1e+00  5e-15  8e-16
 2: -6.2411e-02 -2.9637e-01  2e-01  7e-16  7e-16
 3: -1.4238e-01 -3.5309e-01  2e-01  8e-16  5e-16
 4: -2.8201e-01 -2.9496e-01  1e-02  4e-16  5e-16
 5: -2.8688e-01 -2.8703e-01  2e-04  1e-16  3e-16
 6: -2.8695e-01 -2.8695e-01  2e-06  3e-16  2e-16
 7: -2.8695e-01 -2.8695e-01  2e-08  1e-16  5e-16
Optimal solution found.
     pcost       dcost       gap    pres   dres
 0: -2.6295e-01 -1.3019e+00  4e+01  5e+00  6e+00
 1: -3.1048e-02 -1.0159e+00  2e+00  2e-01  2e-01
 2:  3.3465e-02 -3.1889e-01  5e-01  3e-02  3e-02
 3: -1.3476e-01 -2.5385e-01  1e-01  2e-17  5e-16
 4: -1.9154e-01 -2.0318e-01  1e-02  2e-16  6e-16
 5: -2.0001e-01 -2.0086e-01  9e-04  2e-16  5e-16
 6: -2.0070e-01 -2.0074e-01  4e-05  1e-16  8e-16
 7: -2.0073e-01 -2.0073e-01  1e-06  2e-16  1e-15
 8: -2.0073e-01 -2.0073e-01  1e-08  1e-16  3e-16
Optimal solution found.
     pcost       dcost 

In [17]:
show(Line(
    {'risk aversion': risk_aversion, 'expected return': returns}, 
    x='risk aversion', 
    y='expected return', 
    width=750, 
    height=400
))

## Support Vector Machines

We want to draw a line that separates two sets with as little misclassification as possible.

If $f(x) \le 5$, the point is of type $-1$, otherwise it is type $+1$.

$$f(x) = (x_1 + \epsilon_1) - (x_2+ \epsilon_2)$$

$$\epsilon_i \sim N\left(0, 1.25^2\right)\ \forall\ i \in {1, 2}$$


In [280]:
x1 = [random.uniform(0, 10) for _ in range(150)]
x2 = [random.uniform(0, 10) for _ in range(150)]

y = []
for xi_1, xi_2 in zip(x1, x2):
    eps_1 = random.normalvariate(0, 1.25)
    eps_2 = random.normalvariate(0, 1.25)
    
    if (xi_1 + eps_1) - (xi_2 + eps_2) <= 5:
        y.append(-1)
    else:
        y.append(+1)
        
X = np.array(list(zip(x1, x2)))
y = np.array(y)

In [281]:
show(Scatter(
    {'x1': x1, 'x2': x2, 'y': y}, 
    x='x1', y='x2', color='y', marker='y', palette=['lightblue', 'orange'],
    width=750, height=400
))

### Support Vector Machines a la `scikit-learn`

In [284]:
from sklearn import metrics, svm
clf = svm.SVC(kernel='linear')
clf.fit(X, y)

y_hat = clf.predict(X)

print('coefficients:', clf.intercept_, clf.coef_)
print('confusion matrix:')
metrics.confusion_matrix(y_true=y, y_pred=y_hat)

coefficients: [-2.53047237] [[ 0.61340886 -0.65958552]]
confusion matrix:


array([[117,   4],
       [  6,  23]])

In [286]:
correct = ['Correct' if yi == yh_i else 'Incorrect' for yi, yh_i in zip(y, y_hat)]
show(Scatter(
    {'x1': x1, 'x2': x2, 'y': y, 'correct': correct}, 
    x='x1', y='x2', color='correct', palette=['lightgrey', 'darkred'], marker='correct',
    width=750, height=400
))

In [None]:
### Support Vector Machines a la Linear Optimization

In [282]:
import pulp

m = pulp.LpProblem(sense=pulp.LpMinimize)

w1 = pulp.LpVariable('w1')
w2 = pulp.LpVariable('w2')
b = pulp.LpVariable('b')

errors = []
for i, (xi_1, xi_2, y_i) in enumerate(zip(x1, x2, y)):
    e = pulp.LpVariable('e_%d' % (i+1), lowBound=0)
    m += y_i * (w1 * xi_1 + w2 * xi_2 + b) >= (1 - e)
    errors.append(e)
    
m.setObjective(sum(errors))
assert m.solve() == pulp.LpStatusOptimal

def classify(xi_1, xi_2):
    foo = w1.value() * xi_1 + w2.value() * xi_2 + b.value() >= 0
    return 1 if foo else -1

y_hat = [classify(xi_1, xi_2) for xi_1, xi_2 in zip(x1, x2)]

print('coefficients:', b.value(), w1.value(), w2.value())
print('confusion matrix:')
metrics.confusion_matrix(y_true=y, y_pred=y_hat)

coefficients: -2.7841666 0.65345083 -0.67781657
confusion matrix:


array([[117,   4],
       [  6,  23]])

In [283]:
correct = ['Correct' if yi == yh_i else 'Incorrect' for yi, yh_i in zip(y, y_hat)]
plot = Scatter(
    {'x1': x1, 'x2': x2, 'y': y, 'correct': correct}, 
    x='x1', y='x2', color='correct', palette=['lightgrey', 'darkred'], marker='correct',
    width=750, height=400
)

svm_x1 = np.linspace(0, 10)
svm_x2 = [(b.value() - (w1.value() * x1_i)) / w2.value() for x1_i in svm_x1]
plot.line(svm_x2, svm_x1)
show(plot)