# Task 3

The solution proposed here is partly taken from Gurobi [Feature Selection
case](https://colab.research.google.com/github/Gurobi/modeling-examples/blob/master/linear_regression/l0_regression.ipynb)
to which you are referred for a more thorough machine learning pipeline.

## Solution Approach



### Sets and Indices

$i \in I=\{1,2,\dots,n\}$: Set of observations.

$j \in J=\{0,1,2,\dots,p\}$: Set of features, where the first ID corresponds to the intercept.

$\ell \in L = J \setminus \{0\}$: Set of features, where the intercept is excluded.


### Parameters

$s \in \mathbb{N}$: Number of features to include in the model, ignoring the intercept.


### Decision Variables

$\beta_j \in \mathbb{R}$: Weight of feature $j$, representing the change in the response variable per unit-change of feature $j$.

$z_\ell \in \{0,1\}$: 1 if weight of feature $\ell$ is exactly equal to zero, and 0 otherwise. Auxiliary variable used to manage the budget constraint.

### Objective Function

**Training error**: Minimize the Sum of Absolute Residuals (aka L1) more robust with respect to outliers:

\begin{equation*}
\text{Min} \quad Z = \sum_{i \in I}\left|y_i-\sum_{j \in J}\beta_{j}x_{ij}\right|
\end{equation*}

or minimize the Sum of Squares of Residuals (RSS) (aka L2):

\begin{equation*}
\text{Min} \quad Z = \sum_{i \in I}\left(y_i-\sum_{j \in
J}\beta_{j}x_{ij}\right)^2 = \beta^T X^T X\beta- 2y^TX\beta+y^T y
\end{equation*}

In the Lasso approach we add the Lasso term in the objective function:


\begin{equation*}
\text{Min} \quad Z + \lambda \sum_{\ell=1}^p\left|\beta_\ell\right|
\end{equation*}

The Lasso term could also be expressed with L2: 


\begin{equation*}
\text{Min} \quad Z + \lambda \sum_{\ell=1}^p\left(\beta_\ell\right)^2
\end{equation*}

### Constraints

Alternatively, advances in mixed integer linear programming allow us to model
the feature selection by means of binary variables $z_\ell$ and constraints on them. 

#### Mutual exclusivity

For each feature $\ell$, if $z_\ell=1$, then $\left|\beta_\ell\right|=0$ (in the
solution by Gurobi this is called the $L_0$-norm)

\begin{equation*}
\left|\beta_\ell\right|\leq M (1-z_\ell)
\end{equation*}

Here the big-M is the upper bound of the $\beta$ variables, which are actually
unlimited. 

We can express each of these $L$ constraints as a Special Ordered Set of type 1,
meaning that at most one variables is different from zero (variables can be both
integer or continuous in SOS-1):

\begin{equation}
(\beta_\ell, z_\ell): \text{SOS-1} \quad \forall \ell \in L
\tag{1}
\end{equation}

#### Budget constraint

Exactly $|L| - s$ feature weights must be equal to zero:

\begin{equation}
\sum_{\ell \in L}z_\ell = |L| - s
\tag{2}
\end{equation}

This model, by means of constraint 2, implicitly considers all ${{p} \choose s}$ feature subsets at once. However, we also need to find the value for $s$ that maximizes the performance of the regression on unseen observations. Notice that the training error decreases monotonically as more features are considered, so it is not advisable to use it as the performance metric. Instead, we should estimate the Mean Squared Error (MSE) via cross-validation. This metric is defined as $\text{MSE}=\frac{1}{n}\sum_{i=1}^{n}{(y_i-\hat{y}_i)^2}$, where $y_i$ and $\hat{y}_i$ are the observed and predicted values for the ith observation, respectively. Then, we will fine-tune $s$ using grid search, provided that the set of possible values is quite small.


### Linearization

L1 terms can be linearized as follows (see slides for another possible way of
linearizing absolute values).

The objective function $\min Z= \sum_{i \in I}\left|y_i-\sum_{j \in
J}\beta_{j}x_{ij}\right|$ becomes

\begin{align*}
\min &\sum_{i \in I} \epsilon_i^+ +\epsilon_i^-\\
 &y_i-\sum_{j \in J}\beta_{j}x_{ij} = \epsilon_i^+ -\epsilon_i^-\\
 &\epsilon_i^+, \epsilon_i^- \geq 0
\end{align*}

and the constraints $\left|\beta_\ell\right|\leq M (1-z_\ell)$:


\begin{align*}
\min &\sum_{i \in I} \beta_\ell^+ +\beta_\ell^- \leq M(1-z_\ell)\\
 &\beta_{\ell} = \beta_\ell^+ -\beta_\ell^-\\
 &\beta_\ell^+, \beta_\ell^- \geq 0
\end{align*}


# Python Implementation

In [20]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import math

import gurobipy as gp
from gurobipy import GRB

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing



In [21]:
# Load data and split into train (80%) and test (20%)
housing = fetch_california_housing()
X = housing['data']
y = housing['target']
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2,random_state=10101)

In [22]:
scaler = StandardScaler()
scaler.fit(Xtrain)
Xtrain_std = scaler.transform(Xtrain)
Xtest_std = scaler.transform(Xtest)

In [23]:
Xtrain_std.shape[1]

8

In [24]:
# NOTE: This function assumes the design matrix features does not contain
#a column for the intercept
def milp(features, response, non_zero, verbose=False):
    m = gp.Model("fitting")

    # Sets
    p = features.shape[1] # n of features
    n = features.shape[0] # n of observations (data points)

    I=range(n)
    J=range(p+1)
    L=range(1,p+1)

    # Create the decision variables
    beta = {}
    for j in J:
        beta[j] = m.addVar(lb=-float('inf'), vtype=GRB.CONTINUOUS, name=f"beta_{j}")

    zeta = {}
    for j in L:
        zeta[j] = m.addVar(vtype=GRB.BINARY, name="zeta"+str(j))
        
    # auxiliary to handle absolute values
    betap = m.addVars(L, lb=0.0, vtype=GRB.CONTINUOUS, name="betap")
    betan = m.addVars(L, lb=0.0, vtype=GRB.CONTINUOUS, name="betan")
            
    errp = m.addVars(I, lb=0.0, vtype=GRB.CONTINUOUS, name="errn")
    errn = m.addVars(I, lb=0.0, vtype=GRB.CONTINUOUS, name="errp")


    # The objective is to minimize 
    m.modelSense=gp.GRB.MINIMIZE
    m.setObjective(gp.quicksum(errp[i]+errn[i] for i in range(n)))


    for i in I:
        m.addConstr(errp[i]-errn[i]==response[i]-beta[0]-gp.quicksum(beta[j]*features[i][j-1] for j in L ))

    # Mutual exclusivity
    # Parameter
    M=100000    
    for j in L:
        m.addConstr(betap[j]-betan[j]==beta[j])
        m.addConstr(betap[j]+betan[j]<=M*(1-zeta[j]))

    # or:
    # for j in L:
    #    # If zeta[i]=1, then beta[i] = 0
    #    m.addSOS(GRB.SOS_TYPE1, [zeta[j], beta[j]])
        
    # Budget constraints
    m.addConstr(gp.quicksum(zeta[j] for j in L)>=p-non_zero)    

    # Solve
    # m.write("feat_sel.lp")
    m.optimize()
    # m.display()


    if m.status == gp.GRB.status.OPTIMAL:
        print('\nSum of Absolute Residuals (Training Error): %g' % m.ObjVal)
        print('\nbetas:')
        for j in range(p+1):
            if math.fabs(beta[j].X)>0+0.0001:
                print('%s %g' % (j, beta[j].X))
        print('\nzetas:')
        for j in range(1,p+1):
            #if zeta[j].X>0+0.0001:
            print('%s %g' % (j, zeta[j].X))
    else:
        print('No solution')


In [25]:
milp(Xtrain_std, ytrain, 5, True) # 5 features in the model

Gurobi Optimizer version 12.0.0 build v12.0.0rc1 (mac64[arm] - Darwin 24.3.0 24D60)

CPU model: Apple M1 Max
Thread count: 10 physical cores, 10 logical processors, using up to 10 threads

Optimize a model with 16529 rows, 33057 columns and 181688 nonzeros
Model fingerprint: 0x90992897
Variable types: 33049 continuous, 8 integer (8 binary)
Coefficient statistics:
  Matrix range     [9e-06, 1e+05]
  Objective range  [1e+00, 1e+00]
  Bounds range     [1e+00, 1e+00]
  RHS range        [1e-01, 1e+05]
Presolve time: 0.09s
Presolved: 16529 rows, 33057 columns, 181688 nonzeros
Variable types: 33049 continuous, 8 integer (8 binary)

Deterministic concurrent LP optimizer: primal and dual simplex (primal and dual model)
Showing primal log only...

Root relaxation presolved: 16529 rows, 33057 columns, 181688 nonzeros

Concurrent spin time: 0.02s

Solved with dual simplex

Root relaxation: objective 8.493985e+03, 172 iterations, 0.14 seconds (0.18 work units)

    Nodes    |    Current Node    |  

The result is awkward. Rerunning using the SOS-2 constraints we obtain the
correct result.

-- 

The quadratic model:

In [26]:
# NOTE: This function assumes the design matrix features does not contain
#a column for the intercept
def miqp(features, response, non_zero, verbose=False):
    """
    Deploy and optimize the MIQP formulation of L0-Regression.
    """
    assert isinstance(non_zero, (int, np.integer))
    # Create a Gurobi environment and a model object
    with gp.Env() as env, gp.Model("", env=env) as regressor:
        samples, dim = features.shape
        assert samples == response.shape[0]
        assert non_zero <= dim

        # Append a column of ones to the feature matrix to account for the y-intercept
        X = np.concatenate([features, np.ones((samples, 1))], axis=1)

        # Decision variables
        norm_0 = regressor.addVar(lb=non_zero, ub=non_zero, name="norm")
        beta = regressor.addMVar((dim + 1,), lb=-GRB.INFINITY, name="beta") # Weights
        intercept = beta[dim] # Last decision variable captures the y-intercept

        regressor.setObjective(beta.T @ X.T @ X @ beta
                               - 2*response.T @ X @ beta
                               + np.dot(response, response), GRB.MINIMIZE)

        # Budget constraint based on the L0-norm
        regressor.addGenConstrNorm(norm_0, beta[:-1], which=0, name="budget")

        if not verbose:
            regressor.params.OutputFlag = 0
        regressor.params.timelimit = 60
        regressor.params.mipgap = 0.001
        regressor.optimize()

        coeff = np.array([beta[i].X for i in range(dim)])
        return intercept.X, coeff

In [27]:
miqp(Xtrain_std, ytrain, 5, True) # 5 features in the model

Set parameter Username
Set parameter LicenseID to value 2599106
Academic license - for non-commercial use only - expires 2025-12-13
Set parameter TimeLimit to value 60
Set parameter MIPGap to value 0.001
Gurobi Optimizer version 12.0.0 build v12.0.0rc1 (mac64[arm] - Darwin 24.3.0 24D60)

CPU model: Apple M1 Max
Thread count: 10 physical cores, 10 logical processors, using up to 10 threads

Non-default parameters:
TimeLimit  60
MIPGap  0.001

Optimize a model with 0 rows, 10 columns and 0 nonzeros
Model fingerprint: 0xa6f503be
Model has 45 quadratic objective terms
Model has 1 simple general constraint
  1 NORM
Variable types: 10 continuous, 0 integer (0 binary)
Coefficient statistics:
  Matrix range     [0e+00, 0e+00]
  Objective range  [9e+02, 7e+04]
  QObjective range [1e-12, 6e+04]
  Bounds range     [5e+00, 5e+00]
  RHS range        [0e+00, 0e+00]
         Consider reformulating model or setting NumericFocus parameter
         to avoid numerical issues.
       : nonsense results
Pr

(array(2.07596342),
 array([ 0.72353363,  0.12587989,  0.        ,  0.07955282,  0.        ,
         0.        , -0.99118496, -0.94202137]))

That shows the intercept and the feature coefficients.

We refer to Gurobi solution for a comparison of the models' performance on the
housing learning task. 