# Integrate a random forest in a Gurobi model

*Note: The resulting model in this example will be too large for a size-limited license; in order to solve it, please visit <https://www.gurobi.com/free-trial> for a full license*

In this notebook, we do the student admission problems
[shown in the documentation](https://gurobi-optimization-gurobi-machine-learning.readthedocs-hosted.com/en/latest/mlm-examples/student_admission.html) using a random forest regressor.

In [None]:
import gurobipy as gp
import gurobipy_pandas as gppd
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from gurobi_ml import add_predictor_constr

We now retrieve the historical data used to build the regression from Janos
repository.

The features we use for the regression are `"merit"` (scholarship), `"SAT"` and
`"GPA"` and the target is `"enroll"`. We store those values.

In [None]:
# Base URL for retrieving data
janos_data_url = "https://raw.githubusercontent.com/INFORMSJoC/2020.1023/master/data/"
historical_data = pd.read_csv(
    janos_data_url + "college_student_enroll-s1-1.csv", index_col=0
)

# classify our features between the ones that are fixed and the ones that will be
# part of the optimization problem
features = ["merit", "SAT", "GPA"]
target = "enroll"

## Fit the regression

For the regression, we use a pipeline with a standard scaler and a logistic
regression. We build it using the `make_pipeline` from `scikit-learn`.

In [None]:
# Run our regression
regression = RandomForestRegressor(n_estimators=10, max_depth=5, random_state=1)

regression.fit(X=historical_data.loc[:, features], y=historical_data.loc[:, target])

### Optimization Model

We now turn to building the mathematical optimization model for Gurobi.

First, retrieve the data for the new students. We won't use all the data there,
we randomly pick 100 students from it.

In [None]:
# Retrieve new data used to build the optimization problem
studentsdata = pd.read_csv(janos_data_url + "college_applications6000.csv", index_col=0)

nstudents = 100

# Select randomly nstudents in the data
studentsdata = studentsdata.sample(nstudents)

Now build the model like in the documentation example.

In [None]:
m = gp.Model()

y = gppd.add_vars(m, studentsdata, name="enroll_probability")

# Add variable for merit
studentsdata = studentsdata.gppd.add_vars(m, lb=0.0, ub=2.5, name="merit")

# Keep only features
studentsdata = studentsdata.loc[:, features]
# Denote by x the (variable) "merit" feature
x = studentsdata.loc[:, "merit"]

m.setObjective(y.sum(), gp.GRB.MAXIMIZE)

m.addConstr(x.sum() <= 0.2 * nstudents)

pred_constr = add_predictor_constr(m, regression, studentsdata, y)

pred_constr.print_stats()

We can now optimize the problem.

In [None]:
m.optimize()

We print the error..

In [None]:
print(
    "Error in approximating the regression {:.6}".format(
        np.max(np.abs(pred_constr.get_error()))
    )
)

Note that the error is actually large. This is due to the fact that our model by default may take the wrong path in the decision tree when the feature value is excatly the same as the splitting value for a node. See [https://gurobi-machinelearning.readthedocs.io/en/more-docs/mlm-mip-models.html#decision-trees](the documentation) for more explanation.

To circumvent this, we can set the parameter epsilon to add_predictor_constr.

In [None]:
# Remove pred_constr
pred_constr.remove()

# Add new constraint setting epsilon to 1e-5
pred_constr = add_predictor_constr(m, regression, studentsdata, y, epsilon=1e-5)

pred_constr.print_stats()

m.optimize()

We print the maximal error among all students that now should be almost 0.

In [None]:
print(
    "Error in approximating the regression {:.6}".format(
        np.max(np.abs(pred_constr.get_error()))
    )
)

### Look at the solution

In [None]:
# This is what we predicted
plt.scatter(x.gppd.X, y.gppd.X)

Copyright © 2023 Gurobi Optimization, LLC