# Embed a Decision Tree in a Gurobi model

*Note: The resulting model in this example will be too large for a size-limited license; in order to solve it, please visit https://www.gurobi.com/free-trial for a full license*

In this notebook, we do the student admission problems
[shown in the documentation](https://gurobi-optimization-gurobi-machine-learning.readthedocs-hosted.com/en/latest/mlm-examples/student_admission.html) using a decision tree regressor.

### Extra required packages:

- matplotlib
- pandas

In [None]:
import gurobipy as gp
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from gurobi_ml import add_predictor_constr

We now retrieve the historical data used to build the regression from Janos
repository.

The features we use for the regression are `"merit"` (scholarship), `"SAT"` and
`"GPA"` and the target is `"enroll"`. We store those values.

In [None]:
# Base URL for retrieving data
janos_data_url = "https://raw.githubusercontent.com/INFORMSJoC/2020.1023/master/data/"
historical_data = pd.read_csv(
    janos_data_url + "college_student_enroll-s1-1.csv", index_col=0
)

# classify our features between the ones that are fixed and the ones that will be
# part of the optimization problem
features = ["merit", "SAT", "GPA"]
target = "enroll"

## Fit the regression

For the regression, we use a pipeline with a standard scaler and a logistic
regression. We build it using the `make_pipeline` from `scikit-learn`.

In [None]:
# Run our regression
regression = DecisionTreeRegressor(max_depth=10, max_leaf_nodes=50, random_state=1)

regression.fit(X=historical_data.loc[:, features], y=historical_data.loc[:, target])

### Optimization Model

We now turn to building the mathematical optimization model for Gurobi.

First, retrieve the data for the new students. We won't use all the data there,
we randomly pick 500 students from it.

In [None]:
# Retrieve new data used to build the optimization problem
studentsdata = pd.read_csv(janos_data_url + "college_applications6000.csv", index_col=0)

nstudents = 500

# Select randomly nstudents in the data
studentsdata = studentsdata.sample(nstudents)

Now build the model like in the documentation example.

In [None]:
# Construct lower bounds data frame
feat_lb = studentsdata.copy()
feat_lb.loc[:, "merit"] = 0

# Construct upper bounds data frame
feat_ub = studentsdata.copy()
feat_ub.loc[:, "merit"] = 2.5

# Make sure the columns are ordered in the same way as for the regression model.
feat_lb = feat_lb[features]
feat_ub = feat_ub[features]

# Start with classical part of the model
m = gp.Model()

feature_vars = m.addMVar(
    feat_lb.shape, lb=feat_lb.to_numpy(), ub=feat_ub.to_numpy(), name="feats"
)

y = m.addMVar(nstudents, name="y")

x = feature_vars[:, feat_lb.columns.get_indexer(["merit"])][:, 0]

m.setObjective(y.sum(), gp.GRB.MAXIMIZE)

m.addConstr(x.sum() <= 0.2 * nstudents)

pred_constr = add_predictor_constr(m, regression, feature_vars, y)

pred_constr.print_stats()

We can now optimize the problem.

In [None]:
m.optimize()

We print the error..

In [None]:
print(
    "Error in approximating the regression {:.6}".format(
        np.max(np.abs(pred_constr.get_error()))
    )
)

### Look at the solution

In [None]:
# This is what we predicted
plt.scatter(x.X, y.X)

Copyright © 2022 Gurobi Optimization, LLC