This notebook was used to test possible solutions to https://github.com/microsoft/LightGBM/issues/3713.

In [1]:
import dask.array as da

from dask.distributed import Client, LocalCluster, wait

from lightgbm.dask import DaskLGBMRegressor, DaskLGBMClassifier

Create a cluster with 3 workers. Since this is a `LocalCluster`, those workers are just 3 local processes.

In [2]:
n_workers = 3
cluster = LocalCluster()
client = Client(cluster)
client.wait_for_workers(n_workers)

print(f"View the dashboard: {cluster.dashboard_link}")

View the dashboard: http://127.0.0.1:8787/status


Click the link above to view a diagnostic dashboard while you run the training code below.

In [3]:
num_rows = 1e5
num_features = 1e2
num_partitions = 10
rows_per_chunk = num_rows / num_partitions

data = da.random.random((num_rows, num_features), (rows_per_chunk, num_features))

reg_target = da.random.random((num_rows, 1), (rows_per_chunk, 1))
clf_target = da.random.random((num_rows, 1), (rows_per_chunk, 1)) > 0.5

Right now, the Dask Arrays `data` and `labels` are lazy. Before training, you can force the cluster to compute them by running `.persist()` and then wait for that computation to finish by `wait()`-ing on them.

In [4]:
data = data.persist()
reg_target = reg_target.persist()
clf_target = clf_target.persist()
_ = wait(data)
_ = wait(reg_target)
_ = wait(clf_target)

With the data set up on the workers, train a model. `lightgbm.dask.DaskLGBMRegressor` has an interface that tries to stay as close as possible to the non-Dask scikit-learn interface to LightGBM (`lightgbm.sklearn.LGBMRegressor`).

In [5]:
dask_reg = DaskLGBMRegressor(
    random_state=708,
    objective="regression_l2",
    tree_learner="data",
)

dask_reg.fit(
    client=client,
    X=data,
    y=reg_target,
)

DaskLGBMRegressor(local_listen_port=12400,
                  machines='127.0.0.1:12400,127.0.0.1:12401,127.0.0.1:12402,127.0.0.1:12403',
                  num_machines=4, num_threads=2, objective='regression_l2',
                  random_state=708, time_out=120, tree_learner='data')

In [10]:
# predictions asking for predcontrib should add
# the contribution column
preds = dask_reg.predict(
    data[:1000, :]
).compute()
preds_with_contrib = dask_reg.predict(
    data[:1000, :],
    pred_contrib=True
).compute()

In [8]:
preds_with_contrib.compute().shape

(100000, 101)

In [14]:
preds_with_contrib[1, :20]

array([ 6.45399862e-04, -3.95657555e-05,  7.59298693e-05,  5.69546129e-04,
        2.06662463e-04, -3.71514231e-04, -6.01199797e-05, -1.87284127e-04,
        5.74538695e-04, -7.69852515e-05, -1.36459849e-04, -2.19577372e-04,
        2.28292839e-04,  3.25229481e-05, -1.08228282e-04,  1.79561465e-04,
        7.24739626e-05,  5.76597662e-06, -4.05503338e-05,  9.67508650e-04])

In [18]:
data.compute()[1, :20]

array([0.26912814, 0.64632271, 0.76171869, 0.66172313, 0.15239813,
       0.9929481 , 0.76962697, 0.08387276, 0.19024855, 0.21603516,
       0.58100825, 0.07072624, 0.25820807, 0.02168651, 0.52393139,
       0.73113175, 0.10402213, 0.73453963, 0.05693272, 0.8384313 ])

The model produced by this training run is an instance of `DaskLGBMRegressor`. To get a regular non-Dask model (which can be pickled and saved), run `.to_local()`.

In [None]:
local_model = dask_reg.to_local()
type(local_model)

You can visualize this model by looking at a data frame representation of it.

In [None]:
local_model.booster_.trees_to_dataframe()