# Dask + XGBoost

Dask is designed to interoperate with other Python projects. 

Here we'll look at the integration with distributed XGBoost...

<img src='./images/xgboost.png'/>

In [None]:
from dask.distributed import Client

client = Client(n_workers=4, threads_per_worker=1, memory_limit='256MB')

client

In [None]:
import dask.dataframe

ddf = dask.dataframe.read_csv('data/diamonds.csv', blocksize=1e6)
ddf

In [None]:
y = ddf.price
ddf = ddf.drop(['Unnamed: 0', 'price'], axis=1)

In [None]:
ddf = ddf.categorize()

In [None]:
ddf = dask.dataframe.reshape.get_dummies(ddf)

In [None]:
from dask_ml.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(ddf, y, test_size=0.3)

X_train

We need the `dask-xgboost` library, which exposes high-level APIs like `XGBClassifier` and `XGBRegressor`

In [None]:
from dask_ml.xgboost import XGBRegressor

est = XGBRegressor()
est.fit(X_train, y_train)

__How Does this Work?__

Dask sets up XGBoost’s master process on the Dask scheduler and XGBoost’s worker processes on Dask’s worker processes. 

Then it moves all of the Dask dataframes’ constituent Pandas dataframes to XGBoost and lets XGBoost train. 

Fortunately, because XGBoost has an excellent Python interface, all of this can happen in the same process without any data transfer. The two distributed services can operate together on the same data.

When XGBoost is finished training Dask cleans up the XGBoost infrastructure and continues on as normal.

In [None]:
y_predicted = est.predict(X_test)

y_predicted

In [None]:
from dask_ml.metrics import mean_squared_error
from math import sqrt

sqrt(mean_squared_error(y_test.to_dask_array(), y_predicted))

In [None]:
client.close()