# dswizard Example
This notebook shows you how to use our AutoML tool dswizard. The source code is available on
[Github](https://github.com/Ennosigaeon/dswizard).

In [None]:
import logging
import os

import openml

from dswizard.core.master import Master
from dswizard.core.model import Dataset
from dswizard.optimizers.bandit_learners.pseudo import PseudoBandit
from dswizard.optimizers.config_generators import Hyperopt
from dswizard.optimizers.structure_generators.mcts import MCTS, TransferLearning
from dswizard.util import util


# Fix working directory if we are in binder
if os.getcwd() == '/home/jovyan/ida':
    os.chdir('..')

At first we have to provide which data set we want to optimize and how the optimization should be configured. In this
example we are going to use tasks from [OpenML](https://www.openml.org/search?type=task). More specific, we will perform
a [supervised classification](https://www.openml.org/t/146606) on the [higgs](https://www.openml.org/d/23512) data set.

In [None]:
# Maximum optimization time for in seconds
wallclock_limit = 60
#Maximum cutoff time for a single evaluation in seconds
cutoff = 10
# Directory used for logging
log_dir = 'run/'
# OpenML task id
task = 146606

util.setup_logging(os.path.join(log_dir, str(task), 'log.txt'))
logger = logging.getLogger()
logging.getLogger('matplotlib').setLevel(logging.WARNING)

Now we will actually load the data set and extract the trainings and test data.

In [None]:
logger.info('Processing task {}'.format(task))
task = openml.tasks.get_task(task)
train_indices, test_indices = task.get_train_test_split_indices(repeat=0, fold=0, sample=0)

# noinspection PyUnresolvedReferences
X, y, _, _ = task.get_dataset().get_data(task.target_name)
X_train = X.loc[train_indices]
y_train = y[train_indices]
X_test = X.loc[test_indices]
y_test = y[test_indices]

ds = Dataset(X_train.to_numpy(), y_train.to_numpy(), metric='rocauc')
ds_test = Dataset(X_test.to_numpy(), y_test.to_numpy(), metric=ds.metric)

Let's take a look at the [higgs](https://www.openml.org/d/23512) data set.

In [None]:
X

Next, we will create the actual optimizer instance. In this example we use 2 parallel workers that generate
hyperparameters on the fly using Hyperopt. The structure is obtained using MCTS with transfer learning from previous
evaluations.

In [None]:
master = Master(
    ds=ds,
    working_directory=os.path.join(log_dir, str(task)),
    n_workers=2,
    model='../dswizard/assets/rf_complete.pkl',

    wallclock_limit=wallclock_limit,
    cutoff=cutoff
)

pipeline, run_history = master.optimize()

Finally, we will extract the best pipeline from our run history and print some evaluation statistics. More details about
the evaluated structures and configurations is available in the `run/` folder.

In [None]:
_, incumbent = run_history.get_incumbent()
logging.info('Best found configuration: {}\n{} with loss {}'.format(incumbent.steps,
                                                                    incumbent.get_incumbent().config,
                                                                    incumbent.get_incumbent().loss))
logging.info('A total of {} unique structures where sampled.'.format(len(run_history.data)))
logging.info('A total of {} runs where executed.'.format(len(run_history.get_all_runs())))

logging.info('Final pipeline:\n{}'.format(pipeline))
pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)
y_prob = pipeline.predict_proba(X_test)
logging.info('Final test performance {}'.format(util.score(y_test, y_prob, y_pred, ds.metric)))

ensemble = master.build_ensemble()
logging.info('Final ensemble performance {} based on {} pipelines'.format(
    util.score(ds_test.y, ensemble.predict_proba(ds_test.X), ensemble.predict(ds_test.X), ds.metric),
    len(ensemble.estimators_)))