#  Run on Kubernetes

Run benchmarking on a kubernetes cluster with the given configuration.

Talks to kubernetes to create `n` amount of new `pods` with a dask worker inside of each
forming a `dask` cluster. Then, a function specified from `config` is being imported and
run with the given arguments. The tasks created by this `function` are being run on the
`dask` cluster for distributed computation.

The config dict must contain the following sections:
* run
* dask_cluster
* output

Within the `run` section you need to specify:
* function:
    The complete python path to the function to be run.
* args:
    A dictionary containing the keyword args that will be used with the given function.

Within the `dask_cluster` section you can specify:
* workers:
    The amount of workers to use.
   
* worker_config: A dictionary with the following keys:
    * resources: A dictionary containig the following keys:
        * memory:
            The amount of RAM memory.
        * cpu:
            The amount of cpu's to use.
    * image: A docker image to be used (optional).
    * setup: A dictionary containing the following keys:
        * script: Location to bash script from the docker container to be run.
        * git_repository: A dictionary containing the following keys:
            * url: Link to the github repository to be cloned.
            * reference: A reference to the branch or commit to checkout at.
            * install: command run to install this repository.
        * pip_packages: A list of pip packages to be installed.
        * apt_packages: A list of apt packages to be installed.
        
Within the `output` section you can specify:
* path: The path to a local file or s3 were the file will be saved.
* bucket: If given, the path specified previously will be saved as `s3://bucket/path`
* key: AWS authentication key to access the bucket.
* secret_key: AWS secrect authentication key to access the bucket.



# Run on Kubernetes

There are two ways to run the benchmarking process on kubernetes:

* `run_dask_function`: start a Dask Cluster using dask-kubernetes and run a function.
* `run_on_kubernetes`: run dask function inside a pod using the given config.

## Run dask function

Talks to kubernetes to create `n` amount of new `pods` with a dask worker inside of each
forming a `dask` cluster. Then, a function specified from `config` is being imported and
run with the given arguments. The tasks created by this `function` are being run on the
`dask` cluster for distributed computation.

The config dict must contain the following sections:
    * run
    * dask_cluster
    * output (optional)
    
*Note* to find more information about the config dict please reffer to Kubernetes in our documentation.

In [1]:
config = {
    'run': {
        'function': 'btb_benchmark.main.run_benchmark',
        'args': {
            'iterations': 10,
            'sample': 4,
            'tuners': 'BTB.UniformTuner',
            'challenge_types': 'xgboost',
            'detailed_output': True,
        }
    },
    'dask_cluster': {
        'workers': {
            'maximum': 1
        },
        'worker_config': {
            'resources': {
                'memory': '2G',
                'cpu': 8
            },
            'image': 'mlbazaar/btb_benchmark:latest',
        },
    },
}

In [2]:
from btb_benchmark.kubernetes import run_dask_function

results = run_dask_function(config)

distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at: tcp://192.168.1.132:37203
distributed.scheduler - INFO -   dashboard at:                     :8787
distributed.scheduler - INFO - Receive client connection: Client-d0d657fa-906a-11ea-97d4-00d8610cc1df
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register worker <Worker 'tcp://10.244.0.44:41383', name: tcp://10.244.0.44:41383, memory: 0, processing: 4>
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.244.0.44:41383
distributed.core - INFO - Starting established connection
distributed - INFO - [##########                              ] | 25% Completed | 0:00:12.417366 | 0:00:37.252081 | 2020-05-07 13:59:12.964920
distributed - INFO - [####################                    ] 

## Run on kubernetes cluster

Run dask function inside a pod using the given config.

Create a pod, using the local kubernetes configuration that starts a Dask Cluster
using dask-kubernetes and runs a function specified within the `config` dictionary.

*Note* if you are runing on a different *namespace* that's not `default` you can pass this argument to
the `run_on_kubernetes` function.

In [3]:
from btb_benchmark.kubernetes import run_on_kubernetes

run_on_kubernetes(config)

Pod created.


After runing `run_on_kubernetes` you can check that the pod has been created by runing the following command
on your terminal:
```bash
kubectl get pods
```