# Azure DataBricks integration with Kubeflow

Azure Databricks Package provides a set of Kubeflow Pipeline Tasks (Ops) which let us manipulate Databricks resources using the [Azure Databricks Operator for Kubernetes](https://github.com/microsoft/azure-databricks-operator). This makes the user experience much nicer, and less error prone, than using the ResourceOp to manipulate these Databricks resources.

## Supported Ops

* CreateClusterOp, to create a cluster in Databricks.
* DeleteClusterOp, to delete a cluster created with CreateClusterOp.
* CreateJobOp, to create a Spark job in Databricks.
* DeleteJobOp, to delete a job created with CreateJobOp.
* SubmitRunOp, to submit a job run in Databricks.
* DeleteRunOp, to delete a run submitted with SubmitRunOp.
* CreateSecretScopeOp, to create a secret scope in Databricks.
* DeleteSecretScopeOp, to delete a secret scope created with CreateSecretScopeOp.
* ImportWorkspaceItemOp, to import an item into a Databricks Workspace.
* DeleteWorkspaceItemOp, to delete an item imported with ImportWorkspaceItemOp.
* CreateDbfsBlockOp, to create Dbfs Block in Databricks.
* DeleteDbfsBlockOp, to delete Dbfs Block created with CreateDbfsBlockOp.

For each of these there are two ways a Kubeflow user can create the Ops:

* By passing the complete Databricks spec for the Op within a Python Dictionary.
* By using named parameters.

# Imports

In [None]:
import warnings; warnings.simplefilter('ignore')

import os

x_auth_token = os.environ.get('X_AUTH_TOKEN')

print('X-Auth-Token for DataBricks: ', x_auth_token)

os.environ['X_AUTH_TOKEN'] = x_auth_token

# Define a pipeline function

In [None]:
import kfp.dsl as dsl
import kfp.compiler as compiler
import databricks

def create_cluster(cluster_name):
    return databricks.CreateClusterOp(
        name="createcluster",
        cluster_name=cluster_name,
        spark_version="5.3.x-scala2.11",
        node_type_id="Standard_D3_v2",
        spark_conf={
            "spark.speculation": "true"
        },
        num_workers=2
    )

def submit_run(run_name, cluster_id, parameter):
    return databricks.SubmitRunOp(
        name="submitrun",
        run_name=run_name,
        existing_cluster_id=cluster_id,
        libraries=[{"jar": "dbfs:/docs/sparkpi.jar"}],
        spark_jar_task={
            "main_class_name": "org.apache.spark.examples.SparkPi",
            "parameters": [parameter]
        }
    )

def delete_run(run_name):
    return databricks.DeleteRunOp(
        name="deleterun",
        run_name=run_name
    )

def delete_cluster(cluster_name):
    return databricks.DeleteClusterOp(
        name="deletecluster",
        cluster_name=cluster_name
    )

@dsl.pipeline(
    name="DatabricksCluster",
    description="A toy pipeline that computes an approximation to pi with Azure Databricks."
)
def pipeline_calc(cluster_name="test-cluster", run_name="test-run", parameter="10"):
    create_cluster_task = create_cluster(cluster_name)
    submit_run_task = submit_run(run_name, create_cluster_task.outputs["cluster_id"], parameter)
    delete_run_task = delete_run(run_name)
    delete_run_task.after(submit_run_task)
    delete_cluster_task = delete_cluster(cluster_name)
    delete_cluster_task.after(delete_run_task)

# Compile the pipeline

Compile the pipeline into a tar package.

In [None]:
if __name__ == "__main__":
    compiler.Compiler()._create_and_write_workflow(
        pipeline_func=pipeline_calc,
        package_path="pipeline_calc.tar.gz")

# Submit and run the pipeline with parameters

In [None]:
run = client.run_pipeline(exp.id, 'pipeline-databricks-' + time.strftime("%Y%m%d-%H%M%S"), 'pipeline_databricks.zip',
                          params={'cluster_name': 'test-cluster',
                                  'run_name': 'test-run',
                                  'parameter': '10'})

## Additional examples

More sample pipelines can be found in:

* [samples/contrib/azure-samples/databricks-pipelines](https://github.com/kubeflow/pipelines/blob/master/samples/contrib/azure-samples/databricks-pipelines) 
* [samples/contrib/azure-samples/kfp-azure-databricks/tests](https://github.com/kubeflow/pipelines/blob/master/samples/contrib/azure-samples/kfp-azure-databricks/tests)

## Additional information

* [Kubeflow Pipelines](https://www.kubeflow.org/docs/pipelines/)
* [Azure Databricks documentation](https://docs.microsoft.com/azure/azure-databricks/)
* [Azure Databricks Operator for Kubernetes](https://github.com/microsoft/azure-databricks-operator)
* [Golang SDK for DataBricks REST API 2.0 and Azure DataBricks REST API 2.0, used by Azure Databricks Operator](https://github.com/xinsnake/databricks-sdk-golang)
* [Databricks REST API 2.0](https://docs.databricks.com/dev-tools/api/latest/index.html)
* [Azure Databricks REST API 2.0](https://docs.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/)

The following articles provide information on the supported spec fields for the supported Databricks Ops:

* **Cluster Ops**: [Azure Databricks Cluster API](https://docs.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/clusters)
* **Job Ops**: [Azure Databricks Jobs API](https://docs.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/jobs)
* **Run Ops**: [Azure Databricks Jobs API - Runs Submit](https://docs.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/jobs#--runs-submit)
* **Secret Scope Ops**: [Azure Databricks Secrets API](https://docs.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/secrets)
* **Workspace Item Ops**: [Azure Databricks Workspace API](https://docs.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/workspace)
* **DbfsBlock Ops**: [Azure Databricks DBFS API](https://docs.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/dbfs)