# Databricks SDK for Python

---
#### Learning objectives
* Authentication with the SDK
* How paginated responses and long running operations work
* How to create and destroy Workspace resources like clusters and jobs

---

Picture this: You're at the helm of a turbocharged sports car, and the open road ahead represents the vast and exciting realm of Databricks Lakehouse. The Databricks SDK for Python? Well, think of it as your supercharged engine, the cutting-edge tech that transforms you into the speedster of data development!

![gif](https://databricks-sdk-py.readthedocs.io/en/latest/_images/notebook-native-auth.gif)

The Databricks SDK is like having a nitro boost for Python development in the Databricks Lakehouse. It's your secret sauce to turbocharge your coding abilities. Just like a sports car can handle all kinds of terrain, this SDK covers all public Databricks REST API operations. It's your all-access pass to the entire spectrum of Databricks features and functionalities. Imagine having an autopilot system that can handle unexpected twists and turns effortlessly. The SDK's internal HTTP client is as robust as your sports car's suspension, ensuring smooth handling even when things get bumpy. It's got your back with intelligent retries, so you can keep moving forward without a hiccup. Much like the precision engineering behind a high-performance vehicle, this SDK is meticulously crafted to provide you with the utmost control and accuracy over your Databricks projects. This SDK isn't just fast; it's fuel-efficient too. It streamlines your development process, making it sleek and efficient, so you can get more done in less time. You can always find the latest documentation at [https://databricks-sdk-py.readthedocs.io](https://databricks-sdk-py.readthedocs.io/en/latest/)

Whether you're scripting like a code ninja in the shell, orchestrating seamless CI/CD production setups, or conducting symphonies of data from the Databricks Notebook, this SDK is your all-in-one, full-throttle, programmatic powerhouse. For now, Databricks Runtime may have outdated versions of Python SDK, so until further notice, make sure to always install the latest version `%pip install databricks-sdk==0.8.0` within a notebook:

In [0]:
%pip install databricks-sdk==0.8.0
dbutils.library.restartPython()

Once you've installed it, prepare to witness the magic of simplicity and security! Initializing the WorkspaceClient with the Databricks SDK for Python is like waving a wand that effortlessly picks up authentication from the notebook context. Say goodbye to the brittle hassle of passing tokens around. Thanks to the Unified Client Authentication, it's as if you've cast a spell of harmony across your Databricks tools and integrations. Whether you're running the Databricks Terraform Provider, harnessing the Databricks SDK for Go, wielding the Databricks CLI, or even working with applications targeting Databricks SDKs in other languages, they'll all play together in perfect harmony. It's like a symphony of data tools working seamlessly to empower your data-driven dreams! Read more about Python SDK with Azure CLI, Azure Service Principals, Databricks Account-level Service Principals, and Databricks CLI authentication [here](https://databricks-sdk-py.readthedocs.io/en/latest/authentication.html).

In [0]:
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
me = w.current_user.me()
print(f'My email is {me.user_name}')

## Paginated Responses
In the dynamic world of Databricks APIs, dealing with different pagination styles used to be like solving a complex puzzle. But here's where the magic of our SDKs truly shines. Whether you're wielding the power of Go, Python, Java, R, the Databricks CLI, or any other SDK in our arsenal, they all possess an extraordinary superpower: transparently handling paginated responses! It's like having a universal translator for APIs. Regardless of whether an API uses offset-plus-limit, starts offsets from 0 or 1, employs cursor-based iteration, or simply returns everything in one massive response, our SDKs provide you with a consistent list method that seamlessly returns an **iterator**. It's the ultimate weapon in your arsenal, ensuring that you can conquer any API pagination challenge with ease and grace, making you the undisputed champion of data manipulation and exploration!

In the following example we'll just list all of the jobs we have in the workspace, which obviously spans across multiple pages of results:

In [0]:
for job in w.jobs.list():
    print(f'Found job: {job.settings.name} (id: {job.job_id})')

Want something more personal? Let's look at own notebooks!

In [0]:
for obj in w.workspace.list(f'/Users/{w.current_user.me().user_name}', recursive=True):
    print(f'{obj.object_type}: {obj.path}')

Let's cook up a script that's like your very own mission control center, monitoring the pulse of your jobs with style. Imagine a dashboard that not only reveals the latest execution status of each job but also calculates the average run duration for each one. And the best part? We're going full-throttle excitement by putting the most recent updates right at the top, so you're always in the loop with the freshest data insights. This script is your ticket to a dynamic world of job monitoring that's as thrilling as it is efficient.

In [0]:
# Import required libraries
from collections import defaultdict 
from datetime import datetime, timezone 
from databricks.sdk.service.jobs import PauseStatus 

latest_state = {} # Initialize an empty dictionary for storing the latest state of jobs
all_jobs = {} # Initialize an empty dictionary for storing all databricks jobs
durations = defaultdict(list) # Initialize a defaultdict object for easily creating lists

# Loop through all jobs in databricks workspace
for job in w.jobs.list():
    all_jobs[job.job_id] = job # Append the current job to the all_jobs dictionary
    if job.settings.schedule is None: # Skip jobs that are not scheduled
        continue 
    for run in w.jobs.list_runs(job_id=job.job_id, expand_tasks=False): # Loop through all runs of the current job
        if run.run_duration is not None: # Capture how long the job ran
            durations[job.job_id].append(run.run_duration)
        if job.job_id not in latest_state: # Capture the latest state of a job if it is not already captured
            latest_state[job.job_id] = run
            continue
        if run.end_time < latest_state[job.job_id].end_time: # Skip run that is older than the existing record
            continue
        latest_state[job.job_id] = run

summary = [] # Initialize an empty list for storing the summary of job statuses
for job_id, run in latest_state.items(): # Loop through previously captured latest states of all jobs
    average_duration = 0
    if len(durations[job_id]) > 0:
        average_duration = sum(durations[job_id]) / len(durations[job_id])
    summary.append({
        # Append to the summary the job's name, last status (success or failure), last finished time, and average run duration
        'job_name': all_jobs[job_id].settings.name,
        'last_status': run.state.result_state,
        'last_finished': datetime.fromtimestamp(run.end_time/1000, timezone.utc),
        'average_duration': average_duration
    })

for line in sorted(summary, key=lambda s: s['last_finished'], reverse=True):
    # Print the summary of all jobs' statuses, sorted by date of last finish, in reversed chronological order
    print(f'Latest: {line}')

## Long-running Operations

When you invoke a long-running operation, the SDK provides a high-level API to _trigger_ these operations and _wait_ for the related entities to reach the correct state or return the error message in case of failure. All long-running operations return generic `Wait` instance with `result()` method to get a result of long-running operation, once it's finished. Databricks SDK for Python picks the most reasonable default timeouts for every method, but sometimes you may find yourself in a situation, where you'd want to provide `datetime.timedelta()` as the value of `timeout`
argument to `result()` method.

There are a number of long-running operations in Databricks APIs such as managing:
* Clusters,
* Command execution
* Jobs
* Libraries
* Delta Live Tables pipelines
* Databricks SQL warehouses.

For example, in the Clusters API, once you create a cluster, you receive a cluster ID, and the cluster is in the `PENDING` state Meanwhile
Databricks takes care of provisioning virtual machines from the cloud provider in the background. The cluster is
only usable in the `RUNNING` state and so you have to wait for that state to be reached.

Another example is the API for running a job or repairing the run: right after
the run starts, the run is in the `PENDING` state. The job is only considered to be finished when it is in either
the `TERMINATED` or `SKIPPED` state. Also you would likely need the error message if the long-running
operation times out and fails with an error code. Other times you may want to configure a custom timeout other than
the default of 20 minutes.


In [0]:
import datetime
from databricks.sdk.service.compute import DataSecurityMode, RuntimeEngine

# This will use the DBAcademy cluster policy, which we need for the CloudLabs environment
info = w.clusters.create(cluster_name=f'Created cluster from {w.current_user.me().user_name}',
                         spark_version=w.clusters.select_spark_version(latest=True),
                         autotermination_minutes=10,
                         runtime_engine=RuntimeEngine.STANDARD,
                         data_security_mode=DataSecurityMode.SINGLE_USER,
                         policy_id=[p for p in w.cluster_policies.list() if p.name == "DBAcademy"][0].policy_id).result(timeout=datetime.timedelta(minutes=10))
print(f'Created: {info}')

## Example workload

Let's create a notebook in our home directory that shows names of all databases:

In [0]:
notebook_path = f'/Users/{w.current_user.me().user_name}/sdk-sample.py'
notebook_content = 'display(spark.sql("SHOW DATABASES"))'
w.workspace.upload(notebook_path, notebook_content.encode('utf8'), overwrite=True)

and then create a job that runs this notebook on the interactive cluster we created:

In [0]:
from databricks.sdk.service.jobs import Task, NotebookTask
from databricks.sdk.service.compute import ClusterSpec, DataSecurityMode, RuntimeEngine

latest_runtime = w.clusters.select_spark_version(latest=True)
smallest_node = w.clusters.select_node_type(local_disk=True)

job = w.jobs.create(name=f'Job created from SDK by {w.current_user.me().user_name}',
                    tasks=[Task(task_key='main',
                                notebook_task=NotebookTask(notebook_path),
                                existing_cluster_id=info.cluster_id)])
print(f'Created job with id: {job.job_id}')

and then start this job and wait until it's completed or timeout within 10 minutes. It's always a good practice to set a client-side timeout for long-running operations. 

Please open [jobs owned by me](/#job/list?acl=owned_by_me) after running the cell below and confirm it's running.

In [0]:
import datetime
w.jobs.run_now(job.job_id).result(timeout=datetime.timedelta(minutes=10))

.. and don't forget to remove the notebook, job, and cluster you've just created after you've done experimenting:

In [0]:
w.workspace.delete(notebook_path)
w.jobs.delete(job.job_id)
w.clusters.permanent_delete(info.cluster_id)

## Debug Logging

The Databricks SDK for Python seamlessly integrates with the standard Logging facility for Python. This allows developers to easily enable and customize logging for their Databricks Python projects. To enable debug logging in your Databricks Python project, you can execute the snippet below and then re-run any cell that calls the SDK:

In [0]:
import logging, sys
logging.basicConfig(stream=sys.stderr,
                    level=logging.INFO,
                    format='%(asctime)s [%(name)s][%(levelname)s] %(message)s')
logging.getLogger('databricks.sdk').setLevel(logging.DEBUG)

## More Code Examples

Please checkout [OAuth with Flask](https://github.com/databricks/databricks-sdk-py/tree/main/examples/flask_app_with_oauth.py), 
[Last job runs](https://github.com/databricks/databricks-sdk-py/tree/main/examples/last_job_runs.py), 
[Starting job and waiting](https://github.com/databricks/databricks-sdk-py/tree/main/examples/starting_job_and_waiting.py) examples. You can also dig deeper into different services, like
[alerts](https://github.com/databricks/databricks-sdk-py/tree/main/examples/alerts), 
[billable_usage](https://github.com/databricks/databricks-sdk-py/tree/main/examples/billable_usage), 
[catalogs](https://github.com/databricks/databricks-sdk-py/tree/main/examples/catalogs), 
[cluster_policies](https://github.com/databricks/databricks-sdk-py/tree/main/examples/cluster_policies), 
[clusters](https://github.com/databricks/databricks-sdk-py/tree/main/examples/clusters), 
[credentials](https://github.com/databricks/databricks-sdk-py/tree/main/examples/credentials), 
[current_user](https://github.com/databricks/databricks-sdk-py/tree/main/examples/current_user), 
[dashboards](https://github.com/databricks/databricks-sdk-py/tree/main/examples/dashboards), 
[data_sources](https://github.com/databricks/databricks-sdk-py/tree/main/examples/data_sources), 
[databricks](https://github.com/databricks/databricks-sdk-py/tree/main/examples/databricks), 
[encryption_keys](https://github.com/databricks/databricks-sdk-py/tree/main/examples/encryption_keys), 
[experiments](https://github.com/databricks/databricks-sdk-py/tree/main/examples/experiments), 
[external_locations](https://github.com/databricks/databricks-sdk-py/tree/main/examples/external_locations), 
[git_credentials](https://github.com/databricks/databricks-sdk-py/tree/main/examples/git_credentials), 
[global_init_scripts](https://github.com/databricks/databricks-sdk-py/tree/main/examples/global_init_scripts), 
[groups](https://github.com/databricks/databricks-sdk-py/tree/main/examples/groups), 
[instance_pools](https://github.com/databricks/databricks-sdk-py/tree/main/examples/instance_pools), 
[instance_profiles](https://github.com/databricks/databricks-sdk-py/tree/main/examples/instance_profiles), 
[ip_access_lists](https://github.com/databricks/databricks-sdk-py/tree/main/examples/ip_access_lists), 
[jobs](https://github.com/databricks/databricks-sdk-py/tree/main/examples/jobs), 
[libraries](https://github.com/databricks/databricks-sdk-py/tree/main/examples/libraries), 
[local_browser_oauth.py](https://github.com/databricks/databricks-sdk-py/tree/main/examples/local_browser_oauth.py), 
[log_delivery](https://github.com/databricks/databricks-sdk-py/tree/main/examples/log_delivery), 
[metastores](https://github.com/databricks/databricks-sdk-py/tree/main/examples/metastores), 
[model_registry](https://github.com/databricks/databricks-sdk-py/tree/main/examples/model_registry), 
[networks](https://github.com/databricks/databricks-sdk-py/tree/main/examples/networks), 
[permissions](https://github.com/databricks/databricks-sdk-py/tree/main/examples/permissions), 
[pipelines](https://github.com/databricks/databricks-sdk-py/tree/main/examples/pipelines), 
[private_access](https://github.com/databricks/databricks-sdk-py/tree/main/examples/private_access), 
[queries](https://github.com/databricks/databricks-sdk-py/tree/main/examples/queries), 
[recipients](https://github.com/databricks/databricks-sdk-py/tree/main/examples/recipients), 
[repos](https://github.com/databricks/databricks-sdk-py/tree/main/examples/repos), 
[schemas](https://github.com/databricks/databricks-sdk-py/tree/main/examples/schemas), 
[secrets](https://github.com/databricks/databricks-sdk-py/tree/main/examples/secrets), 
[service_principals](https://github.com/databricks/databricks-sdk-py/tree/main/examples/service_principals), 
[storage](https://github.com/databricks/databricks-sdk-py/tree/main/examples/storage), 
[storage_credentials](https://github.com/databricks/databricks-sdk-py/tree/main/examples/storage_credentials), 
[tokens](https://github.com/databricks/databricks-sdk-py/tree/main/examples/tokens), 
[users](https://github.com/databricks/databricks-sdk-py/tree/main/examples/users), 
[vpc_endpoints](https://github.com/databricks/databricks-sdk-py/tree/main/examples/vpc_endpoints), 
[warehouses](https://github.com/databricks/databricks-sdk-py/tree/main/examples/warehouses), 
[workspace](https://github.com/databricks/databricks-sdk-py/tree/main/examples/workspace), 
[workspace_assignment](https://github.com/databricks/databricks-sdk-py/tree/main/examples/workspace_assignment), 
[workspace_conf](https://github.com/databricks/databricks-sdk-py/tree/main/examples/workspace_conf), 
and [workspaces](https://github.com/databricks/databricks-sdk-py/tree/main/examples/workspaces).

And, of course, https://databricks-sdk-py.readthedocs.io/.