<h1 align="center">Labelspark Cluster Creator</font></h1>
<b>This Python script provides Labelbox customers with a simple and efficient way to automate the deployment of a Databricks cluster, which is pre-configured for optimum use with Labelbox services.
The script performs two main tasks: creating a Databricks cluster with specific configurations and installing Labelbox-recommended Python libraries onto the newly created cluster.
First, the script automatically sets up a Databricks cluster with configurations tailored for optimal use with Labelbox. These configurations include the number of workers, Spark version, node types, and more. This not only simplifies the setup process for the users but also ensures that they're starting with a setup that's tried-and-tested for compatibility and performance with Labelbox.
Next, the script automatically installs essential Python libraries ("labelbox" and "labelspark") on the new cluster. These libraries are crucial for working with Labelbox services and getting the best out of the data labeling and management capabilities it offers.
This script aims to streamline the initial setup process for Labelbox customers, making it easier for them to start using Labelbox with Databricks. By automating these setup tasks, users can start their data labeling and analysis tasks quickly, without worrying about the underlying setup and configurations.</b>

<h3 align="left">Add your Databricks instance URL and your access token</font></h3>

In [25]:
databricks_instance = "<Your databricks instance URL>" # Add your Databricks instance URL here. The URL can be found in the address bar of your Databricks workspace. It should be of the form <location>.gcp.databricks.com. 
personal_access_token = "<Your personal access token>" # Add your personal access token here. The token can be found in the User Settings page of your Databricks workspace.

<h3 align="left">Once you've added your instance URL and personal access token, run the bellow cell to create a cluster and attach the required libraries</font></h3>

In [None]:
import requests
import json

# Remove 'https://' from the start of the URL
if databricks_instance.startswith("https://"):
    databricks_instance = databricks_instance[8:]

# Remove trailing slash from the end of the URL
if databricks_instance.endswith("/"):
    databricks_instance = databricks_instance[:-1]

# Function to create a Databricks cluster
def create_cluster(personal_access_token, databricks_instance):
    # JSON payload for the API request
    json_payload = {
        "autoscale": {"min_workers": 1, "max_workers": 10}, # Cluster autoscaling parameters
        "cluster_name": "Labelbox Worker", # Name of the cluster
        "spark_version": "13.3.x-scala2.12", # Spark version
        "gcp_attributes": {
            "use_preemptible_executors": False,
            "availability": "PREEMPTIBLE_WITH_FALLBACK_GCP",
            "zone_id": "HA"
        }, # GCP-specific attributes
        "node_type_id": "n1-standard-4", # Node type
        "driver_node_type_id": "n1-standard-4", # Driver node type
        "ssh_public_keys": [], # SSH public keys for secure connections
        "custom_tags": {}, # Any custom tags to be associated with the cluster
        "cluster_log_conf": {"dbfs": {"destination": "dbfs:/cluster-logs"}}, # Logging configuration
        "spark_env_vars": {}, # Environment variables for Spark
        "autotermination_minutes": 60, # Autotermination time in minutes
        "enable_elastic_disk": False, # Whether to enable elastic disk
        "cluster_source": "UI", # Source of the cluster creation
        "init_scripts": [], # Initialization scripts
        "enable_local_disk_encryption": False, # Whether to enable local disk encryption
        "runtime_engine": "STANDARD", # Runtime engine
    }

    # Headers for the API request
    headers = {"Authorization": f"Bearer {personal_access_token}", "Content-Type": "application/json"}

    # Send the POST request to create a cluster
    response = requests.post(
        f"https://{databricks_instance}/api/2.0/clusters/create",
        headers=headers,
        data=json.dumps(json_payload)
    )

    # If the response is successful, print a message and return the cluster ID
    if response.status_code == 200:
        print("Cluster created successfully!")
        return json.loads(response.text)["cluster_id"]
    else:
        # If the response is not successful, print an error message and return None
        print(f"Failed to create cluster. Status code: {response.status_code}\nResponse: {response.text}")
        return None


# Function to install libraries in a Databricks cluster
def install_libraries(cluster_id, personal_access_token, databricks_instance):
    # Libraries to be installed
    libraries = [
        {"pypi": {"package": "labelbox"}},
        {"pypi": {"package": "labelspark"}},
    ]

    # Payload for the API request
    data = {"cluster_id": cluster_id, "libraries": libraries}

    # Headers for the API request
    headers = {"Authorization": "Bearer " + personal_access_token, "Content-Type": "application/json"}

    # Send the POST request to install the libraries
    response = requests.post(f"https://{databricks_instance}/api/2.0/libraries/install", headers=headers, data=json.dumps(data))

    # If the response is successful, print a message
    if response.status_code == 200:
        print("Libraries installed successfully!")
    else:
        # If the response is not successful, print an error message
        print(f"Failed to install libraries. Status code: {response.status_code}\nResponse: {response.text}")

# Create a cluster and get its ID
cluster_id = create_cluster(personal_access_token, databricks_instance)

# If the cluster was successfully created, install the libraries
if cluster_id:
    install_libraries(cluster_id, personal_access_token, databricks_instance)