<h1 align="center">Labelbox <> Databricks Pipeline Creator</font></h1>
<b>This script sets up a job to upload data to Labelbox, a data labeling platform. It includes information like where the data is coming from, what specific tasks need to be done, and how often the job should run.

If the script is set to "continuous," the job will keep running without breaks. Otherwise, it will follow a specific schedule.

Once everything is set up, the script sends a request to Databricks, a data analytics platform, to create the job. If everything goes as planned, it will print a success message; otherwise, it will print an error.</b>

<b>Confgure the code using the cell below. </b>

<b>

In [30]:
# ----- CONFIGURATION -----

# User-defined variables
# Databricks cloud instance URL. make sure the URL is in the format <workspace_id>.<cloud>.databricks.com
databricks_instance = ""
# Personal access token for Databricks authentication. This can be generated from the user settings page.
databricks_api__key = ""
# Path to the table which needs to be processed. For example "<metastore>.<database>.<table>"
table_path = ""
# API Key for Labelbox integration. This can be generated from the Labelbox settings page.
labelbox_api__key = ""
# ID of the dataset to be used. 
dataset_id = ""
# Frequency of running the job. Can be set to 'continuous' or a specific cron schedule.
frequency = ""  # Examples: "continuous" or "0 0/5 * * * ?"


In [31]:
# Define the schema
schema_map = [
    ("row_data", "row_data"),
    ("id", "global_key")
]

<b>Do not edit the code below this line unless you want to enable addtional/custom functionality</b>

In [None]:
import requests
import json

# ----- JOB SCHEDULING LOGIC -----

# If the job needs to run continuously, use the "continuous" block
# Else, use the "schedule" block with the specified cron frequency
if frequency == "continuous":
    schedule_block = {
        "continuous": {
            "pause_status": "UNPAUSED"
        }
    }
else:
    schedule_block = {
        "schedule": {
            "quartz_cron_expression": frequency,
            "timezone_id": "UTC",
            "pause_status": "UNPAUSED"
        }
    }

# ----- JOB DEFINITION -----

# Define the parameters and structure of the job to be created in Databricks
payload = {
    "run_as": {"user_name": email},
    "name": "upload_to_labelbox",
    "email_notifications": {"no_alert_for_skipped_runs": False},
    "webhook_notifications": {},
    "timeout_seconds": 0,
    "max_concurrent_runs": 1,
    "tasks": [
        {
            "task_key": "upload_to_labelbox",
            "run_if": "ALL_SUCCESS",
            "notebook_task": {
                "notebook_path": "notebooks/databricks_pipeline_creator/upload_to_labelbox",
                "base_parameters": {
                    "dataset_id": dataset_id,
                    "table_path": table_path,
                    "labelbox_api_key": labelbox_api_key,
                },
                "source": "GIT"
            },
            "job_cluster_key": "Job_cluster",
            "libraries": [
                {"pypi": {"package": "labelspark"}},
                {"pypi": {"package": "labelbox==3.49.1"}},
                {"pypi": {"package": "numpy==1.25"}},
                {"pypi": {"package": "opencv-python==4.8.0.74"}}
            ],
            "timeout_seconds": 0,
            "email_notifications": {},
            "notification_settings": {
                "no_alert_for_skipped_runs": False,
                "no_alert_for_canceled_runs": False,
                "alert_on_last_attempt": False
            }
        }
    ],
    "job_clusters": [
        {
            "job_cluster_key": "Job_cluster",
            "new_cluster": {
                "cluster_name": "",
                "spark_version": "13.3.x-scala2.12",
                "gcp_attributes": {
                    "use_preemptible_executors": False,
                    "availability": "ON_DEMAND_GCP",
                    "zone_id": "HA"
                },
                "node_type_id": "n2-highmem-4",
                "enable_elastic_disk": True,
                "data_security_mode": "SINGLE_USER",
                "runtime_engine": "STANDARD",
                "autoscale": {
                    "min_workers": 1,
                    "max_workers": 10
                }
            }
        }
    ],
    "git_source": {
        "git_url": "https://github.com/Labelbox/labelspark.git",
        "git_provider": "gitHub",
        "git_branch": "master"
    },
    "format": "MULTI_TASK"
}

# Merge the scheduling configuration into the main job payload
payload.update(schedule_block)

# ----- JOB CREATION -----

# Formulate the endpoint URL for the Databricks REST API job creation
url = f"https://{databricks_instance}/api/2.0/jobs/create"
# Define the authentication headers
headers = {
    "Authorization": f"Bearer {databricks_api_key}",
    "Content-Type": "application/json",
}

# Send the POST request to Databricks to create the job
response = requests.post(url, data=json.dumps(payload), headers=headers)

# ----- RESPONSE HANDLING -----

# Print the response
# If the response code is 200, it means the job was created successfully.
# Otherwise, print the error message received.
if response.status_code == 200:
    print("Job created successfully.")
else:
    print(f"Failed to create job. Error: {response.text}")