# Labelspark - The Databricks and Labelbox connector

Thanks for trying out the Databricks and Labelbox Connector! Unless you downloaded this directly from our repository, you or someone from your organization signed up for a Labelbox trial through Databricks Partner Connect. This notebook helps illustrate how Labelbox and Databricks can be used together to power unstructured data workflows. 

Labelbox can be used to rapidly annotate a variety of unstructured data from your Data Lake ([images](https://labelbox.com/product/image), [video](https://labelbox.com/product/video), [text](https://labelbox.com/product/text), and [geospatial tiled imagery](https://docs.labelbox.com/docs/tiled-imagery-editor)) and the Labelbox Connector for Databricks makes it easy to bring the annotations back into your Lakehouse environment for AI/ML and analytical workflows.

<img src="https://labelbox.com/static/images/partnerships/collab-chart.svg" alt="example-workflow" width="800"/>

<b>Questions or comments? Reach out to us at [hello+databricks@labelbox.com](mailto:hello+databricks@labelbox.com)
  
Last Updated: December 20, 2022

# Set Up Labelspark Client

This tutorial notebook requires a Lablbox API Key. Please login to your [Labelbox Account](https://app.labelbox.com/) and generate an [API Key](https://docs.labelbox.com/reference/create-api-key)

In [0]:
%pip install labelspark --upgrade -q

In [0]:
import labelspark

In [0]:
api_key = ""

In [0]:
lbspk_client = labelspark.Client(api_key)

# **Create a Databricks Spark Table from a Labelbox Dataset**

`client.create_table_from_dataset()` will create a Databricks Spark table from a Labelbox dataset given the following: 
- `labelbox Client` object (`labelbox.client.Client`)
- An existing `labelbox_dataset` object `(labelbox.schema.dataset.Dataset)`
- An optional `metadata_index` dictionary
  - This must be a dictionary where {key=metadata_field_name : value=metadata_type} 
    - `metadata_field_name` will correspond to a column in your Databricks Spark table  
    - `metadata_type` must be `"string"`, `"enum"`, `"datetime"` or `"number"`

In [0]:
labelbox_dataset_id = ""
lb_dataset = lbspk_client.lb_client.get_dataset(labelbox_dataset_id)

metadata_index = {
    "split" : "enum"
}

In [0]:
spark_table = lbspk_client.create_table_from_dataset(
  lb_client=lbspk_client.lb_client, 
  lb_dataset=lb_dataset, 
  metadata_index=metadata_index
)

In [0]:
# Let's view your generated Databricks Spark table
spark_table.show(5)

# **Create Labelbox Data Rows from a Databricks Spark Table**

`client.create_data_rows_from_table()` will create Labelbox data rows given the following:
- `labelbox Client` object (`labelbox.client.Client`)
- An existing `labelbox_dataset` object `(labelbox.schema.dataset.Dataset)`
- An existing `spark_table` object
- A column name for your Labelbox data row `row_data`
- An optional column name for your Labelbox data row `global_key` - defaults to `row_data`
- An optional column name for your Labelbox data row `external_id` - defaults to `global_key`
- An optional `metadata_index` dictionary
  - This must be a dictionary where {key=metadata_field_name : value=metadata_type} 
    - `metadata_field_name` will correspond to a column in your Databricks Spark table  
    - `metadata_type` must be `"string"`, `"enum"`, `"datetime"` or `"number"`

In [0]:
# We're going to create a new_global_key column to use for uploading data
from pyspark.sql.functions import lit, concat
import uuid

new_suffix = f"_{str(uuid.uuid4())}"
spark_table = spark_table.withColumn('global_key_suffix', lit(new_suffix)) # Create a column that we can concatenate a fixed value on existing global keys
spark_table = spark_table.withColumn('new_global_key',concat(*['global_key', 'global_key_suffix'])) # Create a new_global_key column

# We're going to create a new dataset to upload data to, you can also pass in an existing dataset
lb_dataset = lbspk_client.lb_client.create_dataset(name="clone_dataset")
print(f"Labelbox Dataset: {lb_dataset.uid}")

metadata_index = {
    "split" : "enum"
}

spark_table.show(5)

In [0]:
upload_results = lbspk_client.create_data_rows_from_table(
  lb_client=lbspk_client.lb_client, 
  spark_table=spark_table,
  lb_dataset=lb_dataset,
  row_data_col="row_data",
  global_key_col="new_global_key",
  metadata_index=metadata_index
)

In [0]:
upload_results

# **Upsert Labelbox Metadata Values with Databricks Spark table column data**

client.upsert_labelbox_metadata() will upsert Labelbox metadata based on most recent column values from your Databricks Spark table given the following:

- `labelbox Client` object (`labelbox.client.Client`)
- An existing `labelbox_dataset` object `(labelbox.schema.dataset.Dataset)`
- A list of `global_keys` that correspond to data rows to-be-upserted
- A column name for your Labelbox data row `global_key`
- A `metadata_index` where each key is a metadata field you're looking to upsert in Labelbox
  - This must be a dictionary where {key=metadata field name : value=metadata type} 
    - `metadata_type` must be `"string"`, `"enum"`, `"datetime"` or `"number"`
  - Each key passed in to the `metadata index` will correspond to a column in your Databricks Spark table

In [0]:
# For this demo, let's change the existing metadata values in our Databricks Spark table and then upsert metadata values on Labelbox
spark_table = spark_table.withColumn("split", lit("train"))
spark_table.show(5)

In [0]:
results = lbspk_client.upsert_labelbox_metadata(
  lb_client=lbspk_client.lb_client,
  spark_table=spark_table,
  global_key_col="new_global_key",
  global_keys_list = [], # Will take all global keys from table if False  
  metadata_index=metadata_index
)

In [0]:
results # You can view metadata changes on Labelbox

# **Upsert Databricks Spark Table Values with the Labelbox Metadata**
`client.upsert_table_metadata()` will upsert a Databricks Spark table based on most recent metadata from Labelbox given the following:
- `labelbox Client` object (`labelbox.client.Client`)
- An existing `labelbox_dataset` object `(labelbox.schema.dataset.Dataset)`
- An existing `spark_table`
- A column name for your Labelbox data row `global_key`
- A `metadata_index` where each key is a column you're looking to update in your Databricks Spark table
  - This must be a dictionary where {key=metadata field name : value=metadata type} 
    - `metadata_type` must be `"string"`, `"enum"`, `"datetime"` or `"number"`
  - Each key passed in to the `metadata index` will correspond to a column in your Databricks Spark table

In [0]:
metadata_index = {
    "split" : "enum"
}

# We'll use the same spark_table, but instead we'll modify the metadata to the original values by referencing the original global_key column
spark_table.show(5) 

In [0]:
spark_table = lbspk_client.upsert_table_metadata(
  lb_client=lbspk_client.lb_client, 
  spark_table=spark_table, 
  global_key_col="global_key", 
  metadata_index=metadata_index
)

In [0]:
spark_table.show(5)