<td>
   <a target="_blank" href="https://labelbox.com" ><img src="https://labelbox.com/blog/content/images/2021/02/logo-v4.svg" width=256/></a>
</td>

# __*Labelbox Connector for Databricks Tutorial Notebook*__

# _**Creating Data Rows with LabelSpark**_

<td>
<a href="https://github.com/Labelbox/labelspark/blob/master/urls.ipynb" target="_blank"><img
src="https://img.shields.io/badge/GitHub-100000?logo=github&logoColor=white" alt="GitHub"></a>
</td>

Thanks for trying out the Databricks and Labelbox Connector! You or someone from your organization signed up for a Labelbox trial through Databricks Partner Connect. This notebook was loaded into your Shared directory to help illustrate how Labelbox and Databricks can be used together to power unstructured data workflows. 

Labelbox can be used to rapidly annotate a variety of unstructured data from your Data Lake ([images](https://labelbox.com/product/image), [video](https://labelbox.com/product/video), [text](https://labelbox.com/product/text), and [geospatial tiled imagery](https://docs.labelbox.com/docs/tiled-imagery-editor)) and the Labelbox Connector for Databricks makes it easy to bring the annotations back into your Lakehouse environment for AI/ML and analytical workflows. 

If you would like to watch a video of the workflow, check out our [Data & AI Summit Demo](https://databricks.com/session_na21/productionizing-unstructured-data-for-ai-and-analytics). 


<img src="https://labelbox.com/static/images/partnerships/collab-chart.svg" alt="example-workflow" width="800"/>

<h5>Questions or comments? Reach out to us at [ecosystem+databricks@labelbox.com](mailto:ecosystem+databricks@labelbox.com)

## _**Documentation**_

### **Data Rows**
_____________________

**Requirements:**

- A `row_data` column - This column must be URLs that point to the asset to-be-uploaded

- Either a `dataset_id` column or an input argument for `dataset_id`
  - If uploading to multiple datasets, provide a `dataset_id` column 
  - If uploading to one dataset, provide a `dataset_id` input argument
    - _This can still be a column if it's already in your CSV file_

**Recommended:**
- A `global_key` column
  - This column contains unique identifiers for your data rows
  - If none is provided, will default to your `row_data` column
- An `external_id` column
  - This column contains non-unique identifiers for your data rows
  - If none is provided, will default to your `global_key` column  

**Optional:**
- A `project_id` columm or an input argument for `project_id`
  - If batching to multiple projects, provide a `project_id` column
  - If batching to one project, provide a `project_id` input argument
    - _This can still be a column if it's already in your CSV file_

## _**Code**_

Install LabelSpark

In [0]:
%pip install labelspark -q

In [0]:
import labelspark as ls

In [0]:
table_name = "labelspark_demo" # Name to register test table under
csv_path = "https://raw.githubusercontent.com/Labelbox/labelspark/master/datasets/urls.csv" # Path to your CSV file
api_key = ""

Load a CSV as a Spark Table

In [0]:
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('labelspark-demo').getOrCreate() # Create a Spark Session
df = pd.read_csv(csv_path) # Read CSV in as Pandas DataFrame
table = spark.createDataFrame(df) # Convert your Pandas DataFrame into a Spark DataFrame

Register your Spark Table (for this demo, we will only save as a temporary table)

In [0]:
tblList = spark.catalog.listTables()

if not any([x.name == table_name for x in tblList]):
  table.createOrReplaceTempView(table_name)
  print(f"Registered table: {table_name}")

In [0]:
display(sqlContext.sql(f"select * from {table_name} LIMIT 5"))

Create a project, dataset and ontology

In [0]:
client = ls.Client(lb_api_key=api_key)

In [0]:
dataset = client.lb_client.create_dataset(name="LabelSpark-demo")

Upload to Labelbox

In [0]:
results = client.create_data_rows_from_table(
    table = table,
    dataset_id = dataset.uid,
    skip_duplicates = False, # If True, will skip data rows where a global key is already in use
    verbose = True, # If True, prints information about code execution
)