<td>
   <a target="_blank" href="https://labelbox.com" ><img src="https://labelbox.com/blog/content/images/2021/02/logo-v4.svg" width=256/></a>
</td>

# _**Creating Data Rows with Attachments with LabelSpark**_

<td>
<a href="https://github.com/Labelbox/labelspark/blob/master/attachments.ipynb" target="_blank"><img
src="https://img.shields.io/badge/GitHub-100000?logo=github&logoColor=white" alt="GitHub"></a>
</td>

## _**Documentation**_

### **Data Rows**
_____________________

**Requirements:**

- A `row_data` column - This column must be URLs that point to the asset to-be-uploaded

- Either a `dataset_id` column or an input argument for `dataset_id`
  - If uploading to multiple datasets, provide a `dataset_id` column 
  - If uploading to one dataset, provide a `dataset_id` input argument
    - _This can still be a column if it's already in your CSV file_

**Recommended:**
- A `global_key` column
  - This column contains unique identifiers for your data rows
  - If none is provided, will default to your `row_data` column
- An `external_id` column
  - This column contains non-unique identifiers for your data rows
  - If none is provided, will default to your `global_key` column  

**Optional:**
- A `project_id` columm or an input argument for `project_id`
  - If batching to multiple projects, provide a `project_id` column
  - If batching to one project, provide a `project_id` input argument
    - _This can still be a column if it's already in your CSV file_

### **Attachments**
_____________________

For attachments, the column name must be " `attachment` + `divider` + `attachment_type` + `divider` + `column_name` "
  - Example: `attachment///raw_text///sample_column_name`
  - `attachment_type` must be one of the following:
    - `image`, `video`, `raw_text`, `html`, `text_url`


Values for attachments must correspond with the attachment type per Labelbox docs
  - More here: 
    - [Labelbox docs on attachments](https://docs.labelbox.com/docs/asset-attachments)

## _**Code**_

Install LabelSpark

In [0]:
%pip install labelspark -q

In [0]:
import labelspark as ls

In [0]:
table_name = "labelspark_attachments_demo" # Name to register test table under
csv_path = "https://raw.githubusercontent.com/Labelbox/labelspark/master/datasets/attachments.csv" # Path to your CSV file
api_key = ""

Load a CSV as a Spark Table

In [0]:
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('labelspark-demo').getOrCreate() # Create a Spark Session
df = pd.read_csv(csv_path) # Read CSV in as Pandas DataFrame
table = spark.createDataFrame(df) # Convert your Pandas DataFrame into a Spark DataFrame

Register your Spark Table (for this demo, we will only save as a temporary table)

In [0]:
tblList = spark.catalog.listTables()

if not any([x.name == table_name for x in tblList]):
  table.createOrReplaceTempView(table_name)
  print(f"Registered table: {table_name}")

In [0]:
display(sqlContext.sql(f"select * from {table_name} LIMIT 5"))

external_id,row_data,global_key,attachment///image///sample_col_1,attachment///video///sample_col_2,attachment///text_url///sample_col_3,attachment///raw_text///sample_col_4,attachment///html///sample_col_5
Euq7yrfb8tbDFpd-cv_cpg.jpg,https://labelbox.s3-us-west-2.amazonaws.com/datasets/mapillary_traffic/images/Euq7yrfb8tbDFpd-cv_cpg.jpg,labelspark-attachments-test-Euq7yrfb8tbDFpd-cv_cpg.jpg,https://storage.googleapis.com/labelbox-sample-datasets/Docs/disease_attachment.jpeg,https://storage.googleapis.com/labelbox-sample-datasets/Docs/drone_video.mp4,https://storage.googleapis.com/labelbox-sample-datasets/Docs/text_attachment.txt,Sample Raw Text,https://storage.googleapis.com/labelbox-sample-datasets/Docs/windy.html
gCbn5IeZtE92OaUbyl1ZjQ.jpg,https://labelbox.s3-us-west-2.amazonaws.com/datasets/mapillary_traffic/images/gCbn5IeZtE92OaUbyl1ZjQ.jpg,labelspark-attachments-test-gCbn5IeZtE92OaUbyl1ZjQ.jpg,https://storage.googleapis.com/labelbox-sample-datasets/Docs/disease_attachment.jpeg,https://storage.googleapis.com/labelbox-sample-datasets/Docs/drone_video.mp4,https://storage.googleapis.com/labelbox-sample-datasets/Docs/text_attachment.txt,Sample Raw Text,https://storage.googleapis.com/labelbox-sample-datasets/Docs/windy.html
9Y6-Vl3bwsZFTNxX8gqHYw.jpg,https://labelbox.s3-us-west-2.amazonaws.com/datasets/mapillary_traffic/images/9Y6-Vl3bwsZFTNxX8gqHYw.jpg,labelspark-attachments-test-9Y6-Vl3bwsZFTNxX8gqHYw.jpg,https://storage.googleapis.com/labelbox-sample-datasets/Docs/disease_attachment.jpeg,https://storage.googleapis.com/labelbox-sample-datasets/Docs/drone_video.mp4,https://storage.googleapis.com/labelbox-sample-datasets/Docs/text_attachment.txt,Sample Raw Text,https://storage.googleapis.com/labelbox-sample-datasets/Docs/windy.html
1MnLIosQZmXH3T-iU-4mtQ.jpg,https://labelbox.s3-us-west-2.amazonaws.com/datasets/mapillary_traffic/images/1MnLIosQZmXH3T-iU-4mtQ.jpg,labelspark-attachments-test-1MnLIosQZmXH3T-iU-4mtQ.jpg,https://storage.googleapis.com/labelbox-sample-datasets/Docs/disease_attachment.jpeg,https://storage.googleapis.com/labelbox-sample-datasets/Docs/drone_video.mp4,https://storage.googleapis.com/labelbox-sample-datasets/Docs/text_attachment.txt,Sample Raw Text,https://storage.googleapis.com/labelbox-sample-datasets/Docs/windy.html
y_9N4kVjlc_AO3C63k2L9w.jpg,https://labelbox.s3-us-west-2.amazonaws.com/datasets/mapillary_traffic/images/y_9N4kVjlc_AO3C63k2L9w.jpg,labelspark-attachments-test-y_9N4kVjlc_AO3C63k2L9w.jpg,https://storage.googleapis.com/labelbox-sample-datasets/Docs/disease_attachment.jpeg,https://storage.googleapis.com/labelbox-sample-datasets/Docs/drone_video.mp4,https://storage.googleapis.com/labelbox-sample-datasets/Docs/text_attachment.txt,Sample Raw Text,https://storage.googleapis.com/labelbox-sample-datasets/Docs/windy.html


Create a project, dataset and ontology

In [0]:
client = ls.Client(lb_api_key=api_key)

In [0]:
dataset = client.lb_client.create_dataset(name="LabelSpark-attachments-demo")

Upload to Labelbox

In [0]:
results = client.create_data_rows_from_table(
    table = table,
    dataset_id = dataset.uid,
    skip_duplicates = False, # If True, will skip data rows where a global key is already in use
    verbose = True, # If True, prints information about code execution
)