<td>
   <a target="_blank" href="https://labelbox.com" ><img src="https://labelbox.com/blog/content/images/2021/02/logo-v4.svg" width=190/></a>
</td>

# Labelpandas - The Labelbox <> Pandas Connector
**Instantly Load CSVs (and other Tables) into Labelbox**

---

This notebook is used to go over the basic use of the Labelpandas Python SDK. 

**Pandas** is a Python library that helps in loading and manipulating CSVs and tabular data more efficiently. It is one of the most widely used Python libraries in the world.

**Labelpandas** incorporates both Labelbox and Pandas to make uploading CSVs and tabular data to Labelbox straightforward. It can handle both local file assets as well as cloud-hosted assets. 

In [1]:
!pip install labelpandas --upgrade -q

In [2]:
import labelpandas as lbpd

# Set up Labelpandas Client

In [3]:
labelbox_api_key = ""

In [4]:
client = lbpd.Client(lb_api_key=labelbox_api_key)

# Load CSV

To upload data rows from a csv, your csv **must** have the following:

- Column consisting of your **row data** as a string value - this pertains to either your asset URL (pointing to cloud storage) or a local file path
    
- Column consisting of your **global key** as a string value - this is an externally facing ID that must be unique (Labelbox enforces it)
    - If you attempt to upload a data row with an existing global key, it will either auto-generate a suffix such as "_1" or it will skip it entirely
    
**To upload data rows with metadta, your csv must have one column per metadata field name**. Labelpandas matches the column names to Labelbox metadata names when uploading metadata.

In [5]:
from io import StringIO
import uuid

demo_csv = f"""global_key,row_data,split
{str(uuid.uuid4())},https://storage.googleapis.com/diagnostics-demo-data/coco/COCO_train2014_000000569539.jpg,train
{str(uuid.uuid4())},https://storage.googleapis.com/diagnostics-demo-data/coco/COCO_train2014_000000288451.jpg,train
{str(uuid.uuid4())},https://storage.googleapis.com/diagnostics-demo-data/coco/COCO_train2014_000000240902.jpg,train
{str(uuid.uuid4())},https://storage.googleapis.com/diagnostics-demo-data/coco/COCO_val2014_000000428116.jpg,train
{str(uuid.uuid4())},https://storage.googleapis.com/diagnostics-demo-data/coco/COCO_val2014_000000459566.jpg,train
{str(uuid.uuid4())},https://storage.googleapis.com/diagnostics-demo-data/coco/COCO_train2014_000000442982.jpg,train
{str(uuid.uuid4())},https://storage.googleapis.com/diagnostics-demo-data/coco/COCO_train2014_000000569538.jpg,valid
{str(uuid.uuid4())},https://storage.googleapis.com/diagnostics-demo-data/coco/COCO_train2014_000000022415.jpg,valid
{str(uuid.uuid4())},https://storage.googleapis.com/diagnostics-demo-data/coco/COCO_val2014_000000146981.jpg,test
{str(uuid.uuid4())},https://storage.googleapis.com/diagnostics-demo-data/coco/COCO_train2014_000000173046.jpg,test"""

In [6]:
## You can load in csv's into pandas with df = pd.read_csv(file_path_as_string)
import pandas as pd

df = pd.read_csv(StringIO(demo_csv))
df.head()

Unnamed: 0,global_key,row_data,split
0,99aad74a-0ce1-41d4-b172-97abbe4ae8b2,https://storage.googleapis.com/diagnostics-dem...,train
1,b47b1358-3d81-4384-920a-d4a08cbe7ffe,https://storage.googleapis.com/diagnostics-dem...,train
2,2a75b633-2266-4ec0-8441-8e41374a04e2,https://storage.googleapis.com/diagnostics-dem...,train
3,7c75170d-26c8-4a8b-8e74-3d1483d7719b,https://storage.googleapis.com/diagnostics-dem...,train
4,53d6ecd1-cc80-42ef-9e11-2be9b7cf879d,https://storage.googleapis.com/diagnostics-dem...,train


# Create a `metadata_index`

* Your metadata_index is a dictionary where {key=`column_name` : value=`metadata_field_type`}
    * `column_name` must correspond to Labelbox metadata field names. Labelpandas uses these names to sync data.
    * `metadata_field_type` must be one of the following string values: **"datetime" "enum" "string" "number"**

In [7]:
metadata_index={ 
    "split" : "enum"
}

# Get or Create a Labelbox Dataset

* Labelpandas will create data rows for you in existing datasets. If you don't have a dataset, create one.

In [11]:
dataset_name = "Labelpandas Demo Dataset" # Desired or existing dataset name
integration_name = "DEFAULT" # Desired delegated access integration name (ignore if using an existing dataset)

dataset = client.base_client.get_or_create_dataset(name=dataset_name, integration=integration_name, verbose=True)

Creating a Labelbox dataset with name Labelpandas Demo Dataset and the default delegated access integration setting
Created a new dataset with ID clchxm7q011xs073n6kxe3otq


# Upload Data Rows from CSV to Labelbox

**`client.create_data_rows_from_table()`** has the following arguments:
```
df              :   Required (pandas.core.frame.DataFrame) - Pandas DataFrame    
lb_dataset      :   Required (labelbox.schema.dataset.Dataset) - Labelbox dataset to add data rows to            
row_data_col    :   Required (str) - Column containing asset URL or file path
global_key_col  :   Optional (str) - Column name containing the data row global key - defaults to row data
external_id_col :   Optional (str) - Column name containing the data row external ID - defaults to global key
metadata_index  :   Optional (dict) - Dictionary where {key=column_name : value=metadata_type}
local_files     :   Optional (bool) - If True, will create urls for local files; if False, uploads `row_data_col` as urls
skip_duplicates :   Optional (bool) - If True, will skip duplicate global_keys, otherwise will generate a unique global_key with a suffix     
verbose         :   Optional (bool) - If True, prints information about code execution
```
This function will return a list of errors, if any

In [9]:
upload_results = client.create_data_rows_from_table(
    df=df, 
    lb_dataset=dataset, 
    row_data_col="row_data", 
    global_key_col="global_key", 
    external_id_col=None, 
    metadata_index=metadata_index,
    local_files=False,
    skip_duplicates=False,
    verbose=True)

Valid metadata_index
Creating upload list - 10 rows in Pandas DataFrame
Generated upload list - 10 data rows to upload
Beginning data row upload: uploading 10 data rows
Batch #1: 10 data rows
Success: upload batch number 1 complete
Upload complete
