<td>
   <a target="_blank" href="https://labelbox.com" ><img src="https://labelbox.com/blog/content/images/2021/02/logo-v4.svg" width=256/></a>
</td>


<td>
<a href="https://colab.research.google.com/github/Labelbox/labelbox-python/blob/develop/examples/basics/custom_embeddings.ipynb" target="_blank"><img
src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>
</td>

<td>
<a href="https://github.com/Labelbox/labelbox-python/tree/develop/examples/basics/custom_embeddings.ipynb" target="_blank"><img
src="https://img.shields.io/badge/GitHub-100000?logo=github&logoColor=white" alt="GitHub"></a>
</td>

# Custom Embeddings

You can improve your data exploration and similarity search experience by adding your own custom embeddings. Labelbox allows you to upload up to 10 different custom embeddings per workspace on any kind of data. You can experiment with different embeddings to power your data selection.

# Set up 

In [None]:
%pip install -q --upgrade "labelbox[data]"

In [None]:
import labelbox as lb
import numpy as np
import json
import uuid
import random

# Replace with your API key

In [None]:
API_KEY = ""
client = lb.Client(API_KEY)

# Select data rows

- Get images from a Labelbox dataset
- To improve similarity search, you need to upload custom embeddings to at least 1,000 data rows.


In [None]:
DATASET_ID = ""

In [None]:
dataset = client.get_dataset(dataset_id=DATASET_ID)
export_task = dataset.export()
export_task.wait_till_done()

In [None]:
data_rows = []


def json_stream_handler(output: lb.BufferedJsonConverterOutput):
    data_row = output.json
    data_rows.append(data_row)


if export_task.has_errors():
    export_task.get_buffered_stream(stream_type=lb.StreamType.ERRORS).start(
        stream_handler=lambda error: print(error))

if export_task.has_result():
    export_json = export_task.get_buffered_stream(
        stream_type=lb.StreamType.RESULT).start(
            stream_handler=json_stream_handler)

In [None]:
data_row_dict = [{"data_row_id": dr["data_row"]["id"]} for dr in data_rows]
data_row_dict = data_row_dict[:
                              1000]  # keep the first 1000 examples for the sake of this demo

# Create custom embedding payload 

Generate random vectors for embeddings (max : 2048 dimensions)

In [None]:
nb_data_rows = len(data_row_dict)
print("Number of data rows: ", nb_data_rows)
# Labelbox supports custom embedding vectors of dimension up to 2048
custom_embeddings = [list(np.random.random(2048)) for _ in range(nb_data_rows)]

List all custom embeddings available in your Labelbox workspace

In [None]:
embeddings = client.get_embeddings()

Choose an existing embedding type or create a new one

In [None]:
# Name of the custom embedding must be unique
embedding = client.create_embedding("my_custom_embedding_2048_dimensions", 2048)

Create payload

The payload should encompass the `key` (data row id  or global key) and the new embedding vector data. Note that the `dataset.upsert_data_rows()` operation will only update the values you pass in the payload; all other existing row data will not be modified.

In [None]:
payload = []
for data_row_dict, custom_embedding in zip(data_row_dict, custom_embeddings):
    payload.append({
        "key":
            lb.UniqueId(data_row_dict["data_row_id"]),
        "embeddings": [{
            "embedding_id": embedding.id,
            "vector": custom_embedding
        }],
    })

print("payload", len(payload), payload[:1])

# Upload payload

Upsert data rows with custom embeddings

In [None]:
task = dataset.upsert_data_rows(payload)
task.wait_till_done()
print(task.errors)
print(task.status)

Get the count of imported vectors for a custom embedding

In [None]:
# Count how many data rows have a specific custom embedding (this can take a couple of minutes)
count = embedding.get_imported_vector_count()
print(count)

Delete custom embedding type

In [None]:
# embedding.delete()

# Upload custom embeddings during data row creation

Create a dataset

In [None]:
# Create a dataset
dataset_new = client.create_dataset(name="data_rows_with_embeddings")

Fetch an embedding (2048 dimension)

In [None]:
embedding = client.get_embedding_by_name("my_custom_embedding_2048_dimensions")
vector = [random.uniform(1.0, 2.0) for _ in range(embedding.dims)]

Upload data rows with embeddings

In [None]:
uploads = []
# Generate data rows
for i in range(1, 9):
    uploads.append({
        "row_data":
            f"https://storage.googleapis.com/labelbox-datasets/People_Clothing_Segmentation/jpeg_images/IMAGES/img_000{i}.jpeg",
        "global_key":
            "TEST-ID-%id" % uuid.uuid1(),
        "embeddings": [{
            "embedding_id": embedding.id,
            "vector": vector
        }],
    })

task1 = dataset_new.create_data_rows(uploads)
task1.wait_till_done()
print("ERRORS: ", task1.errors)
print("RESULTS:", task1.result)