<td>
   <a target="_blank" href="https://labelbox.com" ><img src="https://labelbox.com/blog/content/images/2021/02/logo-v4.svg" width=190/></a>
</td>

<td>
<a href="https://colab.research.google.com/github/Labelbox/labelbox-python/blob/master/examples/basics/batches.ipynb" target="_blank"><img
src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>
</td>

<td>
<a href="https://github.com/Labelbox/labelbox-python/blob/master/examples/basics/batches.ipynb" target="_blank"><img
src="https://img.shields.io/badge/GitHub-100000?logo=github&logoColor=white" alt="GitHub"></a>
</td>

# Batches
https://docs.labelbox.com/docs/batches

* A batch is collection of data rows.
* A data row cannot be part of more than one batch in a given project.
* Batches work for all data types, but there can only be one data type per project.
* Batches can not be shared between projects.
* Batches may have data rows from multiple datasets.
* Currently, only benchmarks quality settings is supported in batch projects
* You can set the priority for each batch.

In [None]:
!pip install "labelbox[data]" -q

In [None]:
import labelbox as lb
import random
import uuid

## API key and client
Provide a valid API key below in order to properly connect to the Labelbox Client.

In [None]:
# Add your API key
API_KEY = ""
client = lb.Client(api_key=API_KEY)

## Create a dataset and data rows

In [None]:
# Create a dataset
dataset = client.create_dataset(name="Demo-Batches-Colab")

uploads = []
# Generate data rows
for i in range(1,9):
    uploads.append({
        'row_data':  f"https://storage.googleapis.com/labelbox-datasets/People_Clothing_Segmentation/jpeg_images/IMAGES/img_000{i}.jpeg",
        "global_key": "TEST-ID-%id" % uuid.uuid1(),
    })

data_rows = dataset.create_data_rows(uploads)
data_rows.wait_till_done()
print("ERRORS: " , data_rows.errors)
print("RESULT URL: ", data_rows.result_url)

## Setup batch project

In [None]:
project = client.create_project(
  name="Demo-Batches-Project",                                 
  media_type=lb.MediaType.Image
)
print("Project Name: ", project.name, "Project ID: ", project.uid)

## Create batches

### Select all data rows from the dataset


In [None]:
global_keys = [data_row.global_key for data_row in dataset.export_data_rows()]
print("Number of global keys:", len(global_keys))

### Select a random sample
This method is useful if you have large datasets and only want to work with a handful of data rows

In [None]:
sample = random.sample(global_keys, 4)

### Create a batch
This method takes in a list of either data row IDs or `DataRow` objects into a `data_rows` argument or global keys into a `global_keys` argument, but both approaches cannot be used in the same method.

In [None]:
batch = project.create_batch(
  name="Demo-First-Batch", # Each batch in a project must have a unique name
  global_keys=sample, # A list of data rows or data row ids
  priority=5 # priority between 1(Highest) - 5(lowest)
)
# number of data rows in the batch
print("Number of data rows in batch: ", batch.size)

### Create multiple batches
The `project.create_batches()` method accepts up to 1 million data rows.  Batches are chunked into groups of 100k if necessary, which is the maximum batch size. This method takes in a list of either data row IDs or `DataRow` objects into a `data_rows` argument or global keys into a `global_keys` argument, but both approaches cannot be used in the same method.

This method takes in a list of either data row IDs or `DataRow` objects into a `data_rows` argument or global keys into a `global_keys` argument, but both approaches cannot be used in the same method. Batches will be created with the specified `name_prefix` argument and a unique suffix to ensure unique batch names. The suffix will be a 4-digit number starting at `0000`.

For example, if the name prefix is `demo-create-batches-` and three batches are created, the names will be `demo-create-batches-0000`, `demo-create-batches-0001`, and `demo-create-batches-0002`. This method will throw an error if a batch with the same name already exists.

In the code below, only one batch will be created, since we are only using the few data rows we created above. Creating over 100k data rows for this demonstration is not sensible, but this method is the preferred approach for batch creation as it will gracefully handle massive sets of data rows.

In [None]:
# First, we must create a second project so that we can re-use the data rows we already created.
second_project = client.create_project(
  name="Second-Demo-Batches-Project",                                 
  media_type=lb.MediaType.Image
)
print("Project Name: ", second_project.name, "Project ID: ", second_project.uid)

# Then, use the method that will create multiple batches if necessary.
task = second_project.create_batches(
  name_prefix="demo-create-batches-",
  global_keys=global_keys,
  priority=5
)

print("Errors: ", task.errors())
print("Result: ", task.result())

### Create batches from a dataset

If you wish to create batches in a project using all the data rows of a dataset, instead of having to gather global keys or ID and using subsets of data rows, you can use the `project.create_batches_from_dataset()` method. This method takes in a dataset ID and creates a batch (or batches if there are more than 100k data rows) comprised of all data rows not already in the project.

The same logic applies to the `name_prefix` argument and the naming of batches as described in the section immediately above.

In [None]:
# First, we must create a third project so that we can re-use the data rows we already created.
third_project = client.create_project(
  name="Third-Demo-Batches-Project",                                 
  media_type=lb.MediaType.Image
)
print("Project Name: ", third_project.name, "Project ID: ", third_project.uid)

# Then, use the method to create batches from a dataset.
task = third_project.create_batches_from_dataset(
    name_prefix="demo-batches-from-dataset-",
    dataset_id=dataset.uid,
    priority=5
)

print("Errors: ", task.errors())
print("Result: ", task.result())

## Manage batches
Note: You can view your batch data through the **Data Rows** tab.

### View batches

In [None]:
## Export the data row iDs
data_rows = [dr for dr in batch.export_data_rows()]
print("Data rows in batch: ", data_rows)

## List the batches in your project
for batch in project.batches():
    print("Batch name: ", batch.name , "  Batch ID:", batch.uid)


### Archive a batch

In [None]:
# Archiving a batch removes all queued data rows in the batch from the project
batch.remove_queued_data_rows()

## Clean up 
Uncomment and run the cell below to optionally delete the batch, dataset, and/or project created in this demo.

In [None]:
# Delete Batch
#batch.delete()

# Delete Project
#project.delete()

# Delete DataSet
#dataset.delete()