# Hub Dataset Annotation

This notebook is a Proof of Concept of how to explore and create some user annotations using exclusively the dataset viewer API from the Hub.

The goal of this experiment is to validate the feasibility of integrate this kind of annotation in Argilla. This could change the main mindset of Argilla, but it could bring a lot of benefits to the user experience and data integration.

## 1. Create a hub dataset instance

The first step is to create a hub dataset instance. This instance will be used to interact with the dataset viewer API.


In [None]:
from hub_dataset import HubDataset

dataset = HubDataset("stanfordnlp/imdb", split="train")
dataset.info()

This dataset wraps the Dataset Viewer API from the Hub. The `info` method returns the dataset information. You can walk through the source code to see all the available methods. 

In [None]:
# You can also inspect the dataset features
dataset.features

In [None]:
# the dataset size
dataset.size()

In [None]:
# The feature statistics
dataset.statistics()

In [None]:
# Fetch some rows
dataset.rows(offset=0, length=5)

In [None]:
# Search for some specific rows
dataset.search(query="I rented I AM CURIOUS-YELLOW from my video store", length=2)

In [None]:
# Filter by some specific features
dataset.filter(where="label=1", length=5)

In [None]:
# There is an iterable version for each method working with the dataset rows 
for row in dataset.iterable_rows(limit=5):
    print(row)

## 2. Create an annotation session over the dataset

The next step is to create an annotation session over the dataset. This session will be used to manage and store the multiple annotation for different users.

In [None]:
from annotation_session import AnnotationSession

session = AnnotationSession(dataset)

In [None]:
for row in session.list(limit=2, status="pending"):
    print(row)

In [None]:
# We define two different users

user1 = "mark"
user2 = "peter"

# And we start annotating some rows for each user given their specific batch 

In [None]:
# User 1
for row in session.pending_rows_batch(user1, limit=10):
    print(row["row_idx"])
    session.annotate(row["row_idx"], user1, label=1)

In [None]:
# User 2
for row in session.pending_rows_batch(user2, limit=10):
    print(row["row_idx"])
    session.annotate(row["row_idx"], user2, label=0)

In [None]:
# Now we can list the annotations for each user
for row in session.annotated_rows(user1):
    print(row["row_idx"])

In [None]:
for row in session.annotated_rows(user2):
    print(row["row_idx"])

Each pending batch call will return a batch of rows to annotate randomized by user. So the user will not see the same rows as the other user. This helps to distribute the annotation work among the users.

Filter and searches are also available for the annotated rows. You can combine them with the status parameter to filter the rows by status.

In [None]:
for row in session.filter(where="label=1", status="annotated", limit=5):
    print("Annotated row id", row["row_idx"])
    
for row in session.search(query="I rented I AM CURIOUS-YELLOW from my video store", status="pending", limit=5):
    print("Pending row id", row["row_idx"])

In [None]:
# We can also to access the annotations for a specific row
for annotated_row in  session.list(limit=10, status="annotated"):
    print(session.get_row_annotations(annotated_row["row_idx"]))
