-
Notifications
You must be signed in to change notification settings - Fork 0
The Taskframe Class: Adding Datasets
This page details methods to add a Dataset to a Taskframe. As in usual data science tools, a Dataset is simply an iterable collection containing the data you want to annotate. Each item in the Dataset will result in an annotation Task.
The library offers several convenient methods to load them from different formats.
Note that these methods are lazily evaluated, the Dataset is not actually submitted until you call taskframe.submit()
# Assuming you have an image classification Taskframe:
tf = Taskframe(data_type="image", task_type="classification", classes=["cat", "dog"])
# Adding a dataset from a list of urls or local files:
tf.add_dataset_from_list(["https://server.com/img1.jpg", "https://server.com/img1.jpg"])
tf.add_dataset_from_list(["local/path/img1.jpg", "local/path/img2.jpg"])
# Adding a dataset from a folder:
tf.add_dataset_from_folder("local/path")
# Adding a dataset from a CSV containing raw data, urls or paths to local files:
tf.add_dataset_from_csv("local/data.csv", column="items")
# Adding a dataset from a Pandas dataframe containing raw data, urls or paths to local files:
dataframe = pd.DataFrame(...)
tf.add_dataset_from_dataframe(dataframe, column="items")
Add Dataset directly from a list of items. The items may be direct raw data (for example text), urls or paths to local files. Signature:
def add_dataset_from_list(
self, items, input_type=None, custom_ids=None, labels=None
):Parameters:
-
items: a list of items that will be annotated. items may be file paths, urls, or raw data (see below) -
input_type(optional): the type of items :file,url,data. If not provided it will be inferred. -
custom_ids(optional): list of unique item ids. length should matchitems -
labels(optional): list of initial labels of your items, that will be used to initialize the annotation form. This is useful for example when you already have a Machine learning model that may generate baseline labels and you want workers to correct them. The length of thelabelslist should matchitems(fill withNonevalues if necessary).
Returns: None
Example:
# Assuming you have a text classification Taskframe:
tf = Taskframe(data_type="text", task_type="classification", classes=["positive", "negative"])
tf.add_dataset_from_list(["this product is really awesome!", "I don't like it"])Add Dataset from all files from a specific folder. Signature:
def add_dataset_from_folder(
self, path, custom_ids=None, labels=None, recursive=False, pattern="*"
)Parameters:
-
path: string orPathinstance of the folder containing your files. -
recursive: Boolean. If true will also laod sub-directories. -
pattern: filter allowed file extensions.
Returns: None
Example:
# Assuming you have an image classification Taskframe:
tf = Taskframe(data_type="image", task_type="classification", classes=["cat", "dog"])
tf.add_dataset_from_folder("local/path/to/images")Add dataset from a local CSV file. Rows may contain either raw data (for example text), urls, or paths to local files. Signature:
add_dataset_from_csv(
self,
csv_path,
column=None,
input_type=None,
base_path=None,
custom_id_column=None,
label_column=None,
)Parameters:
-
csv_path(required): string orPathcontaining the path to the CSV file. -
column: The column of the CSV containing your data. If undefined, takes the first column. -
input_type: the type of items :file,url,data`. If not provided it will be inferred. -
base_path: if you are passing relative file paths, you may pass thisbase_paththat will be prepended to each file's path; -
custom_id_column: the column containing unique item ids. -
label_column: column containing initial labels for your items
Returns: None
Example:
# Assuming you have an image classification Taskframe:
tf = Taskframe(data_type="image", task_type="classification", classes=["cat", "dog"])
tf.add_dataset_from_csv("mydata.csv", column="item")Add a Dataset from a Pandas dataframe.
Signature:
add_dataset_from_dataframe(
self,
dataframe,
column=None,
input_type=None,
base_path=None,
custom_id_column=None,
label_column=None,
)Parameters:
-
dataframe: the Pandas dataframe -
column: The column of the dataframe containing your data. If undefined, takes the first column. -
input_type: the type of items :file,url,data`. If not provided it will be inferred. -
base_path: if you are passing relative file paths, you may pass thisbase_paththat will be prepended to each file's path; -
custom_id_column: the column containing unique item ids. -
label_column: column containing initial labels of your items
Returns: None
Example:
# Assuming you have an image classification Taskframe:
tf = Taskframe(data_type="image", task_type="classification", classes=["cat", "dog"])
import pandas as pd
dataframe = pd.DataFrame(...)
tf.add_dataset_from_dataframe(dataframe, column="item")