Skip to content

Life of a dataset

Aryalfrat edited this page Nov 16, 2022 · 3 revisions

English| 简体中文

  1. Introduction to a dataset

The dataset consists of metadata and media files, and the metadata has the following characteristics:

  • A unique ID and the system has an initial default metadata status of null.

  • A list of resources, where each element points to an actual resource; Metadata doesn't actually hold resources, but only maintains this list of resources.

  • A number of keywords by which a user can search for a particular metadata status.

  • Support users to create a new metadata branch and perform operations on the newly created branch. The operations on the new branch do not affect the status of the original metadata, and the original metadata is still traceable by the user. These operations include but are not limited to the following:

    (1) Adding resources (2) Adding or modifying annotations (3) Add or modify keywords (4) Filtering resources (5) Merging two different metadatas

  • You can switch freely between different metadata.

  • You can query the history of the metadata.

  • You can tag the metadata to facilitate precise search by tag.

  • You can also add keywords to metadata to facilitate fuzzy search through keywords.

  • You can read the resources contained in a metadata and use those resources for browsing, training and so on.

From the above description, it can be seen that the management of metadata is similar to that of VCS (Version Control System), and users can have the following completely different usage methods and scenarios:

The first scene: Directly from the very first metadata, a filtering process is carried out to select and use the data that meets the requirements, as shown in the following figure:

Scenario1

Whenever the user needs to start a new task, :: The user checks out a new feature branch from within the current master branch, getting the metadata in feature#1 state. :: The user performs data filtering and other tasks on the metadata of this new branch. The user can obtain the metadata in the feature #2 state. :: When it is confirmed that this metadata is suitable for the user's training task, then the user can start training using this data.

  • At this point, changes made by other users to the master branch's metadata will not affect the training data the user is using either.

The second scene: Search for certain metadata by label or keyword. The user starts the screening process until the data meets the requirements, and then the data is used. As shown below:

Scenario2

At this point, whenever a user needs to carry out a new task, :: Users can search for metadata that basically matches the user's requirements by means of keywords, labels, and so on. :: On this basis, users need sign out a new branch. :: Users can continue data filtering or cleansing on the new branch to obtain data that actually meets the requirements. :: Users can use this data for training.

The third scene: incremental merging. Suppose the user has completed the training task of the model using certain metadata. At this point, there is an update to the metadata of the repository and the master branch. The user wishes to merge this part of the update into the currently used metadata.

Scenario3

Suppose the user is now in FEATURE#2 and needs to do the following: :: You need switch back to master branch master. :: You need repeat the task previously done for the incremental part master#2 - master#1 to obtain feature#2+. :: You need cut back to feature#2 and merge feature#2+ to get feature#3.

  1. Branch and dataset management

The discussion in this section is based on the following assumptions: :: The user's data is imported in batches in units of datasets. :: Each dataset is a separate branch. :: Changes to and maintenance of each dataset are carried out on this branch. :: Master branch is always empty. This management approach is shown in the following figure:

branch and dataset