Skip to content

Latest commit

 

History

History
159 lines (115 loc) · 5.05 KB

File metadata and controls

159 lines (115 loc) · 5.05 KB

THCHS-30

This topic describes how to manage the THCHS-30 Dataset, which is a dataset with :ref:`reference/label_format/Sentence:Sentence` label

Authorize a Client Instance

An :ref:`reference/glossary:accesskey` is needed to authenticate identity when using TensorBay.

.. literalinclude:: ../../../../docs/code/THCHS30.py
   :language: python
   :start-after: """Authorize a Client Instance"""
   :end-before: """"""

Create Dataset

.. literalinclude:: ../../../../docs/code/THCHS30.py
   :language: python
   :start-after: """Create Dataset"""
   :end-before: """"""

Organize Dataset

It takes the following steps to organize the “THCHS-30” dataset by the :class:`~tensorbay.dataset.dataset.Dataset` instance.

Step 1: Write the Catalog

A :ref:`Catalog <reference/dataset_structure:catalog>` contains all label information of one dataset, which is typically stored in a json file. However the catalog of THCHS-30 is too large, instead of reading it from json file, we read it by mapping from subcatalog that is loaded by the raw file. Check the :ref:`dataloader <THCHS30-dataloader>` below for more details.

Important

See :ref:`catalog table <reference/dataset_structure:catalog>` for more catalogs with different label types.

Step 2: Write the Dataloader

A :ref:`dataloader <THCHS30-dataloader>` is needed to organize the dataset into a :class:`~tensorbay.dataset.dataset.Dataset` instance.

.. literalinclude:: ../../../../tensorbay/opendataset/THCHS30/loader.py
   :language: python
   :name: THCHS30-dataloader
   :linenos:

See :ref:`Sentence annotation <reference/label_format/Sentence:Sentence>` for more details.

There are already a number of dataloaders in TensorBay SDK provided by the community. Thus, instead of writing, importing an available dataloadert is also feasible.

.. literalinclude:: ../../../../docs/code/THCHS30.py
   :language: python
   :start-after: """Organize dataset / import dataloader"""
   :end-before: """"""

Note

Note that catalogs are automatically loaded in available dataloaders, users do not have to write them again.

Important

See :ref:`dataloader table <reference/glossary:dataloader>` for dataloaders with different label types.

Visualize Dataset

Optionally, the organized dataset can be visualized by Pharos, which is a TensorBay SDK plug-in. This step can help users to check whether the dataset is correctly organized. Please see :ref:`features/visualization:Visualization` for more details.

Upload Dataset

The organized "THCHS-30" dataset can be uploaded to TensorBay for sharing, reuse, etc.

.. literalinclude:: ../../../../docs/code/THCHS30.py
   :language: python
   :start-after: """Upload Dataset"""
   :end-before: """"""

Similar with Git, the commit step after uploading can record changes to the dataset as a version. If needed, do the modifications and commit again. Please see :ref:`features/version_control/index:Version Control` for more details.

Read Dataset

Now "THCHS-30" dataset can be read from TensorBay.

.. literalinclude:: ../../../../docs/code/THCHS30.py
   :language: python
   :start-after: """Read Dataset / get dataset"""
   :end-before: """"""

In :ref:`reference/dataset_structure:Dataset` "THCHS-30", there are three :ref:`Segments <reference/dataset_structure:Segment>`: dev, train and test. Get the segment names by listing them all.

.. literalinclude:: ../../../../docs/code/THCHS30.py
   :language: python
   :start-after: """Read Dataset / list segment names"""
   :end-before: """"""

Get a segment by passing the required segment name.

.. literalinclude:: ../../../../docs/code/THCHS30.py
   :language: python
   :start-after: """Read Dataset / get segment"""
   :end-before: """"""

In the dev :ref:`reference/dataset_structure:Segment`, there is a sequence of :ref:`reference/dataset_structure:Data`, which can be obtained by index.

.. literalinclude:: ../../../../docs/code/THCHS30.py
   :language: python
   :start-after: """Read Dataset / get data"""
   :end-before: """"""

In each :ref:`reference/dataset_structure:Data`, there is a sequence of :ref:`reference/label_format/Sentence:Sentence` annotations, which can be obtained by index.

.. literalinclude:: ../../../../docs/code/THCHS30.py
   :language: python
   :start-after: """Read Dataset / get label"""
   :end-before: """"""

There is only one label type in "THCHS-30" dataset, which is Sentence. It contains sentence, spell and phone information. See :ref:`Sentence <reference/label_format/Sentence:Sentence>` label format for more details.

Delete Dataset

.. literalinclude:: ../../../../docs/code/THCHS30.py
   :language: python
   :start-after: """Delete Dataset"""
   :end-before: """"""