Skip to content

Latest commit

 

History

History
187 lines (134 loc) · 6.43 KB

File metadata and controls

187 lines (134 loc) · 6.43 KB

20 Newsgroups

This topic describes how to manage the 20 Newsgroups dataset, which is a dataset with reference/label_format/Classification:Classification label type.

Authorize a Client Instance

An reference/glossary:accesskey is needed to authenticate identity when using TensorBay.

../../../../docs/code/Newsgroups20.py

Create Dataset

../../../../docs/code/Newsgroups20.py

Organize Dataset

Normally, dataloader.py and catalog.json are required to organize the "20 Newsgroups" dataset into the ~tensorbay.dataset.dataset.Dataset instance. In this example, they are stored in the same directory like:

20 Newsgroups/
    catalog.json
    dataloader.py

It takes the following steps to organize the "20 Newsgroups" dataset by the ~tensorbay.dataset.dataset.Dataset instance.

Step 1: Write the Catalog

A Catalog <reference/dataset_structure:Catalog> contains all label information of one dataset, which is typically stored in a json file like catalog.json.

../../../../tensorbay/opendataset/Newsgroups20/catalog.json

The only annotation type for "20 Newsgroups" is reference/label_format/Classification:Classification, and there are 20 reference/label_format/CommonLabelProperties:Category types.

Note

  • The categories<reference/label_format/CommonLabelProperties:Category> in reference/dataset_structure:Dataset "20 Newsgroups" have parent-child relationship, and it use "." to sparate different levels.
  • By passing the path of the catalog.json, ~tensorbay.dataset.dataset.DatasetBase.load_catalog supports loading the catalog into dataset.

Important

See catalog table <reference/dataset_structure:catalog> for more catalogs with different label types.

Step 2: Write the Dataloader

A reference/glossary:Dataloader is neeeded to organize the dataset into a ~tensorbay.dataset.dataset.Dataset instance.

../../../../tensorbay/opendataset/Newsgroups20/loader.py

See Classification annotation <reference/label_format/Classification:Classification> for more details.

Note

The data in "20 Newsgroups" do not have extensions so that a "txt" extension is added to the remote path of each data file to ensure the loaded dataset could function well on TensorBay.

There are already a number of dataloaders in TensorBay SDK provided by the community. Thus, instead of writing, importing an available dataloader is also feasible.

../../../../docs/code/Newsgroups20.py

Note

Note that catalogs are automatically loaded in available dataloaders, users do not have to write them again.

Important

See dataloader table <reference/glossary:dataloader> for dataloaders with different label types.

Visualize Dataset

Optionally, the organized dataset can be visualized by Pharos, which is a TensorBay SDK plug-in. This step can help users to check whether the dataset is correctly organized. Please see features/visualization:Visualization for more details.

Upload Dataset

The organized "20 Newsgroups" dataset can be uploaded to TensorBay for sharing, reuse, etc.

../../../../docs/code/Newsgroups20.py

Similar with Git, the commit step after uploading can record changes to the dataset as a version. If needed, do the modifications and commit again. Please see features/version_control/index:Version Control for more details.

Read Dataset

Now "20 Newsgroups" dataset can be read from TensorBay.

../../../../docs/code/Newsgroups20.py

In reference/dataset_structure:Dataset "20 Newsgroups", there are four Segments <reference/dataset_structure:Segment>: 20news-18828, 20news-bydate-test and 20news-bydate-train, 20_newsgroups. Get the segment names by listing them all.

../../../../docs/code/Newsgroups20.py

Get a segment by passing the required segment name.

../../../../docs/code/Newsgroups20.py

In the 20news-18828 reference/dataset_structure:Segment, there is a sequence of reference/dataset_structure:Data, which can be obtained by index.

../../../../docs/code/Newsgroups20.py

In each reference/dataset_structure:Data, there is a sequence of reference/label_format/Classification:Classification annotations, which can be obtained by index.

../../../../docs/code/Newsgroups20.py

There is only one label type in "20 Newsgroups" dataset, which is Classification. The information stored in reference/label_format/CommonLabelProperties:Category is one of the category names in "categories" list of catalog.json <Newsgroups20-catalog>. See this page <reference/label_format/Classification:Classification> for more details about the structure of Classification.

Delete Dataset

../../../../docs/code/Newsgroups20.py