This topic describes how to manage the 20 Newsgroups dataset, which is a dataset with reference/label_format/Classification:Classification
label type.
An reference/glossary:accesskey
is needed to authenticate identity when using TensorBay.
../../../../docs/code/Newsgroups20.py
../../../../docs/code/Newsgroups20.py
Normally, dataloader.py
and catalog.json
are required to organize the "20 Newsgroups" dataset into the ~tensorbay.dataset.dataset.Dataset
instance. In this example, they are stored in the same directory like:
20 Newsgroups/
catalog.json
dataloader.py
It takes the following steps to organize the "20 Newsgroups" dataset by the ~tensorbay.dataset.dataset.Dataset
instance.
A Catalog <reference/dataset_structure:Catalog>
contains all label information of one dataset, which is typically stored in a json file like catalog.json
.
../../../../tensorbay/opendataset/Newsgroups20/catalog.json
The only annotation type for "20 Newsgroups" is reference/label_format/Classification:Classification
, and there are 20 reference/label_format/CommonLabelProperties:Category
types.
Note
- The
categories<reference/label_format/CommonLabelProperties:Category>
inreference/dataset_structure:Dataset
"20 Newsgroups" have parent-child relationship, and it use "." to sparate different levels. - By passing the path of the
catalog.json
,~tensorbay.dataset.dataset.DatasetBase.load_catalog
supports loading the catalog into dataset.
Important
See catalog table <reference/dataset_structure:catalog>
for more catalogs with different label types.
A reference/glossary:Dataloader
is neeeded to organize the dataset into a ~tensorbay.dataset.dataset.Dataset
instance.
../../../../tensorbay/opendataset/Newsgroups20/loader.py
See Classification annotation <reference/label_format/Classification:Classification>
for more details.
Note
The data in "20 Newsgroups" do not have extensions so that a "txt" extension is added to the remote path of each data file to ensure the loaded dataset could function well on TensorBay.
There are already a number of dataloaders in TensorBay SDK provided by the community. Thus, instead of writing, importing an available dataloader is also feasible.
../../../../docs/code/Newsgroups20.py
Note
Note that catalogs are automatically loaded in available dataloaders, users do not have to write them again.
Important
See dataloader table <reference/glossary:dataloader>
for dataloaders with different label types.
Optionally, the organized dataset can be visualized by Pharos, which is a TensorBay SDK plug-in. This step can help users to check whether the dataset is correctly organized. Please see features/visualization:Visualization
for more details.
The organized "20 Newsgroups" dataset can be uploaded to TensorBay for sharing, reuse, etc.
../../../../docs/code/Newsgroups20.py
Similar with Git, the commit step after uploading can record changes to the dataset as a version. If needed, do the modifications and commit again. Please see features/version_control/index:Version Control
for more details.
Now "20 Newsgroups" dataset can be read from TensorBay.
../../../../docs/code/Newsgroups20.py
In reference/dataset_structure:Dataset
"20 Newsgroups", there are four Segments <reference/dataset_structure:Segment>
: 20news-18828
, 20news-bydate-test
and 20news-bydate-train
, 20_newsgroups
. Get the segment names by listing them all.
../../../../docs/code/Newsgroups20.py
Get a segment by passing the required segment name.
../../../../docs/code/Newsgroups20.py
In the 20news-18828 reference/dataset_structure:Segment
, there is a sequence of reference/dataset_structure:Data
, which can be obtained by index.
../../../../docs/code/Newsgroups20.py
In each reference/dataset_structure:Data
, there is a sequence of reference/label_format/Classification:Classification
annotations, which can be obtained by index.
../../../../docs/code/Newsgroups20.py
There is only one label type in "20 Newsgroups" dataset, which is Classification
. The information stored in reference/label_format/CommonLabelProperties:Category
is one of the category names in "categories" list of catalog.json <Newsgroups20-catalog>
. See this page <reference/label_format/Classification:Classification>
for more details about the structure of Classification.
../../../../docs/code/Newsgroups20.py