# PrimeHub Datasets

On PrimeHub, we could manipulate datasets and corresponding files.

The `datasets` command could help you deal with these.

## Setup PrimeHub Python SDK


In [None]:
from primehub import PrimeHub, PrimeHubConfig
ph = PrimeHub(PrimeHubConfig())

if ph.is_ready():
    print("PrimeHub Python SDK setup successfully")
else:
    print("PrimeHub Python SDK couldn't get the group information, follow the 00-getting-started.ipynb to complete it")

## Help documentation

In [None]:
help(ph.datasets)

## Datasets management

---

```
Usage: 
  primehub datasets <command>

Manage datasets

Available Commands:
  create               Create a datasets
  delete               Delete the dataset
  files-download       download files from the dataset
  files-list           lists files of the dataset
  files-upload         upload files to the dataset
  get                  Get the dataset
  list                 List datasets
  update               Update a dataset

Options:
  -h, --help           Show the help

Global Options:
  --config CONFIG      Change the path of the config file (Default: ~/.primehub/config.json)
  --endpoint ENDPOINT  Override the GraphQL API endpoint
  --token TOKEN        Override the API Token
  --group GROUP        Override the current group
  --json               Output the json format (output human-friendly format by default)

```

---

### Fields for creating or updating

| field | required | type | description |
| --- | --- | --- | --- |
| id | required | string | the name of the dataset |
| tags | optional | object | dataset's tags |

*`tags` is a JSON array, e.g., `["dataset-tag-1", "dataset-tag-2"]`*

## Examples

### Create a datasets

We could create a `test-dataset` dataset with tags `test-dataset-tag` and `training`:

In [None]:
config = {
  "id": "test-dataset",
  "tags": ["test-dataset-tag", "training"]
}

ph.datasets.create(config)

### List datasets or get the dataset
After created a dataset, you could find it with `list` or `get` command

In [None]:
# list all datasets
datasets = list(ph.datasets.list())
n_datasets = len(datasets)
print(f'number of dataset: {n_datasets}')

# get the dataset by name
dataset = ph.datasets.get('test-dataset')
print(dataset)

### Update a dataset

We could update a dataset with new tags:

In [None]:
config = {
    "tags": ["deprecated"]
}
dataset = ph.datasets.update(dataset['name'], config)
print(dataset)

### Delete a dataset

With `delete`, we could delete the whole dataset with its files.

In [None]:
result = ph.datasets.delete(dataset['name'])
print(result)

### Upload files to the dataset with given path

To upload a file or a directory to the dataset, `files-upload` command can help with it.

*Note: indicate `recursive` options when upload directory*

In [None]:
# upload a folder to the dataset
!mkdir -p test-dir
!touch test-dir/a.out
ph.datasets.files_upload('test-dir', 'test-dataset', recursive=True)

# upload a file to the dataset
!touch test.txt
!echo "Test for Python SDK" > test.txt
ph.datasets.files_upload('test.txt', 'test-dataset/test-dir')

### List files of the dataset with given path
We could use `files-list` to watch files of the dataset.

`files-list` requires a `path` parameter, and it always starts with the name of the dataset:

In [None]:
files = ph.datasets.files_list('test-dataset')
print(files)

# sub-directory
files = ph.datasets.files_list('test-dataset/test-dir')
print(files)

### Download files from the dataset with given path
`files-download` can help with downloading a file or a directory from the dataset.

In [None]:
ph.datasets.files_download('test-dataset', 'local-dir', recursive=True)
!tree local-dir