## Uploading a dataset

In [1]:
from pubweb import PubWeb
client = PubWeb()

View a list of projects and processes to get the ID for the project you want to upload to.
You can also find the project ID by looking at the URL on the data portal.

In [2]:
projects = client.project.list()
projects
project = projects[2]
project

Project(id='9a31492a-e679-43ce-9f06-d84213c8f7f7', name='Test Project', description='Project used to test updates to the Portal')

Select the ingestion process we will use to ingest the data

In [3]:
from pubweb.models.process import Executor

ingest_processes = client.process.list(Executor.INGEST)
process = ingest_processes[3]
process

Process(id='custom_dataset', name='Custom Dataset', description='Any collection of files provided by the user', child_process_ids=None, executor=<Executor.INGEST: 'INGEST'>, documentation_url=None, code=None, form_spec_json=None, sample_sheet_path=None, file_requirements_message=None, file_mapping_rules=None)

We've included two helper functions to get a list of files in the specified directory and filter them.

You can also manually create the list of files (using the relative paths)

In [4]:
from pubweb.file_utils import get_files_in_directory, filter_files_by_pattern

directory_to_upload = '/test'

files = get_files_in_directory(directory_to_upload)
files_to_upload = filter_files_by_pattern(files, '*.fastq.gz')
files_to_upload

['test.fastq.gz']

Fill in information on your new dataset in the `name` and `description` variables and then run to check the files and upload new data

In [5]:
from pubweb.models.dataset import CreateIngestDatasetInput
from pubweb.file_utils import check_dataset_files

name = 'Test dataset'
description = ''

file_mapping_rules = client.process.get_process(process.id).file_mapping_rules
check_dataset_files(files_to_upload, file_mapping_rules, directory_to_upload)

dataset_create_request = CreateIngestDatasetInput(
    project_id=project.id,
    process_id=process.id,
    name=name,
    description=description,
    files=files_to_upload
)

create_response = client.dataset.create(dataset_create_request)

client.dataset.upload_files(
    project_id=project.id,
    dataset_id=create_response['datasetId'],
    directory=directory_to_upload,
    files=dataset_create_request.files
)

create_response['datasetId']

Uploading file test.fastq.gz (180.76 KB) | 100.0%|█████████████████████████ | 14


'186b1d46-eb8b-428a-a9c1-01eeb37cf697'