Skip to content

Commit

Permalink
[SYNPY-1384] Create uploading data in bulk tutorial (#1101)
Browse files Browse the repository at this point in the history
* Create uploading data in bulk tutorial
  • Loading branch information
BryanFauble committed May 30, 2024
1 parent 486aefc commit 7b98106
Show file tree
Hide file tree
Showing 21 changed files with 478 additions and 360 deletions.
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ Perform the following one-time steps to set up your local environment.
pipenv install --dev
# Set your active session to the virtual environment you created
pipenv shell
# Note: The 'Python Environment Manager' extension in vscode is reccomended here
# Note: The 'Python Environment Manager' extension in vscode is recommended here
```

4. Once completed you are ready to start developing. Commands run through the CLI, or through an IDE like visual studio code within the virtual environment will have all required dependencies automatically installed. Try running `synapse -h` in your shell to read over the available CLI commands. Or view the `Usage as a library` section in the README.md to get started using the library to write more python.
Expand Down
18 changes: 13 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,16 +44,24 @@ The Python Synapse client has been tested on 3.8, 3.9, 3.10 and 3.11 on Mac OS X

The [Python Synapse Client is on PyPI](https://pypi.python.org/pypi/synapseclient) and can be installed with pip:

(sudo) pip install synapseclient[pandas,pysftp]
# Here are a few ways to install the client. Choose the one that fits your use-case
# sudo may optionally be needed depending on your setup

pip install --upgrade synapseclient
pip install --upgrade "synapseclient[pandas]"
pip install --upgrade "synapseclient[pandas, pysftp, boto3]"

...or to upgrade an existing installation of the Synapse client:

(sudo) pip install --upgrade synapseclient
# sudo may optionally be needed depending on your setup
pip install --upgrade synapseclient

The dependencies on `pandas` and `pysftp` are optional. Synapse [Tables](https://python-docs.synapse.org/reference/tables/) integrate
The dependencies on `pandas`, `pysftp`, and `boto3` are optional. Synapse
[Tables](https://python-docs.synapse.org/reference/tables/) integrate
with [Pandas](http://pandas.pydata.org/). The library `pysftp` is required for users of
[SFTP](https://python-docs.synapse.org/guides/data_storage/#sftp) file storage. Both libraries require native code
to be compiled or installed separately from prebuilt binaries.
[SFTP](https://python-docs.synapse.org/guides/data_storage/#sftp) file storage. All
libraries require native code to be compiled or installed separately from prebuilt
binaries.

### Install from source

Expand Down
2 changes: 1 addition & 1 deletion docs/explanations/benchmarking.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ This test includes adding 5 annotations to each file, a Text, Integer, Floating

S3 was not benchmarked again.

As a result of these tests the sweet spot for thread count is around 50 threads. It is not reccomended to go over 50 threads as it resulted in signficant instability in the client.
As a result of these tests the sweet spot for thread count is around 50 threads. It is not recommended to go over 50 threads as it resulted in signficant instability in the client.

| Test | Thread Count | Synapseutils Sync | os.walk + syn.store | Per file size |
|---------------------------|--------------|-------------------|---------------------|---------------|
Expand Down
2 changes: 1 addition & 1 deletion docs/news.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@
with Synapse.
- **Date type Annotations on Synapse entities are now timezone aware**. Review our
[reference documentation for Annotations](https://python-docs.synapse.org/reference/annotations/).
The [`pytz` package](https://pypi.org/project/pytz/) is reccomended if you regularly
The [`pytz` package](https://pypi.org/project/pytz/) is recommended if you regularly
work with data across time zones.
- If you do not set the `tzinfo` field on a date or datetime instance we will use the
timezone of the machine where the code is executing.
Expand Down
22 changes: 19 additions & 3 deletions docs/tutorials/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,13 @@ The [synapseclient](https://pypi.python.org/pypi/synapseclient/) package is avai
```bash
conda create -n synapseclient python=3.9
conda activate synapseclient
(sudo) pip install (--upgrade) synapseclient[pandas, pysftp]

# Here are a few ways to install the client. Choose the one that fits your use-case
# sudo may optionally be needed depending on your setup

pip install --upgrade synapseclient
pip install --upgrade "synapseclient[pandas]"
pip install --upgrade "synapseclient[pandas, pysftp, boto3]"
```

- pyenv: Use [virtualenv](https://virtualenv.pypa.io/en/latest/) to manage your python environment:
Expand All @@ -21,10 +27,20 @@ pyenv install -v 3.9.13
pyenv global 3.9.13
python -m venv env
source env/bin/activate
(sudo) python3 -m pip3 install (--upgrade) synapseclient[pandas, pysftp]

# Here are a few ways to install the client. Choose the one that fits your use-case
# sudo may optionally be needed depending on your setup

python -m pip install --upgrade synapseclient
python -m pip install --upgrade "synapseclient[pandas]"
python -m pip install --upgrade "synapseclient[pandas, pysftp, boto3]"

python3 -m pip3 install --upgrade synapseclient
python3 -m pip3 install --upgrade "synapseclient[pandas]"
python3 -m pip3 install --upgrade "synapseclient[pandas, pysftp, boto3]"
```

The dependencies on pandas and pysftp are optional. The Synapse `synapseclient.table` feature integrates with Pandas. Support for sftp is required for users of SFTP file storage. Both require native libraries to be compiled or installed separately from prebuilt binaries.
The dependencies on pandas, pysftp, and boto3 are optional. The Synapse `synapseclient.table` feature integrates with Pandas. Support for sftp is required for users of SFTP file storage. Both require native libraries to be compiled or installed separately from prebuilt binaries.

## Local

Expand Down
79 changes: 13 additions & 66 deletions docs/tutorials/python/annotation.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,12 @@ Annotations are stored as key-value pairs in Synapse, where the key defines a pa

Annotations can be based on an existing ontology or controlled vocabulary, or can be created as needed and modified later as your metadata evolves.


**Note:** You may optionally follow the [Uploading data in bulk](./upload_data_in_bulk.md)
tutorial instead. The bulk tutorial may fit your needs better as it limits the amount
of code that you are required to write and maintain.


## Tutorial Purpose
In this tutorial you will:

Expand All @@ -19,58 +25,19 @@ In this tutorial you will:

#### First let's retrieve all of the Synapse IDs we are going to use
```python
import os
import synapseclient
from synapseclient import File

syn = synapseclient.login()

# Retrieve the project ID
my_project_id = syn.findEntityId(
name="My uniquely named project about Alzheimer's Disease"
)

# Retrieve the folders I want to annotate files in
batch_1_folder_id = syn.findEntityId(
name="single_cell_RNAseq_batch_1", parent=my_project_id
)

print(f"Batch 1 Folder ID: {batch_1_folder_id}")

{!docs/tutorials/python/tutorial_scripts/annotation.py!lines=5-22}
```

#### Next let's define the annotations I want to set

```python
annotation_values = {
"species": "Homo sapiens",
"dataType": "geneExpression",
"assay": "SCRNA-seq",
"fileFormat": "fastq",
}
{!docs/tutorials/python/tutorial_scripts/annotation.py!lines=25-31}
```

#### Finally we'll loop over all of the files and set their annotations

```python
for file_batch_1 in syn.getChildren(parent=batch_1_folder_id, includeTypes=["file"]):
# Grab and print the existing annotations this File may already have
existing_annotations_for_file = syn.get_annotations(entity=file_batch_1)

print(
f"Got the annotations for File: {file_batch_1['name']}, ID: {file_batch_1['id']}, Annotations: {existing_annotations_for_file}"
)

# Merge the new annotations with anything existing
existing_annotations_for_file.update(annotation_values)

existing_annotations_for_file = syn.set_annotations(
annotations=existing_annotations_for_file
)

print(
f"Set the annotations for File: {file_batch_1['name']}, ID: {file_batch_1['id']}, Annotations: {existing_annotations_for_file}"
)
{!docs/tutorials/python/tutorial_scripts/annotation.py!lines=33-51}
```


Expand All @@ -93,31 +60,11 @@ Assuming we have a few new files we want to upload we'll follow a similar patter
in the [File tutorial](./file.md), except now we'll specify the `annotations` attribute before
uploading the file to Synapse.

```python
batch_1_scrnaseq_new_file_1 = File(
path=os.path.expanduser(
"~/my_ad_project/single_cell_RNAseq_batch_1/SRR92345678_R1.fastq.gz"
),
parent=batch_1_folder_id,
annotations=annotation_values,
)
batch_1_scrnaseq_new_file_2 = File(
path=os.path.expanduser(
"~/my_ad_project/single_cell_RNAseq_batch_1/SRR92345678_R2.fastq.gz"
),
parent=batch_1_folder_id,
annotations=annotation_values,
)
batch_1_scrnaseq_new_file_1 = syn.store(obj=batch_1_scrnaseq_new_file_1)
batch_1_scrnaseq_new_file_2 = syn.store(obj=batch_1_scrnaseq_new_file_2)

print(
f"Stored file: {batch_1_scrnaseq_new_file_1['name']}, ID: {batch_1_scrnaseq_new_file_1['id']}, Annotations: {batch_1_scrnaseq_new_file_1['annotations']}"
)
print(
f"Stored file: {batch_1_scrnaseq_new_file_2['name']}, ID: {batch_1_scrnaseq_new_file_2['id']}, Annotations: {batch_1_scrnaseq_new_file_2['annotations']}"
)
In order for the following script to work please replace the files with ones that
already exist on your local machine.

```python
{!docs/tutorials/python/tutorial_scripts/annotation.py!lines=53-78}
```

<details class="example">
Expand Down
15 changes: 7 additions & 8 deletions docs/tutorials/python/download_data_in_bulk.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,12 @@ This tutorial will follow a
With a project that has this example layout:
```
.
├── experiment_notes
│   ├── notes_2022
│   │   ├── fileA.txt
│   │   └── fileB.txt
│   └── notes_2023
│   ├── fileC.txt
│   └── fileD.txt
├── biospecimen_experiment_1
│   ├── fileA.txt
│   └── fileB.txt
├── biospecimen_experiment_2
│   ├── fileC.txt
│   └── fileD.txt
├── single_cell_RNAseq_batch_1
│   ├── SRR12345678_R1.fastq.gz
│   └── SRR12345678_R2.fastq.gz
Expand All @@ -36,7 +35,7 @@ In this tutorial you will:
* Make sure that you have completed the following tutorials:
* [Folder](./folder.md)
* [File](./file.md)
* This tutorial is setup to download the data to `~/temp`, make sure that this or
* This tutorial is setup to download the data to `~/my_ad_project`, make sure that this or
another desired directory exists.


Expand Down
Loading

0 comments on commit 7b98106

Please sign in to comment.