Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYNPY-1384] Create uploading data in bulk tutorial #1101

Merged
merged 4 commits into from
May 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ Perform the following one-time steps to set up your local environment.
pipenv install --dev
# Set your active session to the virtual environment you created
pipenv shell
# Note: The 'Python Environment Manager' extension in vscode is reccomended here
# Note: The 'Python Environment Manager' extension in vscode is recommended here
```

4. Once completed you are ready to start developing. Commands run through the CLI, or through an IDE like visual studio code within the virtual environment will have all required dependencies automatically installed. Try running `synapse -h` in your shell to read over the available CLI commands. Or view the `Usage as a library` section in the README.md to get started using the library to write more python.
Expand Down
18 changes: 13 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,16 +44,24 @@ The Python Synapse client has been tested on 3.8, 3.9, 3.10 and 3.11 on Mac OS X

The [Python Synapse Client is on PyPI](https://pypi.python.org/pypi/synapseclient) and can be installed with pip:

(sudo) pip install synapseclient[pandas,pysftp]
# Here are a few ways to install the client. Choose the one that fits your use-case
# sudo may optionally be needed depending on your setup

pip install --upgrade synapseclient
pip install --upgrade "synapseclient[pandas]"
pip install --upgrade "synapseclient[pandas, pysftp, boto3]"

...or to upgrade an existing installation of the Synapse client:

(sudo) pip install --upgrade synapseclient
# sudo may optionally be needed depending on your setup
pip install --upgrade synapseclient

The dependencies on `pandas` and `pysftp` are optional. Synapse [Tables](https://python-docs.synapse.org/reference/tables/) integrate
The dependencies on `pandas`, `pysftp`, and `boto3` are optional. Synapse
[Tables](https://python-docs.synapse.org/reference/tables/) integrate
with [Pandas](http://pandas.pydata.org/). The library `pysftp` is required for users of
[SFTP](https://python-docs.synapse.org/guides/data_storage/#sftp) file storage. Both libraries require native code
to be compiled or installed separately from prebuilt binaries.
[SFTP](https://python-docs.synapse.org/guides/data_storage/#sftp) file storage. All
libraries require native code to be compiled or installed separately from prebuilt
binaries.

### Install from source

Expand Down
2 changes: 1 addition & 1 deletion docs/explanations/benchmarking.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ This test includes adding 5 annotations to each file, a Text, Integer, Floating

S3 was not benchmarked again.

As a result of these tests the sweet spot for thread count is around 50 threads. It is not reccomended to go over 50 threads as it resulted in signficant instability in the client.
As a result of these tests the sweet spot for thread count is around 50 threads. It is not recommended to go over 50 threads as it resulted in signficant instability in the client.

| Test | Thread Count | Synapseutils Sync | os.walk + syn.store | Per file size |
|---------------------------|--------------|-------------------|---------------------|---------------|
Expand Down
2 changes: 1 addition & 1 deletion docs/news.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@
with Synapse.
- **Date type Annotations on Synapse entities are now timezone aware**. Review our
[reference documentation for Annotations](https://python-docs.synapse.org/reference/annotations/).
The [`pytz` package](https://pypi.org/project/pytz/) is reccomended if you regularly
The [`pytz` package](https://pypi.org/project/pytz/) is recommended if you regularly
work with data across time zones.
- If you do not set the `tzinfo` field on a date or datetime instance we will use the
timezone of the machine where the code is executing.
Expand Down
22 changes: 19 additions & 3 deletions docs/tutorials/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,13 @@ The [synapseclient](https://pypi.python.org/pypi/synapseclient/) package is avai
```bash
conda create -n synapseclient python=3.9
conda activate synapseclient
(sudo) pip install (--upgrade) synapseclient[pandas, pysftp]

# Here are a few ways to install the client. Choose the one that fits your use-case
# sudo may optionally be needed depending on your setup

pip install --upgrade synapseclient
pip install --upgrade "synapseclient[pandas]"
pip install --upgrade "synapseclient[pandas, pysftp, boto3]"
```

- pyenv: Use [virtualenv](https://virtualenv.pypa.io/en/latest/) to manage your python environment:
Expand All @@ -21,10 +27,20 @@ pyenv install -v 3.9.13
pyenv global 3.9.13
python -m venv env
source env/bin/activate
(sudo) python3 -m pip3 install (--upgrade) synapseclient[pandas, pysftp]

# Here are a few ways to install the client. Choose the one that fits your use-case
# sudo may optionally be needed depending on your setup

python -m pip install --upgrade synapseclient
python -m pip install --upgrade "synapseclient[pandas]"
python -m pip install --upgrade "synapseclient[pandas, pysftp, boto3]"

python3 -m pip3 install --upgrade synapseclient
python3 -m pip3 install --upgrade "synapseclient[pandas]"
python3 -m pip3 install --upgrade "synapseclient[pandas, pysftp, boto3]"
```

The dependencies on pandas and pysftp are optional. The Synapse `synapseclient.table` feature integrates with Pandas. Support for sftp is required for users of SFTP file storage. Both require native libraries to be compiled or installed separately from prebuilt binaries.
The dependencies on pandas, pysftp, and boto3 are optional. The Synapse `synapseclient.table` feature integrates with Pandas. Support for sftp is required for users of SFTP file storage. Both require native libraries to be compiled or installed separately from prebuilt binaries.

## Local

Expand Down
79 changes: 13 additions & 66 deletions docs/tutorials/python/annotation.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,12 @@ Annotations are stored as key-value pairs in Synapse, where the key defines a pa

Annotations can be based on an existing ontology or controlled vocabulary, or can be created as needed and modified later as your metadata evolves.


**Note:** You may optionally follow the [Uploading data in bulk](./upload_data_in_bulk.md)
tutorial instead. The bulk tutorial may fit your needs better as it limits the amount
of code that you are required to write and maintain.


## Tutorial Purpose
In this tutorial you will:

Expand All @@ -19,58 +25,19 @@ In this tutorial you will:

#### First let's retrieve all of the Synapse IDs we are going to use
```python
import os
import synapseclient
from synapseclient import File

syn = synapseclient.login()

# Retrieve the project ID
my_project_id = syn.findEntityId(
name="My uniquely named project about Alzheimer's Disease"
)

# Retrieve the folders I want to annotate files in
batch_1_folder_id = syn.findEntityId(
name="single_cell_RNAseq_batch_1", parent=my_project_id
)

print(f"Batch 1 Folder ID: {batch_1_folder_id}")

{!docs/tutorials/python/tutorial_scripts/annotation.py!lines=5-22}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is neat, although does this put a risk on making sure we don't notify these tutorial_scripts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree there is some risk here. My opinion is its a risk by manually copying the code over. At least in this way the code gets updated and the lines for each step is updated. I guess it'll be something to look out for in future PRs if there are changes.

```

#### Next let's define the annotations I want to set

```python
annotation_values = {
"species": "Homo sapiens",
"dataType": "geneExpression",
"assay": "SCRNA-seq",
"fileFormat": "fastq",
}
{!docs/tutorials/python/tutorial_scripts/annotation.py!lines=25-31}
```

#### Finally we'll loop over all of the files and set their annotations

```python
for file_batch_1 in syn.getChildren(parent=batch_1_folder_id, includeTypes=["file"]):
# Grab and print the existing annotations this File may already have
existing_annotations_for_file = syn.get_annotations(entity=file_batch_1)

print(
f"Got the annotations for File: {file_batch_1['name']}, ID: {file_batch_1['id']}, Annotations: {existing_annotations_for_file}"
)

# Merge the new annotations with anything existing
existing_annotations_for_file.update(annotation_values)

existing_annotations_for_file = syn.set_annotations(
annotations=existing_annotations_for_file
)

print(
f"Set the annotations for File: {file_batch_1['name']}, ID: {file_batch_1['id']}, Annotations: {existing_annotations_for_file}"
)
{!docs/tutorials/python/tutorial_scripts/annotation.py!lines=33-51}
```


Expand All @@ -93,31 +60,11 @@ Assuming we have a few new files we want to upload we'll follow a similar patter
in the [File tutorial](./file.md), except now we'll specify the `annotations` attribute before
uploading the file to Synapse.

```python
batch_1_scrnaseq_new_file_1 = File(
path=os.path.expanduser(
"~/my_ad_project/single_cell_RNAseq_batch_1/SRR92345678_R1.fastq.gz"
),
parent=batch_1_folder_id,
annotations=annotation_values,
)
batch_1_scrnaseq_new_file_2 = File(
path=os.path.expanduser(
"~/my_ad_project/single_cell_RNAseq_batch_1/SRR92345678_R2.fastq.gz"
),
parent=batch_1_folder_id,
annotations=annotation_values,
)
batch_1_scrnaseq_new_file_1 = syn.store(obj=batch_1_scrnaseq_new_file_1)
batch_1_scrnaseq_new_file_2 = syn.store(obj=batch_1_scrnaseq_new_file_2)

print(
f"Stored file: {batch_1_scrnaseq_new_file_1['name']}, ID: {batch_1_scrnaseq_new_file_1['id']}, Annotations: {batch_1_scrnaseq_new_file_1['annotations']}"
)
print(
f"Stored file: {batch_1_scrnaseq_new_file_2['name']}, ID: {batch_1_scrnaseq_new_file_2['id']}, Annotations: {batch_1_scrnaseq_new_file_2['annotations']}"
)
In order for the following script to work please replace the files with ones that
already exist on your local machine.

```python
{!docs/tutorials/python/tutorial_scripts/annotation.py!lines=53-78}
```

<details class="example">
Expand Down
15 changes: 7 additions & 8 deletions docs/tutorials/python/download_data_in_bulk.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,12 @@ This tutorial will follow a
With a project that has this example layout:
```
.
├── experiment_notes
│   ├── notes_2022
│   │   ├── fileA.txt
│   │   └── fileB.txt
│   └── notes_2023
│   ├── fileC.txt
│   └── fileD.txt
├── biospecimen_experiment_1
│   ├── fileA.txt
│   └── fileB.txt
├── biospecimen_experiment_2
│   ├── fileC.txt
│   └── fileD.txt
├── single_cell_RNAseq_batch_1
│   ├── SRR12345678_R1.fastq.gz
│   └── SRR12345678_R2.fastq.gz
Expand All @@ -36,7 +35,7 @@ In this tutorial you will:
* Make sure that you have completed the following tutorials:
* [Folder](./folder.md)
* [File](./file.md)
* This tutorial is setup to download the data to `~/temp`, make sure that this or
* This tutorial is setup to download the data to `~/my_ad_project`, make sure that this or
another desired directory exists.


Expand Down
Loading
Loading