Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYNPY-1416] File model finishing touches for OOP #1060

Merged
merged 15 commits into from
Feb 7, 2024

Conversation

BryanFauble
Copy link
Contributor

Problem:

  1. File model was not complete
  2. A file could not be stored in Synapse unless you specified the Parent ID, even if the ID of the file was known

Solution:

  1. Adding in the various flags for get and store that modify how they function
  2. Adding in a copy function
  3. Adding in a change_metadata function
  4. Adding in guard conditions in several areas
  5. Updating the order in which we are concurrently saving Annotations and Activities to reflect the issue noted in https://sagebionetworks.jira.com/browse/PLFM-8251
  6. Updating the File script
  7. Adding in logic to the client.py to allow storing a file if storing with the ID of the file

Testing:

  1. Unit/integration testing
  2. File script provided

@BryanFauble BryanFauble requested a review from a team as a code owner February 2, 2024 20:30
@pep8speaks
Copy link

pep8speaks commented Feb 2, 2024

Hello @BryanFauble! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 1515:89: E501 line too long (110 > 88 characters)
Line 1640:89: E501 line too long (90 > 88 characters)

Line 20:89: E501 line too long (102 > 88 characters)
Line 27:89: E501 line too long (89 > 88 characters)
Line 65:89: E501 line too long (110 > 88 characters)

Line 60:89: E501 line too long (93 > 88 characters)
Line 71:89: E501 line too long (89 > 88 characters)
Line 85:89: E501 line too long (89 > 88 characters)
Line 128:89: E501 line too long (90 > 88 characters)
Line 130:89: E501 line too long (89 > 88 characters)
Line 131:89: E501 line too long (89 > 88 characters)
Line 145:89: E501 line too long (89 > 88 characters)
Line 154:89: E501 line too long (92 > 88 characters)
Line 160:89: E501 line too long (89 > 88 characters)
Line 163:89: E501 line too long (89 > 88 characters)
Line 180:89: E501 line too long (89 > 88 characters)
Line 196:89: E501 line too long (89 > 88 characters)
Line 237:89: E501 line too long (89 > 88 characters)
Line 323:89: E501 line too long (89 > 88 characters)
Line 466:89: E501 line too long (89 > 88 characters)
Line 468:89: E501 line too long (95 > 88 characters)
Line 470:89: E501 line too long (89 > 88 characters)
Line 472:89: E501 line too long (101 > 88 characters)
Line 474:89: E501 line too long (96 > 88 characters)
Line 481:89: E501 line too long (101 > 88 characters)
Line 498:89: E501 line too long (90 > 88 characters)
Line 569:89: E501 line too long (90 > 88 characters)
Line 576:89: E501 line too long (92 > 88 characters)
Line 580:89: E501 line too long (116 > 88 characters)
Line 631:89: E501 line too long (89 > 88 characters)
Line 639:89: E501 line too long (93 > 88 characters)
Line 702:89: E501 line too long (110 > 88 characters)
Line 731:89: E501 line too long (110 > 88 characters)
Line 800:89: E501 line too long (100 > 88 characters)
Line 807:89: E501 line too long (90 > 88 characters)
Line 809:89: E501 line too long (110 > 88 characters)
Line 815:89: E501 line too long (115 > 88 characters)
Line 817:89: E501 line too long (89 > 88 characters)
Line 821:89: E501 line too long (133 > 88 characters)

Line 24:89: E501 line too long (106 > 88 characters)
Line 117:89: E501 line too long (106 > 88 characters)

Line 503:89: E501 line too long (125 > 88 characters)
Line 1127:89: E501 line too long (94 > 88 characters)
Line 1176:89: E501 line too long (94 > 88 characters)

Comment last updated at 2024-02-06 21:24:10 UTC

@@ -195,17 +195,18 @@ def create_external_file_handle(syn, path, mimetype=None, md5=None, file_size=No
url = as_url(os.path.expandvars(os.path.expanduser(path)))
if is_url(url):
parsed_url = urllib_parse.urlparse(url)
if parsed_url.scheme == "file" and os.path.isfile(parsed_url.path):
actual_md5 = md5_for_file(parsed_url.path).hexdigest()
parsed_path = file_url_to_path(url)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a Windows specific bug that I found with the integration tests I wrote for my new logic. The parsed_url.path was showing up as /c:/WINDOWS/asdf.txt which was preventing isfile and all subsequent code from getting the correct data.

No errors were showing up - but that means we haven't been storing the MD5 checksum or file size into Synapse on windows for any local files.

@@ -180,7 +180,8 @@ jobs:
export EXTERNAL_S3_BUCKET_AWS_ACCESS_KEY_ID="${{secrets.EXTERNAL_S3_BUCKET_AWS_ACCESS_KEY_ID}}"
export EXTERNAL_S3_BUCKET_AWS_SECRET_ACCESS_KEY="${{secrets.EXTERNAL_S3_BUCKET_AWS_SECRET_ACCESS_KEY}}"
if [ ${{ steps.otel-check.outputs.run_opentelemetry }} == "true" ]; then
export SYNAPSE_OTEL_INTEGRATION_TEST_EXPORTER="otlp"
# Set to 'otlp' to enable OpenTelemetry
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After inactivity or time it seems like the service catalog instance we set up with Jaeger stops - This doesn't prevent integration tests from running, however, it makes more noise when it prints that it cannot export the trace data.

Leaving this off for now and we can flip it on when want the data.

Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 This looks awesome! Just some comments.

synapseclient/models/file.py Outdated Show resolved Hide resolved
synapseclient/models/file.py Show resolved Hide resolved
synapseclient/models/project.py Show resolved Hide resolved
synapseclient/models/file.py Show resolved Hide resolved
from synapseclient.models import File, Folder, Project, Table


async def store_entity_components(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a place that collects all the failed async calls? For example, let's say we're storing 10000 files, it'd be nice if the process doesn't stop if 2 files failed to upload.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the behavior is that when an exception is raised (See the sample script below) the futures that are yet to be executed as cancelled and do not finish. This means it is as you are saying: "If upload fails for file 003 of the 10,000 files we raise the exception and cancel the rest of the upload".

import asyncio
import os
import uuid

from synapseclient.models import File, Folder
from datetime import date, datetime, timedelta, timezone
import synapseclient

PROJECT_ID = "syn52948289"

syn = synapseclient.Synapse(debug=True)
syn.login()


def create_random_file(
    path: str,
) -> None:
    """Create a random file with random data.

    :param path: The path to create the file at.
    """
    with open(path, "wb") as f:
        f.write(os.urandom(1))


async def store_file():
    # Cleanup synapse for previous runs - Does not delete local files/directories:
    script_file_folder = Folder(name="file_script_folder", parent_id=PROJECT_ID)
    if not os.path.exists(os.path.expanduser("~/temp/myNewFolder")):
        os.mkdir(os.path.expanduser("~/temp/myNewFolder"))
    # Hack to get the ID as Folder does not support get by name/id yet
    await script_file_folder.store()
    await script_file_folder.delete()
    await script_file_folder.store()

    # Creating annotations for my file ==================================================
    annotations_for_my_file = {
        "my_single_key_string": "a",
        "my_key_string": ["b", "a", "c"],
        "my_key_bool": [False, False, False],
        "my_key_double": [1.2, 3.4, 5.6],
        "my_key_long": [1, 2, 3],
        "my_key_date": [date.today(), date.today() - timedelta(days=1)],
        "my_key_datetime": [
            datetime.today(),
            datetime.today() - timedelta(days=1),
            datetime.now(tz=timezone(timedelta(hours=-5))),
            datetime(2023, 12, 7, 13, 0, 0, tzinfo=timezone(timedelta(hours=0))),
            datetime(2023, 12, 7, 13, 0, 0, tzinfo=timezone(timedelta(hours=-7))),
        ],
    }

    # 1. Creating a file =================================================================
    files_to_upload = []
    for i in range(10):
        name_of_file = f"file_script_my_file_with_random_data_{uuid.uuid4()}.txt"
        path_to_file = os.path.join(os.path.expanduser("~/temp"), name_of_file)
        create_random_file(path_to_file)
        file = File(
            path=path_to_file,
            annotations=annotations_for_my_file,
            parent_id=script_file_folder.id,
            description="This is a file with random data.",
        )
        files_to_upload.append(file)
    files_to_upload.append(
        File(
            path="~/temp/myNewFolder/file_that_does_not_exist.txt",
            parent_id=script_file_folder.id,
        )
    )

    script_sub_folder = Folder(
        name="file_script_sub_folder", parent_id=script_file_folder.id
    )
    script_sub_folder.files = files_to_upload
    await script_sub_folder.store()


asyncio.run(store_file())

Are you thinking that instead of this the default behavior is that we log the exception/stack trace, but let the rest of the program execute till it finishes? Do you have any thoughts about the behavior in this case as it relates to filling back in the failed File instance? If 10k files are being uploaded and 100 fail I don't want to sift through the console logs as the only thing to tell me that "These are the files that failed". My initial (Not thought out) solution is to stash an attribute on the File that notes the last interaction with Synapse. Something like:

InteractionStatus:
    successful: bool
    exception: Exception

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @BWMac @jaymedina @danlu1 Any thoughts on the behavior of this piece?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it would be annoying if one failure of 10000 prevents the rest of the uploads. This is a pattern that we often encounter with Nextflow Workflows.

My initial thoughts are to do something like @BryanFauble suggests with a status updated upon success/failure, or to have some way to optionally return the files that failed to be uploaded in a data structure. The latter would enable a user to re-try in the same script by iterating back through the failed files and you could decide how many times you want to re-try before you give up on those persistently failing uploads.

Copy link
Contributor Author

@BryanFauble BryanFauble Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BWMac @thomasyu888
Maybe for methods that kick off bulk actions (Think Project.store(), Folder.store()) we have an argument you can pass into store like:

class FailureStrategy(Enum):
    """
    When storing a large number of items through bulk actions like `Project.store()` or 
    `Folder.store()` individual failures may occur. Passing this ENUM will allow you to 
    define how you want to respond to failures.
    """

    RAISE_EXCEPTION = "RAISE_EXCEPTION"
    """An exception is raised on the first failure and all tasks yet to be completed 
    are cancelled."""
    
    LOG_EXCEPTION = "LOG_EXCEPTION"
    """An exception is logged and all tasks yet to be completed continue to be 
    processed."""
    
    RETURN_INTERACTION_FAILURE = "RETURN_INTERACTION_FAILURE"
    """
    When a File or Folder fails to be stored in Synapse a `synchronization_failure` 
    exception will be stored on the failed File or Folder instance. Use the `.failures` 
    property to retrieve a generator that yields a named tuple of:
    {
        "exception": Exception,
        "entity": Union[File, Folder],
        "parent": Union[Project, Folder],
    }
   This will also log the exceptions as they occur.
    """

    CALLBACK = "CALLBACK"
    """
    Call back to a user defined function when a failure occurs. The callback function
    should take in a single argument which will be a named tuple of:
    {
        "exception": Exception,
        "entity": Union[File, Folder],
        "parent": Union[Project, Folder],
    }
    """

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also - In a first iteration we could also provide only RAISE_EXCEPTION or LOG_EXCEPTION

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea, we can add a ticket for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll plan to include this with the changes in https://sagebionetworks.jira.com/browse/SYNPY-1415

Copy link

sonarcloud bot commented Feb 6, 2024

Quality Gate Passed Quality Gate passed

The SonarCloud Quality Gate passed, but some issues were introduced.

5 New issues
0 Security Hotspots
94.3% Coverage on New Code
27.3% Duplication on New Code

See analysis details on SonarCloud

Copy link
Contributor

@BWMac BWMac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I was able to run the POC script successfully

@BryanFauble BryanFauble merged commit 5ea1b7f into develop Feb 7, 2024
38 checks passed
@BryanFauble BryanFauble deleted the SYNPY-1416-file-model branch February 7, 2024 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants