[PLT-344] Add dataset.upsert_data_rows method #1460

attila-papai · 2024-03-07T17:09:01Z

This PR adds support for upserting data rows to a dataset, meaning the provided data row specifications either create new data rows or update existing ones.

Please see the added unit tests for example usages of this new functionality.

vbrodsky

I have a question regarding Task/returns.

A few months ago, when we were discussing upserts, we talked about returning results, specifically successes and failures (and warnings? partial successes?). I think there was an issue with how create_data_rows currently works, but I don't remember all the details. Oh yeah, one of them was like how many data rows we returns errors? I.e. is there a limit etc Perhaps Matt C. might recall better. Is this issue within the scope of the current implementation to review task return to make it support 'infinite' number of rows?

labelbox/schema/dataset.py

labelbox/schema/data_row.py

# Conflicts: # labelbox/schema/asset_attachment.py

labelbox/schema/dataset.py

mrobers1982 · 2024-03-22T22:41:08Z

labelbox/schema/data_row.py

+    name: Optional[str]
+
+
+class DataRowSpec(BaseModel):


I'm not sure we should include these models here - they are only used for upsert and have very generic names which might lead to confusion on when to use and why and IMO we do not want to set any new standards as part of this change.

We can think on this more but we most likely want to get buy-in from code-owners and/or MLSE.

in the follow up refactor ticket, I was gonna reuse this for the create methods
IMO we better start using types to make this SDK library easier to use. Without it, clients need to either read the docs or the underlying code itself, especially when we use *args and **kwargs that makes it even more complicated to understand what variables are supported.

Can you provide more details re: what you plan to refactor?

I'm worried the scope is too large - you mention refactoring create_data_rows but you also need to be mindful that we do not break existing implementations as that would seemingly imply a major upgrade to the SDK.

Personally, I believe we should aim our sights lower with a minimal integration, minimizing changes to existing patterns.

Per above, if you want to continue down this path we likely need to get buy-in from the SDK team as they would likley own this work.

I'd love to have types in all these methods, but I think Matt's right; it's not trivial to add typing to these methods without breaking backward compatibility, and until we do that, we'd have 2 similar methods taking in arguments in different formats.

Since we can't add typing to other methods immediately, we should aim for consistency for now. If we do upgrade to typed arguments, we can do it in all the relevant methods at once, in a single PR, and announce it in patch notes for the relevant SDK version.

Could you file a ticket (if we don't have one already) for adding types here and put it in backlog so this doesn't get lost?

mrobers1982 · 2024-03-22T22:49:01Z

labelbox/schema/dataset.py

+    def _create_descriptor_file(self,
+                                items,
+                                max_attachments_per_data_row=None,
+                                is_upsert=False):


I understand you wanted to keep it DRY, but the code is arguably more complex as a result.

I will address this in the follow-up work.

mrobers1982 · 2024-03-22T22:52:55Z

labelbox/schema/dataset.py

+                f"Cannot upsert more than {MAX_DATAROW_PER_API_OPERATION} DataRows per function call."
+            )
+
+        class ManifestFile:


I understand functions within functions is a pattern already used in this file but I would avoid this pattern unless there's not viable alternative - there's a performance penalty, not to mention it makes the code more difficult to read.

I didn't want to use a class in the first place. Sync up with @vbrodsky
I'm gonna remove it for now.

My comment was not so much in the class as the functions within functions, personally I would extract those to avoid the performance penalty as I do not think you need the closure.

I agree solely on readability concerns, no need to nest the function instead of making it a private func if it doesn't use anything from the outer scope.

Or even just not having a function for this at alll; the function is defined inline right before it's used for the only time so it's not like it's encapsulating anything or reducing line count.

labelbox/schema/dataset.py

…/labelbox-python into attila/PLT-344-upsert-data-rows

labelbox/schema/asset_attachment.py

vbrodsky · 2024-03-25T16:31:29Z

labelbox/schema/dataset.py

        created_by (Relationship): `ToOne` relationship to User
        organization (Relationship): `ToOne` relationship to Organization
    """
+    __upsert_chunk_size: Final = 10_000


how was this number chosen?

nit: In Python, constants are typically written in all capital letters.

mrobers1982

A couple minor things, but almost there - thanks for taking care of this!

mrobers1982 · 2024-03-26T21:06:20Z

labelbox/schema/dataset.py

        created_by (Relationship): `ToOne` relationship to User
        organization (Relationship): `ToOne` relationship to Organization
    """
+    __upsert_chunk_size: Final = 10_000


nit: In Python, constants are typically written in all capital letters.

mrobers1982 · 2024-03-26T21:12:53Z

labelbox/schema/dataset.py

+        >>>                 {"name": "tag", "value": "tag value"},
+        >>>             ]
+        >>>         },
+        >>>         # update existing data row by global key


While it does not look like the code supports this case, I would not give an end-user multiple ways of doing the same thing, it will only create confusion.

mrobers1982 · 2024-03-26T21:13:41Z

labelbox/schema/dataset.py

+                f"Cannot upsert more than {MAX_DATAROW_PER_API_OPERATION} DataRows per function call."
+            )
+
+        def _convert_items_to_upsert_format(_items):


There was an ask to remove the nested functions, can you take care of that too?

mrobers1982 · 2024-03-26T21:18:36Z

labelbox/schema/dataset.py

+        def _upload_chunk(_chunk):
+            return self._create_descriptor_file(_chunk, is_upsert=True)
+
+        file_upload_thread_count = 20


Personally, I would expose this value in the method signature and provide a reasonable default.

mrobers1982 · 2024-03-26T21:24:27Z

labelbox/schema/dataset.py

+                    key = {'type': 'AUTO', 'value': ''}
+                elif isinstance(item['key'], UniqueId):
+                    key = {'type': 'ID', 'value': item['key'].key}
+                    del item['key']


nit: When setting the value in the line above, you can also use pop to retrieve the value and delete the key from the dictionary.

mrobers1982

Let's ultimately wait to merge until we have an approval from @sfendell-labelbox or @vbrodsky.

mrobers1982 · 2024-03-27T19:57:08Z

Let's potentially hold off on merging this change if product wants us to move the implementation from Dataset to Client.

https://labelbox.atlassian.net/browse/PLT-111

mrobers1982 · 2024-03-27T20:00:19Z

Let's potentially hold off on merging this change if product wants us to move the implementation from Dataset to Client.

https://labelbox.atlassian.net/browse/PLT-111

~~Actually, let's assume we are moving to the Client and make these changes.~~

Product wants to get feedback from stakeholders before we make this change.

mrobers1982 · 2024-03-27T23:03:53Z

Let's potentially hold off on merging this change if product wants us to move the implementation from Dataset to Client.
https://labelbox.atlassian.net/browse/PLT-111

~~Actually, let's assume we are moving to the Client and make these changes.~~

Product wants to get feedback from stakeholders before we make this change.

Product wants to leave as-is for now - https://labelbox.atlassian.net/browse/PLT-111?focusedCommentId=192239.

We can go ahead and merge if everyone is happy with the current implementation.

add dataset.upsert_data_rows method

8951086

github-actions bot added the active-learning label Mar 7, 2024

vbrodsky reviewed Mar 7, 2024

View reviewed changes

labelbox/schema/dataset.py Outdated Show resolved Hide resolved

labelbox/schema/dataset.py Outdated Show resolved Hide resolved

labelbox/schema/dataset.py Outdated Show resolved Hide resolved

adrian-chang removed the active-learning label Mar 7, 2024

add DataRowSpec class, update return type of upsertDataRows

4d59492

github-actions bot added the active-learning label Mar 12, 2024

mrobers1982 reviewed Mar 13, 2024

View reviewed changes

attila-papai added 13 commits March 14, 2024 13:24

fix item validation for upsert

90e7a4c

improve dataset.upsert_data_rows method and add more tests

456bdba

Merge branch 'develop' into attila/PLT-344-upsert-data-rows

5a9abe3

# Conflicts: # labelbox/schema/asset_attachment.py

improve dataset.upsert_data_rows method and add more tests

ccaf821

adjust code to backend changes

1636266

add upsert chunk size constant

25753e4

add test for multiple chunks

74fba0f

mypy fix

b70e650

mypy fix

bb071b8

exclude None from json

7e94b45

finalizing improvements

8c88fe8

add media_type to DataRowSpec with a test

28a4435

add comment

ea7cd99

mrobers1982 reviewed Mar 22, 2024

View reviewed changes

attila-papai added 3 commits March 25, 2024 13:15

add test for errors checking

8e74651

mangle chunk size constant

adb494b

upload chunks in parallel

7ec2789

attila-papai changed the title ~~[WIP] add dataset.upsert_data_rows method~~ Add dataset.upsert_data_rows method Mar 25, 2024

attila-papai changed the title ~~Add dataset.upsert_data_rows method~~ [PLT-344] Add dataset.upsert_data_rows method Mar 25, 2024

attila-papai requested a review from sfendell-labelbox March 25, 2024 12:44

Merge branch 'develop' into attila/PLT-344-upsert-data-rows

245ccb7

attila-papai marked this pull request as ready for review March 25, 2024 12:45

attila-papai requested a review from a team as a code owner March 25, 2024 12:45

attila-papai added 2 commits March 25, 2024 16:37

fix mypy

0eaebc9

Merge branch 'attila/PLT-344-upsert-data-rows' of github.com:Labelbox…

415f3ad

…/labelbox-python into attila/PLT-344-upsert-data-rows

vbrodsky reviewed Mar 25, 2024

View reviewed changes

attila-papai added 2 commits March 26, 2024 10:09

remove pydantic models

a3acbaa

use unique global keys in tests

9624432

attila-papai requested review from mrobers1982 and vbrodsky March 26, 2024 20:28

mrobers1982 requested changes Mar 26, 2024

View reviewed changes

improvements 2

410463d

attila-papai requested a review from mrobers1982 March 27, 2024 10:03

mrobers1982 approved these changes Mar 27, 2024

View reviewed changes

Merge branch 'develop' into attila/PLT-344-upsert-data-rows

da4f989

vbrodsky approved these changes Apr 1, 2024

View reviewed changes

mrobers1982 merged commit 9225f08 into develop Apr 1, 2024

mrobers1982 deleted the attila/PLT-344-upsert-data-rows branch April 1, 2024 16:24

[PLT-344] Add dataset.upsert_data_rows method #1460

[PLT-344] Add dataset.upsert_data_rows method #1460

Uh oh!

Conversation

attila-papai commented Mar 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vbrodsky left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

attila-papai Mar 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrobers1982 Mar 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

attila-papai Mar 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrobers1982 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrobers1982 left a comment

Choose a reason for hiding this comment

Uh oh!

mrobers1982 commented Mar 27, 2024

Uh oh!

mrobers1982 commented Mar 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrobers1982 commented Mar 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

attila-papai commented Mar 7, 2024 •

edited

Loading

attila-papai Mar 25, 2024 •

edited

Loading

mrobers1982 Mar 25, 2024 •

edited

Loading

attila-papai Mar 25, 2024 •

edited

Loading

mrobers1982 commented Mar 27, 2024 •

edited

Loading