-
Notifications
You must be signed in to change notification settings - Fork 68
[PLT-344] Add dataset.upsert_data_rows method #1460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
vbrodsky
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a question regarding Task/returns.
A few months ago, when we were discussing upserts, we talked about returning results, specifically successes and failures (and warnings? partial successes?). I think there was an issue with how create_data_rows currently works, but I don't remember all the details. Oh yeah, one of them was like how many data rows we returns errors? I.e. is there a limit etc Perhaps Matt C. might recall better. Is this issue within the scope of the current implementation to review task return to make it support 'infinite' number of rows?
# Conflicts: # labelbox/schema/asset_attachment.py
labelbox/schema/data_row.py
Outdated
| name: Optional[str] | ||
|
|
||
|
|
||
| class DataRowSpec(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we should include these models here - they are only used for upsert and have very generic names which might lead to confusion on when to use and why and IMO we do not want to set any new standards as part of this change.
We can think on this more but we most likely want to get buy-in from code-owners and/or MLSE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the follow up refactor ticket, I was gonna reuse this for the create methods
IMO we better start using types to make this SDK library easier to use. Without it, clients need to either read the docs or the underlying code itself, especially when we use *args and **kwargs that makes it even more complicated to understand what variables are supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide more details re: what you plan to refactor?
I'm worried the scope is too large - you mention refactoring create_data_rows but you also need to be mindful that we do not break existing implementations as that would seemingly imply a major upgrade to the SDK.
Personally, I believe we should aim our sights lower with a minimal integration, minimizing changes to existing patterns.
Per above, if you want to continue down this path we likely need to get buy-in from the SDK team as they would likley own this work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love to have types in all these methods, but I think Matt's right; it's not trivial to add typing to these methods without breaking backward compatibility, and until we do that, we'd have 2 similar methods taking in arguments in different formats.
Since we can't add typing to other methods immediately, we should aim for consistency for now. If we do upgrade to typed arguments, we can do it in all the relevant methods at once, in a single PR, and announce it in patch notes for the relevant SDK version.
Could you file a ticket (if we don't have one already) for adding types here and put it in backlog so this doesn't get lost?
| def _create_descriptor_file(self, | ||
| items, | ||
| max_attachments_per_data_row=None, | ||
| is_upsert=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand you wanted to keep it DRY, but the code is arguably more complex as a result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will address this in the follow-up work.
labelbox/schema/dataset.py
Outdated
| f"Cannot upsert more than {MAX_DATAROW_PER_API_OPERATION} DataRows per function call." | ||
| ) | ||
|
|
||
| class ManifestFile: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand functions within functions is a pattern already used in this file but I would avoid this pattern unless there's not viable alternative - there's a performance penalty, not to mention it makes the code more difficult to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want to use a class in the first place. Sync up with @vbrodsky
I'm gonna remove it for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comment was not so much in the class as the functions within functions, personally I would extract those to avoid the performance penalty as I do not think you need the closure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree solely on readability concerns, no need to nest the function instead of making it a private func if it doesn't use anything from the outer scope.
Or even just not having a function for this at alll; the function is defined inline right before it's used for the only time so it's not like it's encapsulating anything or reducing line count.
…/labelbox-python into attila/PLT-344-upsert-data-rows
| created_by (Relationship): `ToOne` relationship to User | ||
| organization (Relationship): `ToOne` relationship to Organization | ||
| """ | ||
| __upsert_chunk_size: Final = 10_000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how was this number chosen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: In Python, constants are typically written in all capital letters.
mrobers1982
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple minor things, but almost there - thanks for taking care of this!
| created_by (Relationship): `ToOne` relationship to User | ||
| organization (Relationship): `ToOne` relationship to Organization | ||
| """ | ||
| __upsert_chunk_size: Final = 10_000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: In Python, constants are typically written in all capital letters.
labelbox/schema/dataset.py
Outdated
| >>> {"name": "tag", "value": "tag value"}, | ||
| >>> ] | ||
| >>> }, | ||
| >>> # update existing data row by global key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it does not look like the code supports this case, I would not give an end-user multiple ways of doing the same thing, it will only create confusion.
labelbox/schema/dataset.py
Outdated
| f"Cannot upsert more than {MAX_DATAROW_PER_API_OPERATION} DataRows per function call." | ||
| ) | ||
|
|
||
| def _convert_items_to_upsert_format(_items): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was an ask to remove the nested functions, can you take care of that too?
labelbox/schema/dataset.py
Outdated
| def _upload_chunk(_chunk): | ||
| return self._create_descriptor_file(_chunk, is_upsert=True) | ||
|
|
||
| file_upload_thread_count = 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I would expose this value in the method signature and provide a reasonable default.
labelbox/schema/dataset.py
Outdated
| key = {'type': 'AUTO', 'value': ''} | ||
| elif isinstance(item['key'], UniqueId): | ||
| key = {'type': 'ID', 'value': item['key'].key} | ||
| del item['key'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: When setting the value in the line above, you can also use pop to retrieve the value and delete the key from the dictionary.
mrobers1982
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's ultimately wait to merge until we have an approval from @sfendell-labelbox or @vbrodsky.
|
Let's potentially hold off on merging this change if product wants us to move the implementation from |
Product wants to get feedback from stakeholders before we make this change. |
Product wants to leave as-is for now - https://labelbox.atlassian.net/browse/PLT-111?focusedCommentId=192239. We can go ahead and merge if everyone is happy with the current implementation. |
This PR adds support for upserting data rows to a dataset, meaning the provided data row specifications either create new data rows or update existing ones.
Please see the added unit tests for example usages of this new functionality.