Skip to content

Conversation

@attila-papai
Copy link
Contributor

@attila-papai attila-papai commented Mar 7, 2024

This PR adds support for upserting data rows to a dataset, meaning the provided data row specifications either create new data rows or update existing ones.

Please see the added unit tests for example usages of this new functionality.

Copy link
Contributor

@vbrodsky vbrodsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question regarding Task/returns.

A few months ago, when we were discussing upserts, we talked about returning results, specifically successes and failures (and warnings? partial successes?). I think there was an issue with how create_data_rows currently works, but I don't remember all the details. Oh yeah, one of them was like how many data rows we returns errors? I.e. is there a limit etc Perhaps Matt C. might recall better. Is this issue within the scope of the current implementation to review task return to make it support 'infinite' number of rows?

name: Optional[str]


class DataRowSpec(BaseModel):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should include these models here - they are only used for upsert and have very generic names which might lead to confusion on when to use and why and IMO we do not want to set any new standards as part of this change.

We can think on this more but we most likely want to get buy-in from code-owners and/or MLSE.

Copy link
Contributor Author

@attila-papai attila-papai Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the follow up refactor ticket, I was gonna reuse this for the create methods
IMO we better start using types to make this SDK library easier to use. Without it, clients need to either read the docs or the underlying code itself, especially when we use *args and **kwargs that makes it even more complicated to understand what variables are supported.

Copy link
Contributor

@mrobers1982 mrobers1982 Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide more details re: what you plan to refactor?

I'm worried the scope is too large - you mention refactoring create_data_rows but you also need to be mindful that we do not break existing implementations as that would seemingly imply a major upgrade to the SDK.

Personally, I believe we should aim our sights lower with a minimal integration, minimizing changes to existing patterns.

Per above, if you want to continue down this path we likely need to get buy-in from the SDK team as they would likley own this work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to have types in all these methods, but I think Matt's right; it's not trivial to add typing to these methods without breaking backward compatibility, and until we do that, we'd have 2 similar methods taking in arguments in different formats.

Since we can't add typing to other methods immediately, we should aim for consistency for now. If we do upgrade to typed arguments, we can do it in all the relevant methods at once, in a single PR, and announce it in patch notes for the relevant SDK version.

Could you file a ticket (if we don't have one already) for adding types here and put it in backlog so this doesn't get lost?

def _create_descriptor_file(self,
items,
max_attachments_per_data_row=None,
is_upsert=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand you wanted to keep it DRY, but the code is arguably more complex as a result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will address this in the follow-up work.

f"Cannot upsert more than {MAX_DATAROW_PER_API_OPERATION} DataRows per function call."
)

class ManifestFile:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand functions within functions is a pattern already used in this file but I would avoid this pattern unless there's not viable alternative - there's a performance penalty, not to mention it makes the code more difficult to read.

Copy link
Contributor Author

@attila-papai attila-papai Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to use a class in the first place. Sync up with @vbrodsky
I'm gonna remove it for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment was not so much in the class as the functions within functions, personally I would extract those to avoid the performance penalty as I do not think you need the closure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree solely on readability concerns, no need to nest the function instead of making it a private func if it doesn't use anything from the outer scope.

Or even just not having a function for this at alll; the function is defined inline right before it's used for the only time so it's not like it's encapsulating anything or reducing line count.

@attila-papai attila-papai changed the title [WIP] add dataset.upsert_data_rows method Add dataset.upsert_data_rows method Mar 25, 2024
@attila-papai attila-papai changed the title Add dataset.upsert_data_rows method [PLT-344] Add dataset.upsert_data_rows method Mar 25, 2024
@attila-papai attila-papai marked this pull request as ready for review March 25, 2024 12:45
@attila-papai attila-papai requested a review from a team as a code owner March 25, 2024 12:45
created_by (Relationship): `ToOne` relationship to User
organization (Relationship): `ToOne` relationship to Organization
"""
__upsert_chunk_size: Final = 10_000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how was this number chosen?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: In Python, constants are typically written in all capital letters.

Copy link
Contributor

@mrobers1982 mrobers1982 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple minor things, but almost there - thanks for taking care of this!

created_by (Relationship): `ToOne` relationship to User
organization (Relationship): `ToOne` relationship to Organization
"""
__upsert_chunk_size: Final = 10_000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: In Python, constants are typically written in all capital letters.

>>> {"name": "tag", "value": "tag value"},
>>> ]
>>> },
>>> # update existing data row by global key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it does not look like the code supports this case, I would not give an end-user multiple ways of doing the same thing, it will only create confusion.

f"Cannot upsert more than {MAX_DATAROW_PER_API_OPERATION} DataRows per function call."
)

def _convert_items_to_upsert_format(_items):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was an ask to remove the nested functions, can you take care of that too?

def _upload_chunk(_chunk):
return self._create_descriptor_file(_chunk, is_upsert=True)

file_upload_thread_count = 20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I would expose this value in the method signature and provide a reasonable default.

key = {'type': 'AUTO', 'value': ''}
elif isinstance(item['key'], UniqueId):
key = {'type': 'ID', 'value': item['key'].key}
del item['key']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: When setting the value in the line above, you can also use pop to retrieve the value and delete the key from the dictionary.

Copy link
Contributor

@mrobers1982 mrobers1982 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's ultimately wait to merge until we have an approval from @sfendell-labelbox or @vbrodsky.

@mrobers1982
Copy link
Contributor

Let's potentially hold off on merging this change if product wants us to move the implementation from Dataset to Client.

https://labelbox.atlassian.net/browse/PLT-111

@mrobers1982
Copy link
Contributor

mrobers1982 commented Mar 27, 2024

Let's potentially hold off on merging this change if product wants us to move the implementation from Dataset to Client.

https://labelbox.atlassian.net/browse/PLT-111

Actually, let's assume we are moving to the Client and make these changes.

Product wants to get feedback from stakeholders before we make this change.

@mrobers1982
Copy link
Contributor

Let's potentially hold off on merging this change if product wants us to move the implementation from Dataset to Client.
https://labelbox.atlassian.net/browse/PLT-111

Actually, let's assume we are moving to the Client and make these changes.

Product wants to get feedback from stakeholders before we make this change.

Product wants to leave as-is for now - https://labelbox.atlassian.net/browse/PLT-111?focusedCommentId=192239.

We can go ahead and merge if everyone is happy with the current implementation.

@mrobers1982 mrobers1982 merged commit 9225f08 into develop Apr 1, 2024
@mrobers1982 mrobers1982 deleted the attila/PLT-344-upsert-data-rows branch April 1, 2024 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants