-
Notifications
You must be signed in to change notification settings - Fork 72
[SYNPY-1632] Deprecate tables from the Synapse class and table.py module #1233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
list_columns = [] | ||
dtype = {} | ||
|
||
for select_column in self.headers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to deal with self.headers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
headers
for this logic is coming from the response of this API:
https://rest-docs.synapse.org/rest/org/sagebionetworks/repo/model/table/DownloadFromTableResult.html
We shouldn't need to store this data onto the Table-like class, so once you get the DownloadFromTableResult, you should be able to pass the result everywhere it's needed like this asDataFrame
method. It shouldn't need to be exposed to the user querying for data on the Synapse Tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the current code and it looks like headers go through two transformations:
-
In this section, the headers from DownloadFromTableResult (initially dictionaries) are converted into SelectColumn objects.
-
Later, in this section and this one, the column headers are transformed again.
If we’re not storing this information in CsvResult (or CsvFileTable), do we still need to convert headers into SelectColumn objects? Or are we moving away from SelectColumn entirely and planning to just treat headers as dictionaries?
To match our current behavior, I have:
class CsvResult:
def __init__(self, file_path, include_row_id_and_row_version=True):
self.file_path = file_path
self.include_row_id_and_row_version = include_row_id_and_row_version
if result and result.get("headers"):
headers = result.get("headers")
headers = [SelectColumn(**header) for header in headers]
self.headers = self.set_column_headers(headers)
else:
self.headers = None
Based on your suggestion, it sounds like we don’t need self.headers at all. Instead, in asDataFrame we could just do something like:
if result.get("headers") is not None:
headers = result["headers"]
for column in headers:
xxx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing that we need to return back to the user in the case that they call
query_async
-> The DataFrame, OR the path to the downloaded CSV
query_part_mask_async
-> The results wrapped in the QueryResultBundle
object
We do not have to maintain how the previous code worked at all for any of the intermediate objects or class types.
With that being said, in our case:
If we’re not storing this information in CsvResult (or CsvFileTable), do we still need to convert headers into SelectColumn objects? Or are we moving away from SelectColumn entirely and planning to just treat headers as dictionaries?
We do not need to maintain a CsvResult
, SelectColumn
, or any of the concepts the original table had implemented. In fact - The new Tables class also got rid of the SchemaBase
and inheritance structure that was previous in place.
Based on your suggestion, it sounds like we don’t need self.headers at all. Instead, in asDataFrame we could just do something like:
if result.get("headers") is not None:
headers = result["headers"]
for column in headers:
xxx
That is exactly right, we could do something like that.
In the end - If we are not exposing the interface to an end user, we have quite a bit more flexibility with how we need to maintain the "guts", or actual implementation of the function/method. However - Consider this, if it makes our lives easier for development, then there is no harm in creating dataclasses/classes for ourselves.
…ter match the original function
…ble_next_page to return QueryResultBundle; add test for _query_table_next_page
…ake sure the output type align with the async counterpart
… in QueryJob to align with synapse doc;
…with to_synapse_request
"quoteCharacter": self.quote_character, | ||
"escapeCharacter": self.escape_character, | ||
"lineEnd": self.line_end, | ||
"isFirstLineHeader": self.is_file_line_header, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a typo in the original CsvTableDescriptor
class. It should be is_first_line_header
headers: Optional[List[SelectColumn]] = None | ||
"""The list of SelectColumns that describes the rows of this set.""" | ||
|
||
rows: Optional[List[Row]] = field(default_factory=list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initialize this as a list per conversation
query_job_request = QueryJob( | ||
entity_id=entity_id, | ||
sql=query, | ||
write_header=header, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The definition of write_header
based on the documentaiton of DownloadFromTableRequest
is: Should the first line contain the columns names as a header in the resulting file? Set to 'true' to include the headers else, 'false'. The default value is 'true'.
There's also a isFirstLineHeader
parameter under CsvTableDescriptor. I think both parameters mean the same thing. As you can see here, I have both: is_first_line_header=header,
and is_first_line_header=header
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥 LGTM! Im going to defer to @BryanFauble for final review, but thanks for doing the giant deprecation.
Will there be a tutorial page we are going to update to use the new functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR deprecates multiple methods from the Synapse class and table.py module while introducing new table component data classes as part of modernizing the table querying API. The changes map deprecated methods to new implementations and provide comprehensive test coverage for the new functionality.
- Deprecated 11 methods from synapseclient.client.Synapse class related to table operations
- Added 11 new data classes for structured table operations: SumFileSizes, QueryResultOutput, Row, RowSet, SelectColumn, ActionRequiredCount, Query, QueryResult, QueryResultBundle, QueryNextPageToken, QueryJob, QueryBundleRequest
- Migrated table query functionality to use new async-based implementations
Reviewed Changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
tests/unit/synapseclient/mixins/unit_test_table_components.py | Added comprehensive unit tests for all new data classes and query functions |
tests/integration/synapseclient/models/synchronous/test_table.py | Updated test expectations to reflect changes in batch processing behavior |
tests/integration/synapseclient/models/async/test_table_async.py | Updated test expectations and refined spy behavior for async table operations |
tests/integration/synapseclient/models/async/test_entityview_async.py | Enhanced test to capture call stack information for better verification |
synapseclient/table.py | Added deprecation warnings to row_labels functions |
synapseclient/models/table_components.py | Added 11 new data classes with complete REST API mappings and type conversions |
synapseclient/models/mixins/table_components.py | Implemented new query functions and converted existing methods to use new data structures |
synapseclient/models/mixins/asynchronous_job.py | Added endpoint mappings for new query request types |
synapseclient/models/init.py | Exported new data classes for public API |
synapseclient/core/constants/concrete_types.py | Added concrete type constants for new REST API models |
synapseclient/client.py | Added deprecation decorators and migration examples to 11 deprecated methods |
docs/reference/experimental/sync/table.md | Added documentation references for new data classes |
docs/reference/experimental/async/table.md | Added documentation references for new data classes |
.pre-commit-config.yaml | Updated bandit version from 1.7.5 to 1.8.0 |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
This result is modeled from: <https://rest-docs.synapse.org/rest/org/sagebionetworks/repo/model/table/QueryResultBundle.html> | ||
""" | ||
|
||
concrete_type: str = QUERY_TABLE_CSV_REQUEST |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default concrete_type for QueryResultBundle should be the bundle type, not the CSV request type. This should be QUERY_BUNDLE_REQUEST
or a dedicated query result bundle constant, not QUERY_TABLE_CSV_REQUEST
.
concrete_type: str = QUERY_TABLE_CSV_REQUEST | |
concrete_type: str = QUERY_BUNDLE_REQUEST |
Copilot uses AI. Check for mistakes.
Yes, https://sagebionetworks.jira.com/browse/SYNPY-1377 should capture the work to write up the tutorial page. @linglp Do you mind reviewing this jira and add onto the topic for anything that we should cover in addition to whats there (if anything), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate all your hard work that you've put into this!
I appreciate your review. Thank you again, Bryan! |
Problem:
Deprecate the following:
Solution:
Map the following to new methods:
New data classes:
SumFileSizes
QueryResultOutput (created by using the old
QueryBundleRequest
)Row
RowSet
SelectColumn
ActionRequiredCount
Query
QueryResult
QueryResultBundle
QueryNextPageToken
QueryJob
QueryBundleRequest
Testing:
_query_table_row_set
_query_table_next_page
_query_table_csv