Skip to content

feat(csharp/src/Drivers/Databricks): Optimize GetColumnsExtendedAsync via DESC TABLE EXTENDED #2953

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

jackyhu-db
Copy link
Contributor

@jackyhu-db jackyhu-db commented Jun 11, 2025

Optimize GetColumnsExtendedAsync in Databricks via DESC TABLE EXTENDED

Motivation

Currently, HiveServer2Statement.GetColumnsExtendedAsync will make 3 thrift calls GetColumns, GetPrimaryKeys and GetCrossReference to get all the information and then join them into one result set. Now Databricks introduces a SQL DESC TABLE EXTENDED <table> AS JSON that will return all of these information in one query, this can save 2 extra roundtrips and improve the performance.

Description

Override HiveServer2Statement.GetColumnsExtendedAsync in DatabricksStatement by executing single SQL DESC TABLE EXTENDED <table> AS JSON to get all the required column info for GetColumnsExtendedAsync then combine and join these info into the result. As this SQL DESC TABLE EXTENDED <table> AS JSON is only available in Databricks Runtime 16.2 or above, it will check the ServerProtocolVersion. if it does not meet the requirement, it will fallback the implementation of base class.

Change

  • Added DescTableExtendedResult that represents the result of DESC TABLE EXTENDED <table> AS JSON (see format here). It also
    • Parses the string-based table_constraints to the structured primary key and foreign key constraints
    • Adds calculated properties DataType, ColumnSize, FullTypeName which are calculated from the column type and type specific properties
  • Changed the access modifiers of some properties and methods in HiveServer2Statement to protected so that they can be used/overridden in subclass DatabricksStatement
    • PrimaryKeyFields
    • ForeignKeyFields
    • PrimaryKeyPrefix
    • ForeignKeyPrefix
    • GetColumnsExtendedAsync
    • CreateEmptyExtendedColumnsResult
  • Updated DatabricksStatement with the changes below
    • Added GetColumnsExtendedAsync in DatabricksStatement, it executes query DESC TABLE EXTENDED <table> AS JSON to get all the required info and then json them into the QueryResult
    • Moved the column metadata schema creation logic from GetColumnsAsync to a reusable method CreateColumnMetadataSchema
    • Moved the column metadata data array creation logic from GetColumnsAsync to a reusable method CreateColumnMetadataEmptyArray
  • Added a Databricks connection Parameter adbc.databricks.use_desc_table_extended
  • Added an internal calculated property CanUseDescTableExtended in DatabricksConnection, DatabricksStatement will call it to decide if it will override GetColumnsExtendedAsync
  • Improved the Databricks driver integration test StatementTest:CanGetColumnsExtended
    • Setup the required table resources during the test instead of depending on manual setup before the test
    • Support running multiple test cases from the inputs
    • Add a deep result check to make sure all the properties of all the columns are set correctly

Testing

  • Added Test class DescTableExtendedResultTest to cover all the deserialization and parsing cases from raw result of DESC TABLE EXTENDED <table> AS JSON
  • Tested all the ColumnsExtended relevant test cases in Databricks/StatementTest

@github-actions github-actions bot modified the milestone: ADBC Libraries 19 Jun 11, 2025
@jackyhu-db jackyhu-db force-pushed the desc-table-extended branch from 81b326f to 7d83b75 Compare June 11, 2025 02:05
@jackyhu-db jackyhu-db force-pushed the desc-table-extended branch from 49b3f6e to 8c73c0b Compare June 11, 2025 04:36
* limitations under the License.
*/

using Apache.Arrow.Adbc.Drivers.Databricks.Result;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we mocify/improve the existing GetColumnExtended tests to verify that:
With the flag on/off, they get same number of columns, same schema, and same data?

@jackyhu-db jackyhu-db force-pushed the desc-table-extended branch 2 times, most recently from 4d4af63 to 4a0d05a Compare June 12, 2025 03:53
@jackyhu-db jackyhu-db force-pushed the desc-table-extended branch from 4a0d05a to bb93562 Compare June 12, 2025 04:19
Copy link
Contributor

@CurtHagenlocher CurtHagenlocher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good! I've just got a few nits about formatting and casing of method names.

}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra blank line

/// </summary>
internal bool CanUseDescTableExtended => _useDescTableExtended && ServerProtocolVersion != null && ServerProtocolVersion >= TProtocolVersion.SPARK_CLI_SERVICE_PROTOCOL_V7;


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra blank line

return string.Join(".", parts);
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra blank line

return CreateExtendedColumnsResult(columnMetadataSchema,result);
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra blank line

];
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra blank line

using System.Text.RegularExpressions;
using static Apache.Arrow.Adbc.Drivers.Apache.Hive2.HiveServer2Connection;


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra blank lines here and at line 35

}
}

private int getIntervalSize()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private int getIntervalSize()
private int GetIntervalSize()

}
}

private void parseConstraints()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private void parseConstraints()
private void ParseConstraints()

// Constraints string format example:
// "[ (pk_constraint, PRIMARY KEY (`col1`, `col2`)), (fk_constraint, FOREIGN KEY (`col3`) REFERENCES `catalog`.`schema`.`table` (`refcol1`, `refcol2`)) ]"

var constraintString = TableConstraints.Trim();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when not null, we're trimming twice. Consider instead doing

var constraintString = TableConstraints?.Trim();
if (constraintString == null || constraintString.Length == 0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants