[FEATURE] Enhance QueryDQ feature to capture the source and target values #55

vigneshwarrvenkat · 2023-11-10T16:44:29Z

Is your feature request related to a problem? Please describe.
The query DQ feature does provide output in terms of boolean vales. The boolean value of FALSE does inform us that there is something wrong in the query validation results but user won't knew what went wrong. They have to rerun the query manually to figure out the difference. If it is production support team, it is highly unlikely that they are aware of the validation scripts. It becomes tough to get actionable insights out of query DQ feature. If the query results of both the source and target is fetched and stored in a custom stats table, it would be useful for users to build actionable insights or work items based on the results

Describe the solution you'd like
Right now, it is programmed to pass one query to the QueryDQ. Instead, we can pass three queries as below.

select X from table1; select Y from table2; select x=y from t1 join t2
Queries are separated by semi colons. If it is one query, it is the default behaviour and for three, the behaviour is as below.

X and Y are values to be compared between source and target respectively.
Third query is the validation query. If the validation is FALSE, then we can fetch the X and Y values and store it as JSON in a custom stats table. The custom table is user managed and should be passed as the argument as below.

SparkExpectations(custom_dq_info_table = ""...)

Describe alternatives you've considered
We are right now implementing the above option as a separate module and use it along with other features of SparkExpectation

Additional context
The custom table is user managed. Permissions and other stuffs have to be handled by the user.
Number of rows could be restricted to 200 initially for the records to be stored in the custom stats table.

vigneshwarrvenkat · 2024-03-03T23:56:39Z

Connected with @asingamaneni and @jskrajareddy21 on this enhancement. Below are the bug fixes and feature request coupled to this enhancement request

The querydq should execute as is with multiple delimited queries.
As there is an ask to send the details of the detailed stats table to Kafka, If any data is stored in the detailed stats table, it has to be masked before sending to Kafka.
There should not be any limitation on the number of delimited query_dq queries.
Handle edge cases where the one of the delimited query_dq query can be a int or float.

We have started working on this..

vigneshwarrvenkat added the enhancement New feature or request label Nov 10, 2023

vigneshwarrvenkat mentioned this issue Mar 31, 2024

Creating a detailed stats table for capturing details on the execution of dq rules. #80

Merged

9 tasks

asingamaneni closed this as completed May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Enhance QueryDQ feature to capture the source and target values #55

[FEATURE] Enhance QueryDQ feature to capture the source and target values #55

vigneshwarrvenkat commented Nov 10, 2023

vigneshwarrvenkat commented Mar 3, 2024

[FEATURE] Enhance QueryDQ feature to capture the source and target values #55

[FEATURE] Enhance QueryDQ feature to capture the source and target values #55

Comments

vigneshwarrvenkat commented Nov 10, 2023

vigneshwarrvenkat commented Mar 3, 2024