Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Enhance QueryDQ feature to capture the source and target values #55

Closed
vigneshwarrvenkat opened this issue Nov 10, 2023 · 1 comment
Labels
enhancement New feature or request

Comments

@vigneshwarrvenkat
Copy link
Contributor

Is your feature request related to a problem? Please describe.
The query DQ feature does provide output in terms of boolean vales. The boolean value of FALSE does inform us that there is something wrong in the query validation results but user won't knew what went wrong. They have to rerun the query manually to figure out the difference. If it is production support team, it is highly unlikely that they are aware of the validation scripts. It becomes tough to get actionable insights out of query DQ feature. If the query results of both the source and target is fetched and stored in a custom stats table, it would be useful for users to build actionable insights or work items based on the results

Describe the solution you'd like
Right now, it is programmed to pass one query to the QueryDQ. Instead, we can pass three queries as below.

select X from table1; select Y from table2; select x=y from t1 join t2
Queries are separated by semi colons. If it is one query, it is the default behaviour and for three, the behaviour is as below.

X and Y are values to be compared between source and target respectively.
Third query is the validation query. If the validation is FALSE, then we can fetch the X and Y values and store it as JSON in a custom stats table. The custom table is user managed and should be passed as the argument as below.

SparkExpectations(custom_dq_info_table = ""...)

Describe alternatives you've considered
We are right now implementing the above option as a separate module and use it along with other features of SparkExpectation

Additional context
The custom table is user managed. Permissions and other stuffs have to be handled by the user.
Number of rows could be restricted to 200 initially for the records to be stored in the custom stats table.

@vigneshwarrvenkat vigneshwarrvenkat added the enhancement New feature or request label Nov 10, 2023
@vigneshwarrvenkat
Copy link
Contributor Author

Connected with @asingamaneni and @jskrajareddy21 on this enhancement. Below are the bug fixes and feature request coupled to this enhancement request

  1. The querydq should execute as is with multiple delimited queries.
  2. As there is an ask to send the details of the detailed stats table to Kafka, If any data is stored in the detailed stats table, it has to be masked before sending to Kafka.
  3. There should not be any limitation on the number of delimited query_dq queries.
  4. Handle edge cases where the one of the delimited query_dq query can be a int or float.

We have started working on this..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants