-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
added a new tool called "llm_based_sql_translation_validator" (#1272)
* add tool llm_based_sql_translation_validator * Update README.md --------- Co-authored-by: Andrew Gold <41129777+agold-rh@users.noreply.github.com>
- Loading branch information
Showing
3 changed files
with
170 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
# LLM Based SQL Translation Validator | ||
|
||
A LLM based utility tool to identify any translation errors between any database SQL with Bigquery SQL. | ||
|
||
## Business Requirement: | ||
1. To validate the output Bigquery SQL by comparing with the Legacy Source SQL | ||
2. To be able to quickly identify any translation errors, right at the beginning of the migration. | ||
3. To get a sense of translation accuracy | ||
4. To provide a sense of confidence before query deployment. | ||
5. To save time during the migration projects without going through data validation process. | ||
|
||
## Asset Features: | ||
1. This tool is built entirely using Bigquery SQL and doesnot need any infrastrutre to run the scripts. | ||
2. This tool includes only two procedures and have iterate logic to validate all the files at one execution. | ||
3. This tool also includes the DDL's to create the output Bigquery Tables to store the results.Hence no additional deployment scripts are needed. | ||
4. This tool is generic and can be used to compare any SQL from any database to Bigquery SQL. | ||
|
||
## Instructions to Run: | ||
1. Create a new dataset | ||
2. Create a new bigquery connection of type Cloud Resource | ||
3. Call the run_sql_validator stored procedure with the required input | ||
4. Sample Procedure call is as below | ||
|
||
DECLARE var_Dataset_name STRING DEFAULT 'gemini_sql_validator'; | ||
DECLARE var_connection_name STRING DEFAULT 'bq-data-project.us.gemini-sql-validator-connection'; | ||
DECLARE var_source_database STRING DEFAULT 'Teradata'; | ||
DECLARE var_source_gcs_path STRING DEFAULT 'gs://gemini-sql-validator/td/*'; | ||
DECLARE var_target_gcs_path STRING DEFAULT 'gs://gemini-sql-validator/bq/*'; | ||
CALL `dataset_name.run_sql_validator`(var_connection_name, var_source_database, var_source_gcs_path, var_target_gcs_path, var_Dataset_name); | ||
|
||
where var_Dataset_name is the dataset created in step1 and var_connection_name is the connection created in step2. | ||
|
||
5. See the output by quering the "validation_report" view. | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
CREATE OR REPLACE PROCEDURE `bq-data-project.bq_dataset_name.call_llm`(var_dataset_name STRING, var_Batch_Id STRING, var_Source_File STRING, var_Target_File STRING, var_Source_database STRING, var_Insert_Time TIMESTAMP, var_Model_Name STRING, var_Source_Data BYTES, var_Target_Data BYTES, Var_Prompt STRING) | ||
BEGIN | ||
|
||
--Copyright 2024 Google. This software is provided as-is, | ||
--without warranty or representation for any use or purpose. | ||
--Your use of it is subject to your agreement with Google. | ||
|
||
--Purpose: This Procedure call the LLM to generate the validation output, by comparing both the input and output file. | ||
--The result of the output will also be stored in the validation_output table | ||
|
||
--inser the llm output to the validation output table | ||
EXECUTE IMMEDIATE """ | ||
INSERT INTO """||var_dataset_name||""".validation_output (Batch_Id,Source_File,Target_File,Validation_Result,Source_Database,Target_Database,Insert_Time) | ||
--call the llm with the prompt, input file, output file and return the response from the llm | ||
SELECT '"""||var_Batch_Id||"""' as Batch_Id,'"""||var_Source_File||"""' as Source_File,'"""||var_Target_File||"""' as Target_File, | ||
ml_generate_text_llm_result as Validation_Result,'"""||var_Source_database||"""' as Source_Database, | ||
'Bigquery' as Target_Database,CAST('"""||var_Insert_Time||"""' AS TIMESTAMP) as Insert_Time | ||
FROM | ||
ML.GENERATE_TEXT( | ||
MODEL `"""||var_Model_Name||"""`, | ||
(SELECT '''"""||Var_Prompt||"""\n"""||var_Source_database||""" SQL:"""||SAFE_CONVERT_BYTES_TO_STRING(var_Source_Data)||""",Bigquery SQL:"""||SAFE_CONVERT_BYTES_TO_STRING(var_Target_Data)||""" ''' AS prompt | ||
), | ||
STRUCT( | ||
0.8 AS temperature, | ||
8192 AS max_output_tokens, | ||
0.95 AS top_p, | ||
1 as top_k, | ||
true as flatten_json_output));""" | ||
; | ||
EXCEPTION WHEN ERROR THEN | ||
SELECT | ||
@@error.message, | ||
@@error.stack_trace, | ||
@@error.statement_text, | ||
@@error.formatted_stack_trace; | ||
END; |
99 changes: 99 additions & 0 deletions
99
tools/llm_based_sql_translation_validator/run_sql_validator.sql
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,99 @@ | ||
CREATE OR REPLACE PROCEDURE `bq-data-project.bq_dataset_name.run_sql_validator`(var_connection_name STRING, var_source_database STRING, var_source_gcs_path STRING, var_target_gcs_path STRING, var_Dataset_name STRING) | ||
BEGIN | ||
|
||
|
||
--Copyright 2024 Google. This software is provided as-is, | ||
--without warranty or representation for any use or purpose. | ||
--Your use of it is subject to your agreement with Google. | ||
|
||
--Purpose: This Procedure creates the object tables to read both the source and the target files from GCS.This procedure also creates the output tables to store the validation results.The Prompt is crafted in this procedure and would call a child procedure for the purpose of validation output generation. | ||
|
||
--Pre-requiste: A seperate dataset needs to be created. Two Procedures run_sql_validator and call_llm needs to be created on that dataset. A Bigquery Connection of type Cloud Resource needs to be created in advance before execution of this procedure. | ||
|
||
--Sample Input | ||
/* | ||
DECLARE var_connection_name STRING DEFAULT 'bq-data-project.us.gemini-sql-validator-connection'; | ||
DECLARE var_source_database STRING DEFAULT 'Teradata'; | ||
DECLARE var_source_gcs_path STRING DEFAULT 'gs://gemini-sql-validator/td/*'; | ||
DECLARE var_target_gcs_path STRING DEFAULT 'gs://gemini-sql-validator/bq/*'; | ||
DECLARE var_Dataset_name STRING DEFAULT 'gemini_sql_validator'; | ||
*/ | ||
|
||
DECLARE var_Batch_Id STRING; | ||
DECLARE var_Insert_Time TIMESTAMP; | ||
DECLARE var_Model_Name STRING; | ||
DECLARE Var_Prompt STRING; | ||
|
||
SET var_Batch_Id = GENERATE_UUID(); | ||
SET var_Insert_Time=current_timestamp(); | ||
SET @@dataset_id=var_Dataset_name; | ||
SET var_Model_Name = var_Dataset_name||".sql-validator-model"; | ||
|
||
|
||
--Create the output table to store the validation results | ||
CREATE TABLE IF NOT EXISTS `validation_output` | ||
( | ||
Batch_Id STRING, | ||
Source_File STRING, | ||
Target_File STRING, | ||
Validation_Result STRING, | ||
Source_Database STRING, | ||
Target_Database STRING, | ||
Insert_Time TIMESTAMP | ||
); | ||
|
||
---Create view to retrieve the latest validation results for each file | ||
EXECUTE IMMEDIATE """ | ||
CREATE VIEW IF NOT EXISTS validation_report as | ||
( | ||
select Batch_Id,Source_File,Target_File,Validation_Result,Source_Database,Target_Database,Insert_Time | ||
from """||var_Dataset_name||""".validation_output | ||
where (Insert_Time,Source_File) in ( select (max(Insert_Time),Source_File) from """||var_Dataset_name||""".validation_output | ||
group by Source_File ) | ||
) """; | ||
|
||
--Create the LLM Model to execute the validation | ||
EXECUTE IMMEDIATE "CREATE MODEL IF NOT EXISTS `"||var_Model_Name||"` REMOTE WITH CONNECTION `"||var_connection_name ||"` OPTIONS (ENDPOINT = 'gemini-pro')"; | ||
|
||
--Create the object table to read the source input files from GCS | ||
EXECUTE IMMEDIATE "CREATE EXTERNAL TABLE IF NOT EXISTS source_object_table WITH CONNECTION `"||var_connection_name||"` OPTIONS(object_metadata = 'SIMPLE',uris = ['"||var_source_gcs_path||"'],max_staleness=INTERVAL 1 DAY,metadata_cache_mode = 'MANUAL')"; | ||
|
||
--Create the object table to read the target input files from GCS | ||
EXECUTE IMMEDIATE "CREATE EXTERNAL TABLE IF NOT EXISTS target_object_table WITH CONNECTION `"||var_connection_name||"` OPTIONS(object_metadata = 'SIMPLE',uris = ['"||var_target_gcs_path||"'],max_staleness=INTERVAL 1 DAY,metadata_cache_mode = 'MANUAL')"; | ||
|
||
--Refresh both input and putput object tables to include newly added files in GCS | ||
CALL BQ.REFRESH_EXTERNAL_METADATA_CACHE('source_object_table'); | ||
CALL BQ.REFRESH_EXTERNAL_METADATA_CACHE('target_object_table'); | ||
|
||
--Craft the prompt to send it to LLM | ||
SET Var_Prompt=""" | ||
Identify and document all SQL differences between the below """||var_source_database||""" SQL and its equivalent BigQuery SQL. | ||
Input: | ||
"""||var_source_database||""" SQL: | ||
Bigquery SQL: | ||
Process: | ||
Analyze the SQL statements.Deeply compare the syntax, functions, and data types used in both """||var_source_database||""" and BigQuery SQL. | ||
Identify any inconsistencies or differences in the logic between the two statements. | ||
Output: | ||
For each identified difference, create a JSON object with clear explanation of the specific SQL difference and its potential impact.""" ; | ||
|
||
--Iterate over each set of source and target file | ||
FOR record in ( | ||
select | ||
src.data as var_Source_Data, | ||
tgt.data as var_Target_Data, | ||
src.uri as var_Source_File, | ||
tgt.uri as var_Target_File, | ||
var_source_database as var_Source_database | ||
from sql_validator.source_object_table src | ||
inner join sql_validator.source_object_table tgt | ||
on lower(SPLIT(src.uri,"/")[4])=lower(SPLIT(tgt.uri,"/")[4]) | ||
) | ||
DO | ||
|
||
--Call the child procedure for each set of files | ||
CALL call_llm(var_Dataset_name,var_Batch_Id,record.var_Source_File,record.var_Target_File,record.var_Source_database,var_Insert_Time,var_Model_Name,(record.var_Source_Data),(record.var_Target_Data),Var_Prompt); | ||
|
||
END FOR; | ||
|
||
END; |