Skip to content

Commit

Permalink
added a new tool called "llm_based_sql_translation_validator" (#1272)
Browse files Browse the repository at this point in the history
* add tool llm_based_sql_translation_validator

* Update README.md

---------

Co-authored-by: Andrew Gold <41129777+agold-rh@users.noreply.github.com>
  • Loading branch information
aaslam17 and agold-rh committed May 15, 2024
1 parent 8e13add commit fc9f443
Show file tree
Hide file tree
Showing 3 changed files with 170 additions and 0 deletions.
35 changes: 35 additions & 0 deletions tools/llm_based_sql_translation_validator/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# LLM Based SQL Translation Validator

A LLM based utility tool to identify any translation errors between any database SQL with Bigquery SQL.

## Business Requirement:
1. To validate the output Bigquery SQL by comparing with the Legacy Source SQL
2. To be able to quickly identify any translation errors, right at the beginning of the migration.
3. To get a sense of translation accuracy
4. To provide a sense of confidence before query deployment.
5. To save time during the migration projects without going through data validation process.

## Asset Features:
1. This tool is built entirely using Bigquery SQL and doesnot need any infrastrutre to run the scripts.
2. This tool includes only two procedures and have iterate logic to validate all the files at one execution.
3. This tool also includes the DDL's to create the output Bigquery Tables to store the results.Hence no additional deployment scripts are needed.
4. This tool is generic and can be used to compare any SQL from any database to Bigquery SQL.

## Instructions to Run:
1. Create a new dataset
2. Create a new bigquery connection of type Cloud Resource
3. Call the run_sql_validator stored procedure with the required input
4. Sample Procedure call is as below

DECLARE var_Dataset_name STRING DEFAULT 'gemini_sql_validator';
DECLARE var_connection_name STRING DEFAULT 'bq-data-project.us.gemini-sql-validator-connection';
DECLARE var_source_database STRING DEFAULT 'Teradata';
DECLARE var_source_gcs_path STRING DEFAULT 'gs://gemini-sql-validator/td/*';
DECLARE var_target_gcs_path STRING DEFAULT 'gs://gemini-sql-validator/bq/*';
CALL `dataset_name.run_sql_validator`(var_connection_name, var_source_database, var_source_gcs_path, var_target_gcs_path, var_Dataset_name);

where var_Dataset_name is the dataset created in step1 and var_connection_name is the connection created in step2.

5. See the output by quering the "validation_report" view.


36 changes: 36 additions & 0 deletions tools/llm_based_sql_translation_validator/call_llm.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
CREATE OR REPLACE PROCEDURE `bq-data-project.bq_dataset_name.call_llm`(var_dataset_name STRING, var_Batch_Id STRING, var_Source_File STRING, var_Target_File STRING, var_Source_database STRING, var_Insert_Time TIMESTAMP, var_Model_Name STRING, var_Source_Data BYTES, var_Target_Data BYTES, Var_Prompt STRING)
BEGIN

--Copyright 2024 Google. This software is provided as-is,
--without warranty or representation for any use or purpose.
--Your use of it is subject to your agreement with Google.

--Purpose: This Procedure call the LLM to generate the validation output, by comparing both the input and output file.
--The result of the output will also be stored in the validation_output table

--inser the llm output to the validation output table
EXECUTE IMMEDIATE """
INSERT INTO """||var_dataset_name||""".validation_output (Batch_Id,Source_File,Target_File,Validation_Result,Source_Database,Target_Database,Insert_Time)
--call the llm with the prompt, input file, output file and return the response from the llm
SELECT '"""||var_Batch_Id||"""' as Batch_Id,'"""||var_Source_File||"""' as Source_File,'"""||var_Target_File||"""' as Target_File,
ml_generate_text_llm_result as Validation_Result,'"""||var_Source_database||"""' as Source_Database,
'Bigquery' as Target_Database,CAST('"""||var_Insert_Time||"""' AS TIMESTAMP) as Insert_Time
FROM
ML.GENERATE_TEXT(
MODEL `"""||var_Model_Name||"""`,
(SELECT '''"""||Var_Prompt||"""\n"""||var_Source_database||""" SQL:"""||SAFE_CONVERT_BYTES_TO_STRING(var_Source_Data)||""",Bigquery SQL:"""||SAFE_CONVERT_BYTES_TO_STRING(var_Target_Data)||""" ''' AS prompt
),
STRUCT(
0.8 AS temperature,
8192 AS max_output_tokens,
0.95 AS top_p,
1 as top_k,
true as flatten_json_output));"""
;
EXCEPTION WHEN ERROR THEN
SELECT
@@error.message,
@@error.stack_trace,
@@error.statement_text,
@@error.formatted_stack_trace;
END;
99 changes: 99 additions & 0 deletions tools/llm_based_sql_translation_validator/run_sql_validator.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
CREATE OR REPLACE PROCEDURE `bq-data-project.bq_dataset_name.run_sql_validator`(var_connection_name STRING, var_source_database STRING, var_source_gcs_path STRING, var_target_gcs_path STRING, var_Dataset_name STRING)
BEGIN


--Copyright 2024 Google. This software is provided as-is,
--without warranty or representation for any use or purpose.
--Your use of it is subject to your agreement with Google.

--Purpose: This Procedure creates the object tables to read both the source and the target files from GCS.This procedure also creates the output tables to store the validation results.The Prompt is crafted in this procedure and would call a child procedure for the purpose of validation output generation.

--Pre-requiste: A seperate dataset needs to be created. Two Procedures run_sql_validator and call_llm needs to be created on that dataset. A Bigquery Connection of type Cloud Resource needs to be created in advance before execution of this procedure.

--Sample Input
/*
DECLARE var_connection_name STRING DEFAULT 'bq-data-project.us.gemini-sql-validator-connection';
DECLARE var_source_database STRING DEFAULT 'Teradata';
DECLARE var_source_gcs_path STRING DEFAULT 'gs://gemini-sql-validator/td/*';
DECLARE var_target_gcs_path STRING DEFAULT 'gs://gemini-sql-validator/bq/*';
DECLARE var_Dataset_name STRING DEFAULT 'gemini_sql_validator';
*/

DECLARE var_Batch_Id STRING;
DECLARE var_Insert_Time TIMESTAMP;
DECLARE var_Model_Name STRING;
DECLARE Var_Prompt STRING;

SET var_Batch_Id = GENERATE_UUID();
SET var_Insert_Time=current_timestamp();
SET @@dataset_id=var_Dataset_name;
SET var_Model_Name = var_Dataset_name||".sql-validator-model";


--Create the output table to store the validation results
CREATE TABLE IF NOT EXISTS `validation_output`
(
Batch_Id STRING,
Source_File STRING,
Target_File STRING,
Validation_Result STRING,
Source_Database STRING,
Target_Database STRING,
Insert_Time TIMESTAMP
);

---Create view to retrieve the latest validation results for each file
EXECUTE IMMEDIATE """
CREATE VIEW IF NOT EXISTS validation_report as
(
select Batch_Id,Source_File,Target_File,Validation_Result,Source_Database,Target_Database,Insert_Time
from """||var_Dataset_name||""".validation_output
where (Insert_Time,Source_File) in ( select (max(Insert_Time),Source_File) from """||var_Dataset_name||""".validation_output
group by Source_File )
) """;

--Create the LLM Model to execute the validation
EXECUTE IMMEDIATE "CREATE MODEL IF NOT EXISTS `"||var_Model_Name||"` REMOTE WITH CONNECTION `"||var_connection_name ||"` OPTIONS (ENDPOINT = 'gemini-pro')";

--Create the object table to read the source input files from GCS
EXECUTE IMMEDIATE "CREATE EXTERNAL TABLE IF NOT EXISTS source_object_table WITH CONNECTION `"||var_connection_name||"` OPTIONS(object_metadata = 'SIMPLE',uris = ['"||var_source_gcs_path||"'],max_staleness=INTERVAL 1 DAY,metadata_cache_mode = 'MANUAL')";

--Create the object table to read the target input files from GCS
EXECUTE IMMEDIATE "CREATE EXTERNAL TABLE IF NOT EXISTS target_object_table WITH CONNECTION `"||var_connection_name||"` OPTIONS(object_metadata = 'SIMPLE',uris = ['"||var_target_gcs_path||"'],max_staleness=INTERVAL 1 DAY,metadata_cache_mode = 'MANUAL')";

--Refresh both input and putput object tables to include newly added files in GCS
CALL BQ.REFRESH_EXTERNAL_METADATA_CACHE('source_object_table');
CALL BQ.REFRESH_EXTERNAL_METADATA_CACHE('target_object_table');

--Craft the prompt to send it to LLM
SET Var_Prompt="""
Identify and document all SQL differences between the below """||var_source_database||""" SQL and its equivalent BigQuery SQL.
Input:
"""||var_source_database||""" SQL:
Bigquery SQL:
Process:
Analyze the SQL statements.Deeply compare the syntax, functions, and data types used in both """||var_source_database||""" and BigQuery SQL.
Identify any inconsistencies or differences in the logic between the two statements.
Output:
For each identified difference, create a JSON object with clear explanation of the specific SQL difference and its potential impact.""" ;

--Iterate over each set of source and target file
FOR record in (
select
src.data as var_Source_Data,
tgt.data as var_Target_Data,
src.uri as var_Source_File,
tgt.uri as var_Target_File,
var_source_database as var_Source_database
from sql_validator.source_object_table src
inner join sql_validator.source_object_table tgt
on lower(SPLIT(src.uri,"/")[4])=lower(SPLIT(tgt.uri,"/")[4])
)
DO

--Call the child procedure for each set of files
CALL call_llm(var_Dataset_name,var_Batch_Id,record.var_Source_File,record.var_Target_File,record.var_Source_database,var_Insert_Time,var_Model_Name,(record.var_Source_Data),(record.var_Target_Data),Var_Prompt);

END FOR;

END;

0 comments on commit fc9f443

Please sign in to comment.