Skip to content

Feat/snowflake ddl sql import solve #789 #790

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

dmaresma
Copy link
Contributor

Hi, I use snowflake as main EDW workspace, I notice some regression when simple-ddl-parser was replaced by sqlglot unfortunatly snowflake ddl import is a show stopper for me, in the migration of the catalog into datacontracts.
I take care to maintain sqlglot / sqlserver fonctionnality as is. the sql server test pass.

  • Tests pass
  • ruff format
  • README.md updated (if relevant)
  •  CHANGELOG.md entry added

@simonharrer
Copy link
Contributor

Can you elaborate on the error with sqlglot? We'd rather want to fix the issue there and not re-introduce the previously removed library.

@dmaresma
Copy link
Contributor Author

Hi, here the elaborate following issue related to sqlglot into datacontract-cli compare to the simple-ddl-parser test ddl, sqlglot introduces a huge discrependancy on the ddl scan compare to simple-ddl-parser, not just on snowflake, please reconsider the PR until sqlglot fix their issues.

# bench sqlglot vs simple-ddl-parser

import sqlglot
from sqlglot.dialects.dialect import Dialects
import simple_ddl_parser as ddl


ddl = [
    (
        """
    CREATE TABLE public."Diagnosis_identifier" (
        "Diagnosis_id" text NOT NULL
    )
    INHERITS (public.identifier);
    """,
        "postgres",
    ),
    (
        """CREATE TABLE test (
      timestamp TIMESTAMP,
      date DATE GENERATED ALWAYS AS (CAST(timestamp AS DATE))
    )""",
        "postgres",
    ),
    (
        """
    CREATE TABLE public.test (date_updated timestamp with time zone);""",
        "postgres",
    ),
    (
        """CREATE TABLE table (
        surrogatekey_SK NUMBER(38,0) NOT NULL autoincrement start 1 increment 1
        ORDER COMMENT 'Record Identification Number Ordered')""",
        "snowflake",
    ),
    (
        """CREATE TABLE table (
        surrogatekey_SK NUMBER(38,0) NOT NULL DEFAULT DBTEST.SCTEST.SQTEST.NEXTVAL COMMENT 'Record Identification Number',
        myColumnComment VARCHAR(255) COMMENT 'Record Identification Number from Sequence')""",
        "snowflake",
    ),
    (
        """CREATE TABLE table (
        surrogatekey_SK NUMBER(38,0) NOT NULL autoincrement start 1 increment 1
        NOORDER COMMENT 'Record Identification Number NoOrdered')""",
        "snowflake",
    ),
    (
        """
    create external table if not exists TABLE_DATA_SRC.EXT_PAYLOAD_MANIFEST_WEB (
       "type" VARCHAR(255) AS (SPLIT_PART(SPLIT_PART(METADATA$FILENAME, '/', 1), '=', 2 )),
       "year" VARCHAR(255) AS (SPLIT_PART(SPLIT_PART(METADATA$FILENAME, '/', 2), '=', 2)),
       "month" VARCHAR(255) AS (SPLIT_PART(SPLIT_PART(METADATA$FILENAME, '/', 3), '=', 2)),
       "day" VARCHAR(255) AS (SPLIT_PART(SPLIT_PART(METADATA$FILENAME, '/', 4), '=', 2)),
       "cast_YEAR" VARCHAR(200) AS (GET(VALUE,'c1')::string),
       "path" VARCHAR(255) AS (METADATA$FILENAME)
       )
    partition by ("type", "year", "month", "day", "path")
    location=@schema_name.StageName/year=2023/month=08/
    auto_refresh=false
    pattern='*.csv'
    file_format = (TYPE = JSON NULL_IF = () STRIP_OUTER_ARRAY = TRUE )
    ;
    """,
        "snowflake",
    ),
    (
        """
    create or replace table if not exists TABLE_DATA_SRC.EXT_PAYLOAD_MANIFEST_WEB (
       id bigint,
       derived bigint as (id * 10),
       "year" NUMBER(38,0) AS (EXTRACT(year from METADATA$FILE_LAST_MODIFIED)),
       PERIOD VARCHAR(200) AS (CAST(col1 AS VARCHAR(16777216))),
       field VARCHAR(205) AS (CAST(GET(VALUE, 'c3') AS VARCHAR(16777216)))
       )
    location = @sc.stage/entity=events/
    auto_refresh = false
    file_format = (TYPE=JSON NULL_IF=('field') DATE_FORMAT=AUTO TRIM_SPACE=TRUE)
    stage_file_format = (TYPE=JSON NULL_IF=())
    ;
    """,
        "snowflake",
    ),
    (
        """CREATE TABLE ${database_name}.MySchemaName."MyTableName"
    (ID NUMBER(38,0) NOT NULL, "DocProv" VARCHAR(2)) cluster by ("DocProv");""",
        "snowflake",
    ),
    (
        """CREATE TABLE ${database_name}.MySchemaName."MyTableName"
    cluster by ("DocProv") (
    ID NUMBER(38,0) NOT NULL,
    "DocProv" VARCHAR(2)
    );""",
        "snowflake",
    ),
    (
        """
        CREATE TABLE mydataset.newtable ( x INT64 );
        """,
        "bigquery",
    ),
    (
        """
    CREATE TABLE mydataset.newtable
     (
       x INT64 ,
       y STRUCT<a ARRAY<STRING>,b BOOL>
     )
    """,
        "bigquery",
    ),
    (
        """
    CREATE TABLE name.hub.REF_CALENDAR (
    calendar_dt DATE,
    )
    OPTIONS (
    description="Calendar table records reference list of calendar dates and related attributes used for reporting."
    );
    """,
        "bigquery",
    ),
    (
        """
    CREATE TABLE name.hub.REF_CALENDAR (
    calendar_dt DATE OPTIONS(description="Field Description")
    )
    OPTIONS (
    description="Calendar table records reference list of calendar dates and related attributes used for reporting."
    );
    """,
        "bigquery",
    ),
    (
        """
    CREATE TABLE name.hub.REF_CALENDAR (
    calendar_dt DATE OPTIONS(description="Field Description")
    )
    CLUSTER BY year_reporting_week_no
    OPTIONS (
    description="Calendar table records reference list of calendar dates and related attributes used for reporting."
    );

    """,
        "bigquery",
    ),
    (
        """
    CREATE TABLE mydataset.newtable
    (
    x INT64 OPTIONS(description="An optional INTEGER field")
    )
    OPTIONS(
    expiration_timestamp="2023-01-01 00:00:00 UTC",
    description="a table that expires in 2023",
    )

    """,
        "bigquery",
    ),
    (
        """
    CREATE SCHEMA IF NOT EXISTS project.calendar
    OPTIONS (
    location="project-location"
    );
    CREATE TABLE project_id.calendar.REF_CALENDAR (
    calendar_dt DATE,
    calendar_dt_id INT,
    fiscal_half_year_reporting_week_no INT
    )
    OPTIONS (
    description="Calendar table records reference list of calendar dates and related attributes used for reporting."
    )
    PARTITION BY DATETIME_TRUNC(fiscal_half_year_reporting_week_no, DAY)
    CLUSTER BY calendar_dt



    """,
        "bigquery",
    ),
    (
        """
        create TABLE project_id.schema.ChildTableName(
                parentTable varchar
                );
        ALTER TABLE project_id.schema.ChildTableName
        ADD CONSTRAINT "fk_t1_t2_tt"
        FOREIGN KEY ("parentTable")
        REFERENCES project_id.schema.ChildTableName2 ("columnName")
        ON DELETE CASCADE
        ON UPDATE CASCADE;
    """,
        "bigquery",
    ),
    (
        """
CREATE TABLE project_id.calendar.REF_CALENDAR (
    calendar_dt DATE,
    calendar_dt_id INT,
    fiscal_half_year_reporting_week_no INT
    )
    OPTIONS (
    value_1="some value",
   labels=[("org_unit", "development", "ci")])
        """,
        "bigquery",
    ),
    (
        """
            CREATE TABLE `my.data-cdh-hub-REF-CALENDAR` (
    calendar_dt DATE,
    calendar_dt_id INT
    )
    OPTIONS (
        location="location"
        )
    OPTIONS (
    description="Calendar table records reference list of calendar dates and related attributes used for reporting."
    )
    OPTIONS (
        name ="path"
    )
    OPTIONS (
        kms_two="path",
        two="two two"
    )
    OPTIONS (
        kms_three="path",
        three="three",
        threethree="three three"
    )
    OPTIONS (
        kms_four="path",
        four="four four",
        fourin="four four four",
        fourlast="four four four four"
    );
            """,
        "bigquery",
    ),
    (
        """
    CREATE TABLE data.test ( col STRING OPTIONS(description='test') ) OPTIONS(description='test');

    """,
        "bigquery",
    ),
    (
        """
    CREATE TABLE data.test(
        field_a INT OPTIONS(description='some description')
    )
    PARTITION BY RANGE_BUCKET(field_a, GENERATE_ARRAY(10, 1000, 1));""",
        "bigquery",
    ),
    (
        """CREATE TABLE data.test(
       field_a INT OPTIONS(description='some description')
     )
     PARTITION BY RANGE_BUCKET(field_a, [1,2,3]]) ;""",
        "bigquery",
    ),
    (
        """CREATE TABLE data.test(
       field_a INT OPTIONS(description='some description')
     )
     PARTITION BY DATE_TRUNC(field, MONTH);""",
        "bigquery",
    ),
    (
        """CREATE TABLE t1 (
      val INT,
    );
    CREATE INDEX idx1 ON t1(val);""",
        "bigquery",
    ),
    (
        """CREATE TABLE student (id INT, name STRING, age INT) USING CSV
        COMMENT 'this is a comment'
        TBLPROPERTIES ('foo'='bar');""",
        "databricks",
    ),
    (
        """CREATE TABLE student (id INT, name STRING, age INT)
        USING CSV
        PARTITIONED BY (age);""",
        "databricks",
    ),
    (
        """CREATE TABLE t1 (
    ts TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    dt DATETIME DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP);""",
        "mysql",
    ),
    (
        """create table test(
    `id` bigint not null,
    `updated_at` timestamp(3) not null default current_timestamp(3) on update current_timestamp(3),
    primary key (id));""",
        "mysql",
    ),
    (
        """
    CREATE TABLE t_table_records (
    id VARCHAR (255) NOT NULL,
    create_time datetime DEFAULT CURRENT_TIMESTAMP NOT NULL,
    creator VARCHAR (32) DEFAULT 'sys' NOT NULL,
    current_rows BIGINT,
    edit_time datetime DEFAULT CURRENT_TIMESTAMP NOT NULL,
    editor VARCHAR (32) DEFAULT 'sys' NOT NULL,
    managed_database_database VARCHAR (255) NOT NULL,
    managed_database_schema VARCHAR (255),
    managed_database_table VARCHAR (255) NOT NULL,
    source_database_database VARCHAR (255) NOT NULL,
    source_database_jdbc VARCHAR (255) NOT NULL,
    source_database_schema VARCHAR (255),
    source_database_table VARCHAR (255) NOT NULL,
    source_database_type VARCHAR (255) NOT NULL,
    source_rows BIGINT,
    PRIMARY KEY (id)
    ) ENGINE = INNODB DEFAULT CHARSET = utf8mb4 COMMENT = '导入元数据管理';
    """,
        "mysql",
    ),
    (
        """
CREATE TABLE IF NOT EXISTS database.table_name
    (
        [cifno] [numeric](10, 0) IDENTITY(1,1) NOT NULL,
    )
""",
        "mysql",
    ),
    (
        """CREATE TABLE IF NOT EXISTS `ohs`.`authorized_users` (
      `id` INT(6) UNSIGNED NOT NULL AUTO_INCREMENT,
      `signum` VARCHAR(256) NOT NULL,
      `role` INT(2) UNSIGNED NOT NULL,
      `first_name` VARCHAR(64) NOT NULL,
      `last_name` VARCHAR(64) NOT NULL,
      `created_at` DATETIME NULL DEFAULT NULL,
      `created_by` VARCHAR(128) NOT NULL,
      `updated_at` TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP,
      `updated_by` VARCHAR(128) NULL DEFAULT NULL,
      PRIMARY KEY (`id`),
      INDEX `id` (`id` ASC) VISIBLE)
    ENGINE = InnoDB""",
        "mysql",
    ),
    (
        """CREATE TABLE `employee` (
      `user_id` int(11) NOT NULL AUTO_INCREMENT,
      `user_name` varchar(50) NOT NULL,
      `authority` int(11) DEFAULT '1' COMMENT 'user auth',
      PRIMARY KEY (`user_id`),
      KEY `FK_authority` (`user_id`,`user_name`)
    ) ENGINE=InnoDB AUTO_INCREMENT=10 DEFAULT CHARSET=utf8;""",
        "mysql",
    ),
    (
        """CREATE TABLE `posts`(
        `integer_column__index` INT NOT NULL INDEX
    );""",
        "mysql",
    ),
    (
        """CREATE TABLE `posts`(
        `integer_column__index` INT NOT NULL INDEX
    ) ENGINE=InnoDB AUTO_INCREMENT=4682 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci COMMENT='test';""",
        "mysql",
    ),
    (
        """
CREATE TABLE myset (
     cancellation_type enum('enabled','disabled') NOT NULL DEFAULT 'enabled'
);
""",
        "mysql",
    ),
    (
        """
    CREATE TABLE `table_notes` (
    `id` int NOT NULL AUTO_INCREMENT,
    `notes` varchar(255) CHARACTER SET utf8mb3 COLLATE utf8mb3_general_ci NOT NULL,
    );""",
        "mysql",
    ),
    (
        """
CREATE TABLE employee (
     first_name VARCHAR2(128),
     last_name VARCHAR2(128),
     salary_1 NUMBER(6) ENCRYPT,
     empID NUMBER ENCRYPT NO SALT,
     salary NUMBER(6) ENCRYPT USING '3DES168');

CREATE TABLE employee_2 (
     first_name VARCHAR2(128),
     last_name VARCHAR2(128),
     empID NUMBER ENCRYPT 'NOMAC' ,
     salary NUMBER(6));
""",
        "oracle",
    ),
    (
        """

    CREATE TABLE employee (
        first_name VARCHAR2(128),
        last_name VARCHAR2(128),
        salary_1 NUMBER(6) ENCRYPT,
        empID NUMBER ENCRYPT NO SALT,
        salary NUMBER(6) ENCRYPT USING '3DES168');

    CREATE TABLE employee_2 (
        first_name VARCHAR2(128),
        last_name VARCHAR2(128),
        empID NUMBER ENCRYPT 'NOMAC' ,
        salary NUMBER(6));
    """,
        "oracle",
    ),
    (
        """
Create Table emp_table (
empno Number,
ename Varchar2(100),
sal Number,
photo Blob
)
Storage ( Initial 5m Next 5m Maxextents Unlimited )
""",
        "oracle",
    ),
    (
        """
CREATE TABLE order_items
    ( order_id           NUMBER(12) NOT NULL,
      line_item_id       NUMBER(3)  NOT NULL,
      product_id         NUMBER(6)  NOT NULL,
      unit_price         NUMBER(8,2),
      quantity           NUMBER(8),
      CONSTRAINT order_items_fk
      FOREIGN KEY(order_id) REFERENCES orders(order_id)
    )
    PARTITION BY REFERENCE(order_items_fk);
""",
        "oracle",
    ),
    (
        """
    create table ACT_RU_VARIABLE (
        ID_ NVARCHAR2(64) not null,
        REV_ INTEGER,
        TYPE_ NVARCHAR2(255) not null,
        NAME_ NVARCHAR2(255) not null,
        EXECUTION_ID_ NVARCHAR2(64),
        PROC_INST_ID_ NVARCHAR2(64),
        TASK_ID_ NVARCHAR2(64),
        BYTEARRAY_ID_ NVARCHAR2(64),
        DOUBLE_ NUMBER(*,10),
        LONG_ NUMBER(19,0),
        TEXT_ NVARCHAR2(2000),
        TEXT2_ NVARCHAR2(2000),
        primary key (ID_)
    );
    """,
        "oracle",
    ),
    (
        """
CREATE TABLE meta_criteria_combo
(
  parent_criterion_id NUMBER(3),
  child_criterion_id  NUMBER(3),
  include_exclude_ind CHAR(1) NOT NULL CONSTRAINT chk_metalistcombo_logicalopr
  CHECK (include_exclude_ind IN ('I', 'E')),
  CONSTRAINT pk_meta_criteria_combo PRIMARY KEY(parent_criterion_id, child_criterion_id),
  CONSTRAINT fk_metacritcombo_parent FOREIGN KEY(parent_criterion_id) REFERENCES meta_criteria ON DELETE CASCADE,
  CONSTRAINT fk_metacritcombo_child FOREIGN KEY(child_criterion_id) REFERENCES meta_criteria
) ORGANIZATION INDEX;

GRANT SELECT ON meta_criteria_combo TO PUBLIC;
""",
        "oracle",
    ),
    (
        """
create table test (
  col varchar2(30 char) default user not null
);
""",
        "oracle",
    ),
    (
        """create table event_types (
    id number constraint event_types_id_pk primary key ) ;""",
        "oracle",
    ),
    (
        """create table event_types
    ( id number generated by default on null as identity  ) ;""",
        "oracle",
    ),
    (
        """create table event_types
    ( id number GENERATED BY DEFAULT AS IDENTITY  ) ;""",
        "oracle",
    ),
    (
        """create table event_types
    ( id number GENERATED ALWAYS AS IDENTITY ) ;""",
        "oracle",
    ),
    (
        """
    create table sales(
    qtysold smallint not null encode mostly8,
    pricepaid decimal(8,2) encode delta32k,
    commission decimal(8,2) encode delta32k,
    )
    """,
        "redshift",
    ),
    (
        """
    create table sales(
    salesid integer not null,
    listid integer not null,
    sellerid integer not null,
    buyerid integer not null,
    eventid integer not null encode mostly16,
    dateid smallint not null,
    qtysold smallint not null encode mostly8,
    pricepaid decimal(8,2) encode delta32k,
    commission decimal(8,2) encode delta32k,
    saletime timestamp,
    primary key(salesid),
    foreign key(listid) references listing(listid),
    foreign key(sellerid) references users(userid),
    foreign key(buyerid) references users(userid),
    foreign key(dateid) references date(dateid))
    distkey(listid)
    compound sortkey(listid,sellerid);
    """,
        "redshift",
    ),
    (
        """
    create table t1(col1 int distkey) diststyle key;
    """,
        "redshift",
    ),
    (
        """
    create table t2(c0 int, c1 varchar) encode auto;
    """,
        "redshift",
    ),
    (
        """
    create table customer_interleaved (
    c_custkey     	integer        not null,
    c_name        	varchar(25)    not null,
    c_address     	varchar(25)    not null,
    c_city        	varchar(10)    not null,
    c_nation      	varchar(15)    not null,
    c_region      	varchar(12)    not null,
    c_phone       	varchar(15)    not null,
    c_mktsegment      varchar(10)    not null)
    diststyle all
    interleaved sortkey (c_custkey, c_city, c_mktsegment);
    """,
        "redshift",
    ),
    (
        """
    create temp table tempevent(
        qtysold smallint not null encode mostly8,
        pricepaid decimal(8,2) encode delta32k,
        commission decimal(8,2) encode delta32k,
        );
    """,
        "redshift",
    ),
    (
        """
    create temporary table tempevent(
        qtysold smallint not null encode mostly8,
        pricepaid decimal(8,2) encode delta32k,
        commission decimal(8,2) encode delta32k,
        );
    """,
        "redshift",
    ),
    (
        """
    create temp table tempevent(like event);
    """,
        "redshift",
    ),
]
for sql in ddl:
        dialect = sql[1]
        try:
                parsed = sqlglot.parse_one(sql=sql[0], read=dialect.lower())
                tables = parsed.find_all(sqlglot.expressions.Table)
        except Exception as e:
                print("sqlglot_failure on",sql[1])
                continue

@dmaresma
Copy link
Contributor Author

I found the `AUTOINCREMENT START # INCREMENT [NOORDER|ORDER] do sqlglot fail, a PR is send to them (sqlglot team already) tobymao/sqlglot#5223,
Then I use ${} syntax as internal token for my DDL, sqlglot is not friendly, I substitute from the sql automaticali (a kind a favor),
I fix the Table description et Column description in the mapping,
I fix the column tag catch too.
Thanks for your attention

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants