Skip to content

Feature: implement snowflake export function #156 #163

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 4, 2023

Conversation

onlyjackfrost
Copy link
Contributor

@onlyjackfrost onlyjackfrost commented Apr 24, 2023

Description

This pull request implemented the snowflake export function to export the selected data to the local dir in parquet format.

export flow:

  • Use the snowflake "COPY INTO..." command to convert the selected data into parquet format and stored in the user stage. This request will have a UUID generated by Snowflake, we will append this unique id to the stage file name to prevent concurrent copy requests with the same name which Snowflake will throw an error.
  • Use the snowflake "GET ..." command to download the parquet file stored in the user stage to the local directory. We Use the "pattern" parameter in this command and filter out the file with the unique id.
  • Use the snowflake "Remove ..." command to remove the parquet file in the user stage to avoid increasing costs. We Use the "pattern" parameter in this command and filter out the file with the unique id.

Issue ticket number

issue #156

Additional Context

  • The encoding type of columns in the parquet file seems can not be specified in Snowflake.
  • We add a query Id (a uuid generated by snowflake) in our stage file name and allow partition when reach the file size limit
  • We set the max file size of each file to the maximum of snowflake setting (5GB)
  • The max file size does not have a lower bound, but the minimum file size is 1 Mb though. (found when developing)
  • the partition file will have a subfix in the file name like "\d{1,2}\d{1,2}_\d{1,2}.parquet"

image

  • The COPY command does not validate data type conversions for Parquet files. see Usage Note.
  • The GET command will auto-decrypt the parquet file with the key that is used when encrypting., see Usage Note
  • Still can not get the proper encoded type in column after adjusting the session parameter ENABLE_UNLOAD_PHYSICAL_TYPE_OPTIMIZATION , see Note(May encounter invalid parquet type when reading parquet file using parquetjs or derivative package like parquets)

Result

Use parquet-cli to view the downloaded parquet file
image

@vercel
Copy link

vercel bot commented Apr 24, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
vulcan-sql-document ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 4, 2023 2:43am

@codecov-commenter
Copy link

codecov-commenter commented May 2, 2023

Codecov Report

Patch coverage: 92.68% and project coverage change: +0.02 🎉

Comparison is base (7a79c10) 92.91% compared to head (dc72b97) 92.94%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #163      +/-   ##
===========================================
+ Coverage    92.91%   92.94%   +0.02%     
===========================================
  Files          291      291              
  Lines         4473     4506      +33     
  Branches       604      606       +2     
===========================================
+ Hits          4156     4188      +32     
- Misses         203      204       +1     
  Partials       114      114              
Flag Coverage Δ
build 94.64% <ø> (ø)
core 93.96% <0.00%> (-0.05%) ⬇️
extension-dbt 97.43% <ø> (ø)
extension-debug-tools 98.11% <ø> (ø)
extension-driver-bq 85.21% <ø> (ø)
extension-driver-duckdb 97.53% <ø> (ø)
extension-driver-pg 96.11% <ø> (ø)
extension-driver-snowflake 96.22% <95.00%> (+1.63%) ⬆️
integration-testing 96.15% <ø> (ø)
serve 90.04% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
packages/core/src/models/extensions/dataSource.ts 87.50% <0.00%> (-5.84%) ⬇️
...on-driver-snowflake/src/lib/snowflakeDataSource.ts 94.59% <95.00%> (+4.11%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Contributor

@kokokuo kokokuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides some suggestions, others LGTM 👍

this.logger.debug(`Acquiring connection from ${profileName}`);
const client = await pool.acquire();
this.logger.debug(`Acquired connection from ${profileName}`);
const { connection, pool } = await this.getConnection(profileName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to refactor it ! 👍


// use protected for testing purpose
protected getCopyToStageSQL(sql: string, stageFilePath: string) {
return `COPY INTO ${stageFilePath} FROM (${sql}) FILE_FORMAT = (TYPE = 'parquet') HEADER=true INCLUDE_QUERY_ID=true MAX_FILE_SIZE=5368709120;`;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to add the comment to explain why to set the MAX_FILE_SIZE is 5368709120 not other values, and paste the document link to help other members know :D

const { profileName, sql } = exportOptions;

const { connection, pool } = await this.getConnection(profileName);
directory = /^\./.exec(directory) ? directory : `./${directory}`;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, our directory may be a relative path based on the project or an absolute path temp directory from the root if the user does not set folderPath in the cache settings of the vulcan.yaml:

Relative path:
> `tmp/schem1/pg/cache_department`
Absolute path:
> `/var/folders/51/049hk_157kgc8ht18q61c1lm0000gn/T/vulcan/cache/schem1/pg/cache_department`

If the snowflake GET path needs the absolute path, then currently the checking directory has . before the path may cause an issue, and need to remove it.

We could use path.resolve to normalize the above two cases directly, before is the screenshot (seems you have written it in the following code):

image

Btw, suggest adding the comment to explain the reason we need to get the absolute path from the Snowflake GET method needed and paste the document link to help other members know it.

const copyStatement = await this.getStatementFromSQL(
connection,
builtSQL,
[]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could refactor getStatementFromSQL methods to make the binding arguments could be [] array default, then we don't need always to pass [] when copying to stage, getting files, and removing stage.

public override async export(exportOptions: ExportOptions): Promise<void> {
let { directory } = exportOptions;
const { profileName, sql } = exportOptions;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest removing the newline.

// Assert
const files = fs.readdirSync(directory);
expect(files.length).toBe(1);
expect(/parquet$/.exec(path.basename(files[0])));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the expect(/parquet$/.exec(path.basename(files[0]))) actually may not really assert the result.

The value in the expect(value) is the actual value, we should call the method to do the assert.

Please use the expect method to assert the result, e.g:

expect(files[0]).toMatch('/.parquet$/');

const files = fs.readdirSync(directory);
expect(files.length).toBe(4);
// check each file in the files exists
files.forEach((file) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same issue and suggestion line above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to enhance test cases, according to the export method logistics, we may still need two test cases, first is checking the directory not exist and throw error, second is the checking the happened cached error case

const stage = '@~/vulcan_cache_stage';
const stageFilePath = `${stage}/`;
const builtSQL = this.getCopyToStageSQL(sql, stageFilePath);
// copy data to a named stage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to rich the comment to explain why we need to copy to the stage to help members know moving to the stage is necessary.

await this.waitStatement(getStatement);
this.logger.debug(`Exported parquet files to ${directory}`);

// remove parquet file from stage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest rich the comment to describe removing the stage parquet files because we keep them in the stage temporarily for downloading, so it need to remove 😄

Copy link
Contributor

@kokokuo kokokuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing and update the commits, LGTM 👍 👍

@kokokuo kokokuo merged commit a75172c into develop May 4, 2023
@kokokuo kokokuo deleted the feature/snowflake-export-parquet branch May 4, 2023 04:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants