Skip to content

Useful Extensions to be used when modernizing projects from Spark to Snowpark

License

Notifications You must be signed in to change notification settings

Snowflake-Labs/snowpark-extensions

Repository files navigation

Here is my attempt at rephrasing the readme text in a more explanatory, comprehensive, formal, and structured manner:

Snowpark Extensions

Overview

The Snowpark Extensions project aims to simplify the migration process from Apache Spark to Snowpark for Scala developers. It provides a set of helper methods and utilities built as extensions on top of the existing Snowpark Scala APIs.

The core goal is to minimize the amount of manual code changes required when migrating from Spark to Snowpark. This is achieved by leveraging Scala's implicit classes to essentially "overload" existing Snowpark classes like Column, DataFrame, and Session to have additional functionality not available out-of-the-box in Snowflake Snowpark Scala APIs.

Features

The Snowpark Extensions project offers the following features:

  • Implicit Column Extensions - Additional helper methods for Column to simplify common data transformation tasks
  • Implicit DataFrame Extensions - Extra functionality for DataFrame to streamline migrations including things like improved join APIs
  • Implicit Session Extensions - Helper utilities for Session to simplify setup and configuration

By leveraging implicits, these extensions provide overlayed APIs without requiring changes to existing Snowpark imports or references.

In some situations, some functions is easier to implement by registering some SQL or Javascript UDFs. You can find the code for some of them at the scripts folder

Usage

To use the Snowpark Extensions project, simply import the extension classes:

import com.snowflake.snowpark_extensions.Extensions._

This will bring all extended Column, DataFrame, and Session functionalities into scope. You can then utilize the additional methods as if they were available directly on the base classes.

Building

The project uses Maven for building:

mvn clean compile package

This will compile the code and package it into a JAR file for distribution and dependency management.

The output JAR can then be included in any Scala application to leverage the Snowpark Extensions helpers.

SQL Extensions

You can find some SQL scripts here:

UDF Description
_concat_ws Concatenates multiple input string columns together into a single string column, using the given separator.
_datediff Calculates the difference in days between two dates. NOTE: The function name has an underscore at the start to differentiate it from the existing built-in in Snowflake.
array_zip Returns a merged array of arrays.
conv Converts num from from_base to to_base.
date_add Adds days to a date.
date_format Converts a date/timestamp/string to a value of string in the format specified by the given date format.
date_sub Subtracts days from a date.
format_string Returns a formatted string from printf-style format strings.
from_unixtime Returns a date string representation of a Unix epoch timestamp using the specified format.
instr Returns the position of the first occurrence of the substr column in the given string.
isnan Returns true if expr is NaN, or false otherwise.
nanvl Returns expr1 if it's not NaN, or expr2 otherwise.
map_from_arrays Creates a new map from two arrays
regexp_extract Extracts the group specified based on the regexp.
regexp_like Returns True/False based on whether a regexp matches.
regexp_replaceall Replaces all matches to a regexp with another string.
regexp_split Splits into an array based on the regexp.
substring_index Returns the substring from str before count occurrences of the delimiter.
unix_timestamp Turns a date or date string into epoch_second.

Links

See the full API documentation here: https://snowflake-labs.github.io/snowpark-extensions/