# Data Transformation Using Snowpark for Python

The purpose of this script is to demonstrate simple data transformations on Snowflake objects using Snowpark for Python. The intent is to begin with a Snowflake table containing hourly sales data spanning 27 years and perform the following transformations:

- Filter the data to 2005 onwards
- Aggregate the number of sales by month and category
- Sort the data by month and category
- Store the result in a new table in Snowflake 

## Import the InterWorks Snowpark package

Before we can begin, we must import the required package from the InterWorks Snowpark package and leverage it to create a Snowflake Snowpark Session object that is connected to our Snowflake environment. Alternatively, you can modify the code to establish a Snowflake Snowpark Session through any method of your choice.

In [20]:

## Import module to build snowpark sessions
from interworks_snowpark.snowpark_session_builder import build_snowpark_session_via_parameters_json as build_snowpark_session

## Generate Snowpark session
snowpark_session = build_snowpark_session()

## Retrieve source table from staging

Our source data is contained in the following object: `"SALES_DB"."STAGING"."PRODUCT_SALES"` and is read into a Snowflake dataframe object

In [22]:
df_source = snowpark_session.table('"SALES_DB"."STAGING"."PRODUCT_SALES"')

This can be quickly queried to confirm the contents

In [23]:
df_source.show()

----------------------------------------------------------------------------
|"SALE_DATE"  |"CATEGORY"   |"SUBCATEGORY"            |"SALES"             |
----------------------------------------------------------------------------
|1995-01-01   |PRO EDITION  |PRO ADMIN                |24.625859817152804  |
|1995-01-01   |PRO EDITION  |PRO DEVELOPER            |41.043099695254675  |
|1995-01-01   |PRO EDITION  |PRO CONSUMER             |16.417239878101874  |
|1995-01-01   |ENTERPRISE   |ENTERPRISE ADMIN         |25.752198519808445  |
|1995-01-01   |ENTERPRISE   |ENTERPRISE DEVELOPER     |34.33626469307793   |
|1995-01-01   |ENTERPRISE   |ENTERPRISE COLLABORATOR  |17.168132346538965  |
|1995-01-01   |ENTERPRISE   |ENTERPRISE CONSUMER      |8.584066173269482   |
|1995-01-02   |PRO EDITION  |PRO ADMIN                |5.088506747932085   |
|1995-01-02   |PRO EDITION  |PRO DEVELOPER            |8.480844579886808   |
|1995-01-02   |PRO EDITION  |PRO CONSUMER             |3.3923378319547237  |

## Transformations

We are now ready to begin our transformations. Before we can do so, we must import a few functions from the `snowflake.snowpark.functions` module. Most notably, we wish to import `col` as this allows us to target a specific field in a Snowflake dataframe. In addition, we wish to import the native [Snowflake DATE_TRUNC function](https://docs.snowflake.com/en/sql-reference/functions/date_trunc.html) so that we can reduce a timestamp to a year/month, the native [Snowflake YEAR function](https://docs.snowflake.com/en/sql-reference/functions/year.html) so that we can quickly identify and filter by the year of a timestamp, and the native [Snowflake TO_DATE function](https://docs.snowflake.com/en/sql-reference/functions/to_date.html) so that we convert our truncated timestamps into dates.

In [25]:
from snowflake.snowpark.functions import col
from snowflake.snowpark.functions import year
from snowflake.snowpark.functions import date_trunc
from snowflake.snowpark.functions import to_date
from snowflake.snowpark.functions import sum as sf_sum

### Filter the data to 2005 onwards

We leverage the `YEAR` function in Snowflake to determine the year for each record, then filter for when this is greater than or equal to 2005.

In [27]:
df_filtered = df_source.filter(year(col("SALE_DATE")) >= 2005)

df_filtered.show()

----------------------------------------------------------------------------
|"SALE_DATE"  |"CATEGORY"   |"SUBCATEGORY"            |"SALES"             |
----------------------------------------------------------------------------
|2005-01-01   |PRO EDITION  |PRO ADMIN                |13.499431243680485  |
|2005-01-01   |PRO EDITION  |PRO DEVELOPER            |22.499052072800808  |
|2005-01-01   |PRO EDITION  |PRO CONSUMER             |8.999620829120323   |
|2005-01-01   |ENTERPRISE   |ENTERPRISE ADMIN         |11.302098078867544  |
|2005-01-01   |ENTERPRISE   |ENTERPRISE DEVELOPER     |15.069464105156724  |
|2005-01-01   |ENTERPRISE   |ENTERPRISE COLLABORATOR  |7.534732052578362   |
|2005-01-01   |ENTERPRISE   |ENTERPRISE CONSUMER      |3.767366026289181   |
|2005-01-02   |PRO EDITION  |PRO ADMIN                |35.80700202224469   |
|2005-01-02   |PRO EDITION  |PRO DEVELOPER            |59.67833670374115   |
|2005-01-02   |PRO EDITION  |PRO CONSUMER             |23.87133468149646   |

### Aggregate the number of sales by month and category

Now that we have our categories, we are ready to group our data with the `groupby` method. Again, note how we leverage `sf_sum` to avoid using the standard Python `sum` function.

In [28]:
df_grouped = df_filtered \
  .group_by(to_date(date_trunc('month', col("SALE_DATE"))), col("CATEGORY")) \
  .agg(sf_sum(col("SALES"))) \
  .select(col("TO_DATE(DATE_TRUNC(MONTH, SALE_DATE))").alias("SALE_MONTH"), col("CATEGORY"), col("SUM(SALES)").alias("SALES"))

df_grouped.show()

----------------------------------------
|"SALE_MONTH"  |"CATEGORY"   |"SALES"  |
----------------------------------------
|2005-01-01    |PRO EDITION  |2525.0   |
|2005-01-01    |ENTERPRISE   |2114.0   |
|2005-02-01    |PRO EDITION  |2459.5   |
|2005-02-01    |ENTERPRISE   |2109.0   |
|2005-03-01    |PRO EDITION  |2364.75  |
|2005-03-01    |ENTERPRISE   |2366.0   |
|2005-04-01    |PRO EDITION  |2041.5   |
|2005-04-01    |ENTERPRISE   |2300.0   |
|2005-05-01    |PRO EDITION  |2174.25  |
|2005-05-01    |ENTERPRISE   |2569.0   |
----------------------------------------



### Sort the data

Using the `sort()` method, we can simply sort the data by category and month.

In [29]:
df_sorted = df_grouped.sort(col("SALE_MONTH"), col("CATEGORY"))

df_sorted.show()

----------------------------------------
|"SALE_MONTH"  |"CATEGORY"   |"SALES"  |
----------------------------------------
|2005-01-01    |ENTERPRISE   |2114.0   |
|2005-01-01    |PRO EDITION  |2525.0   |
|2005-02-01    |ENTERPRISE   |2109.0   |
|2005-02-01    |PRO EDITION  |2459.5   |
|2005-03-01    |ENTERPRISE   |2366.0   |
|2005-03-01    |PRO EDITION  |2364.75  |
|2005-04-01    |ENTERPRISE   |2300.0   |
|2005-04-01    |PRO EDITION  |2041.5   |
|2005-05-01    |ENTERPRISE   |2569.0   |
|2005-05-01    |PRO EDITION  |2174.25  |
----------------------------------------



### Store the result in a new table in Snowflake

Finally, we can output the data into a table in Snowflake.

In [30]:
df_sorted.write.mode("overwrite").save_as_table('"SALES_DB"."CLEAN"."PRODUCT_SALES"')

### Verify Results

We can connect directly to our new table in Snowflake to verify the results.

In [31]:
snowpark_session.table('"SALES_DB"."CLEAN"."PRODUCT_SALES"').show()

----------------------------------------
|"SALE_MONTH"  |"CATEGORY"   |"SALES"  |
----------------------------------------
|2005-01-01    |ENTERPRISE   |2114.0   |
|2005-01-01    |PRO EDITION  |2525.0   |
|2005-02-01    |ENTERPRISE   |2109.0   |
|2005-02-01    |PRO EDITION  |2459.5   |
|2005-03-01    |ENTERPRISE   |2366.0   |
|2005-03-01    |PRO EDITION  |2364.75  |
|2005-04-01    |ENTERPRISE   |2300.0   |
|2005-04-01    |PRO EDITION  |2041.5   |
|2005-05-01    |ENTERPRISE   |2569.0   |
|2005-05-01    |PRO EDITION  |2174.25  |
----------------------------------------

