<img src="../../images/NxLogoTransparent.png" alt="Nx Icon" width=200px align=lefth /> 


# Workshop: DataFrame Transformations with Snowpark

### Scenario: 

In this exercise, you will use the Snowpark API to examine the raw data set and transform it into a DataFrame where these patterns can easily be explored by your organization’s analysts.


### Steps:
1. Install libraries and set connections parameters.
2. Connect and create a Session object.
3. Examine the inpatient beds data set.
4. Explore the data set through a series of DataFrame transformations.
5. Use UDF to make DataFrame transformations.

### 1. Install libraries and set connections parameters.

Configure connection parameters to your Snowflake account

In [None]:
%pip install ipython-sql
%pip install snowflake-snowpark-python

In [None]:
import os
import getpass
from urllib.parse import quote

# Load Jupyter/IPython sql magic
%load_ext sql

> **&#128221; Note:** Run and complete the textboxes with your free trial account data

In [None]:
# Gather account credentials
sf_account   = input('Snowflake Account: ') #Example: iv29806.us-east-2.aws
sf_user      = input('Snowflake User: ') #Example: WORKSHOP_USER
sf_password      = input('Snowflake Password: ') #Example: WORKSHOP_USER_PASSWORD

# Generate default object names
wh_name    = f"COMPUTE_WH"
db_name    = f"SNOWPARK_DEMO_DB"

print("\r\nAccount credentials gathered. Select the next code cell to continue.")

### 2. Connect and create a Session object.

The following cell connects to your Snowflake account and creates an instance of `Session`. 

*You needn't modify anything in this cell. Just run it.*

In [None]:
# Import Snowpark Session
from snowflake.snowpark import Session

connection_parameters = {
    "account": sf_account.upper(),
    "user": sf_user.upper(),
    "password": sf_password
}  

session = Session.builder.configs(connection_parameters).create()

### 3. Examine the Situation Reports data set.

Import the Snowpark `functions` and `types` libraries for use throughout this exercise.

Create a `DataFrame` from the table `campaign_spend` and `monthly_revenue`,  then examine its schema to discover what kind of information is available in this data set.

*You needn't edit anything in the following cell. Just run it.*

In [None]:
# Import Snowpark functions and types
from snowflake.snowpark.functions import *
from snowflake.snowpark.types import *

# Create a DataFrame for the table 
snow_df_campaing_spend = session.table('SNOWPARK_DEMO_DB.SNOWPARK_DEMO_SCHEMA.CAMPAIGN_SPEND')
snow_df_revenue = session.table('SNOWPARK_DEMO_DB.SNOWPARK_DEMO_SCHEMA.MONTHLY_REVENUE')

print('\nTable CAMPAIGN_SPEND')
for field in snow_df_campaing_spend.schema.fields:
    print(field)

print('\nTable MONTHLY_REVENUE')
for field in snow_df_revenue.schema.fields:
    print(field)

### 4. Explore the data set through a series of DataFrame transformations.

#### 4.1 Select pertinent fields from the snow_df_campaing_spend DataFrame.

Select the following fields from the `snow_df_campaing_spend` DataFrame using the `select()` method (remember to enclose column names in a `col()` method call):
- CAMPAIGN
- CHANNEL
- DATE
- TOTAL_CLICKS
- TOTAL_COST
- ADS_SERVED


*HINT: Want to peek at the results of your transformations? Use the* `show()` *function to execute the query and display the first ten rows.*

> **&#128221; Note:** The [Snowpark API Reference (Python)](https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/index.html#snowpark-api-reference-python)


In [None]:
snow_df_spend = (snow_df_campaing_spend
        .select(
             col("CAMPAIGN")
            ,col("CHANNEL")
            ,col("DATE")
            ,col("TOTAL_CLICKS")
            ,col("TOTAL_COST")
            ,col("ADS_SERVED")
          ))

# Uncomment the following 'action' statement to execute transformations and view first ten results
#snow_df_spend.show()

#### 4.2 Total Spend per Year and Month For All Channels

Let's transform the data so we can see total cost per year/month per channel using `group_by()` and `agg()` Snowpark DataFrame functions.

In [None]:
from snowflake.snowpark.functions import month,year,col,sum

snow_df_spend_per_channel = snow_df_spend.group_by(year('DATE'), month('DATE'),'CHANNEL').agg(sum('TOTAL_COST').as_('TOTAL_COST')).with_column_renamed('"YEAR(DATE)"',"YEAR").with_column_renamed('"MONTH(DATE)"',"MONTH").sort('YEAR','MONTH')

print("Total Spend per Year and Month For All Channels")
snow_df_spend_per_channel.show()

#### 4.3 Total Spend Across All Channels

Let's further transform the campaign spend data so that each row will represent total cost across all channels per year/month using `pivot()` and `sum()` Snowpark DataFrame functions.

This transformation will enable us to join with the revenue table such that we will have our input features and target variable in a single table for model training.

Generate aliases of the columns, to make them more user-friendly.


In [None]:
snow_df_spend_per_month = snow_df_spend_per_channel.pivot('CHANNEL',['search_engine','social_media','video','email']).sum('TOTAL_COST').sort('YEAR','MONTH')
snow_df_spend_per_month = snow_df_spend_per_month.select(
    col("YEAR"),
    col("MONTH"),
    col("'search_engine'").as_("SEARCH_ENGINE"),
    col("'social_media'").as_("SOCIAL_MEDIA"),
    col("'video'").as_("VIDEO"),
    col("'email'").as_("EMAIL")
)

print("Total Spend Across All Channels")

# Uncomment the following 'action' statement to execute transformations and view first ten results
snow_df_spend_per_month.show()

#### 4.4 Total Revenue per Year and Month Data

Now let's transform the revenue data into revenue per year/month using `group_by()` and `agg()` functions.


In [None]:
snow_df_revenue_per_month = snow_df_revenue.group_by('YEAR','MONTH').agg(sum('REVENUE')).sort('YEAR','MONTH').with_column_renamed('SUM(REVENUE)','REVENUE')

print("Total Revenue per Year and Month")
snow_df_revenue_per_month.show()

#### 4.5 Join Total Spend and Total Revenue per Year and Month Across All Channels

Next let's join this revenue data with the transformed campaign spend data so that our input features (i.e. cost per channel) and target variable (i.e. revenue) can be loaded into a single table for further analysis and model training.


In [None]:
snow_df_spend_and_revenue_per_month = snow_df_spend_per_month.join(snow_df_revenue_per_month, ["YEAR","MONTH"])

print("Total Spend and Revenue per Year and Month Across All Channels")
snow_df_spend_and_revenue_per_month.show()

#### 4.6 Examine Query Explain Plan

Snowpark makes it really convenient to look at the DataFrame query and execution plan using `explain()` Snowpark DataFrame function.


In [None]:
snow_df_spend_and_revenue_per_month.explain()

#### 4.7 Save Transformed Data

Let's save the transformed data into a Snowflake table `SPEND_AND_REVENUE_PER_MONTH` so it can be used for further analysis and/or for training a model.

In [None]:
snow_df_spend_and_revenue_per_month.write.mode('overwrite').save_as_table('SNOWPARK_DEMO_DB.SNOWPARK_DEMO_SCHEMA.SPEND_AND_REVENUE_PER_MONTH')

### 5. Use UDF to make DataFrame transformations.

#### 5.1 Select fields from the dfSpendRevenuePerMonthUDF.

Create a DataFrame from the table `SNOWPARK_DEMO_DB.SNOWPARK_DEMO_SCHEMA.SPEND_AND_REVENUE_PER_MONTH` and examine its schema to discover what kind of information is available in this data set.
Select the following fields from the `dfSpendRevenuePerMonthUDF` DataFrame using the `select()` method (remember to enclose column names in a `col()` method call):

- YEAR
- MONTH
- SEARCH_ENGINE
- SOCIAL_MEDIA
- VIDEO
- EMAIL

In [None]:
tableName = 'SNOWPARK_DEMO_DB.SNOWPARK_DEMO_SCHEMA.SPEND_AND_REVENUE_PER_MONTH'
dfSpendRevenuePerMonthUDF = session.table(tableName)

dfOnTimeReporting = (dfSpendRevenuePerMonthUDF
    .select(
        col("YEAR")
        ,col("MONTH")
        ,col("SEARCH_ENGINE")
        ,col("SOCIAL_MEDIA")
        ,col("VIDEO")
        ,col("EMAIL")
        ))
dfOnTimeReporting.show()

#### 5.2 Use the UDF to check the top investment

Invoke the UDF `SNOWPARK_DEMO_DB.PUBLIC.findTopInvestment` using the method `call_builtin`, to check the alert level of the country.
Give to the UDF the following parameters:

- SEARCH_ENGINE
- SOCIAL_MEDIA
- VIDEO
- EMAIL

> **&#128221; Note:** Check the UDF DDL
```
create or replace function SNOWPARK_DEMO_DB.PUBLIC.FINDTOPINVESTMENT(val1 int, text1 text, val2 int, text2 text, val3 int, text3 text, val4 int, text4 text)
returns string
language python
runtime_version = '3.8'
handler = 'FINDTOPINVESTMENT'
as
$$
def FINDTOPINVESTMENT(val1, text1, val2, text2, val3, text3, val4, text4):
    values = {val1: text1, val2: text2, val3: text3, val4: text4}
    max_val = max(values)
    sum_val = sum(values)
    porc_val = max_val / sum_val * 100
    return f"{values[max_val]} with {porc_val} of total"
$$;
```

> &#10071; The UDF must be previously deployed in your account.



In [None]:
dfOnTimeReporting = (dfSpendRevenuePerMonthUDF
    .select(
        col("YEAR")
        ,col("MONTH")
        ,col("SEARCH_ENGINE")
        ,col("SOCIAL_MEDIA")
        ,col("VIDEO")
        ,col("EMAIL")
        ,call_builtin(
            "SNOWPARK_DEMO_DB.PUBLIC.FINDTOPINVESTMENT"
            ,col("SEARCH_ENGINE"), "SEARCH_ENGINE"
            ,col("SOCIAL_MEDIA"), "SOCIAL_MEDIA"
            ,col("VIDEO"), "VIDEO"
            ,col("EMAIL"), "EMAIL").alias("TOP INVESTMENT")
        )
)

    
dfOnTimeReporting.show()

### &#10071; `Shut Down Kernel`
> After completing the activities in a notebook, shut down the completed notebook.