### Establish Snowpark connection and load tables from source data.

This notebook is based on example described in [Building and deploying a time series forecast with Hex + Snowflake](https://quickstarts.snowflake.com/guide/hex/index.html#0). This entire example higlights how we can use Snowflake to perform parallel hyperparameter tuning forecasting foot traffic. Please take a look at Chase Romano's article [Parallel Hyperparameter tuning using Snowpark](https://medium.com/snowflake/parallel-hyperparameter-tuning-using-snowpark-53cdec2faf77) for more information.

We will begin by establishing our Snowflake connection and Snowpark session. This demo assumes the user has access to the `SYSADMIN` role and a virtual warehouse named `COMPUTE_WH` exists and is available for usage. 

In the event the database or schema does not exist, the connection will be established without database and schema context, but we will create them in this notebook. 

In [None]:
import os

import pandas as pd
import snowflake.snowpark.functions as F
import snowflake.snowpark.types as T
from snowflake.snowpark import Session

connection_params = {
    "account": os.environ.get("SNOWFLAKE_ACCOUNT"),
    "user": os.environ.get("SNOWFLAKE_USER"),
    "password": os.environ.get("SNOWFLAKE_PASSWORD"),
    "database": os.environ.get("SNOWFLAKE_DATABASE"),
    "schema": os.environ.get("SNOWFLAKE_SCHEMA"),
    "role": "SYSADMIN",
    "warehouse": "COMPUTE_WH",
}

session = Session.builder.configs(connection_params).create()

We connected earlier using the SYSADMIN role and a virtual warehouse named COMPUTE_WH. Let's create a new database and schema in the event that they do not already exist.

In [None]:
session.sql(
    f"CREATE DATABASE IF NOT EXISTS {os.environ.get('SNOWFLAKE_DATABASE')}"
).collect()
session.sql(
    f"CREATE SCHEMA IF NOT EXISTS {os.environ.get('SNOWFLAKE_DATABASE')}.{os.environ.get('SNOWFLAKE_SCHEMA')}"
).collect()
session.sql(f"USE DATABASE {os.environ.get('SNOWFLAKE_DATABASE')}").collect()
session.sql(
    f"USE SCHEMA {os.environ.get('SNOWFLAKE_DATABASE')}.{os.environ.get('SNOWFLAKE_SCHEMA')}"
).collect()

I'm going to create two Pandas DataFrames based on some CSV files that I have available. These files were generated using a process described in [Building and deploying a time series forecast with Hex + Snowflake](https://quickstarts.snowflake.com/guide/hex/index.html#0). The data is in the `data` directory of this repository.

In [None]:
calendar_df = pd.read_csv("../data/calendar.csv.gz")
traffic_df = pd.read_csv("../data/hourly_traffic.csv.gz")

Let's look at our first Pandas DataFrame.

In [None]:
calendar_df.head(5)

Let's get some information and describe both of these tables to see what we're working with.

In [None]:
calendar_df.info()

We can adjust those "object" types to be more specific.

In [None]:
calendar_df["CALENDAR_DATE"] = pd.to_datetime(calendar_df["CALENDAR_DATE"])
calendar_df["HOLIDAY_NAME"] = calendar_df["HOLIDAY_NAME"].astype("string")

As of the time of this writing, the Snowpark DataFrame from Pandas method converts `datetime64[ns]` to `LongType()` Snowpark types representing [unix time](https://en.wikipedia.org/wiki/Unix_time). We can convert this specific column to make it easier to work with inside of Snowflake. We understand this to be a generic date, so that is what we will convert it to with the `to_date` function. 

Let's persist this table in Snowflake.

I'm using the `overwrite` mode here, but in a typical workflow you would likely want to append to the table.

In [None]:
session.create_dataframe(calendar_df).with_column(
    "CALENDAR_DATE", F.to_date(F.cast("CALENDAR_DATE", T.StringType()))
).write.save_as_table("CALENDAR_INFO", mode="overwrite")

Let's peek at our table. We can also view the schema to see that the `CALENDAR_DATE` column is now a `DATE` type.

In [None]:
session.table("CALENDAR_INFO").show()

Now for our other table for hourly traffic.

In [None]:
traffic_df.head()

In [None]:
traffic_df.describe()

In [None]:
traffic_df.info()

STORE_ID and COLLEGE_TOWN probably need some adjustments, I don't imagine these columns will need to store numbers up to 9,223,372,036,854,775,807. Let's make them `int16` and `bool` respectively.

We will similar conversion as we did with the previous DataFrame. For our time conversion, the `to_datetime` function will still let us use the hour value in the `TIME_POINTS` column.

In [None]:
traffic_df["STORE_ID"] = pd.to_numeric(traffic_df["STORE_ID"], downcast="signed")
traffic_df["COLLEGE_TOWN"] = traffic_df["COLLEGE_TOWN"].astype("boolean")
traffic_df["TIME_POINTS"] = pd.to_datetime(traffic_df["TIME_POINTS"])
traffic_df["HOLIDAY_NAME"] = traffic_df["HOLIDAY_NAME"].astype("string")

In [None]:
traffic_df.info()

Yay, less memory. 🎉 Our memory usage in this example went from 201.6+ MB to 141.1 MB. 

Finally, we'll create our Snowflake table.

In [None]:
session.create_dataframe(traffic_df).with_column(
    "TIME_POINTS", F.to_timestamp(F.cast("TIME_POINTS", T.StringType()))
).write.save_as_table("HOURLY_TRAFFIC", mode="overwrite")

Let's preview our table.

In [None]:
session.table("HOURLY_TRAFFIC").show()