

![](../jpg/stock_small.jpg)


Snowflake recently announced [Snowpark](https://docs.snowflake.com/en/developer-guide/snowpark/index.html) for [public preview](https://www.snowflake.com/blog/welcome-to-snowpark-new-data-programmability-for-the-data-cloud/) So what is Snowpark? Snowpark is an API for querying and processing data in a data pipeline. Snowpark not only works with Jupypter Notebooks but with a variaty of Scala IDE's. Instructions on how to set up your favorite development environment can be found [here](https://docs.snowflake.com/en/developer-guide/snowpark/setup.html). 

This is the first notebook of a series to show the ease of use as well as the power of running Jupyter Notebooks via the Snowpark API on top of Snowflake. All notebooks will be fully self contained, meaning all you need for processing and analyzing datasets with billions of rows is a Snowflake account. If you do not already have a snowflake account, just follow the steps in [sign up](https://signup.snowflake.com/). A time limited  account for trial purposes is free and doesn't even require a credit card. 

In this notebook you will learn how to connect to Snowflake via Snowpark

- [Quick Start](#Quick-Start): Setting up your environment
- [Hello World](#Hello-World): First steps 
- [Snowflake Dataframes](#Snowflake-Dataframes): Query the Snowflake Sample Datasets via Snowflake Dataframes
- [Conclusion](#Conclusion): Conclusion and What's next

All notebooks in this series require a Jupyter notebook environment with a Scala kernel. If you do not already have access to that type of environment I would higly recommend to use [Snowtire V2](https://github.com/zoharsan/snowtire_v2) and this excellent [post](https://medium.com/snowflake/from-zero-to-snowpark-in-5-minutes-72c5f8ec0b55). Additional instructions on versions uses in this series as well as how to make the notebook look nicer by using nbextensions can be found in this github repo.

Versions used in this notebook are up-to-date as of August 2021. Please update them as necessary in the Snowtire setup step.

In case you see any unexpected errors, restart the kernel via *Kernel -> Restart* and start the notebook from the beginning.

Lastly, in my experience, when running on the Almond Kernel most problems reported by Ammonite REPL can be resolved by deleting the Ammonite cache and restarting the Kernel (as mentioned above).

        docker exec -d SnowTrekPost rm -rf /home/jovyan/.cache/almond


# Quick Start

## Step 1

Configure the notebook to use a Maven repository for a library that Snowpark depends on.

In [None]:
import sys.process._
val osgeoRepo = coursierapi.MavenRepository.of("https://repo.osgeo.org/repository/release")
interp.repositories() ++= Seq(osgeoRepo)

## Step 2

Create a directory (if it doesn't exist) for temporary files created by the [REPL](https://ammonite.io/#Ammonite-REPL) environment. To avoid any side-effects from previous runs, we also delete any files that might exist in that directory.

**Note: Make sure that you have the operating system permissions to create a directory in that location.**

**Note: If you are using multiple notebooks, you’ll need to create and configure a separate REPL class directory for each notebook.**

In [None]:
import ammonite.ops._
import ammonite.ops.ImplicitWd._

// This folder is used to store generated repl classes, which will later be used in UDF.
// Please provide an empty folder path.This is essential for Snowpark UDF to work
val replClassPath = pwd+"/../repl_classes"

// Create the repl class folder
import sys.process._
s"mkdir -p $replClassPath" !

// delete any old files in the directory
import sys.process._
s"rm -rf $replClassPath/*" !

## Step 3

Configure the compiler for the Scala REPL. This does the following:
- Configures the compiler to generate classes for the REPL in the directory that you created earlier.
- Configures the compiler to wrap code entered in the REPL in classes, rather than in objects.
- Adds the directory that you created earlier as a dependency of the REPL interpreter.

In [None]:
// Generate all repl classes in repl class folder
interp.configureCompiler(_.settings.outputDirs.setSingleOutput(replClassPath))
interp.configureCompiler(_.settings.Yreplclassbased.value = true)
interp.load.cp(os.Path(replClassPath))

## Step 4
Import the Snowpark library from Maven.

In [None]:
import $ivy.`com.snowflake:snowpark:0.8.0`

To create a session we need to authenticate ourselves to the Snowflake instance. Though it might be tempting to just override the authentication variables below with hard coded values, its not considered best practice to do so. In case you ever wanted to share your version of the notebook, your could disclose your credentials by mistake to the recipient. Even worse, if you upload your notebook to a public code repository, you might advertise your credentials to the whole wide world. To prevent that, you should keep your credentials in an external file (like we are doing here).

Then update your credentials in that file and they will be save on your local machine. Even better if you do not use user/password authentication but [private key authentication](https://docs.snowflake.com/en/user-guide/key-pair-auth.html). 

Copy the credentials template file creds/template_credentials.txt to creds/credentials.txt and update the file with your credentials. Put your key files into the same directory or update the location accordingly in your credentials file. 

In [None]:
import com.snowflake.snowpark._
import com.snowflake.snowpark.functions._

val session = Session.builder.configFile("../creds/credentials.txt").create

## Step 5
Add the Ammonite kernel classes as dependencies for your UDF

In [None]:
def addClass(session: Session, className: String): String = {
  var cls1 = Class.forName(className)
  val resourceName = "/" + cls1.getName().replace(".", "/") + ".class"
  val url = cls1.getResource(resourceName)
  val path = url.getPath().split(":").last.split("!").head
  session.addDependency(path)
  path
}
addClass(session, "ammonite.repl.ReplBridge$")
addClass(session, "ammonite.interp.api.APIHolder")
addClass(session, "pprint.TPrintColors")

# Hello World

Congratulations! You have successfully connected from a Jupyter notebook to a Snowflake instance. Now we are ready to write our first "Hello World" programm using Snowpark. That is as easy as the line in the cell below. After you have executed the cell below, you should see an output similar to

    [scala-interpreter-1]  INFO (Logging.scala:22) - Actively querying parameter snowpark_lazy_analysis from server.
    [scala-interpreter-1]  INFO (Logging.scala:22) - Execute query [queryID: 019e28e6-05025203-0000-22410336b00a]  SELECT  *  FROM (SELECT 'Hello World!' greeting) LIMIT 10
    ----------------
    |"GREETING"    |
    ----------------
    |Hello World!  |
    ----------------
    

In [None]:
session.sql("SELECT 'Hello World!' greeting").show()

Note that Snowpark has automatically translated the scala code into the familiar "Hello World!" SQL statement. This means that we can execute arbitrary SQL by using the **sql** method of the **session** class.

However, this doesn't really show the power of the new Snowpark API. At this point it's time to review the [Snowpark API documentation](https://docs.snowflake.com/en/developer-guide/snowpark/reference/scala/com/snowflake/snowpark/index.html). It provides valuable information on how to use Snowpark. 

Lets now create a new *"Hello World!"* cell, that uses the Snowpark API, specifically the [dataframe API] (https://docs.snowflake.com/en/developer-guide/snowpark/reference/scala/com/snowflake/snowpark/index.html?search=dataframe). To use the dataframe API we first create a row and a schema and then a dataframe based on the row and the schema. 

In [None]:
import com.snowflake.snowpark.types._

val helloWorldDf=session.createDataFrame(Seq(("Hello World!"))).toDF("Greeting")

helloWorldDf
   .show

# Snowflake Dataframes


After having mastered the *Hello World!* stage, we now can query Snowflake tables using the dataframe API. To do so, we will query the [Snowflake Sample Database](https://docs.snowflake.com/en/user-guide/sample-data.html) included in any Snowflake instance. 

As you may know, the TPCH datasets come in different sizes from 1 TB to 1 PB (1000 TB). For starters we will query the orders table in the 10 TB dataset size. Instead of writing a sql statement we will use the dataframe API. The advantage is that dataframes can be build as a pipeline. Let take a look at the demoOrdersDf.

    val demoOrdersDf=session.table(demoDataSchema :+ "ORDERS")
    
In contrast to the initial *Hello World!* example above, we now map a Snowflake table to a dataframe. The definition of a dataframe doesn't take any time. It's just defining metadata. Do get the result, for instance the content of the Orders table, we need to [evaluate](https://docs.snowflake.com/en/developer-guide/snowpark/working-with-dataframes.html#performing-an-action-to-evaluate-a-dataframe) the dataframe. One way of doing that is to apply the *[count()](https://docs.snowflake.com/en/developer-guide/snowpark/reference/scala/com/snowflake/snowpark/DataFrame.html?search=count)* action which returns the row count of the dataframe, e.g. in this case the row count of the *Orders* table. Another method is the *[schema](https://docs.snowflake.com/en/developer-guide/snowpark/reference/scala/com/snowflake/snowpark/DataFrame.html?search=schema)* function

In [None]:
val size:String="10"
val demoDataSchema:Seq[String]=Seq("SNOWFLAKE_SAMPLE_DATA","TPCH_SF"+size)
val demoOrdersDf=session.table(demoDataSchema :+ "ORDERS")

demoOrdersDf
    .count()
demoOrdersDf
    .schema

Next, we want to apply a projection. In SQL terms, this is the *select* clause, i.e. instead of getting of the columns in the Orders table, we are only interested in a few. This is accomplished by the *[select()](https://docs.snowflake.com/en/developer-guide/snowpark/reference/scala/com/snowflake/snowpark/DataFrame.html?search=select)* transformation. Note that we can just add additional qualifications to the already existing dataframe of *demoOrdersDf* and create a new dataframe that includes only a subset of columns. Lastly, instead of counting the rows in the dataframe, this time we want to see the content of in the dataframe. To do so we need to evaluate the dataframe. We can do that using another action, i.e. *[show](https://docs.snowflake.com/en/developer-guide/snowpark/reference/scala/com/snowflake/snowpark/DataFrame.html?search=show)*

In [None]:
val demoOrdersProjectedDf=
        demoOrdersDf
            .select(col("O_ORDERKEY"), col("O_CUSTKEY"), col("O_ORDERSTATUS"), col("O_ORDERDATE"))

demoOrdersProjectedDf
    .show(20)

Let's now assume that we do not want to **all** the rows but only a subset of rows in a dataframe. We can accomplish that with the *[filter()](https://docs.snowflake.com/en/developer-guide/snowpark/reference/scala/com/snowflake/snowpark/DataFrame.html?search=filter) transformation. 

In [None]:
val demoOrdersProjectedFilteredDf=
        demoOrdersProjectedDf
            .filter(col("O_ORDERSTATUS")===lit("O"))
                    
demoOrdersProjectedFilteredDf
    .sort(col("O_ORDERKEY"))
    .show(20)

And lastly, we want to create a new dataframe which joins the Orders table with the LineItem table. Again, we are using our previous dataframe that is a projection and a filter against the Orders table. We can join that dataframe to the *LineItem* table and create a new dataframe. We then apply the *select()* transformation. Again, to see the result we need to evaluate the dataframe, for instance via the *show()* action.

In [None]:
val demoLinesDf=session.table(demoDataSchema :+ "LINEITEM")
val demoOrdersLinesDf=
        demoOrdersProjectedFilteredDf
            .join(demoLinesDf,col("L_ORDERKEY")===col("O_ORDERKEY"))
            .select(col("O_ORDERKEY"),col("O_ORDERSTATUS"),col("L_LINENUMBER"),col("L_LINESTATUS"))

demoOrdersLinesDf
    .sort(col("O_ORDERKEY"),col("L_LINENUMBER"))
    .show(100)

In [None]:
val demoOrdersCountByDateAndStatus=
        demoOrdersProjectedDf
            .select(col("O_ORDERDATE"),col("O_ORDERSTATUS"), col("O_ORDERKEY"))
            .groupBy(col("O_ORDERDATE"),col("O_ORDERSTATUS"))
            .agg(count(col("O_ORDERKEY")).name("O_COUNT"))

val demoOrdersCountByDateAndStatusArr=demoOrdersCountByDateAndStatus
    .sort(col("O_ORDERDATE"),col("O_ORDERSTATUS"))
    .collect()

# Conclusion

In this notebook we have seen to to set up a Jupyter notebook and configure it to use Snowpark to connect to a Snowflake database. Next, we have built a simple *Hello World!* cell via embedded SQL as well as the Snowpark Dataframe API. Lastly we have explored the power of the Snowpark Dataframe API to build scalable Analytics against datasets of billions or rows without first having to build any additional infrastructure like a Spark cluster. Snowpark brings the power of building scalable Analytics applications with Scala and Jupyter notebooks. No additional infrastructure necessary. 

In the next post of this series, we will learn how to add custom Scala based functions to an analytics pipeline and execute arbitratry logic directly in Snowflake via User Defined Functions (UDFs) just by defining the logic in a Jupyter notebook!