![](../../jpg/stock_small.jpg)


This is the first notebook of a series to show how to use Snowpark on Snowflake. This notebook provides a quick-start guide and an introduction to the Snowpark DataFrame API. The notebook explains the steps for setting up the environment (REPL), and how to resolve dependencies to Snowpark. After a simple "Hello World" example you will learn about the Snowflake DataFrame API, projections, filters, and joins.

In this notebook we will learn how to 

- [Quick Start](#Quick-Start): Set up the environment
- [Hello World](#Hello-World): First steps 
- [Snowflake DataFrame API](#Snowflake-Dataframes): Query the Snowflake Sample Datasets via Snowflake DataFrames
- [Conclusion](#Conclusion): Conclusion and What's next


# Prerequisite

To get started you need a Snowflake account and read/write access to a database. If you do not have a Snowflake account, you can sign up for a [free trial](https://signup.snowflake.com/). It doesn't even require a credit card.


# Quick Start

First, we have to set up the environment for our notebook. The instructions for setting up the environment are in the Snowpark documentation in section [configuring-the-jupyter-notebook-for-snowpark](https://docs.snowflake.com/en/developer-guide/snowpark/quickstart-jupyter.html#configuring-the-jupyter-notebook-for-snowpark).

## Step 1

Configure the notebook to use a Maven repository for a library that Snowpark depends on.

In [None]:
import sys.process._
val osgeoRepo = coursierapi.MavenRepository.of("https://repo.osgeo.org/repository/release")
interp.repositories() ++= Seq(osgeoRepo)

## Step 2

Create a directory (if it doesn't exist) for temporary files created by the [REPL](https://ammonite.io/#Ammonite-REPL) environment. To avoid any side effects from previous runs, we also delete any files in that directory.

**Note: Make sure that you have the operating system permissions to create a directory in that location.**

**Note: If you are using multiple notebooks, you’ll need to create and configure a separate REPL class directory for each notebook.**

In [None]:
import ammonite.ops._
import ammonite.ops.ImplicitWd._

// This folder is used to store generated REPL classes, which will later be used in UDFs.
// Please provide an empty folder path. This is essential for Snowpark UDFs to work
val replClassPath = pwd+"/repl_classes"

// Delete any old files in the directory.
import sys.process._
s"rm -rf $replClassPath" !

// Create the REPL class folder.
import sys.process._
s"mkdir -p $replClassPath" !

## Step 3

Configure the compiler for the Scala REPL. This does the following:
- Configures the compiler to generate classes for the REPL in the directory that you created earlier.
- Configures the compiler to wrap code entered in the REPL in classes, rather than in objects.
- Adds the directory that you created earlier as a dependency of the REPL interpreter.

In [None]:
// Generate all repl classes in REPL class folder
interp.configureCompiler(_.settings.outputDirs.setSingleOutput(replClassPath))
interp.configureCompiler(_.settings.Yreplclassbased.value = true)
interp.load.cp(os.Path(replClassPath))

## Step 4

Import the Snowpark library from Maven.

In [None]:
import $ivy.`com.snowflake:snowpark:0.9.0`

To create a session, we need to authenticate ourselves to the Snowflake instance. Though it might be tempting to just override the authentication variables below with hard coded values, it's not considered best practice to do so. If you  share your version of the notebook, you might disclose your credentials by mistake to the recipient. Even worse, if you upload your notebook to a public code repository, you might advertise your credentials to the whole world. To prevent that, you should keep your credentials in an external file (like we are doing here).

Then, update your credentials in that file and they will be saved on your local machine. Even better would be to switch from user/password authentication to [private key authentication](https://docs.snowflake.com/en/user-guide/key-pair-auth.html). 

Copy the credentials template file creds/template_credentials.txt to creds/credentials.txt and update the file with your credentials. Put your key files into the same directory or update the location in your credentials file. 

In [None]:
import com.snowflake.snowpark._
import com.snowflake.snowpark.functions._

val session = Session.builder.configFile("creds/credentials.txt").create

## Step 5

Add the Ammonite kernel classes as dependencies for your UDF.

In [None]:
def addClass(session: Session, className: String): String = {
  var cls1 = Class.forName(className)
  val resourceName = "/" + cls1.getName().replace(".", "/") + ".class"
  val url = cls1.getResource(resourceName)
  val path = url.getPath().split(":").last.split("!").head
  session.addDependency(path)
  path
}
addClass(session, "ammonite.repl.ReplBridge$")
addClass(session, "ammonite.interp.api.APIHolder")
addClass(session, "pprint.TPrintColors")

# Hello World

Congratulations! You have successfully connected from a Jupyter Notebook to a Snowflake instance. Now we are ready to write our first "Hello World" program using Snowpark. That is as easy as the line in the cell below. After you have executed the cell below, you should see an output similar to

    [scala-interpreter-1]  INFO (Logging.scala:22) - Actively querying parameter snowpark_lazy_analysis from server.
    [scala-interpreter-1]  INFO (Logging.scala:22) - Execute query [queryID: 019e28e6-05025203-0000-22410336b00a]  SELECT  *  FROM (SELECT 'Hello World!' greeting) LIMIT 10
    ----------------
    |"GREETING"    |
    ----------------
    |Hello World!  |
    ----------------
    

In [None]:
session.sql("SELECT 'Hello World!' greeting").show()

Note that Snowpark has automatically translated the Scala code into the familiar "Hello World!" SQL statement. This means that we can execute arbitrary SQL by using the **sql** method of the **session** class.

However, this doesn't really show the power of the new Snowpark API. At this point it's time to review the [Snowpark API documentation](https://docs.snowflake.com/en/developer-guide/snowpark/reference/scala/com/snowflake/snowpark/index.html). It provides valuable information on how to use the Snowpark API. 

Let's now create a new *Hello World!* cell, that uses the Snowpark API, specifically the [DataFrame API](https://docs.snowflake.com/en/developer-guide/snowpark/reference/scala/com/snowflake/snowpark/index.html?search=dataframe). To use the DataFrame API we first create a row and a schema and then a DataFrame based on the row and the schema. 

In [None]:
import com.snowflake.snowpark.types._

val helloWorldDf=session.createDataFrame(Seq(("Hello World!"))).toDF("Greeting")

helloWorldDf
   .show

# Snowflake DataFrames


After having mastered the *Hello World!* stage, we now can query Snowflake tables using the DataFrame API. To do so, we will query the [Snowflake Sample Database](https://docs.snowflake.com/en/user-guide/sample-data.html) included in any Snowflake instance. 

As you may know, the TPCH data sets come in different sizes from 1 TB to 1 PB (1000 TB). For starters we will query the orders table in the 10 TB dataset size. Instead of writing a SQL statement we will use the DataFrame API. The advantage is that DataFrames can be built as a pipeline. Let's take a look at the demoOrdersDf.

    val demoOrdersDf=session.table(demoDataSchema :+ "ORDERS")
    
In contrast to the initial *Hello World!* example above, we now map a Snowflake table to a DataFrame. The definition of a DataFrame doesn't take any time. It's just defining metadata. To get the result, for instance the content of the Orders table, we need to [evaluate](https://docs.snowflake.com/en/developer-guide/snowpark/working-with-dataframes.html#performing-an-action-to-evaluate-a-dataframe) the DataFrame. One way of doing that is to apply the *[count()](https://docs.snowflake.com/en/developer-guide/snowpark/reference/scala/com/snowflake/snowpark/DataFrame.html?search=count)* action which returns the row count of the dataframe. In this case, the row count of the *Orders* table. Another method is the *[schema](https://docs.snowflake.com/en/developer-guide/snowpark/reference/scala/com/snowflake/snowpark/DataFrame.html?search=schema)* function.

In [None]:
val size:String="10"
val demoDataSchema:Seq[String]=Seq("SNOWFLAKE_SAMPLE_DATA","TPCH_SF"+size)
val demoOrdersDf=session.table(demoDataSchema :+ "ORDERS")

demoOrdersDf
    .count()
demoOrdersDf
    .schema

Next, we want to apply a projection. In SQL terms, this is the *select* clause. Instead of getting all of the columns in the Orders table, we are only interested in a few. This is accomplished by the *[select()](https://docs.snowflake.com/en/developer-guide/snowpark/reference/scala/com/snowflake/snowpark/DataFrame.html?search=select)* transformation. Note that we can just add additional qualifications to the already existing DataFrame of *demoOrdersDf* and create a new DataFrame that includes only a subset of columns. Lastly, instead of counting the rows in the DataFrame, this time we want to see the content of the DataFrame. To do so we need to evaluate the DataFrame. We can do that using another action *[show](https://docs.snowflake.com/en/developer-guide/snowpark/reference/scala/com/snowflake/snowpark/DataFrame.html?search=show)*.

In [None]:
val demoOrdersProjectedDf=
        demoOrdersDf
            .select(col("O_ORDERKEY"), col("O_CUSTKEY"), col("O_ORDERSTATUS"), col("O_ORDERDATE"))

demoOrdersProjectedDf
    .show(20)

Let's now assume that we do not want **all** the rows but only a subset of rows in a DataFrame. We can accomplish that with the *[filter()](https://docs.snowflake.com/en/developer-guide/snowpark/reference/scala/com/snowflake/snowpark/DataFrame.html?search=select)* transformation. 

In [None]:
val demoOrdersProjectedFilteredDf=
        demoOrdersProjectedDf
            .filter(col("O_ORDERSTATUS")===lit("O"))
                    
demoOrdersProjectedFilteredDf
    .sort(col("O_ORDERKEY"))
    .show(20)

And lastly, we want to create a new DataFrame which joins the Orders table with the LineItem table. Again, we are using our previous DataFrame that is a projection and a filter against the Orders table. We can join that DataFrame to the *LineItem* table and create a new DataFrame. We then apply the *[select()](https://docs.snowflake.com/en/developer-guide/snowpark/reference/scala/com/snowflake/snowpark/DataFrame.html?search=filter)* transformation. Again, to see the result we need to evaluate the DataFrame, for instance by using the *show()* action.

In [None]:
val demoLinesDf=session.table(demoDataSchema :+ "LINEITEM")
val demoOrdersLinesDf=
        demoOrdersProjectedFilteredDf
            .join(demoLinesDf,col("L_ORDERKEY")===col("O_ORDERKEY"))
            .select(col("O_ORDERKEY"),col("O_ORDERSTATUS"),col("L_LINENUMBER"),col("L_LINESTATUS"))

demoOrdersLinesDf
    .sort(col("O_ORDERKEY"),col("L_LINENUMBER"))
    .show(100)

In [None]:
val demoOrdersCountByDateAndStatus=
        demoOrdersProjectedDf
            .select(col("O_ORDERDATE"),col("O_ORDERSTATUS"), col("O_ORDERKEY"))
            .groupBy(col("O_ORDERDATE"),col("O_ORDERSTATUS"))
            .agg(count(col("O_ORDERKEY")).name("O_COUNT"))

val demoOrdersCountByDateAndStatusArr=demoOrdersCountByDateAndStatus
    .sort(col("O_ORDERDATE"),col("O_ORDERSTATUS"))
    .collect()

# Conclusion

Snowpark is a brand new developer experience that brings scalable data processing to the Data Cloud. In Part1 of this series, we learned how to set up a Jupyter Notebook and configure it to use Snowpark to connect to the Data Cloud. Next, we built a simple *Hello World!* program to test connectivity using embedded SQL. Then we enhanced that program by introducing the Snowpark Dataframe API. Lastly we explored the power of the Snowpark Dataframe API using filter, projection, and join transformations. 

In the next post of this series, we will learn how to create custom Scala based functions and execute arbitrary logic directly in Snowflake using user defined functions (UDFs) just by defining the logic in a Jupyter Notebook!