Skip to content

Jupyter

Jolan Rensen edited this page Jul 26, 2023 · 11 revisions

The Kotlin Spark API also supports Kotlin Jupyter notebooks. To it, simply add

%use spark

to the top of your notebook. This will get the latest version of the API, together with the latest version of Spark. To define a certain version of Spark or the API itself, simply add it like this:

%use spark(spark=3.3.2, scala=2.13, v=1.2.4)

Other arguments you can pass in this %use magic include displayLimit and displayTruncate:

%use spark(displayLimit=30, displayTruncate=-1)

As well as any Spark property you like:

%use spark(spark.app.name=MyApp, spark.master=local[*])

Inside the notebook a Spark session will be initiated automatically. This can be accessed via the spark value. sc: JavaSparkContext can also be accessed directly. The API operates pretty similarly.

There is also support for HTML rendering of Datasets and simple (Java)RDDs. The looks of these renders can be adjusted by setting either sparkProperties.displayTruncate (which adjusts the number of characters per cell) or sparkProperties.displayLimit (which adjusts the number of rows per table).

To use Spark Streaming abilities, instead use

%use spark-streaming

This does not start a Spark session right away, meaning you can call withSparkStreaming(batchDuration) {} in whichever cell you want. Check out the example. If a running stream is interrupted by Jupyter, an attempt will be made to close the stream itself so no Spark session will remain running in the background.

NOTE: You need an up-to-datekotlin-jupyter-kernel for the Kotlin Spark API to work. Also, if the %use spark magic does not output "Spark session has been started...", and %use spark-streaming doesn't work at all, add %useLatestDescriptors above it.