By default, the latest version of the API and the latest supported Spark version is chosen.
To specify your own: `%use spark(spark=3.3.0, scala=2.13, v=1.2.0)`

You can also define `displayLimit` and `displayTruncate` to control the display of the result.

Finally, any other property you pass, like `spark.master=local[4]`, will be passed on to Spark.

In [2]:
%use spark

received properties: Properties: {v=1.2.0-SNAPSHOT, spark=3.3.0, scala=2.13, displayLimit=20, displayTruncate=30, spark.app.name=Jupyter, spark.master=local[*], spark.sql.codegen.wholeStage=false, fs.hdfs.impl=org.apache.hadoop.hdfs.DistributedFileSystem, fs.file.impl=org.apache.hadoop.fs.LocalFileSystem}, providing Spark with: {spark.app.name=Jupyter, spark.master=local[*], spark.sql.codegen.wholeStage=false, fs.hdfs.impl=org.apache.hadoop.hdfs.DistributedFileSystem, fs.file.impl=org.apache.hadoop.fs.LocalFileSystem}
22/07/25 12:17:17 INFO SparkContext: Running Spark version 3.3.0
22/07/25 12:17:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/07/25 12:17:18 INFO ResourceUtils: No custom resources configured for spark.driver.
22/07/25 12:17:18 INFO SparkContext: Submitted application: Jupyter
22/07/25 12:17:18 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: c

Let's define some enums and data classes to work with.

In [4]:
enum class EyeColor {
    BLUE, BROWN, GREEN
}

enum class Gender {
    MALE, FEMALE, OTHER
}

In [5]:
data class Person(
    val eyeColor: EyeColor,
    val name: String,
    val gender: Gender,
    val length: Double,
    val age: Int,
)

And now we can simply create a Dataset. We can see the contents of a Dataset by simply stating it. As seen below:

In [6]:
val ds: Dataset<Person> = dsOf(
    Person(
        eyeColor = EyeColor.BLUE,
        name = "Alice",
        gender = Gender.FEMALE,
        length = 1.70,
        age = 25,
    ),
    Person(
        eyeColor = EyeColor.BLUE,
        name = "Bob",
        gender = Gender.MALE,
        length = 1.67,
        age = 25,
    ),
    Person(
        eyeColor = EyeColor.BROWN,
        name = "Charlie",
        gender = Gender.OTHER,
        length = 1.80,
        age = 17,
    ),
)

ds

eyeColor,name,gender,length,age
BLUE,Alice,FEMALE,1.7,25
BLUE,Bob,MALE,1.67,25
BROWN,Charlie,OTHER,1.8,17


The effects of operations like filtering can also be seen immediately, as well as sorting, selecting columns etc...

In [7]:
ds.filter { it.age > 20 }

eyeColor,name,gender,length,age
BLUE,Alice,FEMALE,1.7,25
BLUE,Bob,MALE,1.67,25


In [8]:
ds.sort(col(Person::age), col(Person::length))

eyeColor,name,gender,length,age
BROWN,Charlie,OTHER,1.8,17
BLUE,Bob,MALE,1.67,25
BLUE,Alice,FEMALE,1.7,25


In [9]:
val res: Dataset<Tuple2<Int, Double>> = ds.select(col(Person::age), col(Person::length))
res

age,length
25,1.7
25,1.67
17,1.8


In [10]:
"Average length: " +
    ds
        .map { it.length }
        .reduceK { a, b -> a + b } / ds.count()

Average length: 1.7233333333333334

Extension methods that usually only work in the `withSpark {}` context of the Kotlin Spark API work out of the box in Jupyter.
This means we can also create a Dataset like this:

In [11]:
listOf(1, 2, 3, 4).toDS()

value
1
2
3
4


We can also create RDDs using `sc: JavaSparkContext` which are rendered similarly to Datasets.
You can see that all Tuple helper functions are immediately available too.

In [12]:
val rdd: JavaRDD<Tuple2<Int, String>> = rddOf(
    1 X "aaa",
    t(2, "bbb"),
    tupleOf(3, "cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc"),
)

rdd

Values
"[1, aaa]"
"[2, bbb]"
"[3, ccccccccccccccccccccccc..."


Finally, we can also set the `diplayLimit` and `displayTruncate` on the fly using `sparkProperties`.

In [13]:
sparkProperties {
    displayLimit = 2
    displayTruncate = -1
}

rdd

Values
"[1, aaa]"
"[2, bbb]"
