## SparkSession and SparkContext

SparkSession is accessible via the `spark` object and SparkContext is accessible via the `spark.sparkContext` object.

SparkContext gives access to lower-level API (RDDs).

Explore the methods available by each object using the autocomplete function.

In [0]:
# Spark Session object
spark

In [0]:
# Spark Context object
spark.sparkContext

In [0]:
# or simply
sc

## RDDs vs DataFrames

The following example illustrates the expressiveness of the RDD and DataFrame APIs.

### RDD

In [0]:
# input synthethic data
# array of tuples (name, age)
data = [("Brooke", 20), ("Denny", 20), ("Jules", 30), ("TD", 35), ("Brooke", 25)]

dataRDD = sc.parallelize(data)

dataRDD.take(4)

Q1: Has dataRDD been loaded into memory?

Q2: What does the code below do?

In [0]:
# operate on the RDD
agesRDD = (
    dataRDD
    .map(lambda x: (x[0], (x[1], 1)))
    .reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))
    .map(lambda x: (x[0], x[1][0]/x[1][1]))
)

agesRDD.take(4)

In [0]:
# Inspect the resulting DAG using the UI tool
agesRDD.collect()

In [0]:
# Try again
# Inspect the resulting DAG using the UI tool
# What is the difference?
agesRDD.collect()

A skipped stage indicates that the data has been fetched from cache.

Whenever there is shuffling involved (group/partition by) Spark automatically caches generated data (credit https://stackoverflow.com/a/34581152/11217701)

### DataFrame

Same as above but now using the DataFrame API.

In [0]:
import pyspark.sql.functions as fn

In [0]:
# create Dataframe
data_df = spark.createDataFrame(data, schema=["name", "age"])
data_df

In [0]:
data_df.show()

In [0]:
avg_df = data_df.groupBy("name").agg(fn.avg("age"))
avg_df

In [0]:
avg_df.show()

**Q:** Which code is more readable?