# Reading and Writing Data with Spark

This notebook contains the code from the previous screencast. The only difference is that instead of reading in a dataset from a remote cluster, the data set is read in from a local file. You can see the file by clicking on the "jupyter" icon and opening the folder titled "data".

Run the code cell to see how everything works. 

First let's import SparkConf and SparkSession

In [None]:
from src.config import settings
from src.spark_lakehouse import get_spark_session

Since we're using Spark locally we already have both a sparkcontext and a sparksession running. We can update some of the parameters, such our application's name. Let's just call it "Our first Python Spark SQL example"

In [None]:
spark = get_spark_session("data_inputs_and_outputs")

Let's check if the change went through

In [None]:
spark.conf.getAll

As you can see the app name is exactly how we set it

Let's create our first dataframe from a fairly small sample data set. Througout the course we'll work with a log file data set that describes user interactions with a music streaming service. The records describe events such as logging in to the site, visiting a page, listening to the next song, seeing an ad.

In [None]:
# Use workspace data path so the local Spark driver can access the file
path = settings.SPARK_CLUSTER_DATA_DIR + "sparkify_log_small.json"

user_log = spark.read.json(str(path))
user_log.printSchema()
user_log.show(1)

In [None]:
user_log.describe()

In [None]:
user_log.show(n=1)

In [None]:
user_log.take(5)

In [None]:
out_path = "data/sparkify_log_small.csv"

In [None]:
user_log.write.save(out_path, format="csv", header=True, mode="overwrite")

In [None]:
user_log_2 = spark.read.csv(out_path, header=True)

In [None]:
user_log_2.printSchema()

In [None]:
user_log_2.take(2)

In [None]:
user_log_2.select("userID").show()

In [None]:
user_log_2.take(1)