# Reading and Writing Data with Spark

In [1]:
import pyspark
from pyspark import SparkConf
from pyspark.sql import SparkSession

Since we're using Spark locally we already have both a sparkcontext and a sparksession running. We can update some of the parameters, such our application's name. Let's just call it "Our first Python Spark SQL example"

In [2]:
spark = SparkSession.builder.getOrCreate()
#spark = pyspark.SparkContext.getOrCreate()

Let's check if the change went through

In [3]:
spark.sparkContext.getConf().getAll()

[('spark.driver.extraJavaOptions',
  '-XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED'),
 ('spark.driver.host', '6ca9d335e183'),
 ('spark.executor.id', 'driver'),
 ('spark.app.name', 'pyspark-shell'),
 ('spark.app.submitTime', '1671043038877'),
 ('spark.driver.port', '39861'),
 ('spark.rdd.compress', 'True'),
 ('spark.ex

In [4]:
spark

As you can see the app name is exactly how we set it

Let's create our first dataframe from a fairly small sample data set. Througout the course we'll work with a log file data set that describes user interactions with a music streaming service. The records describe events such as logging in to the site, visiting a page, listening to the next song, seeing an ad.

In [5]:
path = "/FileStore/tables/music_log_small.json"
#path = data/music_log_small.json
user_log = spark.read.json(path)

AnalysisException: Path does not exist: file:/FileStore/tables/music_log_small.json

In [None]:
user_log.printSchema()

In [None]:
# can be slow if using whole logs
user_log.describe()

In [None]:
user_log.show(n=1)

In [None]:
user_log.take(1)

In [None]:
out_path = "/data/music_log.csv"

In [None]:
user_log.limit(10).write.mode('overwrite').csv(out_path, header=True)

In [None]:
user_log_2 = spark.read.csv(out_path, header=True)

In [None]:
user_log_2.printSchema()

In [None]:
user_log_2.take(2)

In [None]:
user_log_2.select("userID").show()

In [None]:
user_log_2.take(1)