# Read and Write DataFrames with `PySpark`  
> [PySpark SQL module documentation](<https://spark.apache.org/docs/2.4.5/api/python/pyspark.sql.html>)

In [2]:
import pyspark
from pyspark import SparkConf
from pyspark.sql import SparkSession

## Create a new `sparkSession`
if it doesn't already exist  
> [SparkSession Class Docs](<https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/sql/SparkSession.html>)

In [4]:
spark = SparkSession \
    .builder \
    .appName("Our first Spark SQL example") \
    .getOrCreate()

## Retrieve server settings

In [5]:
spark.sparkContext.getConf().getAll()

[('spark.driver.port', '33547'),
 ('spark.app.name', 'Our first Spark SQL example'),
 ('spark.rdd.compress', 'True'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.master', 'local[*]'),
 ('spark.executor.id', 'driver'),
 ('spark.submit.deployMode', 'client'),
 ('spark.app.id', 'local-1584618453379'),
 ('spark.ui.showConsoleProgress', 'true'),
 ('spark.driver.host', '4c932bda68b4')]

## Visualize `SparkSession` version
and other attributes:

In [6]:
spark

## Read a `json` file
from a folder within this notebook's folder: this creates a `DataFrame` object  
> [PySpark SQL DataFrame Documentation](<https://spark.apache.org/docs/2.4.5/api/python/pyspark.sql.html#pyspark.sql.DataFrame>)

In [7]:
# set target file path
log_path = 'data/sparkify_log_small.json'

# read json data to "user_log" variable
user_log = spark.read.json(log_path)

## Check `user_log` object type
It's a PySpark SQL's DataFrame

In [8]:
type(user_log)

pyspark.sql.dataframe.DataFrame

## use DataFrame's `printSchema()` method
to obtain a list of the objects' attributes and data types

In [10]:
user_log.printSchema()

root
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: long (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: double (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: long (nullable = true)
 |-- sessionId: long (nullable = true)
 |-- song: string (nullable = true)
 |-- status: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- userId: string (nullable = true)



## use `describe()`
to obtain a "simplified" version of `printSchema()` results:

In [11]:
user_log.describe()

DataFrame[summary: string, artist: string, auth: string, firstName: string, gender: string, itemInSession: string, lastName: string, length: string, level: string, location: string, method: string, page: string, registration: string, sessionId: string, song: string, status: string, ts: string, userAgent: string, userId: string]

## use `show(n=)`
to exhibit "n" DataFrame rows

In [13]:
user_log.show(n=1)

+-------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+
|       artist|     auth|firstName|gender|itemInSession|lastName|   length|level|            location|method|    page| registration|sessionId|                song|status|           ts|           userAgent|userId|
+-------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+
|Showaddywaddy|Logged In|  Kenneth|     M|          112|Matthews|232.93342| paid|Charlotte-Concord...|   PUT|NextSong|1509380319284|     5132|Christmas Tears W...|   200|1513720872284|"Mozilla/5.0 (Win...|  1046|
+-------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+------

## use `take(n)`
to obtain a list object of PySpark DataFrame's `Row` classes  
> [Row Class Documentation](<https://spark.apache.org/docs/2.4.5/api/python/pyspark.sql.html#pyspark.sql.Row>)  
  
Notice the `Row` Class contains `key=value` pairs

In [14]:
user_log.take(1)

[Row(artist='Showaddywaddy', auth='Logged In', firstName='Kenneth', gender='M', itemInSession=112, lastName='Matthews', length=232.93342, level='paid', location='Charlotte-Concord-Gastonia, NC-SC', method='PUT', page='NextSong', registration=1509380319284, sessionId=5132, song='Christmas Tears Will Fall', status=200, ts=1513720872284, userAgent='"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"', userId='1046')]

## data output path is set

In [22]:
out_path = 'data/sparkify_csv_log'

## use `select()`
to retrieve specific DataFrame columns within a new DataFrame.

In [25]:
user_log.select("userID").show()

+------+
|userID|
+------+
|  1046|
|  1000|
|  2219|
|  2373|
|  1747|
|  1747|
|  1162|
|  1061|
|   748|
|   597|
|  1806|
|   748|
|  1176|
|  2164|
|  2146|
|  2219|
|  1176|
|  2904|
|   597|
|   226|
+------+
only showing top 20 rows



## Use DataFrame's `write` property
along with the `save()` method to save DataFrame contents to external storage.

> `write` definition: Interface for saving the content of the non-streaming :class:`DataFrame` out into external
storage.  
> 

In [23]:
#  save the "user_log" DataFrame into a CSV file within the folder
# specified in the "out_path" variable.
user_log.write.save(
     path=out_path
    ,format='csv'
    ,header=True
    ,mode='append'
)

## the same methods shown above
can be used to read the csv file recently saved.