# Reading from CSV / json file and writing to Parquet file

This sample code reads a few fields from nested json, and created a dataframe.

Then write the dataframe to storage.

In [1]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
from pyspark.sql.types import *

In [2]:
spark = SparkSession.builder.appName('02 reading').getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/02/15 09:03:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


The Spark UI is available at http://localhost:4040 when running locally in a PC

# Read nested json into a dataframe

HINT: During testing, create a tiny jsonl file so reading is fast. I did `head -n 12 the-file.json > test_12.json`

In [None]:
# https://sparkbyexamples.com/pyspark/pyspark-read-json-file-into-dataframe/
# https://bigdataprogrammers.com/read-nested-json-in-spark-dataframe/
# https://sparkbyexamples.com/pyspark/pyspark-explode-array-and-map-columns-to-rows/
from pyspark.sql.types import MapType
fname = "../data/sample.json"

# Note: By default, the schema is inferred from the data.
# This is slower and may sometime fail due to bad input files.
# A possilbe workaround is to read a short well defined file, extract the schema from it, and then read
# the full file using this schema.
# inferred = spark.read.json(fname_ref)
# inferred.printSchema()
#bids = spark.read.schema(inferred.schema).json(fname)
df = spark.read.json(fname)
df.show()   
    

In [None]:
# Now we can get a few columns of our choice. note the nesting
subset = df.select('address.zip', 'name') 
subset.show(4)

If the notebooks runs inside a Docker container, we need to provide access to the hosted data directory.

For example, create a directory in the host and configure in docker-compose.

## Reading multiple files at a time
https://sparkbyexamples.com/pyspark/pyspark-read-json-file-into-dataframe/

Using the read.json() method you can also read multiple JSON files from different paths, just pass all file names with fully qualified paths by separating comma, or a list of files

# Writing the dataframe to storage

What if you want to persist (save values) of a DF?
Can be saved to a database (covered in another lesson), or saved to a file in the file system.
Using Parquet format is very efficient as we can see here.


For example, In one test read a jsonl file (602MB) into a DF, then wrote it to parquet file (actually it creates a directory with several files).
The parquet file is compressed so the total saved storage is 92MB. 


In [None]:
%%time 
# Read a CSV into a dataframe, inferring the schema.
dataPath = "../data/Open_Parking_and_Camera_Violations_100.csv"
fines = spark.read.format("csv")\
  .option("header","true")\
  .option("inferSchema", "true")\
  .load(dataPath)
  

In [None]:
%%time
# the output file must not exist
# Column names must not include spaces (and some other characters
fines.select(['Plate','Amount Due']).withColumnRenamed('Amount Due','AmountDue').write.parquet("./testing_noam.parquet")
# On my Docker/PC, saving 250K lines DF, took about 9 seconds (CPU times: user 16 µs, sys: 3.98 ms, total: 4 ms, Wall time: 8.6 s)
#  and reading them took 2.5 Sec

In [None]:
%%time 
# read the DF from the parquet file:
restored_df = spark.read.parquet("./testing_noam.parquet")

In [None]:
restored_df.count()

In [None]:
%%time
restored_df.select(f.max("Plate")).collect()

### Repartition before writing to storage

Spark DataFrameWriter provides partitionBy method which can be used to partition data on write. It repartition the data into separate files on write using a provided set of columns. [2]