## Overview
This notebook shows how to read and write the data in pyspark via Dataframe API using different ways and options.

#### **Contents :**
- Reading file via `DataFrameReader.json()` method 
- Reading file via `DataFrameReader.format().load()` method 
- Writing file via `DataframeWriter.write().json()` method 
- Writing file via `DataframeWriter.write().format().save()` method 

This is a **Python** notebook so the default cell type is Python. However, you can use different languages by using the %LANGUAGE magic command. `Python`, `Scala(%scala)`, `SQL(%sql)`, `FileStore(%fs)` and `R(%r)` all are supported.

**Input JSON File Used :** 
- https://github.com/databricks/Spark-The-Definitive-Guide/blob/master/data/flight-data/json/2011-summary.json

**Spark Read/Write Documentation Link**
- https://spark.apache.org/docs/latest/sql-data-sources.html

In [0]:
# check the files under FileStore
dbutils.fs.ls('FileStore/tables')

Out[1]: [FileInfo(path='dbfs:/FileStore/tables/2010_summary.csv', name='2010_summary.csv', size=7121, modificationTime=1728547018000),
 FileInfo(path='dbfs:/FileStore/tables/2010_summary_write/', name='2010_summary_write/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/FileStore/tables/2010_summary_write.csv/', name='2010_summary_write.csv/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/FileStore/tables/2010_summary_write_02/', name='2010_summary_write_02/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/FileStore/tables/2011_summary.json', name='2011_summary.json', size=21301, modificationTime=1729515377000),
 FileInfo(path='dbfs:/FileStore/tables/2011_summary_write.json/', name='2011_summary_write.json/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/FileStore/tables/NewFile/', name='NewFile/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/FileStore/tables/RangeFile/', name='RangeFile/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/FileStore/tables/Rang

#### Reading file via `DataFrameReader.json()` method

In [0]:
# A JSON dataset is pointed to by path.
# The path can be either a single text file or a directory storing text files

# Define read options
options = {
    "inferSchema": "True",
    "header": "True",
    "mode": "FAILFAST"
}

file_path = "/FileStore/tables/2011_summary.json"
df = spark.read.options(**options).json(file_path)

# Display the result dataframe 
display(df.head(5))
display(df.printSchema())

DEST_COUNTRY_NAME,ORIGIN_COUNTRY_NAME,count
United States,Saint Martin,2
United States,Guinea,2
United States,Croatia,1
United States,Romania,3
United States,Ireland,268


root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)



#### Reading file via `DataFrameReader.format().load()` method

In [0]:
file_path = "/FileStore/tables/2011_summary.json"
file_type = "json"

# JSON options
infer_schema = "false"
first_row_is_header = "true"

# The applied options are for JSON files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("mode", "FAILFAST") \
  .load(file_path)

# Display the result dataframe 
display(df.head(5))
display(df.printSchema())


DEST_COUNTRY_NAME,ORIGIN_COUNTRY_NAME,count
United States,Saint Martin,2
United States,Guinea,2
United States,Croatia,1
United States,Romania,3
United States,Ireland,268


root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)



#### Writing file via DataframeWriter.write().json() method

In [0]:
# Read a JSON file with chaining multiple options
df_read = spark.read.option("inferSchema", True) \
                .option("header", True) \
                .json("/FileStore/tables/2011_summary.json")

# Display the result dataframe 
display(df_read.head(5))
display(df_read.printSchema())

# Save and Write DataFrame to JSON File via write().csv() method 
df_read.write.option("header",True) \
               .mode('ignore') \
               .option("compression", "gzip") \
               .json("/FileStore/tables/2011_summary_write.json")

DEST_COUNTRY_NAME,ORIGIN_COUNTRY_NAME,count
United States,Saint Martin,2
United States,Guinea,2
United States,Croatia,1
United States,Romania,3
United States,Ireland,268


root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)



#### Writing file via DataframeWriter.write().format().save() method

In [0]:
# Read a JSON file with chaining multiple options
df_read = spark.read.option("inferSchema", True) \
                .option("header", True) \
                .json("/FileStore/tables/2011_summary.json")

# Display the result dataframe 
df_less = df_read.where(df_read.ORIGIN_COUNTRY_NAME=='India')
display(df_less)

# Save and Write DataFrame to CSV File via format().save() method
df_less.write.format("json").option("header",True) \
                .mode('overwrite') \
                .option("compression", "gzip") \
                .save("/FileStore/tables/2011_summary_write.json")

DEST_COUNTRY_NAME,ORIGIN_COUNTRY_NAME,count
United States,India,76


In [0]:
# Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")

# SQL statements can be run by using the sql methods provided by spark
teenagerNamesDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19")
teenagerNamesDF.show()
# +------+
# |  name|
# +------+
# |Justin|
# +------+

# Alternatively, a DataFrame can be created for a JSON dataset represented by
# an RDD[String] storing one JSON object per string
jsonStrings = ['{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}']
otherPeopleRDD = sc.parallelize(jsonStrings)
otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.show()
# +---------------+----+
# |        address|name|
# +---------------+----+
# |[Columbus,Ohio]| Yin|
# +---------------+----+

In [0]:
df_read = spark.read.option("inferSchema", True) \
                .option("header", True) \
                .json("/FileStore/tables/2011_summary.json")

# Creates a temporary view using the DataFrame
df_read.createOrReplaceTempView("2011Summary")

In [0]:
# SQL statements can be run by using the sql methods provided by spark
dest_df = spark.sql("SELECT DISTINCT DEST_COUNTRY_NAME FROM 2011Summary WHERE count = 1")
dest_df.show()

+-----------------+
|DEST_COUNTRY_NAME|
+-----------------+
|            Yemen|
|             Togo|
|       The Gambia|
|    United States|
|          Belarus|
|            Malta|
|       Mauritania|
|          Georgia|
|            Libya|
|          Namibia|
|     Saint Martin|
|          Lebanon|
|        Greenland|
|          Vietnam|
+-----------------+



In [0]:
# Since this table is registered as a temp view, it will only be available to this notebook. If you'd like other users to be able to query this table, you can also create a table from the DataFrame.
# Once saved, this table will persist across cluster restarts as well as allow various users across different notebooks to query this data.
# To do so, choose your table name and uncomment the bottom line.

permanent_table_name = "2011_summary_table"

df.write.mode('overwrite').format('delta').saveAsTable(permanent_table_name)

In [0]:
# Alternatively, a DataFrame can be created for a JSON dataset represented by
# an RDD[String] storing one JSON object per string

jsonStrings = ['{"DEST_COUNTRY_NAME":"JAPAN", "ORIGIN_COUNTRY_NAME":"INDIA", "count":"7"}']
print(jsonStrings)

otherRDD = sc.parallelize(jsonStrings)
newRDD = spark.read.json(otherRDD)
newRDD.show()

['{"DEST_COUNTRY_NAME":"JAPAN", "ORIGIN_COUNTRY_NAME":"INDIA", "count":"7"}']
+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|            JAPAN|              INDIA|    7|
+-----------------+-------------------+-----+

