
1. What is json data and how to read it in spark?
2. What if I have 3 keys in all line and 4 keys in one line?
3. What is multiline and line-delimited JSON?
4. Which one works faster - multiline or line-delimited?
5. How to convert nested JSON into spark dataframe?
6. What will happen if I have corrupted JSON file or invited JSON file?

What is JSON data
------------------

JSON, short for JavaScript Object Notation, is a lightweight data interchange format. It uses a human-readable text format to represent data objects consisting of 
key-value pairs. 

***Find the below figure where Mahish has given a brief introduction on JSON data***

<img src="https://drive.google.com/uc?id=1f-9OKUvPAwIZgI86tqy9fleFaBRu1jPD" width="700" height="500">

[CLICK HERE](https://chat.openai.com/share/59558685-6f99-46b8-80ec-de51a6be3f71) to know more about benefits of JSON over CSV(Thanks ChatGPT💗)

In [0]:
#Read JSON file in PySpark

line_delimited_df = spark.read.format("json")\
                              .option("inferSchema","true")\
                              .option("mode","PERMISSIVE")\
                              .load('/FileStore/tables/line_delimited_json.json')
line_delimited_df.show()

+---+--------+------+
|age|    name|salary|
+---+--------+------+
| 20|  Manish| 20000|
| 25|  Nikita| 21000|
| 16|  Pritam| 22000|
| 35|Prantosh| 25000|
| 67|  Vikash| 40000|
+---+--------+------+



In [0]:
# What if I have 3 keys in all line and 4 keys in one line?
line_delimited_extrafield_df = spark.read.format("json")\
                              .option("inferSchema","true")\
                              .option("mode","PERMISSIVE")\
                              .load('/FileStore/tables/line_delimited_json_extrafield.json')
line_delimited_extrafield_df.show()

#Takeways: If there is one extra field for any record then it will set null for rest of the record.

+---+------+--------+------+
|age|gender|    name|salary|
+---+------+--------+------+
| 20|  null|  Manish| 20000|
| 25|  null|  Nikita| 21000|
| 16|  null|  Pritam| 22000|
| 35|  null|Prantosh| 25000|
| 67|     M|  Vikash| 40000|
+---+------+--------+------+



What is Multiline and line-delimited JSON?
-------------------------------------------
In PySpark (the Python API for Apache Spark), multiline JSON and line-delimited JSON are two common formats used for reading 
and writing JSON data.

***Multiline JSON:*** 

Multiline JSON is a format where each JSON object occupies multiple lines in a file. This format is useful 
when dealing with large JSON objects or when each line of the file represents a distinct JSON document.

 ***line-delimited JSON:***

 Line-delimited JSON, also known as JSON Lines or newline-delimited JSON, is a format where each line of a file contains a 
 standalone JSON object. It's a compact and efficient way to store multiple JSON documents in a single file./

 [Source](https://chat.openai.com/share/fe6d4e3e-5379-44d5-987c-7ed29dd3de37)

In [0]:
# Reading Multiline JSON

# Note:
# PySpark JSON data source API provides the multiline option to read records from multiple lines. By default, 
# PySpark considers every record in a JSON file as a fully qualified record in a single line.
# If our JSON data is spread across multiple lines, we can use the multiline option to read it correctly. To do this,
# we have to set the multiline option to true when reading the JSON file.

multiline_json_df = spark.read.format("json")\
                              .option("inferSchema","true")\
                              .option("mode","PERMISSIVE")\
                              .option("multiline","true")\
                              .load('/FileStore/tables/Multi_line_correct.json')
# It will throw an exception if multiline option is not mentioned while reading the multiline json data as by default
# "multipleline" option sets to false
multiline_json_df.show()


+---+--------+------+
|age|    name|salary|
+---+--------+------+
| 20|  Manish| 20000|
| 25|  Nikita| 21000|
| 16|  Pritam| 22000|
| 35|Prantosh| 25000|
| 67|  Vikash| 40000|
+---+--------+------+



Which works faster between multiline JSON and line-delimited JSON? and Why?
-----------------------------------------------------------------------------

<img src="https://drive.google.com/uc?id=1UJsdHNweUYatVm2yTF4ke4GQAi3W2mPy" width="500" height="300">

In [0]:
# Reading incorrect multiline json file
multiline_incorrect_json_df = spark.read.format("json")\
                              .option("inferSchema","true")\
                              .option("mode","PERMISSIVE")\
                              .option("multiline","true")\
                              .load('/FileStore/tables/Multi_line_incorrect.json')

# It will show only first record because the provided JSON data is not in an array format. PySpark's json reader 
# expects an array of JSON objects when reading a multiline JSON file. In our case, each JSON object is on a 
# separate line, but they are not enclosed within square brackets ([]) to form an array. So it will only read first record

multiline_incorrect_json_df.show()

+---+------+------+
|age|  name|salary|
+---+------+------+
| 20|Manish| 20000|
+---+------+------+



In [0]:
# Reading corrupted JSON file
corrupted_json_df = spark.read.format("json")\
                              .option("inferSchema","true")\
                              .option("mode","PERMISSIVE")\
                              .load('/FileStore/tables/corrupted_json.json')
# It will create _corrupt_record column and will add the corrupted record.
# The null value will be set for respective valid record in _corrupt_recod column
corrupted_json_df.show(truncate=False)

+----------------------------------------+----+--------+------+
|_corrupt_record                         |age |name    |salary|
+----------------------------------------+----+--------+------+
|null                                    |20  |Manish  |20000 |
|null                                    |25  |Nikita  |21000 |
|null                                    |16  |Pritam  |22000 |
|null                                    |35  |Prantosh|25000 |
|{"name":"Vikash","age":67,"salary":40000|null|null    |null  |
+----------------------------------------+----+--------+------+

