# Working with JSON data in PySpark

- JSON stands for JavaScript Object Notation. JSON data is a long-standing data interchange format that became massively popular for its readability and its relatively small size. 

- JSON consists of key and value pair. Keys are always strings, and values can take numerical, Boolean, string, or null values.

## 1. Data Ingestion

In [1]:
# Creating a SparkSession
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

In [2]:
# Loading a file
shows = spark.read.json("./data/shows/shows-silicon-valley.json") 
shows.count()

1

In [3]:
# Loading many files 
three_shows = spark.read.json("./data/shows/shows-*.json", multiLine=True)
three_shows.count()

3

## 2. Breaking the second dimension with complex data types

In [4]:
shows.printSchema()

root
 |-- _embedded: struct (nullable = true)
 |    |-- episodes: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- _links: struct (nullable = true)
 |    |    |    |    |-- self: struct (nullable = true)
 |    |    |    |    |    |-- href: string (nullable = true)
 |    |    |    |-- airdate: string (nullable = true)
 |    |    |    |-- airstamp: string (nullable = true)
 |    |    |    |-- airtime: string (nullable = true)
 |    |    |    |-- id: long (nullable = true)
 |    |    |    |-- image: struct (nullable = true)
 |    |    |    |    |-- medium: string (nullable = true)
 |    |    |    |    |-- original: string (nullable = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- number: long (nullable = true)
 |    |    |    |-- runtime: long (nullable = true)
 |    |    |    |-- season: long (nullable = true)
 |    |    |    |-- summary: string (nullable = true)
 |    |    |    |-- url: string (nullable = true

In [5]:
print(shows.columns)

['_embedded', '_links', 'externals', 'genres', 'id', 'image', 'language', 'name', 'network', 'officialSite', 'premiered', 'rating', 'runtime', 'schedule', 'status', 'summary', 'type', 'updated', 'url', 'webChannel', 'weight']


### 2.1 When you have more than one value: The array

PySpark arrays are containers for values of the same type. To work a little with the array, let's get a subset from the show dataset.

In [6]:
# Selecting two columns
array_subset = shows.select("name", "genres")
array_subset.show()

+--------------+--------+
|          name|  genres|
+--------------+--------+
|Silicon Valley|[Comedy]|
+--------------+--------+



To take to the value inside the array, you need to extract them. There are many ways to exract elements from an array. Let me show that.

In [7]:
import pyspark.sql.functions as F
array_subset = array_subset.select(
    "name",
    array_subset.genres[0].alias("dot_and_index"), 
    F.col("genres")[0].alias("col_and_index"),
    array_subset.genres.getItem(0).alias("dot_and_method"), 
    F.col("genres").getItem(0).alias("col_and_method"),
)
array_subset.show()

+--------------+-------------+-------------+--------------+--------------+
|          name|dot_and_index|col_and_index|dot_and_method|col_and_method|
+--------------+-------------+-------------+--------------+--------------+
|Silicon Valley|       Comedy|       Comedy|        Comedy|        Comedy|
+--------------+-------------+-------------+--------------+--------------+



Let's take a look at how to perform multiple operations on an array column.

In [8]:
array_subset_repeated = array_subset.select(
    "name",
    # lit() is used to create scalar columns.
    F.lit("Comedy").alias("one"),
    F.lit("Horror").alias("two"),
    F.lit("Drama").alias("three"),
    F.col("dot_and_index"),
).select(
    "name",
    # The array method is used to create an array.
    F.array("one", "two", "three").alias("Some_Genres"), 
    # The array_repeat is used to repeat the values five times within an array.
    F.array_repeat("dot_and_index", 5).alias("Repeated_Genres"), 
)
array_subset_repeated.show(1, False)

+--------------+-----------------------+----------------------------------------+
|name          |Some_Genres            |Repeated_Genres                         |
+--------------+-----------------------+----------------------------------------+
|Silicon Valley|[Comedy, Horror, Drama]|[Comedy, Comedy, Comedy, Comedy, Comedy]|
+--------------+-----------------------+----------------------------------------+



In [9]:
# Using the size method
array_subset_repeated.select(
    "name", 
    # The size method is used to compute a number of elements.
    F.size("Some_Genres"), F.size("Repeated_Genres") 
).show()

+--------------+-----------------+---------------------+
|          name|size(Some_Genres)|size(Repeated_Genres)|
+--------------+-----------------+---------------------+
|Silicon Valley|                3|                    5|
+--------------+-----------------+---------------------+



In [10]:
# Using the array_distrinct method
array_subset_repeated.select(
    "name",
    # The array_distrinct method is used to remove duplicate values.
    F.array_distinct("Some_Genres"),
    F.array_distinct("Repeated_Genres"), 
).show(1, False)

+--------------+---------------------------+-------------------------------+
|name          |array_distinct(Some_Genres)|array_distinct(Repeated_Genres)|
+--------------+---------------------------+-------------------------------+
|Silicon Valley|[Comedy, Horror, Drama]    |[Comedy]                       |
+--------------+---------------------------+-------------------------------+



In [11]:
# Using the array_intersect method.
array_subset_repeated = array_subset_repeated.select(
    "name",
    # The array_intersect method is used to look at intersect values.
    F.array_intersect("Some_Genres", "Repeated_Genres").alias("Genres"),
)
array_subset_repeated.show()

+--------------+--------+
|          name|  Genres|
+--------------+--------+
|Silicon Valley|[Comedy]|
+--------------+--------+



In [12]:
# Using the array_position method
array_subset_repeated.select(
    "Genres", 
    # The array_position method is used to look at the position of a value in array.
    F.array_position("Genres", "Comedy")
).show()

+--------+------------------------------+
|  Genres|array_position(Genres, Comedy)|
+--------+------------------------------+
|[Comedy]|                             1|
+--------+------------------------------+



### 2.2 The map type: Keys and values within a column

A map has keys and values just like in a dictionary. One of the easiest ways to create a map is from two columns of type array.

In [13]:
# Creating a map
columns = ["name", "language", "type"]
shows_map = shows.select(
    *[F.lit(column) for column in columns],
    F.array(*columns).alias("values"),
)
shows_map = shows_map.select(F.array(*columns).alias("keys"), "values")
shows_map.show(1)

+--------------------+--------------------+
|                keys|              values|
+--------------------+--------------------+
|[name, language, ...|[Silicon Valley, ...|
+--------------------+--------------------+



In [14]:
shows_map = shows_map.select(
    F.map_from_arrays("keys", "values").alias("mapped")
)

In [15]:
shows_map.printSchema()

root
 |-- mapped: map (nullable = false)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)



In [16]:
shows_map.show(1, False)

+---------------------------------------------------------------+
|mapped                                                         |
+---------------------------------------------------------------+
|[name -> Silicon Valley, language -> English, type -> Scripted]|
+---------------------------------------------------------------+



In [17]:
shows_map.select(
    F.col("mapped.name"), 
    F.col("mapped")["name"], 
    shows_map.mapped["name"], 
).show()

+--------------+--------------+--------------+
|          name|  mapped[name]|  mapped[name]|
+--------------+--------------+--------------+
|Silicon Valley|Silicon Valley|Silicon Valley|
+--------------+--------------+--------------+



You can get the value corresponding to a key using the dot notation within the col function.

In [18]:
shows_map.select(
    F.col("mapped.name"), 
    F.col("mapped")["name"], 
    shows_map.mapped["name"], 
).show()

+--------------+--------------+--------------+
|          name|  mapped[name]|  mapped[name]|
+--------------+--------------+--------------+
|Silicon Valley|Silicon Valley|Silicon Valley|
+--------------+--------------+--------------+



 ## 3. The struct: Nesting columns within columns

The struct is similar to a JSON object. So the key or name of each pair is a string and that each record can be of a different type. Let's select a subset from the dataset.

In [19]:
shows.select("schedule").printSchema()

root
 |-- schedule: struct (nullable = true)
 |    |-- days: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- time: string (nullable = true)



Here, the schedule column is a struct. The struct contains two named fields: days (an Array) and time, a string. As a result, you can think of the struct as **a small data frame** within your column records.

### 3.1 Navigating structs as if they were nested columns

Let's look at how to extract values from nested structs inside a data frame using the embedded column.

In [20]:
shows.select(F.col("_embedded")).printSchema()

root
 |-- _embedded: struct (nullable = true)
 |    |-- episodes: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- _links: struct (nullable = true)
 |    |    |    |    |-- self: struct (nullable = true)
 |    |    |    |    |    |-- href: string (nullable = true)
 |    |    |    |-- airdate: string (nullable = true)
 |    |    |    |-- airstamp: string (nullable = true)
 |    |    |    |-- airtime: string (nullable = true)
 |    |    |    |-- id: long (nullable = true)
 |    |    |    |-- image: struct (nullable = true)
 |    |    |    |    |-- medium: string (nullable = true)
 |    |    |    |    |-- original: string (nullable = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- number: long (nullable = true)
 |    |    |    |-- runtime: long (nullable = true)
 |    |    |    |-- season: long (nullable = true)
 |    |    |    |-- summary: string (nullable = true)
 |    |    |    |-- url: string (nullable = true

The embedded column only contain a field, episodes. Let's access this field using the dot notation.

In [21]:
shows_clean = shows.withColumn("episodes", F.col("_embedded.episodes")).drop("_embedded")
shows_clean.printSchema()

root
 |-- _links: struct (nullable = true)
 |    |-- previousepisode: struct (nullable = true)
 |    |    |-- href: string (nullable = true)
 |    |-- self: struct (nullable = true)
 |    |    |-- href: string (nullable = true)
 |-- externals: struct (nullable = true)
 |    |-- imdb: string (nullable = true)
 |    |-- thetvdb: long (nullable = true)
 |    |-- tvrage: long (nullable = true)
 |-- genres: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- id: long (nullable = true)
 |-- image: struct (nullable = true)
 |    |-- medium: string (nullable = true)
 |    |-- original: string (nullable = true)
 |-- language: string (nullable = true)
 |-- name: string (nullable = true)
 |-- network: struct (nullable = true)
 |    |-- country: struct (nullable = true)
 |    |    |-- code: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- timezone: string (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- name: string (nul

Here, we drop the embedded column  and promoted the field of the struct (episodes) as a top level column.  Now, let's select a field in an Array to create a column

In [22]:
episodes_name = shows_clean.select(F.col("episodes.name"))
episodes_name.printSchema()

root
 |-- name: array (nullable = true)
 |    |-- element: string (containsNull = true)



## 4. Building and using the data frame schema

This section cover how to define and use a schema with a PySpark data frame.

### 4.1 Using Spark types as the base blocks of a schema

In this subsection, I explain the column types in the context of a schema definition. The data types we use to build a schema are located in the pyspark.sql.types module. Let's import this module.

In [23]:
import pyspark.sql.types as T

Within the pyspark.sql.types, there are two main kinds of objects: First, you have the types object such as  the ValueType(), LongType() that represents a column of a certain type. Second, you have the field object: the StructField()

A StructField() contains two mandatory as well as two optional parameters:
- The name of the field, passed as a string
- The dataType of the field, passed as a type object
- (Optional) A nullable flag, which determines if the field can be null or not
(by default True)
- (Optional) A metadata dictionary that contains arbitrary information, which we
will use for column metadata when working with ML pipelines.

Let' take a look at the schema for the embedded field.

In [24]:
# The _links field contains a self struct that itself contains a single-string field: href.
episode_links_schema = T.StructType(
    [T.StructField("self", T.StructType([T.StructField("href", T.StringType())]))]
)

In [25]:
# The image field is a struct of two string fields: medium and original.
episode_image_schema = T.StructType(
    [
        T.StructField("medium", T.StringType()), 
        T.StructField("original", T.StringType()),
    ]
) 

In [26]:
episode_schema = T.StructType(
    [
        T.StructField("_links", episode_links_schema), 
        T.StructField("airdate", T.DateType()),
        T.StructField("airstamp", T.TimestampType()),
        T.StructField("airtime", T.StringType()),
        T.StructField("id", T.StringType()),
        T.StructField("image", episode_image_schema), 
        T.StructField("name", T.StringType()),
        T.StructField("number", T.LongType()),
        T.StructField("runtime", T.LongType()),
        T.StructField("season", T.LongType()),
        T.StructField("summary", T.StringType()),
        T.StructField("url", T.StringType()),
    ]
)

In [27]:
embedded_schema = T.StructType(
    [
        T.StructField(
            "_embedded",
            T.StructType(
                [
                    T.StructField(
                    "episodes", T.ArrayType(episode_schema))
                ]
            ),
        )
    ]
)

### 4.2 Reading a JSON document with a strict schema in place

In this section I'm going to talk about how to read a JSON document while enforcing a precise schema. Let's read a JSON document with an explicit partial schema.

In [28]:
shows_with_schema = spark.read.json(
    "./data/shows/shows-silicon-valley.json",
    #  we only read the defined fields.
    schema=embedded_schema, 
    # To crash DataFrameReader if our schema is incompatible
    mode="FAILFAST", 
)

Let's verify the new date and timestamp field with the following codes.

In [29]:
for column in ["airdate", "airstamp"]:
    shows.select(f"_embedded.episodes.{column}").select(
    F.explode(column)
    ).show(5)

+----------+
|       col|
+----------+
|2014-04-06|
|2014-04-13|
|2014-04-20|
|2014-04-27|
|2014-05-04|
+----------+
only showing top 5 rows

+--------------------+
|                 col|
+--------------------+
|2014-04-07T02:00:...|
|2014-04-14T02:00:...|
|2014-04-21T02:00:...|
|2014-04-28T02:00:...|
|2014-05-05T02:00:...|
+--------------------+
only showing top 5 rows



Here you go. Everything looks fine.

### 4.3 Going full circle: Specifying your schemas in JSON

In this section, I'll cover a different approach to the schema definition. 

In [30]:
# Pretty-printing the schema
import pprint
pprint.pprint(
    shows_with_schema.select(
        F.explode("_embedded.episodes").alias("episode")
    )
    .select("episode.airtime")
    .schema.jsonValue()
)

{'fields': [{'metadata': {},
             'name': 'airtime',
             'nullable': True,
             'type': 'string'}],
 'type': 'struct'}


In [31]:
# Pretty-printing dummy complex types
pprint.pprint(
    T.StructField("array_example", T.ArrayType(T.StringType())).jsonValue()
)

{'metadata': {},
 'name': 'array_example',
 'nullable': True,
 'type': {'containsNull': True, 'elementType': 'string', 'type': 'array'}}


In [32]:
pprint.pprint(
    T.StructField(
    "map_example", T.MapType(T.StringType(), T.LongType())
    ).jsonValue()
)

{'metadata': {},
 'name': 'map_example',
 'nullable': True,
 'type': {'keyType': 'string',
          'type': 'map',
          'valueContainsNull': True,
          'valueType': 'long'}}


## 5 Reducing duplicate data with complex data types

This section takes the hierarchical data model and presents the advantages in a big
data setting. 

### 5.1 Getting to the “just right” data frame: Explode and collect

This section covers how to use explode and collect operations to go from hierarchical to tabular and back. 

In [33]:
# Exploding the _embedded.episodes into 53 distinct records
episodes = shows.select(
    "id", F.explode("_embedded.episodes").alias("episodes")
) 
episodes.show(5, truncate=70)

+---+----------------------------------------------------------------------+
| id|                                                              episodes|
+---+----------------------------------------------------------------------+
|143|[[[http://api.tvmaze.com/episodes/10897]], 2014-04-06, 2014-04-07T0...|
|143|[[[http://api.tvmaze.com/episodes/10898]], 2014-04-13, 2014-04-14T0...|
|143|[[[http://api.tvmaze.com/episodes/10899]], 2014-04-20, 2014-04-21T0...|
|143|[[[http://api.tvmaze.com/episodes/10900]], 2014-04-27, 2014-04-28T0...|
|143|[[[http://api.tvmaze.com/episodes/10901]], 2014-05-04, 2014-05-05T0...|
+---+----------------------------------------------------------------------+
only showing top 5 rows



In [34]:
# Exploding a map using posexplode()
episode_name_id = shows.select(
    F.map_from_arrays( 
    F.col("_embedded.episodes.id"), F.col("_embedded.episodes.name")
    ).alias("name_id")
)
episode_name_id = episode_name_id.select(
    F.posexplode("name_id").alias("position", "id", "name") 
)
episode_name_id.show(5)

+--------+-----+--------------------+
|position|   id|                name|
+--------+-----+--------------------+
|       0|10897|Minimum Viable Pr...|
|       1|10898|       The Cap Table|
|       2|10899|Articles of Incor...|
|       3|10900|    Fiduciary Duties|
|       4|10901|      Signaling Risk|
+--------+-----+--------------------+
only showing top 5 rows



In [35]:
# Collecting our results back into an array
collected = episodes.groupby("id").agg(
    F.collect_list("episodes").alias("episodes")
)
collected.count() 

1

### 5.2 Building your own hierarchies: Struct as a function

This section concludes the chapter by showing how you can create structs within a data frame. The struct function can take one or more column objects (or column names). I passed a literal column to indicate that I’ve watched the show.

In [36]:
# Creating a struct column using the struct function
struct_ex = shows.select(
    F.struct( 
        F.col("status"), F.col("weight"), F.lit(True).alias("has_watched")
    ).alias("info")
)

In [37]:
struct_ex.show(1, False)

+-----------------+
|info             |
+-----------------+
|[Ended, 96, true]|
+-----------------+



In [38]:
struct_ex.printSchema()

root
 |-- info: struct (nullable = false)
 |    |-- status: string (nullable = true)
 |    |-- weight: long (nullable = true)
 |    |-- has_watched: boolean (nullable = false)



## Conclusion

In this notebook, We ingested, processed, navigated, and molded a JSON document with the same data frame and set of functions that we used for textual and tabular data. 

### Takeaway

- You can use JSON DataFrameReader for ingesting JSON documents within a data frame.

- You can think of JSON data as a Python dictionary.

- In PySpark, hierarchical data models are represented through complex column types. For example, the array represents lists of elements of the same type, the map represents multiple keys and values and the struct represents an object in the JSON sense.

- PySpark offers a programatic API to build data frame schemas on top of a JSON representation.

Thanks for reading. I hope you enjoy it 😀

Don't forget to follow us on [YouTube](http://youtube.com/tirendazacademy) | [Medium](http://tirendazacademy.medium.com) | [Twitter](http://twitter.com/tirendazacademy) | [GitHub](http://github.com/tirendazacademy) | [Linkedin](https://www.linkedin.com/in/tirendaz-academy) | [Kaggle](https://www.kaggle.com/tirendazacademy) 😎

## Resource
- Data Analysis with Python and PySpark