# Working with JSON data in PySpark

- JSON stands for JavaScript Object Notation. JSON data is a long-standing data interchange format that became massively popular for its readability and its relatively small size. 

- JSON consists of key and value pair. Keys are always strings, and values can take numerical, Boolean, string, or null values.

## 1. Data Ingestion

In [7]:
# Creating a SparkSession
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

In [8]:
# Loading a file
shows = spark.read.json("./data/shows/shows-silicon-valley.json") 
shows.count()

In [10]:
# Loading many files 
three_shows = spark.read.json("./data/shows/shows-*.json", multiLine=True)
three_shows.count()

## 2. Breaking the second dimension with complex data types

In [16]:
shows.printSchema()

root
 |-- _embedded: struct (nullable = true)
 |    |-- episodes: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- _links: struct (nullable = true)
 |    |    |    |    |-- self: struct (nullable = true)
 |    |    |    |    |    |-- href: string (nullable = true)
 |    |    |    |-- airdate: string (nullable = true)
 |    |    |    |-- airstamp: string (nullable = true)
 |    |    |    |-- airtime: string (nullable = true)
 |    |    |    |-- id: long (nullable = true)
 |    |    |    |-- image: struct (nullable = true)
 |    |    |    |    |-- medium: string (nullable = true)
 |    |    |    |    |-- original: string (nullable = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- number: long (nullable = true)
 |    |    |    |-- runtime: long (nullable = true)
 |    |    |    |-- season: long (nullable = true)
 |    |    |    |-- summary: string (nullable = true)
 |    |    |    |-- url: string (nullable = true

In [17]:
print(shows.columns)

['_embedded', '_links', 'externals', 'genres', 'id', 'image', 'language', 'name', 'network', 'officialSite', 'premiered', 'rating', 'runtime', 'schedule', 'status', 'summary', 'type', 'updated', 'url', 'webChannel', 'weight']


## 2.1 When you have more than one value: The array

PySpark arrays are containers for values of the same type. To work a little with the array, let's get a subset from the show dataset.

In [29]:
# Selecting two columns
array_subset = shows.select("name", "genres")
array_subset.show()

+--------------+--------+
|          name|  genres|
+--------------+--------+
|Silicon Valley|[Comedy]|
+--------------+--------+



To take to the value inside the array, you need to extract them. There are many ways to exract elements from an array. Let me show that.

In [19]:
import pyspark.sql.functions as F
array_subset = array_subset.select(
    "name",
    array_subset.genres[0].alias("dot_and_index"), 
    F.col("genres")[0].alias("col_and_index"),
    array_subset.genres.getItem(0).alias("dot_and_method"), 
    F.col("genres").getItem(0).alias("col_and_method"),
)
array_subset.show()

+--------------+-------------+-------------+--------------+--------------+
|          name|dot_and_index|col_and_index|dot_and_method|col_and_method|
+--------------+-------------+-------------+--------------+--------------+
|Silicon Valley|       Comedy|       Comedy|        Comedy|        Comedy|
+--------------+-------------+-------------+--------------+--------------+



Let's take a look at how to perform multiple operations on an array column.

In [20]:
array_subset_repeated = array_subset.select(
    "name",
    # lit() is used to create scalar columns.
    F.lit("Comedy").alias("one"),
    F.lit("Horror").alias("two"),
    F.lit("Drama").alias("three"),
    F.col("dot_and_index"),
).select(
    "name",
    # The array method is used to create an array.
    F.array("one", "two", "three").alias("Some_Genres"), 
    # The array_repeat is used to repeat the values five times within an array.
    F.array_repeat("dot_and_index", 5).alias("Repeated_Genres"), 
)
array_subset_repeated.show(1, False)

+--------------+-----------------------+----------------------------------------+
|name          |Some_Genres            |Repeated_Genres                         |
+--------------+-----------------------+----------------------------------------+
|Silicon Valley|[Comedy, Horror, Drama]|[Comedy, Comedy, Comedy, Comedy, Comedy]|
+--------------+-----------------------+----------------------------------------+



In [21]:
# Using the size method
array_subset_repeated.select(
    "name", 
    # The size method is used to compute a number of elements.
    F.size("Some_Genres"), F.size("Repeated_Genres") 
).show()

+--------------+-----------------+---------------------+
|          name|size(Some_Genres)|size(Repeated_Genres)|
+--------------+-----------------+---------------------+
|Silicon Valley|                3|                    5|
+--------------+-----------------+---------------------+



In [22]:
# Using the array_distrinct method
array_subset_repeated.select(
    "name",
    # The array_distrinct method is used to remove duplicate values.
    F.array_distinct("Some_Genres"),
    F.array_distinct("Repeated_Genres"), 
).show(1, False)

+--------------+---------------------------+-------------------------------+
|name          |array_distinct(Some_Genres)|array_distinct(Repeated_Genres)|
+--------------+---------------------------+-------------------------------+
|Silicon Valley|[Comedy, Horror, Drama]    |[Comedy]                       |
+--------------+---------------------------+-------------------------------+



In [23]:
# Using the array_intersect method.
array_subset_repeated = array_subset_repeated.select(
    "name",
    # The array_intersect method is used to look at intersect values.
    F.array_intersect("Some_Genres", "Repeated_Genres").alias("Genres"),
)
array_subset_repeated.show()

In [25]:
# Using the array_position method
array_subset_repeated.select(
    "Genres", 
    # The array_position method is used to look at the position of a value in array.
    F.array_position("Genres", "Comedy")
).show()

+--------+------------------------------+
|  Genres|array_position(Genres, Comedy)|
+--------+------------------------------+
|[Comedy]|                             1|
+--------+------------------------------+



### 2.2 The map type: Keys and values within a column

A map has keys and values just like in a dictionary. One of the easiest ways to create a map is from two columns of type array.

In [31]:
# Creating a map
columns = ["name", "language", "type"]
shows_map = shows.select(
    *[F.lit(column) for column in columns],
    F.array(*columns).alias("values"),
)
shows_map = shows_map.select(F.array(*columns).alias("keys"), "values")
shows_map.show(1)

+--------------------+--------------------+
|                keys|              values|
+--------------------+--------------------+
|[name, language, ...|[Silicon Valley, ...|
+--------------------+--------------------+



In [32]:
shows_map = shows_map.select(
    F.map_from_arrays("keys", "values").alias("mapped")
)

In [33]:
shows_map.printSchema()

root
 |-- mapped: map (nullable = false)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)



In [34]:
shows_map.show(1, False)

+---------------------------------------------------------------+
|mapped                                                         |
+---------------------------------------------------------------+
|[name -> Silicon Valley, language -> English, type -> Scripted]|
+---------------------------------------------------------------+



In [35]:
shows_map.select(
    F.col("mapped.name"), 
    F.col("mapped")["name"], 
    shows_map.mapped["name"], 
).show()

+--------------+--------------+--------------+
|          name|  mapped[name]|  mapped[name]|
+--------------+--------------+--------------+
|Silicon Valley|Silicon Valley|Silicon Valley|
+--------------+--------------+--------------+



You can get the value corresponding to a key using the dot notation within the col function.

In [37]:
shows_map.select(
    F.col("mapped.name"), 
    F.col("mapped")["name"], 
    shows_map.mapped["name"], 
).show()

+--------------+--------------+--------------+
|          name|  mapped[name]|  mapped[name]|
+--------------+--------------+--------------+
|Silicon Valley|Silicon Valley|Silicon Valley|
+--------------+--------------+--------------+

