# Working with JSON data

In addition to simple tabular data, Spark also supportes nested data containing strcutures, arrays and maps. This is particular interesting if working with non-relational, semi-structured data like JSONs.

### Example Data

This time we will not work with weather data, since that data set does not contain the features we want to discuss. Instead we use Twitter data, which is provided as JSON data (one record for one tweet). As we will see, even simple things like Tweets end up in fairly complex data structures with lots of information. Welcome to the new world!

# 1 Inspect Data

So as a simple first step, let's try to load the data and inspect it like we did before.

In [1]:
storageLocation = "s3://dimajix-training/data/twitter-sample/00.json"

In [2]:
from pyspark.sql.functions import *

### Load and Inspect
Load data as JSON and convert it to a Pandas DataFrame.

In [3]:
twitter = spark.read.json(storageLocation)
twitter.limit(5).toPandas()

Unnamed: 0,contributors,coordinates,created_at,delete,entities,extended_entities,favorite_count,favorited,filter_level,geo,...,quoted_status_id,quoted_status_id_str,retweet_count,retweeted,retweeted_status,source,text,timestamp_ms,truncated,user
0,,,Fri Jul 29 08:00:00 +0000 2016,,"([], None, [], [], [])",,0,False,low,,...,,,0,False,,"<a href=""https://github.com/mispy/twitter_eboo...",Carrots coming in clutch with two Christmases.,1469779200658,False,"(False, Thu May 26 02:35:14 +0000 2016, False,..."
1,,,Fri Jul 29 08:00:00 +0000 2016,,"([], None, [], [], [])",,0,False,low,,...,,,0,False,,"<a href=""http://twitter.com/download/iphone"" r...",明日朝から夕方暇な方いらっしゃいませんかあ〜あ〜,1469779200662,False,"(False, Wed Apr 27 11:58:49 +0000 2016, True, ..."
2,,,Fri Jul 29 08:00:00 +0000 2016,,"([], None, [], [(twitter.com/Bangin_is_15/s…, ...",,0,False,low,,...,7.589258e+17,7.589257632193004e+17,0,False,,"<a href=""http://ifttt.com"" rel=""nofollow"">IFTT...",\76trdf\n— ゆぐどらしる(՞ةڼ◔) (Bangin_is_15) July 29...,1469779200666,False,"(False, Fri Jun 12 18:17:46 +0000 2015, True, ..."
3,,,Fri Jul 29 08:00:00 +0000 2016,,"([], None, [], [], [])",,0,False,low,,...,,,0,False,,"<a href=""http://twitter.com/download/iphone"" r...",Tシャツ諦めようと思ってたのに…,1469779200666,False,"(False, Tue Mar 15 14:06:45 +0000 2016, False,..."
4,,,Fri Jul 29 08:00:00 +0000 2016,,"([([109, 114], عاجل), ([115, 124], السعودية)],...",,0,False,low,,...,,,0,False,,"<a href=""http://twitterfeed.com"" rel=""nofollow...",تراجع أسعار المستهلكين في اليابان للشهر الرابع...,1469779200661,False,"(False, Thu Jun 27 07:28:44 +0000 2013, True, ..."


### Inspect Schema

After we already saw that some columns seem to contain nested data (for example the `entities` column), let's inspect the schema.

In [4]:
twitter.printSchema()

root
 |-- contributors: string (nullable = true)
 |-- coordinates: struct (nullable = true)
 |    |-- coordinates: array (nullable = true)
 |    |    |-- element: double (containsNull = true)
 |    |-- type: string (nullable = true)
 |-- created_at: string (nullable = true)
 |-- delete: struct (nullable = true)
 |    |-- status: struct (nullable = true)
 |    |    |-- id: long (nullable = true)
 |    |    |-- id_str: string (nullable = true)
 |    |    |-- user_id: long (nullable = true)
 |    |    |-- user_id_str: string (nullable = true)
 |    |-- timestamp_ms: string (nullable = true)
 |-- entities: struct (nullable = true)
 |    |-- hashtags: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- indices: array (nullable = true)
 |    |    |    |    |-- element: long (containsNull = true)
 |    |    |    |-- text: string (nullable = true)
 |    |-- media: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |  

### Remarks

That pretty large and complex schema gives you an impression of what you have to expect from social networking platforms. Similar complex structures also appear with event sourcing architectures.

But the big question now is, how can we work with this data. There are multiple challenges:
* Nested data
* Arrays of sub-entities

Theoretically Spark also supports maps, but JSON cannot distinguish between maps and structs. A good schema design would always use struct instead of maps, because this gives a static schema and therefore a reliable contract.

# 2 Accessing Elements

So let's start with the first simple exercise: We try to access some nested element by its top-level name. We chose the `geo` element.

In [5]:
result = twitter.filter(twitter["geo"].isNotNull()).select(twitter["geo"])
result.limit(5).toPandas()

Unnamed: 0,geo
0,"([51.826199, 4.62477], Point)"
1,"([51.6301173, 0.81552029], Point)"
2,"([36.19877126, 29.63925128], Point)"
3,"([13.28694, 100.929589], Point)"
4,"([30.32469225, 120.05983768], Point)"


### Inspect Schema

In [6]:
result.printSchema()

root
 |-- geo: struct (nullable = true)
 |    |-- coordinates: array (nullable = true)
 |    |    |-- element: double (containsNull = true)
 |    |-- type: string (nullable = true)



## 2.1 Accessing nested entries

You can also access nested entries by using the JSON path, which simply consists of the element names concatenated by a dot (.).

In [7]:
result = twitter.filter(twitter["geo.coordinates"].isNotNull()).select(
    twitter["geo.coordinates"]
)
result.limit(5).toPandas()

Unnamed: 0,coordinates
0,"[51.826199, 4.62477]"
1,"[51.6301173, 0.81552029]"
2,"[36.19877126, 29.63925128]"
3,"[13.28694, 100.929589]"
4,"[30.32469225, 120.05983768]"


In [8]:
result.printSchema()

root
 |-- coordinates: array (nullable = true)
 |    |-- element: double (containsNull = true)



## 2.2 Accessing Array Entries

The next challenge after accessing nested elements is to access entris inside an array. This can be achieved by subscripting a column with a numerical index.

In [9]:
result = twitter.filter(twitter["geo.coordinates"].isNotNull()).select(
    twitter["geo.coordinates"][0], twitter["geo.coordinates"][1]
)
result.limit(5).toPandas()

Unnamed: 0,geo.coordinates AS `coordinates`[0],geo.coordinates AS `coordinates`[1]
0,51.826199,4.62477
1,51.630117,0.81552
2,36.198771,29.639251
3,13.28694,100.929589
4,30.324692,120.059838


In [10]:
result.printSchema()

root
 |-- geo.coordinates AS `coordinates`[0]: double (nullable = true)
 |-- geo.coordinates AS `coordinates`[1]: double (nullable = true)



# 3 Exploding Entries

Accessing individual elements in an array via its index works fine as long as the number of entries is known. But in different scenarios, an array can contain an arbitrary number of elements. The Twitter data for example contains an array of used hashtags. Spark 2.3 does not provide much support, but it is possible to convert an array of entries into multiple records using the `explode` function.

In [11]:
result = twitter.select(
    twitter["id"],
    twitter["created_at"],
    explode(twitter["entities.hashtags"]).alias("hashtags"),
)
result.limit(5).toPandas()

Unnamed: 0,id,created_at,hashtags
0,758935090906882049,Fri Jul 29 08:00:00 +0000 2016,"([109, 114], عاجل)"
1,758935090906882049,Fri Jul 29 08:00:00 +0000 2016,"([115, 124], السعودية)"
2,758935090911031296,Fri Jul 29 08:00:00 +0000 2016,"([52, 62], KCAMexico)"
3,758935090911031296,Fri Jul 29 08:00:00 +0000 2016,"([63, 86], ValentinaZenereVillana)"
4,758935090902687744,Fri Jul 29 08:00:00 +0000 2016,"([27, 36], security)"


### Inspect Schema

In [12]:
result.printSchema()

root
 |-- id: long (nullable = true)
 |-- created_at: string (nullable = true)
 |-- hashtags: struct (nullable = true)
 |    |-- indices: array (nullable = true)
 |    |    |-- element: long (containsNull = true)
 |    |-- text: string (nullable = true)



## 3.1 Exploding sub-entities

In the example above, it might be useful to access sub-entries of an array. If no subscription is used, this will result again in an array, which can be exploded afterwards.

In [13]:
result = twitter.select(
    twitter["id"],
    twitter["created_at"],
    twitter["entities.hashtags.text"].alias("hashtags"),
)
result.limit(5).toPandas()

Unnamed: 0,id,created_at,hashtags
0,758935090894217216,Fri Jul 29 08:00:00 +0000 2016,[]
1,758935090911072256,Fri Jul 29 08:00:00 +0000 2016,[]
2,758935090927849477,Fri Jul 29 08:00:00 +0000 2016,[]
3,758935090927841280,Fri Jul 29 08:00:00 +0000 2016,[]
4,758935090906882049,Fri Jul 29 08:00:00 +0000 2016,"[عاجل, السعودية]"


#### Inspecting the Schema

In [14]:
result.printSchema()

root
 |-- id: long (nullable = true)
 |-- created_at: string (nullable = true)
 |-- hashtags: array (nullable = true)
 |    |-- element: string (containsNull = true)



### Exploding

The `explode` function allows to create multiple records for each entry in an array while retaining other non-array columns.

In [15]:
result = twitter.select(
    twitter["id"],
    twitter["created_at"],
    explode(twitter["entities.hashtags.text"]).alias("hashtags"),
)
result.limit(5).toPandas()

Unnamed: 0,id,created_at,hashtags
0,758935090906882049,Fri Jul 29 08:00:00 +0000 2016,عاجل
1,758935090906882049,Fri Jul 29 08:00:00 +0000 2016,السعودية
2,758935090911031296,Fri Jul 29 08:00:00 +0000 2016,KCAMexico
3,758935090911031296,Fri Jul 29 08:00:00 +0000 2016,ValentinaZenereVillana
4,758935090902687744,Fri Jul 29 08:00:00 +0000 2016,security


#### Inspecting the Schema

In [16]:
result.printSchema()

root
 |-- id: long (nullable = true)
 |-- created_at: string (nullable = true)
 |-- hashtags: string (nullable = true)



## Remark

Note that `explode` will actually create no record for empty lists of hashtags. If you still require all records which do not have any hashtags, you can use the function `explode_outer` instead.

In [29]:
result = twitter.select(
    twitter["id"],
    twitter["created_at"],
    explode_outer(twitter["entities.hashtags.text"]).alias("hashtags"),
)
result.limit(5).toPandas()

Unnamed: 0,id,created_at,hashtags
0,758935090894217216,Fri Jul 29 08:00:00 +0000 2016,
1,758935090911072256,Fri Jul 29 08:00:00 +0000 2016,
2,758935090927849477,Fri Jul 29 08:00:00 +0000 2016,
3,758935090927841280,Fri Jul 29 08:00:00 +0000 2016,
4,758935090906882049,Fri Jul 29 08:00:00 +0000 2016,عاجل


# 4 Working with UDFs

Of course another approach to work with nested data (specifically with arrays) is to use UDFs. For example let us try to extract the longest hashtag for every tweet. This would be rather difficult with the current functionality of Spark, since we cannot create subselects inside a single record.

But a small Python UDF will just do the work.

## 4.1 Define Python Function

First we define and test a small Python function, which should perform the task.

In [22]:
# Import builtin Python functions, like max
import builtins


def select_longest(tags):
    if tags:
        return builtins.max(tags, key=lambda t: len(t))
    else:
        return None

### Test Python function

We should test the function with some common cases
* non-empty list
* empty list
* `NULL` value (i.e. `None`)

In [32]:
print(select_longest(["x", "12345", "abc"]))

12345


In [30]:
print(select_longest([]))

None


In [31]:
print(select_longest(None))

None


## 4.2 Convert Python function to UDF

Now we have to encapsulate the Python function into a Spark UDF.

In [26]:
from pyspark.sql.types import *


select_longest_udf = udf(select_longest, StringType())

### Use UDF

Now we can use the Python UDF in a simple `select` statement

In [27]:
result = twitter.select(
    twitter["id"],
    twitter["created_at"],
    select_longest_udf(twitter["entities.hashtags.text"]).alias("longest_hashtag"),
)
result.limit(5).toPandas()

Unnamed: 0,id,created_at,longest_hashtag
0,758935090894217216,Fri Jul 29 08:00:00 +0000 2016,
1,758935090911072256,Fri Jul 29 08:00:00 +0000 2016,
2,758935090927849477,Fri Jul 29 08:00:00 +0000 2016,
3,758935090927841280,Fri Jul 29 08:00:00 +0000 2016,
4,758935090906882049,Fri Jul 29 08:00:00 +0000 2016,السعودية


## 4.3 Use Pandas UDF

Of course a Pandas UDF might improve performance significantly. Let's try that instead of the classic Python UDF.

In [34]:
import builtins

from pyspark.sql.functions import PandasUDFType, pandas_udf


@pandas_udf('string', PandasUDFType.SCALAR)
def select_longest(series):
    def f(tags):
        if tags is not None and len(tags) > 0:
            return builtins.max(tags, key=lambda t: len(t))
        else:
            return None

    return series.apply(f)

### Use Pandas UDF

We can use the Pandas UDF in the same way as we did with the original Python UDF.

In [35]:
result = twitter.select(
    twitter["id"],
    twitter["created_at"],
    select_longest(twitter["entities.hashtags.text"]).alias("longest_hashtag"),
)
result.limit(5).toPandas()

Unnamed: 0,id,created_at,longest_hashtag
0,758935090894217216,Fri Jul 29 08:00:00 +0000 2016,
1,758935090911072256,Fri Jul 29 08:00:00 +0000 2016,
2,758935090927849477,Fri Jul 29 08:00:00 +0000 2016,
3,758935090927841280,Fri Jul 29 08:00:00 +0000 2016,
4,758935090906882049,Fri Jul 29 08:00:00 +0000 2016,السعودية
