# Working with JSON data

In addition to simple tabular data, Spark also supportes nested data containing strcutures, arrays and maps. This is particular interesting if working with non-relational, semi-structured data like JSONs.

### Example Data

This time we will not work with weather data, since that data set does not contain the features we want to discuss. Instead we use Twitter data, which is provided as JSON data (one record for one tweet). As we will see, even simple things like Tweets end up in fairly complex data structures with lots of information. Welcome to the new world!

# 1 Inspect Data

So as a simple first step, let's try to load the data and inspect it like we did before.

In [None]:
storageLocation = "s3://dimajix-training/data/twitter-sample/00.json"

In [None]:
from pyspark.sql.functions import *

### Load and Inspect
Load data as JSON and convert it to a Pandas DataFrame.

In [None]:
twitter = spark.read.json(storageLocation)
twitter.limit(5).toPandas()

### Inspect Schema

After we already saw that some columns seem to contain nested data (for example the `entities` column), let's inspect the schema.

In [None]:
twitter.printSchema()

### Remarks

That pretty large and complex schema gives you an impression of what you have to expect from social networking platforms. Similar complex structures also appear with event sourcing architectures.

But the big question now is, how can we work with this data. There are multiple challenges:
* Nested data
* Arrays of sub-entities

Theoretically Spark also supports maps, but JSON cannot distinguish between maps and structs. A good schema design would always use struct instead of maps, because this gives a static schema and therefore a reliable contract.

# 2 Accessing Elements

So let's start with the first simple exercise: We try to access some nested element by its top-level name. We chose the `geo` element.

In [None]:
result = # YOUR CODE HERE
result.limit(5).toPandas()

### Inspect Schema

In [None]:
# YOUR CODE HERE

## 2.1 Accessing nested entries

You can also access nested entries by using the JSON path, which simply consists of the element names concatenated by a dot (.).

In [None]:
result = # YOUR CODE HERE
result.limit(5).toPandas()

In [None]:
result.printSchema()

## 2.2 Accessing Array Entries

The next challenge after accessing nested elements is to access entris inside an array. This can be achieved by subscripting a column with a numerical index.

In [None]:
result = twitter \
    .filter(twitter["geo.coordinates"].isNotNull()) \
    .select(
        # YOUR CODE HERE
    )
result.limit(5).toPandas()

In [None]:
result.printSchema()

# 3 Exploding Entries

Accessing individual elements in an array via its index works fine as long as the number of entries is known. But in different scenarios, an array can contain an arbitrary number of elements. The Twitter data for example contains an array of used hashtags. Spark 2.3 does not provide much support, but it is possible to convert an array of entries into multiple records using the `explode` function.

In [None]:
result = # YOUR CODE HERE
result.limit(5).toPandas()

### Inspect Schema

In [None]:
result.printSchema()

## 3.1 Exploding sub-entities

In the example above, it might be useful to access sub-entries of an array. If no subscription is used, this will result again in an array, which can be exploded afterwards.

In [None]:
result = # YOUR CODE HERE
result.limit(5).toPandas()

#### Inspecting the Schema

In [None]:
result.printSchema()

### Exploding

The `explode` function allows to create multiple records for each entry in an array while retaining other non-array columns.

In [None]:
result = twitter \
    .select(
        # YOUR CODE HERE
    )
result.limit(5).toPandas()

#### Inspecting the Schema

In [None]:
result.printSchema()

## Remark

Note that `explode` will actually create no record for empty lists of hashtags. If you still require all records which do not have any hashtags, you can use the function `explode_outer` instead.

In [None]:
result = twitter \
    .select(
        # YOUR CODE HERE
    )
result.limit(5).toPandas()

# 4 Working with UDFs

Of course another approach to work with nested data (specifically with arrays) is to use UDFs. For example let us try to extract the longest hashtag for every tweet. This would be rather difficult with the current functionality of Spark, since we cannot create subselects inside a single record.

But a small Python UDF will just do the work.

## 4.1 Define Python Function

First we define and test a small Python function, which should perform the task.

In [None]:
# Import builtin Python functions, like max
import builtins


def select_longest(tags):
    # YOUR CODE HERE

### Test Python function

We should test the function with some common cases
* non-empty list
* empty list
* `NULL` value (i.e. `None`)

In [None]:
print(select_longest(["x", "12345", "abc"]))

In [None]:
print(select_longest([]))

In [None]:
print(select_longest(None))

## 4.2 Convert Python function to UDF

Now we have to encapsulate the Python function into a Spark UDF.

In [None]:
from pyspark.sql.types import *


select_longest_udf = # YOUR CODE HERE

### Use UDF

Now we can use the Python UDF in a simple `select` statement

In [None]:
result = twitter \
    .select(
        # YOUR CODE HERE
    )
result.limit(5).toPandas()

## 4.3 Use Pandas UDF

Of course a Pandas UDF might improve performance significantly. Let's try that instead of the classic Python UDF.

In [None]:
import builtins

from pyspark.sql.functions import PandasUDFType, pandas_udf


@pandas_udf('string', PandasUDFType.SCALAR)
def select_longest(series):
    # YOUR CODE HERE

### Use Pandas UDF

We can use the Pandas UDF in the same way as we did with the original Python UDF.

In [None]:
result = twitter \
    .select(
        twitter["id"],
        twitter["created_at"],
        select_longest(twitter["entities.hashtags.text"]).alias("longest_hashtag")
    )
result.limit(5).toPandas()