# Glue ETL Transformation - Flatten and Unnest JSON

[CN]

首先我们来定义问题, 什么是 Flatten, 什么是 Unnest. 我们先来考虑如下 Record.

```python
{
    "user_id": 1,
    "contact": {"email": "alice@example.com", "phone": "111-222-3333"},
    "accounts": ["acc1", "acc2", "acc3"],
}
```

**Unnest Struct**

所谓 Nest 就是 ``contact`` 这种, 一个 value 是一个 struct, 里面又有很多 key value, 不断嵌套. 我们希望 Unnest 后的结果是:

```python
{
    "contact.email": "alice@example.com",
    "contact.phone": "111-222-3333",
}
```

**Flatten Array**

所谓 Flatten 就是扁平化, 使得数据结构没有嵌套, 都是简单的 Key Value. 在语法上其实是包含了 Unnest. 也就是说 Unnest 是 flatten 的一种. 这里我们主要来看 ``accounts`` array.

通常 Flatten array 有两种方法:

方法 1: 给元素添加序号

```python
{
    "accounts[0]": "acc1",
    "accounts[1]": "acc2",
    "accounts[2]": "acc3",
}
```

方法 2: 基于 array 把一个 record 复制 N 份, N 等于 array 中元素的个数

```python
{"user_id": 1, "account": "acc1", ...}
{"user_id": 1, "account": "acc2", ...}
{"user_id": 1, "account": "acc3", ...}
```

在大数据分析中, 我们常用第二种方法.

下面我们来看看在 Glue Job 中如何执行 Unnest / Flatten 的操作.

[EN]

**Unnest Struct**

For struct, it move leaf nodes to root level, and use full json path as the key:

```python
# Input
{"id": 1, "specs": {"color": "red"}}

# Output
{"id": 1, "specs.color": "red"}
```

**Flatten Array**

For array, one record expand to number of records that equal to the length of the array:

```python
# Input
{"id": 1, "categories": ["cate1", "cate2", "cate3"]}

# Output
{"id": 1, "categories": "cate1"}
{"id": 1, "categories": "cate2"}
{"id": 1, "categories": "cate3"}
```

If you want to flatten based on multiple array fields, it expand to the combination of each fields:

```python
# Input
{"id": "7e3f, "array1": [1, 2], "array2": [3, 4, 5]}

# Output
{"id": "7e3f", "array1": 1, "array2": 3}
{"id": "7e3f", "array1": 1, "array2": 4}
{"id": "7e3f", "array1": 1, "array2": 5}
{"id": "7e3f", "array1": 2, "array2": 3}
{"id": "7e3f", "array1": 2, "array2": 4}
{"id": "7e3f", "array1": 2, "array2": 5}
```

In [1]:
import sys
from pyspark.context import SparkContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.transforms import *

# Create SparkContext
sparkContext = SparkContext.getOrCreate()

# Create Glue Context
glueContext = GlueContext(sparkContext)

# Get spark session
spark = glueContext.spark_session

# Resolve job parameters
# Uncomment this in Glue ETL job
# args = getResolvedOptions(sys.argv, ["JOB_NAME"
# job = Job(glueContext)
# job.init(args['JOB_NAME'], args)

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
16,application_1646085135716_0021,pyspark,idle,Link,Link,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [2]:
class Config:
    bucket = "aws-data-lab-sanhe-for-everything-us-east-2"
    prefix = "poc/learn-big-data-on-aws/glue-job-examples/03-transformation-examples/05-flatten-and-unnest-json"
    
    @property
    def s3path_prefix(self):
        return S3Path(self.bucket, self.prefix)
    
config = Config()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [3]:
# gdf = Glue Dynamic Frame
gdf = glueContext.create_dynamic_frame.from_options(
    connection_type="s3", 
    connection_options=dict(
        paths=[
            f"s3://{config.bucket}/{config.prefix}/"
        ],
        recurse=True,
    ),
    format="json",
    format_options=dict(multiLine=True),
    transformation_ctx="datasource",
)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [4]:
# print data schema
gdf.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
|-- id: int
|-- name: string
|-- price: int
|-- specs: struct
|    |-- color: string
|-- categories: array
|    |-- element: string
|-- reviews: array
|    |-- element: struct
|    |    |-- rank: int
|    |    |-- comment: string

In [5]:
# preview the data
gdf.toDF().show(3, truncate=False, vertical=True)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

-RECORD 0--------------------------------------------------------
 id         | 1                                                  
 name       | report                                             
 price      | 74                                                 
 specs      | [LightSeaGreen]                                    
 categories | [simple]                                           
 reviews    | [[4, In simple way eat.], [2, Best care network.]] 
-RECORD 1--------------------------------------------------------
 id         | 2                                                  
 name       | style                                              
 price      | 87                                                 
 specs      | [White]                                            
 categories | [answer, nation]                                   
 reviews    | [[2, Agreement modern test.]]                      
-RECORD 2--------------------------------------------------------
 id       

## UNNEST struct example

In [6]:
gdf_unnest_struct_selected = SelectFields.apply(frame=gdf, paths=["id", "specs"])
gdf_unnest_struct_selected.toDF().show(3, truncate=False, vertical=True)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

-RECORD 0----------------
 id    | 1               
 specs | [LightSeaGreen] 
-RECORD 1----------------
 id    | 2               
 specs | [White]         
-RECORD 2----------------
 id    | 3               
 specs | [Aqua]          
only showing top 3 rows

In [7]:
# apply the ``UnnestFrame`` transformation operator
gdf_unnested_struct = gdf_unnest_struct_selected.unnest()
for row in gdf_unnested_struct.toDF().toPandas().head(3).to_dict(orient="records"):
    print(row)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

{'id': 1, 'specs.color': 'LightSeaGreen'}
{'id': 2, 'specs.color': 'White'}
{'id': 3, 'specs.color': 'Aqua'}

## Flatten array example

In [8]:
# Double check the "before" state
gdf.toDF().show(3, truncate=False, vertical=True)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

-RECORD 0--------------------------------------------------------
 id         | 1                                                  
 name       | report                                             
 price      | 74                                                 
 specs      | [LightSeaGreen]                                    
 categories | [simple]                                           
 reviews    | [[4, In simple way eat.], [2, Best care network.]] 
-RECORD 1--------------------------------------------------------
 id         | 2                                                  
 name       | style                                              
 price      | 87                                                 
 specs      | [White]                                            
 categories | [answer, nation]                                   
 reviews    | [[2, Agreement modern test.]]                      
-RECORD 2--------------------------------------------------------
 id       

In [9]:
# import the explode function
# ref: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.explode.html
from pyspark.sql.functions import explode

# pdf = PySpark Data Frame, convert to PySpark Data Frame
pdf = gdf.toDF()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [10]:
# flatten based on an array of string
pdf_unnest_array_of_string = pdf.select(
    pdf.id,
    explode(pdf.categories).alias("category"),
)
for row in pdf_unnest_array_of_string.toPandas().head(10).to_dict(orient="records"):
    print(row)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

{'id': 1, 'category': 'simple'}
{'id': 2, 'category': 'answer'}
{'id': 2, 'category': 'nation'}
{'id': 3, 'category': 'thousand'}
{'id': 3, 'category': 'mouth'}
{'id': 3, 'category': 'style'}
{'id': 4, 'category': 'do'}
{'id': 5, 'category': 'government'}
{'id': 6, 'category': 'today'}
{'id': 7, 'category': 'mouth'}

In [11]:
# flatten based on an array of struct
pdf_unnest_array_of_struct = pdf.select(
    pdf.id,
    explode(pdf.reviews).alias("review"),
)
for row in pdf_unnest_array_of_struct.toPandas().head(10).to_dict(orient="records"):
    print(row)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

{'id': 1, 'review': Row(rank=4, comment='In simple way eat.')}
{'id': 1, 'review': Row(rank=2, comment='Best care network.')}
{'id': 2, 'review': Row(rank=2, comment='Agreement modern test.')}
{'id': 3, 'review': Row(rank=2, comment='Seat minute record policy soon.')}
{'id': 4, 'review': Row(rank=5, comment='Agreement production as environmental.')}
{'id': 5, 'review': Row(rank=2, comment='Cultural his generation ask movie.')}
{'id': 5, 'review': Row(rank=3, comment='Third table available law themselves some economy officer.')}
{'id': 6, 'review': Row(rank=1, comment='Place tend mouth discover sport.')}
{'id': 6, 'review': Row(rank=2, comment='Mean what sometimes animal our sometimes.')}
{'id': 6, 'review': Row(rank=2, comment='Cold leader ok market.')}

In [12]:
# flatten then unnest
pdf_unnest_tmp_1 = pdf.select(
    pdf.id,
    explode(pdf.categories).alias("category"),
    pdf.reviews,
)
pdf_unnest_tmp_2 = pdf_unnest_tmp_1.select(
    pdf_unnest_tmp_1.id,
    pdf_unnest_tmp_1.category,
    explode(pdf_unnest_tmp_1.reviews).alias("review"),
)
gdf_unnest_tmp = DynamicFrame.fromDF(pdf_unnest_tmp_2, glueContext, "gdf_unnest_tmp")
gdf_unnest_everything = gdf_unnest_tmp.unnest()
for row in gdf_unnest_everything.toDF().toPandas().head(10).to_dict(orient="records"):
    print(row)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

{'id': 1, 'category': 'simple', 'review.rank': 4, 'review.comment': 'In simple way eat.'}
{'id': 1, 'category': 'simple', 'review.rank': 2, 'review.comment': 'Best care network.'}
{'id': 2, 'category': 'answer', 'review.rank': 2, 'review.comment': 'Agreement modern test.'}
{'id': 2, 'category': 'nation', 'review.rank': 2, 'review.comment': 'Agreement modern test.'}
{'id': 3, 'category': 'thousand', 'review.rank': 2, 'review.comment': 'Seat minute record policy soon.'}
{'id': 3, 'category': 'mouth', 'review.rank': 2, 'review.comment': 'Seat minute record policy soon.'}
{'id': 3, 'category': 'style', 'review.rank': 2, 'review.comment': 'Seat minute record policy soon.'}
{'id': 4, 'category': 'do', 'review.rank': 5, 'review.comment': 'Agreement production as environmental.'}
{'id': 5, 'category': 'government', 'review.rank': 2, 'review.comment': 'Cultural his generation ask movie.'}
{'id': 5, 'category': 'government', 'review.rank': 3, 'review.comment': 'Third table available law themselv

In [13]:
col_mapper = [
    (f"`{col}`", col.replace(".", "_"))
    for col in gdf_unnest_everything.toDF().columns
]
col_mapper

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[('`id`', 'id'), ('`category`', 'category'), ('`review.rank`', 'review_rank'), ('`review.comment`', 'review_comment')]

In [14]:
gdf_unnest_result = gdf_unnest_everything
for col_old, col_new in col_mapper:
    gdf_unnest_result = gdf_unnest_result.rename_field(col_old, col_new)
gdf_unnest_result.toDF().show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+----------+-----------+--------------------+
| id|  category|review_rank|      review_comment|
+---+----------+-----------+--------------------+
|  1|    simple|          4|  In simple way eat.|
|  1|    simple|          2|  Best care network.|
|  2|    answer|          2|Agreement modern ...|
|  2|    nation|          2|Agreement modern ...|
|  3|  thousand|          2|Seat minute recor...|
|  3|     mouth|          2|Seat minute recor...|
|  3|     style|          2|Seat minute recor...|
|  4|        do|          5|Agreement product...|
|  5|government|          2|Cultural his gene...|
|  5|government|          3|Third table avail...|
|  6|     today|          1|Place tend mouth ...|
|  6|     today|          2|Mean what sometim...|
|  6|     today|          2|Cold leader ok ma...|
|  6|     today|          5| Wind thank law gun.|
|  7|     mouth|          1|Girl difficult li...|
|  7|     mouth|          5|Responsibility ci...|
|  7|     mouth|          2|   We dog a contain.|
