# Tidy data and nested schemas

Data tidying is the concept of structuring datasets to facilitate analysis.

The principles of tidy data have been described in 2013 by statistician [Hadley Wickman](http://hadley.nz/) and closely tied to the principles of relational databases and Codd's relational algebra. They provide a standard way to organize data values within a dataset and can be synthetized as:

- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.

## What will you learn in this course? 🧐🧐
This course will demonstrate how to tidy up a Dataframe's schema. Here's the ouline:

* Array operations and nested schemas
    * `F.size(...)`
    * `F.explode(...)`
    * `.groupBy()` again ;)
    * `.collect_list(...)`
* Deep Nested schema
    * `.getField(...)`
* Even deeper
* Advanced groupBy



## Array operations and nested schemas ⚙️⚙️

In this lecture we will introduce some spark sql which we'll need in order to clean our datasets before we run further analysis.

In [0]:
# Pas utile si on est sur databriks

# spark
# sc = spark.sparkContext

In [0]:
# prelude

from pyspark.sql import functions as F # This will load the class where spark sql functions are contained
from pyspark.sql import Row # this will let us manipulate rows with spark sql
from pyspark.sql.types import * # Import types to convert columns using spark sql

Let's say we have some data about users, here we create a RDD from a dict, but in real life, we would obtain it through a pipeline or a query from a database.

# LEVEL 1

In [0]:
users_dct = [
    {'id': 1, 'name': 'George', 'orders': [50.61, 31.32, 20.9]},
    {'id': 2, 'name': 'Hugues', 'orders': [133.8, 59.0, 40.03, 27.91]}
]

users_rdd = sc.parallelize(users_dct)

# Les 2 lignes ci-dessous fonctionnent
# Sans doute car y a 1 seul niveau
# users_df = spark.createDataFrame(users_rdd.map(lambda x: Row(**x))) # this is called unpacking, 
users_df = spark.createDataFrame(users_rdd)

# try this command with Row(x) and Row(*x) to understand what it does
users_df.show()

+---+------+--------------------+
| id|  name|              orders|
+---+------+--------------------+
|  1|George|[50.61, 31.32, 20.9]|
|  2|Hugues|[133.8, 59.0, 40....|
+---+------+--------------------+



In [0]:
# The .createDataFrame(...) method is able to infer the data schema by itself
users_df.printSchema()

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- orders: array (nullable = true)
 |    |-- element: double (containsNull = true)



Although Spark is able to infer the data schema by itself, it can be useful to design it yourself, let's try and do this.

In [0]:
# from pyspark.sql.types import * # Import types to convert columns using spark sql

In [0]:
users_dct = [
    {'id': 1, 'name': 'George', 'orders': [50, 31, 20]},
    {'id': 2, 'name': 'Hugues', 'orders': [133, 59, 40, 27]}
]
users_rdd = sc.parallelize(users_dct)

# we create a variable schema as a list of StructField inside a StructType object
schema = StructType([
    StructField('id', IntegerType(), True), # the first column is of type Integer
    StructField('name', StringType(), True), # the second column is a String
    StructField('orders', ArrayType(IntegerType()), True) # the third column contains Array of Integer
])

# Bien voir la lambda
#users_df = spark.createDataFrame(users_rdd.map(lambda x: Row(**x)), schema=schema) 
users_df = spark.createDataFrame(users_rdd, schema=schema)

# to the function using the appropriate argument
users_df.printSchema()
users_df.show()

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- orders: array (nullable = true)
 |    |-- element: integer (containsNull = true)

+---+------+-----------------+
| id|  name|           orders|
+---+------+-----------------+
|  1|George|     [50, 31, 20]|
|  2|Hugues|[133, 59, 40, 27]|
+---+------+-----------------+



### `F.size(...)`

This function is able to calculate the number of elements inside an array type column

In [0]:
# On drope orders car pas utile à l'affichage
# Bien voir qu'à l'affichage, dans orders_quantities, on a le nb d'élément de chaque list 
users_df \
    .withColumn('orders_quantity', F.size('orders'))\
    .drop('orders')                                     \
    .show()

+---+------+---------------+
| id|  name|orders_quantity|
+---+------+---------------+
|  1|George|              3|
|  2|Hugues|              4|
+---+------+---------------+



In [0]:
# Bien garder en tête que users_df n'a PAS été modifié
users_df.show()

+---+------+-----------------+
| id|  name|           orders|
+---+------+-----------------+
|  1|George|     [50, 31, 20]|
|  2|Hugues|[133, 59, 40, 27]|
+---+------+-----------------+



We get the size of the array, which is pretty nice, but what if we want to compute other aggregates like sum or average? It appears it's not trivial, we will go through one method but there are other, you can read more about it [here](https://databricks.com/blog/2017/05/24/working-with-nested-data-using-higher-order-functions-in-sql-on-databricks.html).

### `F.explode(...)`
Before we try to compute aggregate, let's ask another question: what if we want one row per order?  
An order is an observational unit which, according to the tidy principles, deserves it's own table.

The explode function will take a column of type array, and make copies of the entire line so that each element of the array be represented on a separate entry of the table.

In [0]:
# On créé une "nouvelle" colonne orders qui va écraser la précédente
# On "explose" le contenu de l'ancienne colonne orders
# Pour chaque élément de la liste, les lignes se trouvent répétées 
orders_df = users_df.withColumn('orders', F.explode('orders'))
orders_df.printSchema()
orders_df.show()

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- orders: integer (nullable = true)

+---+------+------+
| id|  name|orders|
+---+------+------+
|  1|George|    50|
|  1|George|    31|
|  1|George|    20|
|  2|Hugues|   133|
|  2|Hugues|    59|
|  2|Hugues|    40|
|  2|Hugues|    27|
+---+------+------+



### `.goupBy(...)`
Now we can compute the average order by customer with a `.groupBy(...)`.

In [0]:
orders_df.groupBy('id', 'name') \
    .mean('orders') \
    .show()

# here it's ok to just writethe column names, but don't forget that it's usually
# better to use the column objects instead to avoid errors 

+---+------+------------------+
| id|  name|       avg(orders)|
+---+------+------------------+
|  1|George|33.666666666666664|
|  2|Hugues|             64.75|
+---+------+------------------+



### `.collect_list(...)`
The opposite transformation is **`.collect_list(...)`**.

In [0]:
orders_df.groupBy('id', 'name') \
    .agg(F.collect_list('orders').alias('orders')) \
    .show()

+---+------+-----------------+
| id|  name|           orders|
+---+------+-----------------+
|  1|George|     [50, 31, 20]|
|  2|Hugues|[133, 59, 40, 27]|
+---+------+-----------------+



We got our original DataFrame back.

# Niveau 2
* Y a 2 niveaux d'imbrication
* This time our schema will be a bit more difficult, we have a list of users with their orders, but not only we have the order amount, we also some additional details.

In [0]:
from pyspark.sql.types import *

In [0]:
users = [
    {'id': 1, 'name': 'George', 'orders': [
        {'id': 1, 'value': 55.1},
        {'id': 2, 'value': 78.31},
        {'id': 4, 'value': 52.13}
    ]},
    {'id': 2, 'name': 'Hughes', 'orders': [
        {'id': 3, 'value': 31.19},
        {'id': 5, 'value': 131.1}
    ]}
]
users_rdd = sc.parallelize(users)

schema = StructType([
    StructField('id', IntegerType(), True),
    StructField('name', StringType(), True),
    StructField('orders', ArrayType(
        StructType([
            StructField('id', IntegerType(), True),
            StructField('value', FloatType(), True)
        ])
    ), True)
])

users_df = spark.createDataFrame(users_rdd, schema=schema)
users_df.printSchema()
users_df.show()

# You'll see that the schema this time is a little deeper than before!

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- orders: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: integer (nullable = true)
 |    |    |-- value: float (nullable = true)

+---+------+--------------------+
| id|  name|              orders|
+---+------+--------------------+
|  1|George|[{1, 55.1}, {2, 7...|
|  2|Hughes|[{3, 31.19}, {5, ...|
+---+------+--------------------+



In [0]:
# Let's explode the orders column start unnesting the schema
orders_df = users_df.withColumn('orders', F.explode('orders'))
orders_df.printSchema()
orders_df.show()

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- orders: struct (nullable = true)
 |    |-- id: integer (nullable = true)
 |    |-- value: float (nullable = true)

+---+------+----------+
| id|  name|    orders|
+---+------+----------+
|  1|George| {1, 55.1}|
|  1|George|{2, 78.31}|
|  1|George|{4, 52.13}|
|  2|Hughes|{3, 31.19}|
|  2|Hughes|{5, 131.1}|
+---+------+----------+



### `.getField(...)`
We can access nested fields using `.getField(fieldname)`

In [0]:
# Se rappeler qu'on avait ce shéma
# root
#  |-- id: integer (nullable = true)
#  |-- name: string (nullable = true)
#  |-- orders: struct (nullable = true)
#  |    |-- id: integer (nullable = true)
#  |    |-- value: float (nullable = true)

# F.col("col_name") returns the column object just like df.col_name or df["col_name"]

# orders_df.printSchema()

orders_df \
    .withColumn('order_id', F.col('orders').getField('id')) \
    .show()



+---+------+----------+--------+
| id|  name|    orders|order_id|
+---+------+----------+--------+
|  1|George| {1, 55.1}|       1|
|  1|George|{2, 78.31}|       2|
|  1|George|{4, 52.13}|       4|
|  2|Hughes|{3, 31.19}|       3|
|  2|Hughes|{5, 131.1}|       5|
+---+------+----------+--------+



### Préférer la notation `F.col('orders').getField('id')` ?
Or using **`.`** notation, just like you would do to access a column inside a DataFrame object.

In [0]:
orders_df \
    .withColumn('order_id', F.col('orders.id')) \
    .show()

+---+------+----------+--------+
| id|  name|    orders|order_id|
+---+------+----------+--------+
|  1|George| {1, 55.1}|       1|
|  1|George|{2, 78.31}|       2|
|  1|George|{4, 52.13}|       4|
|  2|Hughes|{3, 31.19}|       3|
|  2|Hughes|{5, 131.1}|       5|
+---+------+----------+--------+



In [0]:
# Let's extract both the nested columns to get a flat schema
orders_df_flattened = orders_df \
    .withColumn('order_id', F.col('orders').getField('id')) \
    .withColumn('order_value', F.col('orders').getField('value')) \
    .drop('orders')
orders_df_flattened.show()

+---+------+--------+-----------+
| id|  name|order_id|order_value|
+---+------+--------+-----------+
|  1|George|       1|       55.1|
|  1|George|       2|      78.31|
|  1|George|       4|      52.13|
|  2|Hughes|       3|      31.19|
|  2|Hughes|       5|      131.1|
+---+------+--------+-----------+



In [0]:
# Même chose avec la notation .
orders_df_flattened = orders_df \
    .withColumn('order_id', F.col('orders.id')) \
    .withColumn('order_value', F.col('orders.value')) \
    .drop('orders')
orders_df_flattened.show()

+---+------+--------+-----------+
| id|  name|order_id|order_value|
+---+------+--------+-----------+
|  1|George|       1|       55.1|
|  1|George|       2|      78.31|
|  1|George|       4|      52.13|
|  2|Hughes|       3|      31.19|
|  2|Hughes|       5|      131.1|
+---+------+--------+-----------+



In [0]:
# It is now possible to aggregate this table using goupBy and some aggregation function like .sum
orders_df_flattened                 \
    .groupBy('name')                \
    .sum('order_value')             \
    .orderBy('sum(order_value)')    \
    .show()

+------+------------------+
|  name|  sum(order_value)|
+------+------------------+
|Hughes|162.29000663757324|
|George|185.53999710083008|
+------+------------------+



In [0]:
# Aliasing inline and descending sort
orders_df_flattened                                 \
    .groupBy('name')                                \
    .agg(F.sum('order_value').alias('total_value')) \
    .orderBy(F.desc('total_value'))                 \
    .show()

+------+------------------+
|  name|       total_value|
+------+------------------+
|George|185.53999710083008|
|Hughes|162.29000663757324|
+------+------------------+



# Niveau 3
Let's now simulate an even deeper nested schema, and we will walk you through the process of unnesting it!

In [0]:
users = [
    {'id': 1, 'name': 'George', 'orders': [
        {'id': 1, 'items': [
            {'id': 1, 'category': 'shirt', 'price': 80, 'quantity': 4},
            {'id': 2, 'category': 'jeans', 'price': 130, 'quantity': 2}
        ]},
        {'id': 4, 'items': [
            {'id': 1, 'category': 'shirt', 'price': 80, 'quantity': 1},
            {'id': 3, 'category': 'shoes', 'price': 240, 'quantity': 1}
        ]}
    ]},
    {'id': 2, 'name': 'Hughes', 'orders': [
        {'id': 2, 'items': [
            {'id': 4, 'category': 'shorts', 'price': 120, 'quantity': 3},
            {'id': 1, 'category': 'shirt', 'price': 180, 'quantity': 2},
            {'id': 3, 'category': 'shoes', 'prices': 240, 'quantity': 1}
        ]},
        {'id': 3, 'items': [
            {'id': 5, 'category': 'suit', 'price': 2000, 'quantity': 1}
        ]}
    ]}
]
users_rdd = sc.parallelize(users)

schema = StructType([
    StructField('id', IntegerType(), True),
    StructField('name', StringType(), True),
    StructField('orders', ArrayType(
        StructType([
            StructField('id', IntegerType(), True),
            StructField('items', ArrayType(
                StructType([
                    StructField('id', IntegerType(), True),
                    StructField('category', StringType(), True),
                    StructField('price', IntegerType(), True),
                    StructField('quantity', IntegerType(), True)
                ])
            ))
        ])
    ), True)
])

users_df = spark.createDataFrame(users_rdd, schema=schema)
users_df.printSchema()
users_df.show()

# This schema is much deeper than the other two!

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- orders: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: integer (nullable = true)
 |    |    |-- items: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- id: integer (nullable = true)
 |    |    |    |    |-- category: string (nullable = true)
 |    |    |    |    |-- price: integer (nullable = true)
 |    |    |    |    |-- quantity: integer (nullable = true)

+---+------+--------------------+
| id|  name|              orders|
+---+------+--------------------+
|  1|George|[{1, [{1, shirt, ...|
|  2|Hughes|[{2, [{4, shorts,...|
+---+------+--------------------+



In [0]:
# We start by exploding the orders column, which where the nest resides
# Bien voir le ORDERS_DF car on descend d'un niveau
orders_df = users_df.withColumn('orders', F.explode('orders'))
orders_df.show()

+---+------+--------------------+
| id|  name|              orders|
+---+------+--------------------+
|  1|George|{1, [{1, shirt, 8...|
|  1|George|{4, [{1, shirt, 8...|
|  2|Hughes|{2, [{4, shorts, ...|
|  2|Hughes|{3, [{5, suit, 20...|
+---+------+--------------------+



Now brace yourselves as we will walk you step by step through the process of unnesting this data schema!

## Etape intermédiaire


### Step 1

In [0]:
orders_df = orders_df.withColumn('order_id', F.col('orders.id')) \
    .withColumn('items', F.col('orders.items')) \
    .drop('orders') \
    .show() 



+---+------+--------+--------------------+
| id|  name|order_id|               items|
+---+------+--------+--------------------+
|  1|George|       1|[{1, shirt, 80, 4...|
|  1|George|       4|[{1, shirt, 80, 1...|
|  2|Hughes|       2|[{4, shorts, 120,...|
|  2|Hughes|       3|[{5, suit, 2000, 1}]|
+---+------+--------+--------------------+



### Step 2
* **Pour lancer  le code ci-dessous...** il ne faut **PAS** avoir lancé les lignes précédentes 
* Ou alors il faut relancer les lignes où on crée le ``orders_df``
* Voir `orders_df = users_df.withColumn('orders', F.explode('orders'))`

In [0]:
orders_df.withColumn('order_id', F.col('orders.id')) \
    .withColumn('items', F.col('orders.items')) \
    .drop('orders') \
    .withColumn('item', F.explode('items')) \
    .drop('items') \
    .withColumn('item_id', F.col('item').getField('id')) \
    .withColumn('category', F.col('item').getField('category')) \
    .withColumn('price', F.col('item').getField('price')) \
    .withColumn('quantity', F.col('item').getField('quantity')) \
    .drop('item') \
    .show()

+---+------+--------+-------+--------+-----+--------+
| id|  name|order_id|item_id|category|price|quantity|
+---+------+--------+-------+--------+-----+--------+
|  1|George|       1|      1|   shirt|   80|       4|
|  1|George|       1|      2|   jeans|  130|       2|
|  1|George|       4|      1|   shirt|   80|       1|
|  1|George|       4|      3|   shoes|  240|       1|
|  2|Hughes|       2|      4|  shorts|  120|       3|
|  2|Hughes|       2|      1|   shirt|  180|       2|
|  2|Hughes|       2|      3|   shoes| null|       1|
|  2|Hughes|       3|      5|    suit| 2000|       1|
+---+------+--------+-------+--------+-----+--------+



In [0]:
items_df = (
    orders_df.withColumn('order_id', F.col('orders').getField('id'))
    .withColumn('items', F.col('orders').getField('items'))
    .drop('orders')
    .withColumnRenamed('name', 'user_name')
    .withColumnRenamed('id', 'user_id')
    .withColumn('items', F.explode('items'))
    .withColumn('item_id', F.col('items').getField('id'))
    .withColumn('item_category', F.col('items').getField('category'))
    .withColumn('item_price', F.col('items').getField('price'))
    .withColumn('item_quantity', F.col('items').getField('quantity'))
    .withColumn('total_price', F.col('item_price') * F.col('item_quantity'))
    .drop('items')
)
items_df.show()

+-------+---------+--------+-------+-------------+----------+-------------+-----------+
|user_id|user_name|order_id|item_id|item_category|item_price|item_quantity|total_price|
+-------+---------+--------+-------+-------------+----------+-------------+-----------+
|      1|   George|       1|      1|        shirt|        80|            4|        320|
|      1|   George|       1|      2|        jeans|       130|            2|        260|
|      1|   George|       4|      1|        shirt|        80|            1|         80|
|      1|   George|       4|      3|        shoes|       240|            1|        240|
|      2|   Hughes|       2|      4|       shorts|       120|            3|        360|
|      2|   Hughes|       2|      1|        shirt|       180|            2|        360|
|      2|   Hughes|       2|      3|        shoes|      null|            1|       null|
|      2|   Hughes|       3|      5|         suit|      2000|            1|       2000|
+-------+---------+--------+----

Même chose en notation dot `.`

In [0]:
items_df = (
    orders_df.withColumn('order_id', F.col('orders.id'))
    .withColumn('items', F.col('orders.items'))
    .drop('orders')
    .withColumnRenamed('name', 'user_name')
    .withColumnRenamed('id', 'user_id')
    .withColumn('items', F.explode('items'))
    .withColumn('item_id', F.col('items.id'))
    .withColumn('item_category', F.col('items.category'))
    .withColumn('item_price', F.col('items.price'))
    .withColumn('item_quantity', F.col('items.quantity'))
    .withColumn('total_price', F.col('item_price') * F.col('item_quantity'))
    .drop('items')
)
items_df.show()

+-------+---------+--------+-------+-------------+----------+-------------+-----------+
|user_id|user_name|order_id|item_id|item_category|item_price|item_quantity|total_price|
+-------+---------+--------+-------+-------------+----------+-------------+-----------+
|      1|   George|       1|      1|        shirt|        80|            4|        320|
|      1|   George|       1|      2|        jeans|       130|            2|        260|
|      1|   George|       4|      1|        shirt|        80|            1|         80|
|      1|   George|       4|      3|        shoes|       240|            1|        240|
|      2|   Hughes|       2|      4|       shorts|       120|            3|        360|
|      2|   Hughes|       2|      1|        shirt|       180|            2|        360|
|      2|   Hughes|       2|      3|        shoes|      null|            1|       null|
|      2|   Hughes|       3|      5|         suit|      2000|            1|       2000|
+-------+---------+--------+----

This is much better.
Unnesting may be a tedious task but it is an essential part of the process towards facilitating analysis, running analysis and sql type queries on a nested schema is hard, so it is definitely worthspending some time preparing your data so that everyone else saves time when they query your tables.

## Advanced groupBy 🧮🧮

In [0]:
# Here we group the data by item category and calculate the sum
items_df \
    .groupBy('item_category') \
    .sum('item_quantity') \
    .orderBy(F.desc('sum(item_quantity)')) \
    .show()

You might want to alias, in this case, you change `.sum()` for `.agg()`. This is a little beyond the scope of today's lecture, but we'll show it to you before spending more time understanding aggregates in the following days.

In [0]:
items_df \
    .groupBy('item_category') \
    .agg(F.sum('item_quantity').alias('total_quantity')) \
    .orderBy(F.desc('total_quantity')) \
    .show()

If I want to alias..

In [0]:
items_df \
    .groupBy('item_category') \
    .agg((F.sum('total_price') / F.sum('item_quantity')).alias('avg_sale')) \
    .orderBy(F.desc('avg_sale')) \
    .show()

## Resources 📚📚

* We strongly advice you take the time to read [the original paper from Wickam](https://vita.had.co.nz/papers/tidy-data.pdf).
You might want to look at this ressource which will be used in the exercises in order to flatten a very nested data schema.
* [Automatically and Elegantly flatten DataFrame in Spark SQL](https://stackoverflow.com/questions/37471346/automatically-and-elegantly-flatten-dataframe-in-spark-sql) on StackOverflow