# Part 2: Normalisation

## Load nested data with auto normalisation

When converting nested data to tabular formats, to keep fragmentations minimal:
* Nested dictionaries can be flattened into the parent row to
* Nested lists however need to be expressed as separate tables due to the different granularity (1:n relationship)

And of course, when going from JSON to DB, we want some things standardised:
* Data types such as timestamps should be detected correctly
* Column names should be converted to db-compatible names
* Unnested sub-tables should be linked to parent tables via auto generated keys


For this work, we will use `dlt` library, which is purpose-made to solve such tasks in a scalable way, for example by using generators.



### Introducing dlt

dlt is a python library created for the purpose of assisting data engineers to build simpler, faster and more robust pipelines with minimal effort.

dlt automates much of the tedious work a data engineer would do, and does it in a way that is robust.

dlt can handle things like:

- Schema: Inferring and evolving schema, alerting changes, using schemas as data contracts.
- Typing data, flattening structures, renaming columns to fit database standards.
- Processing a stream of events/rows without filling memory. This includes extraction from generators. In our example we will pass the “data” you can see above.
- Loading to a variety of dbs of file formats.

Read more about dlt [here](https://dlthub.com/docs/intro).

Now let’s use it to load our nested json to duckdb:

In [None]:
import dlt
import duckdb

data = [
    {
        "vendor_name": "VTS",
				"record_hash": "b00361a396177a9cb410ff61f20015ad",
        "time": {
            "pickup": "2009-06-14 23:23:00",
            "dropoff": "2009-06-14 23:48:00"
        },
        "Trip_Distance": 17.52,
        # nested dictionaries could be flattened
        "coordinates": { # coordinates__start__lon
            "start": {
                "lon": -73.787442,
                "lat": 40.641525
            },
            "end": {
                "lon": -73.980072,
                "lat": 40.742963
            }
        },
        "Rate_Code": None,
        "store_and_forward": None,
        "Payment": {
            "type": "Credit",
            "amt": 20.5,
            "surcharge": 0,
            "mta_tax": None,
            "tip": 9,
            "tolls": 4.15,
						"status": "booked"
        },
        "Passenger_Count": 2,
        # nested lists need to be expressed as separate tables
        "passengers": [
            {"name": "John", "rating": 4.9},
            {"name": "Jack", "rating": 3.9}
        ],
        "Stops": [
            {"lon": -73.6, "lat": 40.6},
            {"lon": -73.5, "lat": 40.5}
        ]
    },
    # ... more data
]


# define the connection to load to.
# We now use duckdb, but you can switch to Bigquery later
pipeline = dlt.pipeline(destination='duckdb', dataset_name='taxi_rides')



# run with merge write disposition.
# This is so scaffolding is created for the next example,
# where we look at merging data

info = pipeline.run(data,
					table_name="rides",
					write_disposition="merge",
                    primary_key="record_hash")

print(info)

### Inspecting the nested structure, joining the child tables

Let's look at what happened during the load
- By looking at the loaded tables, we can see our json document got flattened and sub-documents got split into separate tables
- We can re-join those child tables to the parent table by using the generated keys `on parent_table._dlt_id = child_table._dlt_parent_id`.
- Data types: If you will pay attention to datatypes, you will note that the timestamps, which in json are of string type, are now of timestamp type in the db.


In [None]:
# show the outcome

conn = duckdb.connect(f"{pipeline.pipeline_name}.duckdb")

# let's see the tables
conn.sql(f"SET search_path = '{pipeline.dataset_name}'")
print('Loaded tables: ')
display(conn.sql("show tables"))


print("\n\n\n Rides table below: Note the times are properly typed")
rides = conn.sql("SELECT * FROM rides").df()
display(rides)

print("\n\n\n Pasengers table")
passengers = conn.sql("SELECT * FROM rides__passengers").df()
display(passengers)
print("\n\n\n Stops table")
stops = conn.sql("SELECT * FROM rides__stops").df()
display(stops)


# to reflect the relationships between parent and child rows, let's join them
# of course this will have 4 rows due to the two 1:n joins

print("\n\n\n joined table")

joined = conn.sql("""
SELECT *
FROM rides as r
left join rides__passengers as rp
  on r._dlt_id = rp._dlt_parent_id
left join rides__stops as rs
  on r._dlt_id = rs._dlt_parent_id
""").df()
display(joined)

What are we looking at?
- Nested dicts got flattened into the parent row, the structure `{"coordinates":{"start": {"lat": ...}}}` became
`coordinates__start__lat`

- Nested lists got broken out into separate tables with generated columns that would allow us to join the data back when needed.