# Structured types

In IoT, complex machinery means complex data. How are complex structures handled by Parquet? Let's read an inventory of devices with more interesting structure.

In [None]:
import daft
from pathlib import Path
from time import time
import json
import tempfile
import os
from IPython.display import display, Markdown

%reload_ext autoreload
%autoreload 2
from helpers import inspect
daft.set_execution_config(parquet_target_row_group_size=128*1024*1024) # Set back to defaults

## Using automated schema discovery

Daft is excellent at inferring schemas from JSON data, even when the data is nested or slightly inconsistent.

In [None]:
df = daft.read_json('../data/input/cmdata.jsonl')
df.show(10)
display(df.schema())

files = df.write_parquet('../data/output/cmdata.parquet', write_mode='overwrite')
cmdata_parquet_file = files.to_pydict()['path'][0]
inspect(Path(cmdata_parquet_file))

What do you notice?

* How do the arrays look like in the table and in the actual Parquet file? What is the path, the definition and the repetition level?
* How are the nested fields in `c8y_ActiveAlarmStatus` shown in the Parquet file? What is the path, the definition and the repetition level?

Even though the first rows contain empty `c8y_ActiveAlarmStatus`, Daft evolves the schema as it goes along to discover the full structure of the alarm statuses. Not all libraries can do that; in PyArrow you would have to specify the schema yourself if it's not adequately represented in the first record.

### Encoding structured types

Arbitrarily deeply nested structured are still represented as a flat list of columns in Parquet. The algorithm for this originates from [Google's Dremel](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf). 

* **Repetition level** level tells us at what level a value repeated. If it is 0, a new top-level record is started. If it is 1, the repetition is in a top-level list in the record, and so on.
* **Definition level** tells us where in the path to a value a null appeared. If it is 0, the top-level property was not present. If it is one, the top-level property was present but the next part of the path was null, and so on.

Here is the example from the original paper from Google:

![Shredding](../images/dremel-shredding.png)

## Processing complex structured data

Let's work with the `radiator.jsonl` dataset. This file is a classic example of "dirty data" from an industrial source. 

Specifically, the `meas_ModelNumber` field has conflicting types:
* In most records, it is a struct `{"name": "Audi RS6", "value": 4}` where `value` is an integer.
* In some records (around line 172,000), the `value` is sent as a string: `"value": " 3"` or `"value": "1"`.

How will `meas_ModelNumer.value` turn out?

In [None]:
radiator_df = daft.read_json('../data/input/radiator.jsonl').with_column("source_id", daft.col("source")["value"])
files = radiator_df.write_parquet('../data/output/radiator', write_mode='overwrite')
radiator_parquet_path = files.to_pydict()['path'][0]

inspect(Path(radiator_parquet_path))

How did Daft handle the type conflict?

1. It infers `Int64` for `meas_ModelNumber.value` from the clean records at the beginning.
2. When it eventually encounters the string values (`" 3"`, `"1"`), it effectively **coerces** them into the target type (`Int64`).
3. This allows the read to complete successfully without dropping data or requiring a manual "cleaning pass" beforehand. 

Strict schema-on-read systems like PyArrow will crash or fail to parse this file. Daft can handle some level of changes.

## Parquet predicate pushdown and Tuning

As discussed, sorting data helps query engines skip unnecessary data. Let's create a sorted version of the radiator data. We sort by the ID of the device and the time of measurement.

Note: To demonstrate the effect, we write the DataFrame using PyArrow with a small row group size. We also flatten the `source` property -- we will see why.

In [None]:
radiator_sorted_df = radiator_df.with_column("source_id", daft.col("source")["value"]).sort([daft.col("source_id"), daft.col("time")])

daft.set_execution_config(parquet_target_row_group_size=4*1024*1024)
files = radiator_sorted_df.write_parquet('../data/output/radiator_sorted', write_mode='overwrite')
radiator_sorted_parquet_path = files.to_pydict()['path'][0]

inspect(Path(radiator_sorted_parquet_path))

Open some column chunks and check the statistics of the `source_id` property. 

Now let's query one measurement of a single device. Device "1822301" appears only in the last column chunk of the sorted file, as can be seen by the statistics.

In [None]:
print("Querying unoptimized file...")
df_unopt = daft.read_parquet(radiator_parquet_path)

start = time()
result = df_unopt.filter(
    daft.col("source_id") == "1822301"
).select(
    "time",
    daft.col("meas_PressureBalance")["current_ForceValue"]["value"].alias("force")
).collect()
full_time = time() - start
print(f"Query time on unoptimized file: {full_time:.2f} seconds")

print("Querying sorted file...")
df_opt = daft.read_parquet(radiator_sorted_parquet_path)

start = time()
result = df_opt.filter(
    daft.col("source_id") == "1822301"
).select(
    "time",
    daft.col("meas_PressureBalance")["current_ForceValue"]["value"].alias("force")
).collect()
full_time = time() - start
print(f"Query time on optimized file: {full_time:.2f} seconds")

daft.set_execution_config(parquet_target_row_group_size=128*1024*1024) # Set back to defaults

Of course, the files used in our tests are comparatively small and local filesystem access is very small. You should still see a small difference in the execution speed. In my case, the query on sorted data took about a sixth of the time.


## Bonus: Daft schema inferrence

As we have seen, real-world machine data is rarely clean. Let's try out some more challenging type conflicts and schema changes with "micro datasets" and verify the output. For example,

* What is the result if a property is encountered first as numeric `1` and then as string `"1.0.1"`?
* What happens if there are arrays with mixed types?
* What happens when first an array and then a struct is encountered? 

Feel encourage to find your own "evil" scenarios. You will certainly see such a situation at some point in real-life.

In [None]:
def check_evil_scenario(scenario_name, data):
    display(Markdown(f"### {scenario_name}"))
    with tempfile.NamedTemporaryFile(dir="../data/output", mode='w', suffix='.jsonl', delete=False) as tmp:
        try:
            for record in data:
                tmp.write(json.dumps(record) + '\n')
            tmp_path = tmp.name
        except Exception as e:
            print(f"Error creating temp file: {e}")
            return

    try:
        df = daft.read_json(tmp_path)
        print("Inferred Schema:")
        print(df.schema())
        print("Data Preview:")
        display(df.collect())
    except Exception as e:
        print(f"Daft failed to read: {e}")
    finally:
        if os.path.exists(tmp_path):
            os.remove(tmp_path)

check_evil_scenario("Numeric vs string", [{"a": 1}, {"a": "1.0.1"}] )
check_evil_scenario("Struct evolution", [{"a": {"x": 1}}, {"a": {"y": 2}}])
check_evil_scenario("Array vs struct", [{"a": [1]}, {"a": {"b": 1}}])
check_evil_scenario("Mixed types in array", [{"a": [1, "2", [3]]}])
check_evil_scenario("Null handling", [{"a": 1}, {"a": None}])

Daft does a quite good job. But when do you need to expect a loss of input data? How can you handle it? Check out Section "Querying" for an option to rewrite data with a different structure. 

## Summary

Parquet is able to "flatten" out any kind of structured data into a columnar representation by using Dremel's "shredding" algorithm. However, Parquet always expects well-defined data structures, which is not always the case in real world input data. Daft does a quite decent job of discovering a viable schema from the input data, but it cannot handle all types of changes automatically. Schemas may also degenerate over longer time. Eventually, you may have to manually intervene.

We have also executed a short test to evaluate "predicate pushdown" on the complex data, i.e., using metadata to only read the parts of a file that are required for querying. The effects are hard to demonstrate in test environments, but are very relevant for large, real-world data sets on object stores. We'll look into object stores in the next section.