In [0]:
%sql
use catalog `get_started`;
use schema `labuser`;

SHOW TABLES

In [0]:
%sql
SELECT string(key), string(value) FROM PARQUET.`/Volumes/get_started/labuser/myfiles/events`

In [0]:
%sql
-- events raw is from a kafka payload and in most cases will be a binary encoded json values. (key, value)

CREATE OR REPLACE TEMP VIEW events_strings AS
SELECT string(key), string(value) FROM PARQUET.`/Volumes/get_started/labuser/myfiles/events`;

SELECT * FROM events_strings LIMIT 10;

In [0]:
%sql
SELECT * FROM events_strings 
WHERE value:event_type = "error" 
ORDER BY key LIMIT 5;

Let's use the JSON string example above to derive the schema, then parse the entire JSON column into STRUCT types.

• schema_of_json() returns the schema derived from an example JSON string.
• from_json() parses a column containing a JSON string into a STRUCT type using the specified schema.

After we unpack the JSON string to a STRUCT type, let's unpack and flatten all STRUCT fields into columns.

* unpacking can be used to flatten a STRUCT; col_name.* pulls out the subfields of col_name into their own columns.

In [0]:
%sql
SELECT schema_of_json('{"event_type": "purchase", "timestamp": 1744492072, "location": {"country": "IN", "city": "New York"}, "devices": ["tablet", "mobile", "mobile"], "items": [{"sku": "65ac8e80", "qty": 4, "price": 180.23}, {"sku": "cbb9ac76", "qty": 4, "price": 153.06}, {"sku": "e1a387bb", "qty": 3, "price": 127.65}], "error": null, "tags": ["evening", "morning"]}') AS schema;

In [0]:
%sql
CREATE OR REPLACE TEMP VIEW parsed_events AS SELECT json.* FROM(
  SELECT from_json(value, 'STRUCT<devices: ARRAY<STRING>, error: STRING, event_type: STRING, items: ARRAY<STRUCT<price: DOUBLE, qty: BIGINT, sku: STRING>>, location: STRUCT<city: STRING, country: STRING>, tags: ARRAY<STRING>, timestamp: BIGINT>' ) AS json FROM events_strings
);

SELECT * FROM parsed_events LIMIT 5

### Manipulate Arrays
Spark SQL has a number of functions for manipulating array data, including the following:
- explode() separates the elements of an array into multiple rows; this creates a new row for each element.
- size() provides a count for the number of elements in an array for each row.

The code below explodes the items field (an array of structs) into multiple rows and shows events containing arrays with 3 or more items.

In [0]:
%sql
CREATE OR REPLACE TEMP VIEW exploded_events AS 
SELECT *, explode(items) AS item
FROM parsed_events;
    
SELECT * FROM exploded_events WHERE size(items) >2 ;