Let's find a table we can work with. It should be a time-series table

In [0]:
%sql
create table workspace.default.sales (
    timestamp TIMESTAMP,
    amount DOUBLE
)


I create a basic notebook `insert 1h of data.ipynb` to fill table with data. Then, setup a job to run that notebook every hour


In [0]:
%sql
select * from workspace.default.sales
limit 10


Create the Monitor via Unity Catalog Explorer 👇

I set up the monitor as `TimeSeries` profile. I pointed out the `timestamp` column and a granularity of 1 hour. The schedule of the monitor is actually daily.

![](./img/create_monitor.png)

By default, two new tables are created

- `<table_name>_profile_metrics`
- `<table_name>_drift_metrics`

Let's inspect them

In [0]:
%sql
SHOW TABLES IN workspace.default;

In [0]:
%sql
select * from workspace.default.sales_profile_metrics

The **profile** table has a row for each pair
- `window` (the beginning and end of every hour)
- `column_name` every column of the table. In addition, it adds a special row `:table` to compute the table-level profile.

Optionally, it can slice on column values when specified at the time of the creation of the _Monitor_

For each row, it computes a bunch of statistics like `avg`, `quantiles`, `min`, `max`, etc. (when applicable, eg for float columns).

In [0]:
%sql
select * from workspace.default.sales_drift_metrics

The **drift** table is similare to the profile table. The **drift** table has a row for each pair
- `window` (the beginning and end of every hour)
- `column_name` every column of the table. In addition, it adds a special row `:table` to compute the table-level profile.

In addition, it has the `window_cmp`, where _cmp_ stands for _compare_. All the statistics are compared against another window (the previous one). There are various statistics like
- `count_delta`
- `ks_test`, in statistics, the Kolmogorov–Smirnov can be used to test whether two samples came from the same distribution

###Dashboard

Lakehouse Monitoring creates also a dashboard automatically that displays the data in these _profile and drift_ tables.

However, I find this dashboard too crowded and not ready to use. You need to work on it to customize it by yourself.

### Alerts
Monitor alerts are created and used the same way as other Databricks SQL alerts. You create a Databricks SQL query on the monitor profile metrics table or drift metrics table. You then create a Databricks SQL alert for this query.

### Pricing

Lakehouse Monitoring is billed under a serverless jobs SKU. You can monitor its usage via `system.billing.usage` table or via the Usage dashboard at Account console.

You need to pay attention. I expect that the costs may rise for columns with high number of columns if you don't fine-tune the monitor.

In [0]:
%sql
SELECT usage_date, sum(usage_quantity) as dbus
FROM system.billing.usage
WHERE
  usage_date >= DATE_SUB(current_date(), 30) AND
  sku_name like "%JOBS_SERVERLESS%" AND
  custom_tags["LakehouseMonitoring"] = "true"
GROUP BY usage_date
ORDER BY usage_date DESC

### Opinion

Lakehouse monitoring is all about these two profile and drift tables. It takes little effort