Problem
DJ's existing Column.type is the physical SQL type (bigint, varchar, double etc), derived from the underlying query or source. It tells you how the value is stored, not what it means. The same SQL type backs many semantically distinct concepts:
bigint could be a count, an epoch-ms timestamp, a boolean flag, an ID, or a regular number
varchar could be free text, a UUID, an email, a JSON blob, or an ISO date string
double could be a generic number, a percentage, a probability, or a currency amount
Consumers downstream of DJ have no way to recover this intent, other than by using name-suffix heuristic or other custom conventions.
Proposal
Add an optional semantic_data_type field on Column, structured as {kind, code}, mirroring the design of unit in #2149:
class SemanticDataType(BaseModel):
kind: SemanticDataKind
code: Optional[str] = None # interpreted per-kind
Initial kinds
| kind |
meaning |
code |
| boolean |
flag value |
must be None |
| string |
textual data |
optional format hint (email, uuid, json, ...) |
| number |
a numeric value |
must be None |
| date |
a calendar date |
optional type |
| timestamp |
a point in time |
precision: sec / ms / us / ns |
New kinds can be added incrementally over time.
Examples
kind composes with unit (from #2149). The two fields are orthogonal:
semantic_data_type = what the value is
unit = what measurement scale applies (when relevant)
unit is meaningful only when kind: number. For all other kinds, unit must be unset.
columns:
- name: order_count
type: bigint
semantic_data_type:
kind: number
- name: revenue
type: double
semantic_data_type:
kind: number
unit:
kind: currency
code: USD
- name: response_time_ms
type: bigint
semantic_data_type: {kind: number}
unit: {kind: time, code: ms}
- name: order_date
type: date
semantic_data_type: {kind: date}
- name: event_ts
type: bigint
semantic_data_type: {kind: timestamp, code: ms}
- name: user_email
type: varchar
semantic_data_type: {kind: string, code: email}
Problem
DJ's existing
Column.typeis the physical SQL type (bigint,varchar,doubleetc), derived from the underlying query or source. It tells you how the value is stored, not what it means. The same SQL type backs many semantically distinct concepts:bigintcould be a count, an epoch-ms timestamp, a boolean flag, an ID, or a regular numbervarcharcould be free text, a UUID, an email, a JSON blob, or an ISO date stringdoublecould be a generic number, a percentage, a probability, or a currency amountConsumers downstream of DJ have no way to recover this intent, other than by using name-suffix heuristic or other custom conventions.
Proposal
Add an optional
semantic_data_typefield onColumn, structured as{kind, code}, mirroring the design of unit in #2149:Initial kinds
New kinds can be added incrementally over time.
Examples
kindcomposes withunit(from #2149). The two fields are orthogonal:semantic_data_type= what the value isunit= what measurement scale applies (when relevant)unitis meaningful only whenkind: number. For all other kinds, unit must be unset.