Skip to content

Add semantic_data_type field on Column for author-declared intent #2151

@shangyian

Description

@shangyian

Problem

DJ's existing Column.type is the physical SQL type (bigint, varchar, double etc), derived from the underlying query or source. It tells you how the value is stored, not what it means. The same SQL type backs many semantically distinct concepts:

  • bigint could be a count, an epoch-ms timestamp, a boolean flag, an ID, or a regular number
  • varchar could be free text, a UUID, an email, a JSON blob, or an ISO date string
  • double could be a generic number, a percentage, a probability, or a currency amount

Consumers downstream of DJ have no way to recover this intent, other than by using name-suffix heuristic or other custom conventions.

Proposal

Add an optional semantic_data_type field on Column, structured as {kind, code}, mirroring the design of unit in #2149:

class SemanticDataType(BaseModel):
    kind: SemanticDataKind
    code: Optional[str] = None  # interpreted per-kind

Initial kinds

kind meaning code
boolean flag value must be None
string textual data optional format hint (email, uuid, json, ...)
number a numeric value must be None
date a calendar date optional type
timestamp a point in time precision: sec / ms / us / ns

New kinds can be added incrementally over time.

Examples

kind composes with unit (from #2149). The two fields are orthogonal:

  • semantic_data_type = what the value is
  • unit = what measurement scale applies (when relevant)

unit is meaningful only when kind: number. For all other kinds, unit must be unset.

  columns:
    - name: order_count
      type: bigint
      semantic_data_type:
        kind: number

    - name: revenue
      type: double
      semantic_data_type:
        kind: number
      unit:
        kind: currency
        code: USD

    - name: response_time_ms
      type: bigint
      semantic_data_type: {kind: number}
      unit: {kind: time, code: ms}

    - name: order_date
      type: date
      semantic_data_type: {kind: date}

    - name: event_ts
      type: bigint
      semantic_data_type: {kind: timestamp, code: ms}

    - name: user_email
      type: varchar
      semantic_data_type: {kind: string, code: email}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions