Skip to content

Column Types

Adam Hooper edited this page Apr 6, 2021 · 20 revisions

Introduction

Workbench manipulates tabular data in Apache Arrow, Apache Parquet 2.0 and Pandas DataFrame formats. (See Table Data Structures to understand how each format is used.)

Arrow is the canonical data format.

Here are the valid column types and how to use them.

What is a column?

A column in Arrow

In Arrow, a column is a ChunkedArray (C++ docs, Python docs). For example:

import pyarrow as pa
data = pa.chunked_array([["a", "b", "c", None]])

(Currently, Workbench only supports zero or one Array in the ChunkedArray.)

All columns support null values.

A table is a Table (C++ docs, Python docs). For example:

table = pa.table({"A": data})

A table must not have any metadata. The schema can always be derived exclusively from the table's ChunkedArrays' data types.

A column in Parquet

The Parquet format is used as a render cache: all Parquet data has a one-to-one mapping to Arrow data. Parquet tables have no metadata.

A column in Pandas

Every Pandas column can be represented as a pandas.Series:

import pandas as pd
import numpy as np

# a DataFrame
table: pandas.DataFrame = pd.DataFrame({
    'A': ['a', 'b', 'c'],
    'B': [1, 2, 3],
})

# a Series
a_series: pandas.Series = df['A']

Nulls

Workbench does not have a null column type; but any value (in any column) may be null.

Workbench supports nulls because its input data may include nulls. Workbench itself has no meaning for null. The nuance of null comes from our community of modules.

Our guiding principle with null is: it should remind users of SQL's NULL. Here are the basics of NULL-handling in SQL:

SELECT NULL + 3; -- NULL: arithmetic returns null
SELECT NULL || 'foo'; -- NULL: string operations produce null
SELECT SUM(v)
  FROM (SELECT UNNEST(ARRAY[1,2,NULL,4]) AS v) t
; -- 7: aggregations ignore null values...
SELECT SUM(v)
  FROM (SELECT UNNEST(ARRAY[NULL,NULL]::BIGINT[]) AS v) t
; -- NULL: ... or return null when all values are null

Each module can handle null in its own way. The module's documentation should clarify what null means.

Table rules

Regardless of format, each table must follow these rules:

  • Each column must have a different name.
  • Column names cannot include ASCII control characters such as \n. (The full disallowed range is \x00-\x1f.)
  • Column names cannot exceed 120 bytes when UTF8-encoded. (All formats store column names as UTF-8.)
  • Column names must be valid Unicode.
  • Tables cannot exceed a million rows.
  • Tables cannot exceed 500 columns.
  • Values must be of the types listed below; and they must follow type-specific rules.

If your module output does not follow these rules, Workbench will tell the user your module has a bug.

Calling convention

A module's render() must input and output a pd.DataFrame or ArrowTable:

from typing import Any, Dict
import pandas as pd
import numpy as np

def render(table: pd.DataFrame, params: Dict[str, Any])
    # ...
    return table  # or another pd.DataFrame

Column types

Text type

A valid Text value

Each valid Text value is made up of 0 to 32,767 bytes of valid UTF-8 (or null).

Text in Arrow

Arrow text is either a StringArray (C++ docs, Python docs) or a DictionaryArray (C++ docs, Python docs) with StringArray values.

DictionaryArray is an optimization, and all modules must read it. Only use dictionaries to store large amounts of repetitive data. A good guideline: don't store more than 100 unique values or values over 100 bytes each. (TODO create hard limits.)

A DictionaryArray must not contain unused dictionary values or duplicate dictionary values.

While Step code sees the difference between DictionaryArray and StringArray, users do not see the difference. To them, text data is text data.

import pyarrow as pa

# these all produce equivalent text columns
string_array = pa.array(["a", "b", "c", None, "a"])
dictionary_array = pa.DictionaryArray.from_arrays([0, 1, 2, None, 0], ["a", "b", "c"])
dictionary_array = string_array.dictionary_encode()

Text in Parquet

Parquet mimics Arrow.

A StringArray is encoded in any valid Parquet encoding, except a dictionary encoding.

A DictionaryArray is encoded in RLE_DICTIONARY encoding.

Text in Pandas

A StringArray becomes a pd.Series of objects, in which each object is either a str or None.

A DictionaryArray becomes a pd.Series of unordered Categorical without unused values. Nulls are stored as numpy.nan in this case.

Beware: many Pandas algorithms behave differently when faced with String versus Categorical data. To be safe, you can always "un-optimize" and convert a categorical to objects.

All these series are equivalent, as far as the user is concerned:

import pandas as pd
import pyarrow as pa

str_series = pd.Series(["a", "b", "c", None, "a"])
str_series_from_arrow = pa.array(["a", "b", "c", None, "a"]).to_pandas()
cat_series = str_series.astype("category")
cat_series_from_arrow = pa.array(["a", "b", "c", None, "a"]).dictionary_encode().to_pandas()

# Converting from category to str takes two lines of code:
str_series_from_dict = cat_series.astype(str)
str_series_from_dict[cat_series.isnull()] = None

is series text?

Workbench only uses categorical arrays and object arrays to handle strings. So there's the test logic:

def is_text(series):
    return hasattr(series, 'cat') or series.dtype == object

manipulate text data in Pandas

  • Treat either as an object array using str functions: for instance, series.str.lower()
  • You can also use if hasattr(series, 'cat') and then deal with series.cat.rename_categories() and so forth. These functions can be fast and memory-efficient. Category logic is tricky: for instance, you can't simply say series = series.cat.rename_categories(series.cat.categories.str.upper()) because the upper() call may return duplicates, which would throw an error. If you know you want the output to be categories and you're not optimizing, the simplest approach is to use series.str.func().astype('category')

Pandas is inefficient with text. It is RAM-heavy: Each value costs 50 bytes of overhead, plus a pointer, all in physical RAM. (In comparison, an Arrow value costs only a pointer in overhead, mmapped so text does not need to consume physical RAM.) And it is fragmented: text is scattered across physical RAM, making for many cache misses. (In comparison, Arrow pointers and text data are contiguous.) If your text-manipulation module is too slow, rewrite it as pure Arrow before you even begin optimizing.

convert number => text

Beware series.astype(str) because it will try to convert np.nan to str, which will not give the result you expect. To convert properly, use this two-step process:

text_series = number_series.astype(str)  # np.nan => "NaN"
text_series[number_series.isna()] = np.nan  # "NaN" from previous line => np.nan
text_series = text_series.as_type('category')  # optionally

Number type

Number arrays are arrays of signed integers or finite floats of any size (or null).

The rules of precision: when operating with integer scalars and columns:

  • Treat float values that can be losslessly converted to integer as integer.
  • If the input values are all integers that can be represented as (signed) int64, the output column must be an integer type; output values must be accurate as per int64 math.
  • Otherwise, the output column must be a float type; output values must be accurate as per float64 math.
  • To optimize space and speed, integers and floats may optionally be represented as int32, int16, int8 or float32 only if converting to the smaller type does not lose any data.

Floats must be finite or null. Workbench does not support NaN, Infinity and -Infinity.

Module documentation should explain to the user how null is handled in that module. Usually, modules should mimic SQL NULL handling: when null is an input to simple math, return null; when null is a value being aggregated, ignore it.

Any Number column has a "format" that dictates how to convert them to String. This is stored in metadata as a string adhering to a subset of the Python format specification. For instance: ${:,d}M means "dollar-sign prefix" ($), "thousands separator" (,), "cast as integer" (d) and "M suffix" (M). The default format is {:,} -- which usually renders as decimal of arbitrary precision Whole numbers -- even floating-point numbers -- are cast to int before being passed to Python's format() function, because format strings are supported in JavaScript and JavaScript has only one Number type. Common, useful formats: {:,.1%} (percentage), {:,.2f} (fixed point notation).

Numbers in Arrow

The Arrow format for numbers is NumericArray (C++ docs, Python docs).

The concrete types each module must support are Int8Array, Int16Array, Int32Array, Int64Array, FloatArray and DoubleArray. The canonical logic is Int64Array and DoubleArray operations.

Nulls can be stored in any type. NaN, Infinity and -Infinity cannot be stored in any type.

Numbers in Parquet

The Parquet format for numbers is INT32, INT64, FLOAT32 or FLOAT64 -- whatever fits the data, using a sensible encoding. Integers' Logical Types must be signed-integer.

Nulls can be stored in any type. NaN, Infinity and -Infinity cannot be stored in any type.

Numbers in Pandas

Pandas numbers are backed by numpy np.array arrays.

Nulls are represented as np.nan. NaN, Infinity and -Infinity cannot be stored. This usually isn't a big concern: Pandas's handling of np.nan usually does what we want to do with Workbench nulls.

BUG: numpy does not allow nulls in an int64-typed array. When the module-loading code converts an Arrow table to Pandas, it will automatically convert a null-containing int64 array to float64. This is lossy, and it makes the module produce bugs according to the "rules of precision" above. If your users complain, rewrite your algorithm as Arrow to correct the problem.

is series number?

from pandas.api.types import is_numeric_dtype
def is_number(series):
    return is_numeric_dtype(series)

generate test number series

series1 = pandas.Series([1, 2, 3, numpy.nan])  # automatic dtype -- float64 in this case
series2 = pandas.Series([1, 2.1, 3], dtype=numpy.float32)  # manual dtype

manipulate number data

A slew of numpy/pandas operations operate on an array at a time, instead of a value at a time. For instance:

series2 = series * 2  # series2 has same length as series, doubled values
series3 = series2 - series1  # arraywise subtraction -- here, series3 will equal to series1

convert text => number

number_series = pandas.to_numeric(text_series)  # throws ValueError
number_series = pandas.to_numeric(text_series, errors='coerce')  # np.nan for each parse error

Datetime type

This "timestamp" type is a 64-bit number (or null) representing nanoseconds since the UNIX Epoch (1970-01-01T00:00:00.000Z). It does not store timezone. All times are UTC.

Leap seconds are ignored: a leap second has the value of the second before it.

The minimum timestamp is 1677-09-21 00:12:43.145225Z; the maximum timestamp is 2262-04-11 23:47:16.854775807Z.

Datetime in Arrow

In Arrow, we use a TimestampArray (C++ docs, Python docs). The type must always be in ns.

The following are equivalent:

import datetime
import dateutil.parser
import pyarrow as pa
by_int64 = pa.array([1234678901234567000, None], pa.timestamp("ns"))
by_pyobject = pa.array([datetime.datetime(2009, 2, 15, 6, 21, 41, 234567), None], pa.timestamp("ns"))
by_parse = pa.array([dateutil.parser.parse("2009-02-15T06:21:41.234567Z"), None], pa.timestamp("ns"))

pyarrow.array() defaults to microsecond precision when parsing datetimes. Microsecond precision is not allowed. So this won't work:

by_pyobject = pa.array([datetime.datetime(2009, 2, 15)])  # INVALID

Datetime in Parquet

In Parquet, we use the TIMESTAMP logical type to store an int64 (or null). isAdjustedToUTC is always true, and precision is always NANOS.

Datetime in Pandas

In Pandas, we use the datetime64[ns] numpy type. See Pandas timeseries docs. null becomes pd.NaT.

is series datetime?

def is_datetime(series):
    return pandas.api.types.is_datetime64_dtype(series.dtype)

generate test datetime series

# Dates are long to write
series = pandas.Series(['2018-08-07T17:12:00.003-04:00'], dtype='datetime64[ns]')

convert text => datetime

If the input series is ISO8601-formatted, all is well. Otherwise, chaos.

For instance: pandas will interpret 03:12 to be today's time. A proper converter has its work cut out for it.

datetime_series = pandas.to_datetime(text_series)  # throws ValueError
datetime_series = pandas.to_datetime(text_series, errors='coerce')  # np.nan for each parse error

manipulate datetime data

Pandas docs

Two cool ideas:

series.dt.floor('d')  # `.dt` rounds dates down
series2 - series1  # subtraction gives timedelta64 values (not Workbench-compatible)
(series2 - series1) / np.timedelta64(1, 's')  # np.float64 -- Workbench-compatible

datetime to text

iso8601_series = series.dt.strftime('%Y-%m-%dT%H:%M:%SZ')

Date type

This "date" type is a 32-bit number (or null) representing days since the UNIX Epoch (1970-01-01). It does not store timezone.

The minimum date is -5877641-06-23; the maximum is 5881580-07-11.

Beware, Pythonistas: Python's native datetime.date raises errors for years under 0001 or over 9999. Do not use Python datetime.date for Workbench date math. What's more, pyarrow and pandas tend to convert to Python datetime.date for some operations (in particular, repr()): you must avoid this.

A Date column has a "unit": "year", "month", "day", "week", "quarter". The "unit" declares which dates are valid and how dates should be formatted. "Day" is the default.

Date in Arrow

In Arrow, we use a Date32Array (C++ docs, Python docs).

The following are equivalent

import datetime
import pyarrow as pa
by_pyobject = pa.array([datetime.date(2021, 4, 5), None])
by_int32 = pa.array([18722, None], pa.date32())

Date in Parquet

In Parquet, we use the DATE logical type to store an int32 (or null).

Date in Pandas

In Pandas, we use the period[D] Pandas type. See Pandas timeseries docs. null becomes pd.NaT.

Beware mixed metaphors. A Pandas Period is a timespan (with a duration and timezone); a Workbench Date is a calendar date. So while the physical storage is correct, you'll want to steer clear of Pandas' tendency to convert anything to timestamp. Workbench draws a clear line between timestamp (with timezone) and date (without timezone); your module should, too.

is series date?

def is_date(series):
    return pd.PeriodDtype('D') == series.dtype

generate test datetime series

series = pd.Series([pd.Period('2021-04-05', 'D')])

convert text => date

If the input series is ISO8601-formatted (YYYY-MM-DD), all is well. Otherwise, Pandas will silently convert to text.

date_series = pandas.Series(text_series, dtype=pd.PeriodDtype('D'))
if pandas.PeriodDtype('D') != date_series.dtype:
    raise ValueError("At least one date was badly formatted")  # throws ValueError

manipulate date data

Pandas docs

Cool ideas:

series.dt.asfreq('M').dt.asfreq('D')  # round down to month

date to text

iso8601_series = series.dt.strftime('%Y-%m-%d')
Clone this wiki locally