Skip to content

Commit

Permalink
initial antora documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
cwensel committed Nov 14, 2023
1 parent ea836c5 commit fc07b7f
Show file tree
Hide file tree
Showing 10 changed files with 402 additions and 0 deletions.
7 changes: 7 additions & 0 deletions tessellate-main/src/main/antora/antora.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
name: tessellate
title: Tessellate
version: 1.0-wip
start_page: ROOT:index.adoc
nav:
- modules/ROOT/nav.adoc
- modules/reference/nav.adoc
2 changes: 2 additions & 0 deletions tessellate-main/src/main/antora/modules/ROOT/nav.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
* xref:install.adoc[]
* xref:support.adoc[]
57 changes: 57 additions & 0 deletions tessellate-main/src/main/antora/modules/ROOT/pages/index.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
= Tessellate

Tessellate is a command line tool for reading and writing data to/from multiple locations and across multiple formats.

This project is under active development and many features are considered alpha.

== About

A primary activity of any data-engineering effort is to format and organize data for different access patterns.

For example, logs frequently arrive as lines of text, but are often best consumed as structured data. And different
stakeholders may have different needs of the log data, so it must be organized (partitioned) in different ways that
support those needs.

Tessellate was designed to support data engineers and data scientists in their efforts to automate managing data for use
by different platforms and tools.

== Overview

Tessellate is a command line interface (cli).

[source,console]
.Show Help
----
tess -h
----

It expects a simple xref:reference:pipeline.adoc[JSON formatted pipeline] file to
xref:reference:source-sink.adoc[declare sources, sinks] and any xref:reference:transforms.adoc[transforms] to perform.

It can read or write files locally, or from HDFS or AWS S3.

It also supports most text formats and Apache Parquet natively.

And during writing, it will efficiently partition data into different paths based on input or derived data availble
in the pipeline.

== Use

Tessellate may be used from the command line, or in a container.

It also natively supports the https://github.com/ClusterlessHQ/clusterless[Clusterless] workload model.

Other uses include:

- Data inspection from a terminal
- Host log processing (push rotated logs to HDFS or the cloud)
- For cloud processing arriving data (like AWS Fargate or ECS via AWS Batch)
- As a serverless function (like AWS Lambda) [we plan to publish artifacts in Maven for inclusion in Lambda functions]

== Cascading

Tessellate uses https://cascading.wensel.net/[Cascading] under the hood for all of its processing.

Historically, Cascading has been used to run large and complex Apache Hadoop and Tez applications, but it also supports
a local execution without Hadoop runtime dependencies. This makes it very suitable for local processing, or in a cloud
environment on AWS ECS/Fargate or AWS Lambda.
14 changes: 14 additions & 0 deletions tessellate-main/src/main/antora/modules/ROOT/pages/install.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
= Installation

All tessellate releases are available via [Homebrew](https://brew.sh):

[source,console]
----
brew tap clusterlesshq/tap
brew install tessellate
tess --version
----

Or, you can download the latest releases from GitHub:

- https://github.com/ClusterlessHQ/tessellate/releases
13 changes: 13 additions & 0 deletions tessellate-main/src/main/antora/modules/ROOT/pages/support.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
= Getting Help

We use a discussion board to support users.

- https://github.com/orgs/ClusterlessHQ/discussions
If you find bugs or would like to suggest features or additional documentation, please use the discussion board to make suggestions, we can then convert that into an issue for the appropriate project.

== Support

For ongoing corporate support, reach out to:

- https://chris.wensel.net/
5 changes: 5 additions & 0 deletions tessellate-main/src/main/antora/modules/reference/nav.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
.Reference
* xref:pipeline.adoc[]
* xref:source-sink.adoc[]
* xref:transforms.adoc[]
* xref:types.adoc[]
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
= Pipeline

`tess` expects a JSON formatted "pipeline" file that declares the xref:source-sink.adoc[sources, sinks], and
xref:transforms.adoc[transforms] to be run.

[source,console]
.Print Pipeline Template
----
tess --print-pipeline
----

[source,json]
----
{
"source" : {
"inputs" : [ ], <1>
"schema" : {
"declared" : [ ], <2>
"format" : null, <3>
"compression" : "none", <4>
"embedsSchema" : false <5>
},
"partitions" : [ ] <6>
},
"transform" : [ ], <7>
"sink" : {
"output" : null, <8>
"schema" : {
"declared" : [ ], <9>
"format" : null, <10>
"compression" : "none", <11>
"embedsSchema" : false <12>
},
"partitions" : [ ] <13>
}
}
----

<1> URLs to read from, required
<2> Schema fields to declare, required if not embedded or type information should be declared
<3> Format type
<4> Compression type
<5> Whether the schema is embedded in the files (has headers)
<6> Partitions to parse into fields
<7> Transforms to apply to the data
<8> URL to write to, required
<9> Schema fields to declare, by default all fields are written
<10> Format type
<11> Compression type
<12> Whether the schema should be embedded in the files (add headers)
<13> Partitions to write out
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
= Source and Sink

Pipelines read data from sources and write data to sinks.

Each source or sink has a format and protocol.

And the data may be partitioned by values in the data set in order to allow interleaving of data from new data sets into
existing data sets.

This can significantly improve the performance of queries on the data.

== Formats

Every source and sink supports its own set of formats.

[source,console]
----
tess --show-source=formats
----

text/regex:: Lines of text parsed by regex (like Apache or S3 log files).
csv:: With or without headers.
tsv:: With or without headers.
parquet:: [Apache Parquet](https://parquet.apache.org)

== Protocols

Every source and sink supports its own set of protocols.

[source,console]
----
tess --show-source=protocols
----

`file://`:: Read/write local files.
`s3://`:: Read/write files in AWS S3.
`hdfs://`:: Read/write files on Apache Hadoop HDF filesystem.

== Compression

Every source and sink supports its own set of compression formats.

[source,console]
----
tess --show-source=compression
----

Som common formats supported are:

* none
* gzip
* lz4
* bzip2
* brotli
* snappy

== Partitioning

Partition be performed with data from on the values read or created in the pipeline.

Path partitioning:: Data can be partitioned by intrinsic values in the data set.
named partitions::: e.g. `year=2023/month=01/day=01`, or
unnamed partitions::: e.g. `2023/01/01`

Partitions, when declared in the pipeline file, can be simple, or represent a transform.

Simple:: `<field_name>` becomes `/<field_name>=<field_value>/`
Transform:: `<field_name>+><partition_name>|<field_type>` becomes `/<partition_name>=<transformed_value>/`

Note the `+>` operator.

Consider the following example, where `time` is either a `long` timestamp, or an `Instant`.

* `time+>year|DateTime|yyyy`
* `time+>month|DateTime|MM`
* `time+>day|DateTime|dd`

The above produces a path like `/year=2023/month=01/day=01/`.

== File naming

Workload processes can fail. And when they do, it is important not to overwrite existing files. It is also important to
find the files that were created and written before the failure.

The following metadata can help disambiguate files across processing runs, and also to help detect schema changes.

Filename metadata:: `[prefix]-[field-hash]-[guid].parquet`
`prefix`::: The value `part` by default.
`field-hash`::: A hash of the schema: field names, and field types, so that schema changes can be detected.
`guid`::: A random UUID or a provided value.

The JSON model for this metadata is:

[source,console]
----
"filename" : {
"prefix" : null, <1>
"includeGuid" : false, <2>
"providedGuid" : null, <3>
"includeFieldsHash" : false <4>
}
----

<1> The prefix to use for the filename. Defaults to `part`.
<2> Whether to include a random UUID in the filename. Defaults to `false`.
<3> A provided UUID to use in the filename. Defaults to using a random UUID.
<4> Whether to include a hash of the schema (field name + type) in the filename. Defaults to `false`.
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
= Transforms

== Fields

Input and output files/objects (also referred to as sources and sinks) are made of both rows and columns. Or tuples and fields.

A tuple has a set of fields, and a field has an optional xref:types.adoc[type] (and any associated metadata).

Data files, or objects, have paths and names. Field values can be parsed from the paths and embedded in the tuple stream
as fields. This is common when data has been partitioned into files where common values (like month and/or day) can
be embedded in the path name to help select relevant files (push down predicates are applied to path values by many
query engines).

Declared fields in a pipeline have the following format: `<field_name>|<field_type>`, where `<field_name>` is a string,
or an ordinal (number representing the position).

`<field_type>` is optional, depending on the use. `<field_type>` further may be formatted as `<type>|<metadata>`.

The actual supported types and associated metadata are described in xref:types.adoc[].

== Transforms

Transforms manipulate the tuple stream. They are applied to every tuple in the tuple stream.

Insert literal:: Insert a literal value into a field.
Coerce field:: Transform a field, in every tuple.
Copy field:: Copy a field value to a new field.
Rename field:: Rename a field, optionally coercing its type.
Discard field:: Remove a field.
Apply function:: Apply intrinsic functions against one or more fields.

=== Operators

There are three transform operators:

`pass:[=>]`:: Assign a literal value to a new field.
Format::: `literal pass:[=>] new_field|type`
`+>`:: Retain the input field, and assign the result value to a new field.
Format::: `field +> new_field|type`
`pass:[->]`:: Discard the input fields, and assign the result value to a new field.
Format::: `field pass:[->] new_field|type`

For example:

- `US pass:[=>] country|String` - assigns the value `US` to the field `country` as a string.
- `0.5 pass:[=>] ratio|Double` - assigns the value `0.5` to the field `ratio` as a double.
- `1689820455 pass:[=>] time|DateTime|yyyyMMdd` - convert the long value to a date time using the format `yyyyMMdd` and assign the result to the field `time`.
- `ratio +> ratio|Double` - Coerces the string field "ratio" to a double, `null` ok.
- `ratio|Double` - Same as above, coerces the string field "ratio" to a double, `null` ok.
- `name +> firstName|String` - assigns the value of the field "name" to the field "firstName" as a string. The field `name` is retained.
- `name pass:[->] firstName|String` - assigns the value of the field "name" to the field "firstName" as a string. The field `name` is discarded (dropped from the tuple stream).
- `password pass:[->]` - discards the field `password` from the tuple stream.

==== Expressions

Expressions are applied to incoming fields and the results are assigned to a new field. Expressions can have zero or
more field arguments.

There are two types of expression:

- functions - combine arguments into new values
- filters - drop tuples from the tuple stream (currently unimplemented)

NOTE: Many more expression types are planned, including native support for regular expressions and JSON paths.

Current only `intrinsic` functions are supported. `intrinsic` functions are built-in functions, with optional
parameters

No arguments:: `^intrinsic{} +> new_field|type`
No arguments, with parameters:: `^intrinsic{param1:value1, param2:value2} +> new_field|type`
With arguments:: `from_field1+from_field2+from_fieldN ^intrinsic{} +> new_field|type`
With arguments, with parameters:: `from_field1+from_field2+from_fieldN ^intrinsic{param1:value1, param2:value2} +> new_field|type`

Expression may retain or discard the argument fields depending on the operator used.

== Intrinsic Functions

NOTE: Many more functions are planned.

Built-in functions on fields can be applied to one or more fields in every tuple in the tuple stream.

`tsid`:: create a unique id as a long or string (using https://github.com/f4b6a3/tsid-creator)
Def:::
`^tsid{node:...,nodeCount:...,epoch:...,format:...,counterToZero:...} +> intoField|type`
`type`:::: must be `string` or `long`, defaults to `long`. When `string`, the `format` is honored.
Params:::
`node`:::: The node id, defaults to a random int.
* If a string is provided, it is hashed to an int.
* `SIP_HASHER.hashString(s, StandardCharsets.UTF_8).asInt() % nodeCount;`
`nodeCount`:::: The number of nodes, defaults to `1024`
`epoch`:::: - The epoch, defaults to `Instant.parse("2020-01-01T00:00:00.000Z").toEpochMilli()`
`format`:::: The format, defaults to `null`. Example: `K%S` where `%S` is a placeholder.
Placeholders:::::
- `%S`: canonical string in upper case
- `%s`: canonical string in lower case
- `%X`: hexadecimal in upper case
- `%x`: hexadecimal in lower case
- `%d`: base-10
- `%z`: base-62
`counterToZero`:::: Resets the counter portion when the millisecond changes, defaults to `false`.

Loading

0 comments on commit fc07b7f

Please sign in to comment.