initial antora documentation

ClusterlessHQ · Nov 14, 2023 · fc07b7f · fc07b7f
1 parent ea836c5
commit fc07b7f
Show file tree

Hide file tree

Showing 10 changed files with 402 additions and 0 deletions.
diff --git a/tessellate-main/src/main/antora/antora.yml b/tessellate-main/src/main/antora/antora.yml
@@ -0,0 +1,7 @@
+name: tessellate
+title: Tessellate
+version: 1.0-wip
+start_page: ROOT:index.adoc
+nav:
+  - modules/ROOT/nav.adoc
+  - modules/reference/nav.adoc
diff --git a/tessellate-main/src/main/antora/modules/ROOT/nav.adoc b/tessellate-main/src/main/antora/modules/ROOT/nav.adoc
@@ -0,0 +1,2 @@
+* xref:install.adoc[]
+* xref:support.adoc[]
diff --git a/tessellate-main/src/main/antora/modules/ROOT/pages/index.adoc b/tessellate-main/src/main/antora/modules/ROOT/pages/index.adoc
@@ -0,0 +1,57 @@
+= Tessellate
+
+Tessellate is a command line tool for reading and writing data to/from multiple locations and across multiple formats.
+
+This project is under active development and many features are considered alpha.
+
+== About
+
+A primary activity of any data-engineering effort is to format and organize data for different access patterns.
+
+For example, logs frequently arrive as lines of text, but are often best consumed as structured data. And different
+stakeholders may have different needs of the log data, so it must be organized (partitioned) in different ways that
+support those needs.
+
+Tessellate was designed to support data engineers and data scientists in their efforts to automate managing data for use
+by different platforms and tools.
+
+== Overview
+
+Tessellate is a command line interface (cli).
+
+[source,console]
+.Show Help
+----
+tess -h
+----
+
+It expects a simple xref:reference:pipeline.adoc[JSON formatted pipeline] file to
+xref:reference:source-sink.adoc[declare sources, sinks] and any xref:reference:transforms.adoc[transforms] to perform.
+
+It can read or write files locally, or from HDFS or AWS S3.
+
+It also supports most text formats and Apache Parquet natively.
+
+And during writing, it will efficiently partition data into different paths based on input or derived data availble
+in the pipeline.
+
+== Use
+
+Tessellate may be used from the command line, or in a container.
+
+It also natively supports the https://github.com/ClusterlessHQ/clusterless[Clusterless] workload model.
+
+Other uses include:
+
+- Data inspection from a terminal
+- Host log processing (push rotated logs to HDFS or the cloud)
+- For cloud processing arriving data (like AWS Fargate or ECS via AWS Batch)
+- As a serverless function (like AWS Lambda) [we plan to publish artifacts in Maven for inclusion in Lambda functions]
+
+== Cascading
+
+Tessellate uses https://cascading.wensel.net/[Cascading] under the hood for all of its processing.
+
+Historically, Cascading has been used to run large and complex Apache Hadoop and Tez applications, but it also supports
+a local execution without Hadoop runtime dependencies. This makes it very suitable for local processing, or in a cloud
+environment on AWS ECS/Fargate or AWS Lambda.
diff --git a/tessellate-main/src/main/antora/modules/ROOT/pages/install.adoc b/tessellate-main/src/main/antora/modules/ROOT/pages/install.adoc
@@ -0,0 +1,14 @@
+= Installation
+
+All tessellate releases are available via [Homebrew](https://brew.sh):
+
+[source,console]
+----
+brew tap clusterlesshq/tap
+brew install tessellate
+tess --version
+----
+
+Or, you can download the latest releases from GitHub:
+
+- https://github.com/ClusterlessHQ/tessellate/releases
diff --git a/tessellate-main/src/main/antora/modules/ROOT/pages/support.adoc b/tessellate-main/src/main/antora/modules/ROOT/pages/support.adoc
@@ -0,0 +1,13 @@
+= Getting Help
+
+We use a discussion board to support users.
+
+- https://github.com/orgs/ClusterlessHQ/discussions
+
+If you find bugs or would like to suggest features or additional documentation, please use the discussion board to make suggestions, we can then convert that into an issue for the appropriate project.
+
+== Support
+
+For ongoing corporate support, reach out to:
+
+- https://chris.wensel.net/
diff --git a/tessellate-main/src/main/antora/modules/reference/nav.adoc b/tessellate-main/src/main/antora/modules/reference/nav.adoc
@@ -0,0 +1,5 @@
+.Reference
+* xref:pipeline.adoc[]
+* xref:source-sink.adoc[]
+* xref:transforms.adoc[]
+* xref:types.adoc[]
diff --git a/tessellate-main/src/main/antora/modules/reference/pages/pipeline.adoc b/tessellate-main/src/main/antora/modules/reference/pages/pipeline.adoc
@@ -0,0 +1,51 @@
+= Pipeline
+
+`tess` expects a JSON formatted "pipeline" file that declares the xref:source-sink.adoc[sources, sinks], and
+xref:transforms.adoc[transforms] to be run.
+
+[source,console]
+.Print Pipeline Template
+----
+tess --print-pipeline
+----
+
+[source,json]
+----
+{
+  "source" : {
+    "inputs" : [ ], <1>
+    "schema" : {
+      "declared" : [ ], <2>
+      "format" : null, <3>
+      "compression" : "none", <4>
+      "embedsSchema" : false <5>
+    },
+    "partitions" : [ ] <6>
+  },
+  "transform" : [ ], <7>
+  "sink" : {
+    "output" : null, <8>
+    "schema" : {
+      "declared" : [ ], <9>
+      "format" : null, <10>
+      "compression" : "none", <11>
+      "embedsSchema" : false <12>
+    },
+    "partitions" : [ ] <13>
+  }
+}
+----
+
+<1> URLs to read from, required
+<2> Schema fields to declare, required if not embedded or type information should be declared
+<3> Format type
+<4> Compression type
+<5> Whether the schema is embedded in the files (has headers)
+<6> Partitions to parse into fields
+<7> Transforms to apply to the data
+<8> URL to write to, required
+<9> Schema fields to declare, by default all fields are written
+<10> Format type
+<11> Compression type
+<12> Whether the schema should be embedded in the files (add headers)
+<13> Partitions to write out
diff --git a/tessellate-main/src/main/antora/modules/reference/pages/source-sink.adoc b/tessellate-main/src/main/antora/modules/reference/pages/source-sink.adoc
@@ -0,0 +1,107 @@
+= Source and Sink
+
+Pipelines read data from sources and write data to sinks.
+
+Each source or sink has a format and protocol.
+
+And the data may be partitioned by values in the data set in order to allow interleaving of data from new data sets into
+existing data sets.
+
+This can significantly improve the performance of queries on the data.
+
+== Formats
+
+Every source and sink supports its own set of formats.
+
+[source,console]
+----
+tess --show-source=formats
+----
+
+text/regex:: Lines of text parsed by regex (like Apache or S3 log files).
+csv:: With or without headers.
+tsv:: With or without headers.
+parquet:: [Apache Parquet](https://parquet.apache.org)
+
+== Protocols
+
+Every source and sink supports its own set of protocols.
+
+[source,console]
+----
+tess --show-source=protocols
+----
+
+`file://`:: Read/write local files.
+`s3://`:: Read/write files in AWS S3.
+`hdfs://`:: Read/write files on Apache Hadoop HDF filesystem.
+
+== Compression
+
+Every source and sink supports its own set of compression formats.
+
+[source,console]
+----
+tess --show-source=compression
+----
+
+Som common formats supported are:
+
+* none
+* gzip
+* lz4
+* bzip2
+* brotli
+* snappy
+
+== Partitioning
+
+Partition be performed with data from on the values read or created in the pipeline.
+
+Path partitioning:: Data can be partitioned by intrinsic values in the data set.
+named partitions::: e.g. `year=2023/month=01/day=01`, or
+unnamed partitions::: e.g. `2023/01/01`
+
+Partitions, when declared in the pipeline file, can be simple, or represent a transform.
+
+Simple:: `<field_name>` becomes `/<field_name>=<field_value>/`
+Transform:: `<field_name>+><partition_name>|<field_type>` becomes `/<partition_name>=<transformed_value>/`
+
+Note the `+>` operator.
+
+Consider the following example, where `time` is either a `long` timestamp, or an `Instant`.
+
+* `time+>year|DateTime|yyyy`
+* `time+>month|DateTime|MM`
+* `time+>day|DateTime|dd`
+
+The above produces a path like `/year=2023/month=01/day=01/`.
+
+== File naming
+
+Workload processes can fail. And when they do, it is important not to overwrite existing files. It is also important to
+find the files that were created and written before the failure.
+
+The following metadata can help disambiguate files across processing runs, and also to help detect schema changes.
+
+Filename metadata:: `[prefix]-[field-hash]-[guid].parquet`
+`prefix`::: The value `part` by default.
+`field-hash`::: A hash of the schema: field names, and field types, so that schema changes can be detected.
+`guid`::: A random UUID or a provided value.
+
+The JSON model for this metadata is:
+
+[source,console]
+----
+ "filename" : {
+      "prefix" : null, <1>
+      "includeGuid" : false, <2>
+      "providedGuid" : null, <3>
+      "includeFieldsHash" : false <4>
+    }
+----
+
+<1> The prefix to use for the filename. Defaults to `part`.
+<2> Whether to include a random UUID in the filename. Defaults to `false`.
+<3> A provided UUID to use in the filename. Defaults to using a random UUID.
+<4> Whether to include a hash of the schema (field name + type) in the filename. Defaults to `false`.
diff --git a/tessellate-main/src/main/antora/modules/reference/pages/transforms.adoc b/tessellate-main/src/main/antora/modules/reference/pages/transforms.adoc
@@ -0,0 +1,101 @@
+= Transforms
+
+== Fields
+
+Input and output files/objects (also referred to as sources and sinks) are made of both rows and columns. Or tuples and fields.
+
+A tuple has a set of fields, and a field has an optional xref:types.adoc[type] (and any associated metadata).
+
+Data files, or objects, have paths and names. Field values can be parsed from the paths and embedded in the tuple stream
+as fields. This is common when data has been partitioned into files where common values (like month and/or day) can
+be embedded in the path name to help select relevant files (push down predicates are applied to path values by many
+query engines).
+
+Declared fields in a pipeline have the following format: `<field_name>|<field_type>`, where `<field_name>` is a string,
+or an ordinal (number representing the position).
+
+`<field_type>` is optional, depending on the use. `<field_type>` further may be formatted as `<type>|<metadata>`.
+
+The actual supported types and associated metadata are described in xref:types.adoc[].
+
+== Transforms
+
+Transforms manipulate the tuple stream. They are applied to every tuple in the tuple stream.
+
+Insert literal:: Insert a literal value into a field.
+Coerce field:: Transform a field, in every tuple.
+Copy field:: Copy a field value to a new field.
+Rename field:: Rename a field, optionally coercing its type.
+Discard field:: Remove a field.
+Apply function:: Apply intrinsic functions against one or more fields.
+
+=== Operators
+
+There are three transform operators:
+
+`pass:[=>]`:: Assign a literal value to a new field.
+Format::: `literal pass:[=>] new_field|type`
+`+>`:: Retain the input field, and assign the result value to a new field.
+Format::: `field +> new_field|type`
+`pass:[->]`:: Discard the input fields, and assign the result value to a new field.
+Format::: `field pass:[->] new_field|type`
+
+For example:
+
+- `US pass:[=>] country|String` - assigns the value `US` to the field `country` as a string.
+- `0.5 pass:[=>] ratio|Double` - assigns the value `0.5` to the field `ratio` as a double.
+- `1689820455 pass:[=>] time|DateTime|yyyyMMdd` - convert the long value to a date time using the format `yyyyMMdd` and assign the result to the field `time`.
+- `ratio +> ratio|Double` - Coerces the string field "ratio" to a double, `null` ok.
+- `ratio|Double` - Same as above, coerces the string field "ratio" to a double, `null` ok.
+- `name +> firstName|String` - assigns the value of the field "name" to the field "firstName" as a string. The field `name` is retained.
+- `name pass:[->] firstName|String` - assigns the value of the field "name" to the field "firstName" as a string. The field `name` is discarded (dropped from the tuple stream).
+- `password pass:[->]` - discards the field `password` from the tuple stream.
+
+==== Expressions
+
+Expressions are applied to incoming fields and the results are assigned to a new field. Expressions can have zero or
+more field arguments.
+
+There are two types of expression:
+
+- functions - combine arguments into new values
+- filters - drop tuples from the tuple stream (currently unimplemented)
+
+NOTE: Many more expression types are planned, including native support for regular expressions and JSON paths.
+
+Current only `intrinsic` functions are supported. `intrinsic` functions are built-in functions, with optional
+parameters
+
+No arguments:: `^intrinsic{} +> new_field|type`
+No arguments, with parameters:: `^intrinsic{param1:value1, param2:value2} +> new_field|type`
+With arguments:: `from_field1+from_field2+from_fieldN ^intrinsic{} +> new_field|type`
+With arguments, with parameters:: `from_field1+from_field2+from_fieldN ^intrinsic{param1:value1, param2:value2} +> new_field|type`
+
+Expression may retain or discard the argument fields depending on the operator used.
+
+== Intrinsic Functions
+
+NOTE: Many more functions are planned.
+
+Built-in functions on fields can be applied to one or more fields in every tuple in the tuple stream.
+
+`tsid`:: create a unique id as a long or string (using https://github.com/f4b6a3/tsid-creator)
+Def:::
+`^tsid{node:...,nodeCount:...,epoch:...,format:...,counterToZero:...} +> intoField|type`
+`type`:::: must be `string` or `long`, defaults to `long`. When `string`, the `format` is honored.
+Params:::
+`node`:::: The node id, defaults to a random int.
+* If a string is provided, it is hashed to an int.
+* `SIP_HASHER.hashString(s, StandardCharsets.UTF_8).asInt() % nodeCount;`
+`nodeCount`:::: The number of nodes, defaults to `1024`
+`epoch`:::: - The epoch, defaults to `Instant.parse("2020-01-01T00:00:00.000Z").toEpochMilli()`
+`format`:::: The format, defaults to `null`. Example: `K%S` where `%S` is a placeholder.
+Placeholders:::::
+- `%S`: canonical string in upper case
+- `%s`: canonical string in lower case
+- `%X`: hexadecimal in upper case
+- `%x`: hexadecimal in lower case
+- `%d`: base-10
+- `%z`: base-62
+`counterToZero`:::: Resets the counter portion when the millisecond changes, defaults to `false`.
+