Skip to content

Commit

Permalink
update docs and README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
cwensel committed Nov 15, 2023
1 parent 58e1a62 commit 404bf31
Show file tree
Hide file tree
Showing 3 changed files with 116 additions and 186 deletions.
188 changes: 2 additions & 186 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ This project is under active development and many features are considered alpha.
Please do play around with this project in order to provide early feedback, but do expect things to change until we hit
1.0 release.

Documentation can be found at: https://docs.clusterless.io/tessellate/1.0-wip/index.html

All tessellate releases are available via [Homebrew](https://brew.sh):

```shell
Expand All @@ -32,192 +34,6 @@ Tessellate was designed to support data engineers and data scientists in their e
Tessellate may be used from the command line, but also natively supports the
[Clusterless](https://github.com/ClusterlessHQ/clusterless) workload model.

## Features

### Pipeline definition

Tessellate pipelines are defined in JSON files.

For a copy of a template pipeline JSON file, run:

```shell
tess --print-pipeline > pipeline.json
```

Some command line options are merged at runtime with the pipeline JSON file. Command line options take precedence over
the pipeline JSON file.

Overriding command line options include

- `--inputs`
- `--input-manifest`
- `--input-manifest-lot`
- `--output`
- `--output-manifest`
- `--output-manifest-lot`

### Supported data formats

- text/regex - lines of text parsed by regex
- csv - with or without headers
- tsv - with or without headers
- [Apache Parquet](https://parquet.apache.org)

Regex support is based on regex groups. Groups are matched by ordinal with the declared fields in the schema.

Provided named formats include:

- AWS S3 Access Logs
- named: `aws-s3-access-log`
- https://docs.aws.amazon.com/AmazonS3/latest/dev/LogFormat.html

Usage:

```json
{
"source": {
"schema": {
"name": "aws-s3-access-log"
}
}
}
```

### Supported data locations/protocols

- `file://`
- `s3://`
- `hdfs://`

### Supported path and filename patterns

- Path partitioning - data can be partitioned by intrinsic values in the data set.
- partitioning can be named, e.g. `year=2023/month=01/day=01`, or
- unnamed, e.g. `2023/01/01`
- Filename metadata - `[prefix]-[field-hash]-[guid].parquet`
- `prefix` is `part` by default
- `field-hash` is a hash of the schema: field names, and field types
- `guid` is a random UUID or a provided value

### Supported operations

#### Transforms

- insert - insert a literal value into a field
- `value => intoField|type`
- coerce - transform a field to a new type
- `field|newType`
- copy - copy a field value to a new field
- `fromField +> toField|type`
- rename - rename a field, optionally coercing its type
- `fromField -> toField|type`
- discard - remove a field
- `field ->`

#### Intrinsic Functions

- `tsid` - create a unique id as a long or string (using https://github.com/f4b6a3/tsid-creator)
- `^tsid{node:...,nodeCount:...,epoch:...,format:...,counterToZero:...} +> intoField|type`
- `type` must be `string` or `long`, defaults to `long`. When `string`, the `format` is honored.
- Params:
- `node` - the node id, defaults to a random int.
- if a string is provided, it is hashed to an int
- SIP_HASHER.hashString(s, StandardCharsets.UTF_8).asInt() % nodeCount;
- `nodeCount` - the number of nodes, defaults to `1024`
- `epoch` - the epoch, defaults to `Instant.parse("2020-01-01T00:00:00.000Z").toEpochMilli()`
- `format` - the format, defaults to `null`. Example: `K%S` where `%S` is a placeholder.
- Placeholders:
- `%S`: canonical string in upper case
- `%s`: canonical string in lower case
- `%X`: hexadecimal in upper case
- `%x`: hexadecimal in lower case
- `%d`: base-10
- `%z`: base-62
- `counterToZero` - resets the counter portion when the millisecond changes, defaults to `false`

### Supported types

- `String`
- `int` - `null` coerced to `0`
- `Integer`
- `long` - `null` coerced to `0`
- `Long`
- `float` - `null` coerced to `0`
- `Float`
- `double` - `null` coerced to `0`
- `Double`
- `boolean` - `null` coerced to `false`
- `Boolean`
- `DateTime|format` - canonical type is `Long`, format defaults to `yyyy-MM-dd HH:mm:ss.SSSSSS z`
- `Instant|format` - canonical type is `java.time.Instant`, supports nanos precision, format defaults to ISO-8601
instant format, e.g. `2011-12-03T10:15:30Z`
- `json` - canonical type is `com.fasterxml.jackson.databind.JsonNode`, supports nested objects and arrays

## Pipeline Template expressions

In order to embed system properties, environment variables, or other provided intrinsic values, [MVEL
templates](http://mvel.documentnode.com) are supported in the pipeline JSON file.

Provided intrinsic values include:

- `env[...]` - Environment variables
- `sys[...]` - System properties
- `source.*` - Pipeline source properties
- `sink.*` - Pipeline sink properties
- `pid` - `ProcessHandle.current().pid()`
- `rnd32` - `Math.abs(random.nextInt())` always returns the same value
- `rnd64` - `Math.abs(random.nextLong())` always returns the same value
- `rnd32Next` - `Math.abs(random.nextInt())` never returns the same value
- `rnd64Next` - `Math.abs(random.nextLong)` never returns the same value
- `hostAddress` - `localHost.getHostAddress()`
- `hostName` - `localHost.getCanonicalHostName()`
- `currentTimeMillis` - `now.toEpochMilli()`
- `currentTimeISO8601` - `now.toString()` at millis precision
- `currentTimeYear` - `utc.getYear()`
- `currentTimeMonth` - `utc.getMonthValue()` zero padded
- `currentTimeDay` - `utc.getDayOfMonth()` zero padded
- `currentTimeHour` - `utc.getHour()` zero padded
- `currentTimeMinute` - `utc.getMinute()` zero padded
- `currentTimeSecond` - `utc.getSecond()` zero padded

Where:

- `Random random = new Random()`
- `InetAddress localHost = InetAddress.getLocalHost()`
- `Instant now = Instant.now()`
- `ZonedDateTime utc = now.atZone(ZoneId.of("UTC"))`

For example:

- `@{env['USER']}` - resolve an environment variable
- `@{sys['user.name']}` - resolve a system property
- `@{sink.manifestLot}` - resolve a sink property from the pipeline JSON definition

Used in a transform to embed the current `lot` value into the output:

```json
{
"transform": [
"@{source.manifestLot}=>lot|string"
]
}
```

Or create a filename that prevents collisions but simplifies duplicate removal:

```json
{
"filename": {
"prefix": "access",
"includeGuid": true,
"providedGuid": "@{sink.manifestLot}-@{currentTimeMillis}",
"includeFieldsHash": true
}
}
```

Will result in a filename similar to `access-1717f2ea-20230717PT5M250-1689896792672-00000-00000-m-00000.gz`.

## Building

So that the Cascading WIP releases can be retrieved, to `gradle.properties` add:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,10 @@
`tess` expects a JSON formatted "pipeline" file that declares the xref:source-sink.adoc[sources, sinks], and
xref:transforms.adoc[transforms] to be run.

Some values in the pipeline file can be overridden by command line options.

== Pipeline Declaration Format

[source,console]
.Print Pipeline Template
----
Expand Down Expand Up @@ -49,3 +53,92 @@ tess --print-pipeline
<11> Compression type
<12> Whether the schema should be embedded in the files (add headers)
<13> Partitions to write out

To view all pipeline options:

[source,console]
.Print Complete Pipeline Template
----
tess --print-pipeline all
----

== Pipeline File Overrides

Some command line options are merged at runtime with the pipeline JSON file. Command line options take precedence over
the pipeline JSON file.

Overriding command line options include

- `--inputs`
- `--input-manifest`
- `--input-manifest-lot`
- `--output`
- `--output-manifest`
- `--output-manifest-lot`

== Pipeline Template Expressions

In order to embed system properties, environment variables, or other provided intrinsic values,
http://mvel.documentnode.com[MVEL templates] are supported in the pipeline JSON file.

Provided intrinsic values include:

`env[...]`:: Environment variables.
`sys[...]`:: System properties.
`source.*`:: Pipeline source properties.
`sink.*`:: Pipeline sink properties.
`pid`:: `ProcessHandle.current().pid()`.
`rnd32`:: `Math.abs(random.nextInt())` always returns the same value.
`rnd64`:: `Math.abs(random.nextLong())` always returns the same value.
`rnd32Next`:: `Math.abs(random.nextInt())` never returns the same value.
`rnd64Next`:: `Math.abs(random.nextLong)` never returns the same value.
`hostAddress`:: `localHost.getHostAddress()`.
`hostName`:: `localHost.getCanonicalHostName()`.
`currentTimeMillis`:: `now.toEpochMilli()`.
`currentTimeISO8601`:: `now.toString()` at millis precision.
`currentTimeYear`:: `utc.getYear()`.
`currentTimeMonth`:: `utc.getMonthValue()` zero padded.
`currentTimeDay`:: `utc.getDayOfMonth()` zero padded.
`currentTimeHour`:: `utc.getHour()` zero padded.
`currentTimeMinute`:: `utc.getMinute()` zero padded.
`currentTimeSecond`:: `utc.getSecond()` zero padded.

Where:

- `Random random = new Random()`
- `InetAddress localHost = InetAddress.getLocalHost()`
- `Instant now = Instant.now()`
- `ZonedDateTime utc = now.atZone(ZoneId.of("UTC"))`

For example:

- `@{env['USER']}` - resolve an environment variable
- `@{sys['user.name']}` - resolve a system property
- `@{sink.manifestLot}` - resolve a sink property from the pipeline JSON definition

Used in a transform to embed the current `lot` value into the output:

[source,json]
----
{
"transform": [
"@{source.manifestLot}=>lot|string"
]
}
----

Or create a filename that prevents collisions but simplifies duplicate removal:

[source,json]
----
{
"filename": {
"prefix": "access",
"includeGuid": true,
"providedGuid": "@{sink.manifestLot}-@{currentTimeMillis}",
"includeFieldsHash": true
}
}
----

Will result in a filename similar to `access-1717f2ea-20230717PT5M250-1689896792672-00000-00000-m-00000.gz`.
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,27 @@ csv:: With or without headers.
tsv:: With or without headers.
parquet:: [Apache Parquet](https://parquet.apache.org)

Regex support is based on regex groups. Groups are matched by ordinal with the declared fields in the schema.

Provided named formats include:

AWS S3 Access Logs::
- named: `aws-s3-access-log`
- https://docs.aws.amazon.com/AmazonS3/latest/dev/LogFormat.html

Usage:

[source,json]
----
{
"source": {
"schema": {
"name": "aws-s3-access-log"
}
}
}
----

== Protocols

Every source and sink supports its own set of protocols.
Expand Down

0 comments on commit 404bf31

Please sign in to comment.