Skip to content
Highly performant JavaScript data stream ETL engine.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
src Add stdout destination tests. May 25, 2019
tests Add stdout destination tests. May 25, 2019
.gitignore
.npmignore Use xlstream for xlsx streaming; Feb 11, 2019
.travis.yml Finalize codecov support. May 15, 2019
LICENSE Initial commit Jan 18, 2019
README.md Emit job info on start processing event. May 22, 2019
docker-compose.yml Add testing environment. Jan 21, 2019
jest.config.js Add testing environment. Jan 21, 2019
package-lock.json Refactor events functionality; add new events. May 21, 2019
package.json Limit rows on processor instead of destination. May 25, 2019
tsconfig.json Initial implementation. Jan 19, 2019
tslint.json Initial implementation. Jan 19, 2019
wait-for-it.sh Add testing environment. Jan 21, 2019

README.md

bellboy Build Status codecov npm

Highly performant JavaScript data stream ETL engine.

How it works?

Bellboy streams input data row by row. Every row, in turn, goes through user-defined function where it can be transformed. When enough data is collected in batch, it is being loaded to destination.

Installation

Before install, make sure you are using latest version of Node.js.

npm install bellboy

Example

This example shows how bellboy can extract rows from the Excel file, modify it on the fly, load to the Postgres database, move processed file to the other folder and process remaining files.

Just in five simple steps.

const bellboy = require('bellboy');
const fs = require('fs');
const path = require('path');

(async () => {
    const srcPath = `C:/source`;
    // 1. create a processor which will process 
    // Excel files one by one in the folder 
    const processor = new bellboy.ExcelProcessor({
        path: srcPath,
        hasHeader: true,
    });
    // 2. create a destination which will add a new 'status' 
    // field to each row and load processed data to Postgres database
    const destination = new bellboy.PostgresDestination({
        connection: {
            user: 'user',
            password: 'password',
            server: 'localhost',
            database: 'bellboy',
        },
        table: 'stats',
        recordGenerator: async function* (record) {
            yield {
                ...record.raw.obj,
                status: 'done',
            };
        }
    });
    // 3. create a job which will glue processor and destination together
    const job = new bellboy.Job(processor, [destination]);
    // 4. tell bellboy to move file away as soon as it was processed
    job.on('processedFile', async (file) => {
        const filePath = path.join(srcPath, file);
        const newFilePath = path.join(`./destination`, file);
        await fs.renameSync(filePath, newFilePath);
    });
    // 5. run your job
    await job.run();
})();

Jobs

A job in bellboy is a relationship link between processor and destinations. When the job is run, data processing and loading mechanism will be started.

Initialization

To initialize a Job instance, pass processor and some destination(s).

const job = new bellboy.Job(processor_instance, [destination_instance], job_options = {});

Instance methods

  • run async function()
    Starts processing data.
  • on function(event, async function listener)
    Intercepts specified event and pauses processing until listener function will be executed.
    If on returns some truthy value, processing will be stopped.
// move file to the new location when processedFile event is fired
job.on('processedFile', async (file) => {
    const filePath = path.join(srcPath, file);
    const newFilePath = path.join(`./destination`, file);
    await rename(filePath, newFilePath);
});

Events

The following table lists the job life-cycle events and parameters they emit.

Event Parameters Description
startProcessing processorInstance, destinationInstance[] Job has started execution.
startProcessingRow data Received row is about to be processed.
rowAddedToBatch destinationIndex, data Received row has been added to the destination batch (wether as is or by recordGenerator function).
rowProcessingError destinationIndex, error Received row processing has been failed.
endProcessingRow Received row has been processed.
transformingBatch destinationIndex, data Batch is about to be transformed (before calling batchTransformer function).
transformingBatchError destinationIndex, error Batch transformation has been failed (batchTransformer function has thrown an error).
transformedBatch destinationIndex, data Batch has been transformed (after calling batchTransformer function).
loadingBatch destinationIndex, data Batch is about to be loaded in destination.
loadingBatchError destinationIndex, error Batch load has failed.
loadedBatch destinationIndex Batch load has been finished.
endProcessing Job has finished execution.

Processors

Each processor in bellboy is a class which has a single responsibility of processing data of specific type -

MqttProcessor

Usage examples

Listens for messages and processes them one by one. It also handles backpressure by queuing messages, so all messages can be eventually processed.

Options

  • url string required
  • topics string[] required

HttpProcessor

Usage examples

Processes data received from a HTTP call. Can process JSON as well as delimited data. Can handle pagination by using nextRequest function.

Options

  • connection object required
    Options from request library.
  • dataFormat delimited | json required
  • delimiter string required for delimited
  • jsonPath string required for json
    Only values that match provided JSONPath will be processed.
  • nextRequest async function(header)
    Function which must return connection for the next request or null if the next request is not needed. If data format is json, it will have header parameter which contains data before the first jsonPath match.
const processor = new bellboy.HttpProcessor({
    // gets next connection from the header until last page is reached
    nextRequest: async function (header) {
        if (header) {
            const pagination = header.pagination;
            if (pagination.total_pages > pagination.current_page) {
                return {
                    ...connection,
                    url: `${url}&current_page=${pagination.current_page + 1}`
                };
            }
        }
        return null;
    },
    // ...
});

Directory processors

Used for streaming text data from files in directory. There are currently three types of directory processors - ExcelProcessor, JsonProcessor and TailProcessor. Such processors search for the files in the source directory and process them one by one.

Options

  • path string required
    Path to the directory where files are located.
  • filePattern string
    Regex pattern for the files to be processed. If not specified, all files in the directory will be matched.
  • files string[]
    Array of file names. If not specified, all files in the directory will be matched against filePattern regex and processed in alphabetical order.

Events

  • processingFile (file, filePath)
    Emitted when file is about to be processed.
  • processedFile (file, filePath)
    Emitted after file has been processed.

ExcelProcessor

Usage examples

Processes XLSX files in the directory.

Options

  • Directory processor options
  • hasHeader boolean
    Wether worksheet has header or not, false by default.
  • sheetName string
  • sheetIndex number
    Starts from 0.
  • sheetGetter async function(sheets)
    Function which has array of sheets as a parameter and must return required name of the sheet.
const processor = new bellboy.ExcelProcessor({
    // returns last sheet name
    sheetGetter: async (sheets) => {
        return sheets[sheets.length - 1];
    },
    // ...
});

If no sheetName specified, value of the sheetIndex will be used. If it isn't specified either, sheetGetter function will be called. If none options are specified, first sheet will be processed.

Produced row

To see how processed row will look like, proceed to xlstream library documentation which is used for Excel processing.

JsonProcessor

Processes JSON files in the directory.

Options

DelimitedProcessor

Usage examples

Processes files with delimited data in the directory.

Options

TailProcessor

Usage examples

Watches for file changes and outputs last part of file as soon as new lines are added to the file.

Options

  • Directory processor options
  • fromBeginning boolean
    In addition to emitting new lines, emits lines from the beginning of file, false by default.

Produced row

  • file string
    Name of the file the data came from.
  • data string

Database processors

Processes SELECT query row by row. There are two database processors - PostgresProcessor (usage examples) and MssqlProcessor (usage examples). Both of them are having the same options.

Options

  • query string required
    Query to execute.
  • connection object required
    • user
    • password
    • server
    • host
    • database
    • schema
      Currently available only for PostgresProcessor.

DynamicProcessor

Processor which generates records on the fly. Can be used to define custom data processors.

Options

  • generator async generator function required
    Generator function which must yield records to process.
// processor which generates 10 records dynamically
const processor = new bellboy.DynamicProcessor({
    generator: async function* () {
        for (let i = 0; i < 10; i++) {
            yield i;
        }
    },
});

Destinations

Every job can have as many destinations (outputs) as needed. For example, one job can load processed data into a database, log this data to stdout and post it by HTTP simultaneously.

Options

  • batchSize number
    Number of records to be processed before loading them to the destination. If not specified or 0 is passed, all records will be processed.
  • recordGenerator async generator function(row)
    Function which receives produced row by processor and can apply transformations to it.
  • batchTransformer async function(rows)
    Function which receives whole batch of rows. This function is being called after row count reaches batchSize. Data is being loaded to destination immediately after this function has been executed.

StdoutDestination

Logs out all data to stdout (console).

Options

HttpDestination

Usage examples

Puts processed data one by one in body and executes specified HTTP request.

Options

PostgresDestination

Usage examples

Inserts data to PostgreSQL.

Options

  • General destination options
  • table string required
    Table name.
  • upsertConstraints string[]
    If specified, UPSERT command will be executed based on provided constraints.
  • connection object required
    • user
    • password
    • server
    • host
    • database
    • schema

MssqlDestination

Usage examples

Inserts data to MSSQL.

Options

  • General destination options
  • table string required
    Table name.
  • upsertConstraints string[]
    If specified, UPSERT command will be executed based on provided constraints.
  • connection object required
    • user
    • password
    • server
    • host
    • database

Extendability

New processors and destinations can be made by extending existing ones. Feel free to make a pull request if you create something interesting.

Creating a new processor

Processor class examples

To create a new processor, you must extend Processor class and implement async process function. This function accepts two parameters:

  • processStream async function(readStream) required
    Callback function which accepts Readable stream. After calling this function, job instance will handle passed stream internally.
  • emit async function(event, ...arguments)
    Callback function which accepts event name and custom arguments. Such events can be then caught with on job function.
class CustomProcessor extends bellboy.Processor {
    async process(processStream, emit) {
        // await processStream(readStream);
        // await emit('customEvent', 'hello', 'world');
    }
}

Creating a new destination

Destination class examples

To create a new destination, you must extend Destination class and implement async loadBatch function. This function accepts one parameter:

  • data any[] required
    Array of some processed data that needs to be loaded.
class CustomDestination extends bellboy.Destination {
    async loadBatch(data) {
        console.log(data);
    }
}

Testing

Tests can be run by using docker-compose up --abort-on-container-exit --exit-code-from test command.

You can’t perform that action at this time.