Skip to content

Discovery Node: Architecture

Sid Sethi edited this page Oct 29, 2020 · 1 revision

Background

An Audius Discovery Node is a service that indexes the contents of the Audius contracts on the Ethereum blockchain, for Audius users to query. The indexed content includes user, track, and album/playlist information along with social features. The data is stored for quick access, updated on a regular interval, and made available for clients via a RESTful API.

github repository

NOTE - previously the discovery node was also referred to as "discovery provider" or "discovery node". These names all refer to the same service, the discovery node.

Design Goals

  1. Expose a queryable endpoint which listeners/creators can interact with
  2. Reliably store relevant blockchain events
  3. Continuously monitor the blockchain and ensure stored data is up to date with the network

Component Details

db

Our Postgres database is managed through SQLAlchemy, an object relational mapper and Alembic, a lightweight database migration tool. The data models are defined in src/models.py which is used by alembic to automatically generate the migrations under alembic/versions. You can find the connection defined between alembic and our data models in alembic/env.py

flask

The discovery node web server serves as the entrypoint for reading data through the audius protocol. All queries are returned as JSON objects parsed from SQLAlchemy query resultsn, and can be found in src/queries. Some examples of queries include user-specific feeds, track data, playlist data, etc.

celery

Celery is simply a task queue - it allows us to define a set of single tasks repeated throughout the lifetime of the discovery node.

Currently, a single task (src/tasks/index.py:update_task()) handles all database write operations. The flask application reads from our database and is unaware of data correctness.

Celery worker and beat are the key underlying concepts behind celery usage in the discovery node. Celery beat is responsible for periodically scheduling index tasks and is run as a separate container from the worker. Details about periodic task scheduling can be found in the official documentation.

Celery worker is the component that actually runs tasks - in this case ‘index_blocks’.

celery worker

What happens when 'index_blocks' is actually executed? The celery task does the following operations:

  1. Check whether the latest block is different than the last processed block in the ‘blocks’ table. If so, an array of blocks is generated from the last blockhash present in our database up to the latest block number specified by the block indexing window.

    • block indexing window is equivalent to the maximum number of blocks to be processed in a single indexing operation
  2. Traverse over each block in the block array produced after the above step.

    In each block, check if any transactions relevant to the audius smart contracts are present. If present, we retrieve specific event information from the associated transaction hash - examples include creator and track metadata. To do so, the discovery node must be aware of both the contract ABIs as well as each contract's address - these are shipped with each discovery node image.

  3. Given operations from audius contracts in a given block, the task updates the corresponding table in the database. Certain index operations require a metadata fetch from decentralized storage (IPFS - InterPlanetary File System). Metadata formats can be found here.

Why index blocks instead of using event filters?

This is a great question - the main reason we have chosen to index blocks in this manner is to handle cases of false progress and rollback. Each indexing task opens a fresh database session, which means DB transactions can be reverted at a block level - while rollback handling for the discovery node has yet to be implemented, block-level indexing will be immediately useful when it becomes necessary.

custom task context (DatabaseTask):

We define a custom celery task context which overrides the celery worker's notion of a ‘Task’ - this can be found in src/__init__.py:configure_celery().

By default, celery workers have no knowledge of the application from which they are spawned - this is by design, as a celery ‘Task’ is independent of the context in which it is first declared.

However, we have specific requirements within our celery task that require some additional context. In order to ensure each scheduled task has the requisite information, we override the default celery.Task (per official documents) to add the following context:

  • Contract ABI values
  • Web3 provider
  • Database Connection object
  • Configuration object (shared_config)
  • IPFS Client object
  • Redis connection object

celery beat

Identical container as the celery worker but is run as a 'beat scheduler' to ensure indexing is run at a periodic interval. By default this interval is 5 seconds.

redis

Redis is used as the broker for celery - a ‘broker’ is the data store used by celery for scheduling or other functions.

A second, equally important functionality of our local redis instance is to allow locking across multiple tasks**.** This is why a redis connection is included as part of our custom task context mentioned above.

In order to ensure only a single task is executed at once, we use a redis “lock” during each indexing operation - details can be found in src/tasks/index.py.

So, let us take an example: given a task interval of 10s, a single ‘indexing_task’ (A) has been executing for 10 seconds when a second task (B) is scheduled. At this point, A still holds the “disc_prov_lock” so when “B” attempts to acquire the lock, it fails to do so and exits - this is expected behavior. If A completes in 3 more seconds, at 13 seconds total - there is no indexing for 7 additional seconds until a new task is scheduled.