Skip to content

Infrastructure

Davide Berdin edited this page Nov 25, 2018 · 1 revision

Here is an idea of potential Infrastructure for Turing

Infrastructure

Components

The Architecture will be divided in several parts. Each component is described in the paragraphs below. There are some information missing such as the Login of the user. I would suggest Auth0 but I left it out of the Infrastructure for now until we think about the whole picture. This represent the bare minimum to make things working.

Turing UI

The ui will be responsible to take care of the Playgrounds, the User profile, etc. The user will interact with the UI only. I know it's obvious but I think it's important that the user does not have to do extra work to start using the platform.

Turing APIs

The APIs are responsible of storing all the information regarding the platform. The APIs will be updated every time a new component is added to the Playground or a new Setting has been modified.

Mongo

The reason why we chose MongoDB in the first place is for the flexibility of storing the Schema of the Playground. In fact, we can consider the schema as a JSON document that is constantly updated. Mongo makes easier this interaction.

Vault

To make sure that we are storing the credentials of the data connectors, Vault is one of the potential solutions that we can use. It supports tons of Authentication/Authorization mechanisms but most importantly, it natively supports encrypted Key/Value Storage as backend. This is very powerful because we can define how to make the key unique and store the credentials in most convenient way. Besides, it's very easy to set it up in Kubernetes.

Converter

I decided to have a dedicated service for converting the schema produced by the UI to the Apache NiFi templating. I assume that there will be a lot of templates we need to define and we need to make sure that we are able to translate those correctly. Having those within the APIs would also work, but it would violate the SRP principle for microservices. In this way, deployment is also a bit easier. The converter will read the schema that is received in Input and translate it in a NiFi system.

Apache NiFi

NiFi will actually execute the work given the translated schema. All the components will be created a run-time based on the template and it should start working out of the box. We should perhaps add some Best-Practice when creating each individual Processor.

Apache Kafka

We should use Kafka extensively when moving data from component A to B, etc. In this way we will enforce the communication and reduce data-loss.

ML Models

The ML models (still need to be defined) will read data from Kafka and execute what they are supposed to execute. The result should be placed again in Kafka and either send it to Druid for being queried or make it available for other steps. We can also use Kafka for storing Checkpoints datasets in between components.

Druid

The cherry on top of the cake. This will guarantee to be able to fast querying the dataset. It offers also a JDBC connector so that users can use SQL via Apache Calcite to query the database. Druid will be able to be queried also via Superset.

Superset

Superset could be integrated and made it available in such a way that the user can test it out its new querable dataset.

Kubernetes

Kubernetes is the natural choice to make everything working together.

Clone this wiki locally