-
Notifications
You must be signed in to change notification settings - Fork 0
Infrastructure
Here is an idea of potential Infrastructure for Turing
The Architecture will be divided in several parts. Each component is described in the paragraphs below. There are some information missing such as the Login
of the user. I would suggest Auth0
but I left it out of the Infrastructure for now until we think about the whole picture. This represent the bare minimum to make things working.
The ui will be responsible to take care of the Playgrounds
, the User
profile, etc. The user will interact with the UI only. I know it's obvious but I think it's important that the user does not have to do extra work to start using the platform.
The APIs are responsible of storing all the information regarding the platform. The APIs will be updated every time a new component is added to the Playground
or a new Setting
has been modified.
The reason why we chose MongoDB in the first place is for the flexibility of storing the Schema
of the Playground
. In fact, we can consider the schema as a JSON document that is constantly updated. Mongo makes easier this interaction.
To make sure that we are storing the credentials of the data connectors
, Vault is one of the potential solutions that we can use. It supports tons of Authentication/Authorization mechanisms but most importantly, it natively supports encrypted Key/Value
Storage as backend. This is very powerful because we can define how to make the key
unique and store the credentials in most convenient way. Besides, it's very easy to set it up in Kubernetes.
I decided to have a dedicated service for converting the schema
produced by the UI to the Apache NiFi templating. I assume that there will be a lot of templates we need to define and we need to make sure that we are able to translate those correctly. Having those within the APIs would also work, but it would violate the SRP
principle for microservices. In this way, deployment is also a bit easier.
The converter will read the schema
that is received in Input and translate it in a NiFi system.
NiFi will actually execute the work given the translated schema. All the components will be created a run-time based on the template and it should start working out of the box. We should perhaps add some Best-Practice when creating each individual Processor
.
We should use Kafka extensively when moving data from component A to B, etc. In this way we will enforce the communication and reduce data-loss.
The ML models (still need to be defined) will read data from Kafka and execute what they are supposed to execute. The result should be placed again in Kafka and either send it to Druid
for being queried or make it available for other steps. We can also use Kafka for storing Checkpoints
datasets in between components.
The cherry on top of the cake. This will guarantee to be able to fast querying the dataset. It offers also a JDBC
connector so that users can use SQL via Apache Calcite to query the database. Druid will be able to be queried also via Superset.
Superset could be integrated and made it available in such a way that the user can test it out its new querable dataset.
Kubernetes is the natural choice to make everything working together.