Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalability of CLARIAH tools & infrastructure #126

Open
proycon opened this issue Jun 14, 2022 · 0 comments
Open

Scalability of CLARIAH tools & infrastructure #126

proycon opened this issue Jun 14, 2022 · 0 comments
Labels
discussion This is a discussion point; invitation to discussion FAIR Distribution & Deployment FAIR Distribution & Deployment

Comments

@proycon
Copy link
Member

proycon commented Jun 14, 2022

We had a meeting between WP3 and WP6 today about certain use cases where a
high(er) degree of scalability is needed; specifically the need to invoke certain
processing tasks in parallel so the output can be obtained in a more reasonable
time.

As this is of course a central theme in any large infrastructure, I wanted to
open up this issue to track any progress, solutions and discussion on this,
from a generic perspective.

There are different aspects to the need to scale that we need to distinguish:

  1. Multithreading, parallel execution to use multiple cores (this is a software matter rather than an infrastructure matter). This comes down to efficient software design that fits contemporary hardware, but is definitely not trivial. Aside from CPUs, the role of GPUs should also be considered here.
  2. Distributed computing, i.e. the ability for a single user to dedicate parallel computing resources working together towards a single task, reducing the time in which it runs and results can be obtained. Here we an also distinguishing parallellisation on a single computing node (multiple processes) vs distribution over a larger computing cluster. The common solution here is to partition the input into n splits (if feasible of course) and run one process for each.
  3. Concurrency: Scaling up deployments when there are more users at the same time (which is what our Infrastructure Requirements covers in point 23), and scaling down deployments as users shrink again

For 1 we need robust software design (and algorithmic design in particular). This is something we need to encourage if the problem can be solved on this level.,
For 3 we need load balancing and container orchestration, which should be handled by the infrastructure and is viable with solutions like kubernetes.
Point 2 is typically addressed in high performance clusters using job schedulers like SLURM or complete workflow management solutions (e.g. DANE, Nextflow, Airflow, Luigi, etc). Solutions like kube-scheduler may also be fitting for our service-oriented architecture.

These three are not mutually exclusive, in real situations there may be demands for all three, also at the same time (which complicates matters)

Any views on this or ongoing efforts that address this?

  • @mmisworking: to what extent is work being done on 2 and 3 currently in the KNAW HuC kubernetes cluster for CLARIAH? I think 3 is probably the 'lowest hanging fruit' or 'most minimal viable solution'.
  • There may be certain CLARIAH software that forms a performance bottleneck for certain use-cases now and may require extra attention, WP6 identified Alpino to be one such a tool.

(Poking all participants in the WP3/WP6 meeting (who are on github): @JanOdijk @jorisvanzundert @karinavdo @julianeugarten)

@proycon proycon added discussion This is a discussion point; invitation to discussion FAIR Distribution & Deployment FAIR Distribution & Deployment labels Jun 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion This is a discussion point; invitation to discussion FAIR Distribution & Deployment FAIR Distribution & Deployment
Development

No branches or pull requests

1 participant