Separate workers for parsing and database insertions #137

j-rausch · 2018-09-05T20:49:15Z

Is your feature request related to a problem? Please describe.
Decouple UDF processes from the backend/database session.
Right now, when we run UDFRunner.apply_mt(), we create a number of UDF worker processes. These processes all own an sqlalchemy Session object and add/commit to the database at the end of their respective parsing loop.

Describe the solution you'd like
Make the UDF processes backend-agnostic, e.g. by having a set of separate BackendWorker processes handle the insertion of sentences. One possible way: Connect the output_queue of UDF to the input of BackendWorker, which receive Sentence lists and handle the sqlalchemy commits.

This will not fully decouple UDF from the backend, because the parser returns sqlalchemy-specific Sentence objects, but it could be one step towards that goal.

Additional context
This feature request refers to decoupling of parsing and backend.
There's likely more coupling with the backend later in the processing pipeline.

The text was updated successfully, but these errors were encountered:

HiromuHota · 2019-11-09T00:07:27Z

The apply/reducer architecture, which has been used by the snorkel-extraction project, may be used here too.

HiromuHota · 2019-11-09T00:25:36Z

The Snorkel team has discussed multithreaded reduce at snorkel-team/snorkel#562.

senwu · 2019-11-09T00:37:07Z

This is a great question! Is there a way to use the other package to do this instead of building our own code?

HiromuHota · 2019-11-09T01:00:30Z

We could use (Py)Spark, Dask, etc. for distributed computing but the bottleneck would be the data persistence layer, i.e., PostgreSQL.
In other words, as long as we use PostgreSQL, it'll be the bottleneck and we end up doing ad-hoc performance optimizations here and there.

One idea is to use different appliers for different storage backends: one for in-memory, another for PostgreSQL, one another for Hive, etc.
The snorkel project (not snorkel-extraction) takes this approach for different computing frameworks (LFApplier, DaskLFApplier, SparkLFApplier), but Fonduer has more appliers to take care of, i.e., parser, mention_extractor, candidate_extractor, labeler, featurizer; and Fonduer has to worry about the data persistence layer too.

senwu · 2019-11-09T01:20:56Z

That's one idea! I think it would be better to modularize so we can 1) have better support for distributed computing from other parties (e.g., PySpark, Dask ); 2) easy to extend to other data layers.

lukehsiao added enhancement New feature or request help wanted Extra attention is required labels Sep 21, 2018

HiromuHota mentioned this issue Sep 26, 2019

No database dependency during inference/serving #316

Closed

robbieculkin mentioned this issue Mar 3, 2021

UDF hangs with no exception / warning #535

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate workers for parsing and database insertions #137

Separate workers for parsing and database insertions #137

j-rausch commented Sep 5, 2018

HiromuHota commented Nov 9, 2019

HiromuHota commented Nov 9, 2019

senwu commented Nov 9, 2019

HiromuHota commented Nov 9, 2019

senwu commented Nov 9, 2019

Separate workers for parsing and database insertions #137

Separate workers for parsing and database insertions #137

Comments

j-rausch commented Sep 5, 2018

HiromuHota commented Nov 9, 2019

HiromuHota commented Nov 9, 2019

senwu commented Nov 9, 2019

HiromuHota commented Nov 9, 2019

senwu commented Nov 9, 2019