Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No database dependency during inference/serving #316

Closed
HiromuHota opened this issue Sep 26, 2019 · 3 comments · Fixed by #368
Closed

No database dependency during inference/serving #316

HiromuHota opened this issue Sep 26, 2019 · 3 comments · Fixed by #368

Comments

@HiromuHota
Copy link
Contributor

Is your feature request related to a problem? Please describe.

I think a Fonduer-based app lifecycle has three phases: development, training, and serving.
During development, you may have many iterations of re-definition of mention/candidate subclasses, labeling functions, which involve re-parsing, re-extraction, re-labeling, etc.
I understand that it is important to persist intermediate results to a database to save time.
Depending on the size of training dataset, you may need to persist intermediate results during training too.

On the other hand, there is no need to persist intermediate results to a database during inference/serving.
The dependency on database (postgres) makes a Fonduer-based app less portable and less scalable.

Describe the solution you'd like

I'd like no database dependency during inference/serving.

Describe alternatives you've considered

sqlite is easier to install than postgres, but still there is no need to persist intermediate results to a database.

Additional context

Related to #137.

@HiromuHota
Copy link
Contributor Author

One step further towards this goal, I'd like to make all the child class of UDF like ParserUDF unaware of the database. Meanwhile, the child class of UDFRunner like Parser will be intact to maintain the public APIs.

Currently, the apply method of each child class of UDF returns objects that are to be saved to the database:
ParserUDF.apply(self, doc: Document, ...) -> Iterator[Sentence]
MentionExtractorUDF.apply(self, doc: Document, ...) -> Iterator[Mention]
CandidateExtractorUDF.apply(self, doc: Document, ...) -> Iterator[Candidate]
or nothing because objects are saved within the method:
LabelerUDF.apply(self, doc: Document, ...) -> None
FeaturizerUDF.apply(self, doc: Document, ...) -> None

To make these methods unaware of the database,

  1. They should not use session object
  2. They should return objects that will be used by the following process.

For example,
ParserUDF.apply(self, doc: Document, ...) -> Document
MentionExtractorUDF.apply(self, doc: Document, ...) -> Document
CandidateExtractorUDF.apply(self, doc: Document, ...) -> Document
and (with less confidence)
LabelerUDF.apply(self, doc: Document, ...) -> np.ndarray
FeaturizerUDF.apply(self, doc: Document, ...) -> csr_matrix

@senwu , @lukehsiao thoughts?

@lukehsiao
Copy link
Contributor

At a high level, this sounds like an excellent idea to me. I wonder if/how this might affect performance as well. Perhaps we can get less lock contention if the different UDFs are not using the session object directly.

@HiromuHota
Copy link
Contributor Author

Good point.
When we take this approach, the database will be accessed only by a single process, which runs UDFRunner.apply. So yes, much less or even no lock contention at the database.
The downside is that there will be an overhead in object transfer from each UDF to the UDFRunner, which I hope is less expensive than the database lock.

Also we have to be careful not to run out of memory when a large corpus is processed.

As you have noticed, the architecture looks like a Map-Reduce (mapper: each UDF and reducer: UDFRunner).
Actually, the snorkel-extraction does exactly this. You can see it at snorkel-extraction/snorkel/udf.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants