-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persist doc only when no error happens during parsing #490
Conversation
PyTorch 1.6.0 has been released just two days ago. |
…rn only successfully parsed documents
Codecov Report
@@ Coverage Diff @@
## master #490 +/- ##
==========================================
+ Coverage 85.84% 85.90% +0.05%
==========================================
Files 88 88
Lines 4565 4568 +3
Branches 850 851 +1
==========================================
+ Hits 3919 3924 +5
+ Misses 464 463 -1
+ Partials 182 181 -1
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Please squash it when merging. |
@@ -141,6 +141,11 @@ def apply( # type: ignore | |||
progress_bar=progress_bar, | |||
) | |||
|
|||
def _add(self, doc: Union[Document, None]) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we use this function? From this PR, I didn't find a place to use this function. Let me know if I am wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parser._add
is called in the main process after getting a result from ParserUDF.apply
at
fonduer/src/fonduer/utils/udf.py
Lines 143 to 150 in 5909011
while ( | |
any([udf.is_alive() for udf in self.udfs]) or not out_queue.empty() | |
) and count_parsed < total_count: | |
# Get doc from the out_queue and persist the result into postgres | |
try: | |
(doc_name, y) = out_queue.get() # block until an item is available | |
self._add(y) | |
self.last_docs.add(doc_name) |
The UDF._add
has been meant to add features/labels to the database, more specifically by
fonduer/src/fonduer/features/featurizer.py
Lines 248 to 252 in 5909011
def _add(self, records_list: List[List[Dict[str, Any]]]) -> None: | |
# Make a flat list of all records from the list of list of records. | |
# This helps reduce the number of queries needed to update. | |
all_records = list(itertools.chain.from_iterable(records_list)) | |
batch_upsert_records(self.session, Feature, all_records) |
and
fonduer/src/fonduer/supervision/labeler.py
Lines 309 to 311 in 5909011
def _add(self, records_list: List[List[Dict[str, Any]]]) -> None: | |
for records in records_list: | |
batch_upsert_records(self.session, self.table, records) |
.
So only Featurizer._add
and Labeler._add
have been implemented, but I realized that Parser._add
can be implemented to store the "transient" Document
to database.
We don't need _add
for MentionExtractor
and CandidateExtractor
as extracted mentions and candidates will be persisted without explicitly adding them to the session.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left one comment
Description of the problems or issues
Is your pull request related to a problem? Please describe.
See #489.
Does your pull request fix any issue.
Fix #489.
Description of the proposed changes
Persist doc only when no error happens during parsing.
Test plan
Added a test that demonstrates #489.
Checklist