Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persist doc only when no error happens during parsing #490

Merged
merged 13 commits into from
Jul 31, 2020

Conversation

HiromuHota
Copy link
Contributor

@HiromuHota HiromuHota commented Jul 30, 2020

Description of the problems or issues

Is your pull request related to a problem? Please describe.

See #489.

Does your pull request fix any issue.

Fix #489.

Description of the proposed changes

Persist doc only when no error happens during parsing.

Test plan

Added a test that demonstrates #489.

Checklist

  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • I have updated the CHANGELOG.rst accordingly.

@HiromuHota HiromuHota changed the title Fix/489 Persist doc only when no error happens during parsing Jul 30, 2020
@HiromuHota
Copy link
Contributor Author

PyTorch 1.6.0 has been released just two days ago.
Mypy complains that "unused 'type: ignore' comment" at torch.__version__.
I couldn't track down which change at PyTorch v1.6.0 affects this, but removing that ignore comment satisfies Mypy.

@codecov-commenter
Copy link

codecov-commenter commented Jul 31, 2020

Codecov Report

Merging #490 into master will increase coverage by 0.05%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #490      +/-   ##
==========================================
+ Coverage   85.84%   85.90%   +0.05%     
==========================================
  Files          88       88              
  Lines        4565     4568       +3     
  Branches      850      851       +1     
==========================================
+ Hits         3919     3924       +5     
+ Misses        464      463       -1     
+ Partials      182      181       -1     
Flag Coverage Δ
#unittests 85.90% <100.00%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/fonduer/features/featurizer.py 86.02% <100.00%> (ø)
src/fonduer/parser/parser.py 93.03% <100.00%> (+0.08%) ⬆️
src/fonduer/utils/udf.py 88.57% <100.00%> (-0.11%) ⬇️
src/fonduer/candidates/models/span_mention.py 84.11% <0.00%> (+1.86%) ⬆️

@HiromuHota HiromuHota marked this pull request as ready for review July 31, 2020 22:09
@HiromuHota
Copy link
Contributor Author

Please squash it when merging.

@@ -141,6 +141,11 @@ def apply( # type: ignore
progress_bar=progress_bar,
)

def _add(self, doc: Union[Document, None]) -> None:
Copy link
Collaborator

@senwu senwu Jul 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use this function? From this PR, I didn't find a place to use this function. Let me know if I am wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parser._add is called in the main process after getting a result from ParserUDF.apply at

while (
any([udf.is_alive() for udf in self.udfs]) or not out_queue.empty()
) and count_parsed < total_count:
# Get doc from the out_queue and persist the result into postgres
try:
(doc_name, y) = out_queue.get() # block until an item is available
self._add(y)
self.last_docs.add(doc_name)

The UDF._add has been meant to add features/labels to the database, more specifically by

def _add(self, records_list: List[List[Dict[str, Any]]]) -> None:
# Make a flat list of all records from the list of list of records.
# This helps reduce the number of queries needed to update.
all_records = list(itertools.chain.from_iterable(records_list))
batch_upsert_records(self.session, Feature, all_records)

and

def _add(self, records_list: List[List[Dict[str, Any]]]) -> None:
for records in records_list:
batch_upsert_records(self.session, self.table, records)

.
So only Featurizer._add and Labeler._add have been implemented, but I realized that Parser._add can be implemented to store the "transient" Document to database.
We don't need _add for MentionExtractor and CandidateExtractor as extracted mentions and candidates will be persisted without explicitly adding them to the session.

Copy link
Collaborator

@senwu senwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"Document "{document.name}" not added to database, because of parse error" but being added
3 participants