Persist doc only when no error happens during parsing #490

HiromuHota · 2020-07-30T20:49:20Z

Description of the problems or issues

Is your pull request related to a problem? Please describe.

See #489.

Does your pull request fix any issue.

Fix #489.

Description of the proposed changes

Persist doc only when no error happens during parsing.

Test plan

Added a test that demonstrates #489.

Checklist

I have updated the documentation accordingly.
I have added tests to cover my changes.
All new and existing tests passed.
I have updated the CHANGELOG.rst accordingly.

HiromuHota · 2020-07-30T21:39:24Z

PyTorch 1.6.0 has been released just two days ago.
Mypy complains that "unused 'type: ignore' comment" at torch.__version__.
I couldn't track down which change at PyTorch v1.6.0 affects this, but removing that ignore comment satisfies Mypy.

…ch#489)

…rn only successfully parsed documents

codecov-commenter · 2020-07-31T21:48:13Z

Codecov Report

Merging #490 into master will increase coverage by 0.05%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #490      +/-   ##
==========================================
+ Coverage   85.84%   85.90%   +0.05%     
==========================================
  Files          88       88              
  Lines        4565     4568       +3     
  Branches      850      851       +1     
==========================================
+ Hits         3919     3924       +5     
+ Misses        464      463       -1     
+ Partials      182      181       -1

Flag	Coverage Δ
#unittests	`85.90% <100.00%> (+0.05%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/fonduer/features/featurizer.py	`86.02% <100.00%> (ø)`
src/fonduer/parser/parser.py	`93.03% <100.00%> (+0.08%)`	⬆️
src/fonduer/utils/udf.py	`88.57% <100.00%> (-0.11%)`	⬇️
src/fonduer/candidates/models/span_mention.py	`84.11% <0.00%> (+1.86%)`	⬆️

HiromuHota · 2020-07-31T23:18:30Z

Please squash it when merging.

senwu · 2020-07-31T23:32:02Z

src/fonduer/parser/parser.py

@@ -141,6 +141,11 @@ def apply(  # type: ignore
            progress_bar=progress_bar,
        )

+    def _add(self, doc: Union[Document, None]) -> None:


Do we use this function? From this PR, I didn't find a place to use this function. Let me know if I am wrong.

Parser._add is called in the main process after getting a result from ParserUDF.apply at

fonduer/src/fonduer/utils/udf.py

Lines 143 to 150 in 5909011

while (

any([udf.is_alive() for udf in self.udfs]) or not out_queue.empty()

) and count_parsed < total_count:

# Get doc from the out_queue and persist the result into postgres

try:

(doc_name, y) = out_queue.get() # block until an item is available

self._add(y)

self.last_docs.add(doc_name)

The UDF._add has been meant to add features/labels to the database, more specifically by

fonduer/src/fonduer/features/featurizer.py

Lines 248 to 252 in 5909011

def _add(self, records_list: List[List[Dict[str, Any]]]) -> None:

# Make a flat list of all records from the list of list of records.

# This helps reduce the number of queries needed to update.

all_records = list(itertools.chain.from_iterable(records_list))

batch_upsert_records(self.session, Feature, all_records)

and

fonduer/src/fonduer/supervision/labeler.py

Lines 309 to 311 in 5909011

def _add(self, records_list: List[List[Dict[str, Any]]]) -> None:

for records in records_list:

batch_upsert_records(self.session, self.table, records)

.
So only Featurizer._add and Labeler._add have been implemented, but I realized that Parser._add can be implemented to store the "transient" Document to database.
We don't need _add for MentionExtractor and CandidateExtractor as extracted mentions and candidates will be persisted without explicitly adding them to the session.

senwu

Left one comment

HiromuHota changed the title ~~Fix/489~~ Persist doc only when no error happens during parsing Jul 30, 2020

HiromuHota mentioned this pull request Jul 30, 2020

No need to ignore type for torch.__version__ as of PyTorch 1.6.0 #491

Merged

4 tasks

Hiromu Hota added 4 commits July 30, 2020 15:54

Add a test that demonstrates HazyResearch#489

9e85625

Persist doc only when no error happens during parsing (fix HazyResear…

1ae61fe

…ch#489)

Make it explicit that ParserUDF.apply returns None on error

91551f0

Update comments

c5edaf7

HiromuHota force-pushed the fix/489 branch from 3e9a08a to 1c39a24 Compare July 30, 2020 22:54

Update CHANGELOG

c851709

HiromuHota force-pushed the fix/489 branch from 1c39a24 to c851709 Compare July 30, 2020 23:19

Hiromu Hota added 8 commits July 30, 2020 19:46

Let GitHub Actions timeout after 60 minutes

0ba16cf

Persist the object only when it is a Document

ca5cc30

Simply add the object to the session if transient

cd519f9

Add a type hint for readability

bdb8f85

Simplify if-statement

a170648

Add a transient doc in Parser._add

cfebf87

Document that Parser.get_documents and Parser.get_last_documents retu…

c3fa5d9

…rn only successfully parsed documents

Fix mypy error

7872698

HiromuHota marked this pull request as ready for review July 31, 2020 22:09

senwu reviewed Jul 31, 2020

View reviewed changes

senwu approved these changes Jul 31, 2020

View reviewed changes

senwu merged commit 2ab25d8 into HazyResearch:master Jul 31, 2020

HiromuHota deleted the fix/489 branch July 31, 2020 23:48

HiromuHota mentioned this pull request Aug 21, 2020

parser.apply does not return for a long time even though the progress bar indicates it finishes parsing #494

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persist doc only when no error happens during parsing #490

Persist doc only when no error happens during parsing #490

HiromuHota commented Jul 30, 2020 •

edited

Loading

HiromuHota commented Jul 30, 2020

codecov-commenter commented Jul 31, 2020 •

edited

Loading

HiromuHota commented Jul 31, 2020

senwu Jul 31, 2020 •

edited

Loading

HiromuHota Jul 31, 2020

senwu left a comment

	while (
	any([udf.is_alive() for udf in self.udfs]) or not out_queue.empty()
	) and count_parsed < total_count:
	# Get doc from the out_queue and persist the result into postgres
	try:
	(doc_name, y) = out_queue.get() # block until an item is available
	self._add(y)
	self.last_docs.add(doc_name)

	def _add(self, records_list: List[List[Dict[str, Any]]]) -> None:
	# Make a flat list of all records from the list of list of records.
	# This helps reduce the number of queries needed to update.
	all_records = list(itertools.chain.from_iterable(records_list))
	batch_upsert_records(self.session, Feature, all_records)

	def _add(self, records_list: List[List[Dict[str, Any]]]) -> None:
	for records in records_list:
	batch_upsert_records(self.session, self.table, records)

Persist doc only when no error happens during parsing #490

Persist doc only when no error happens during parsing #490

Conversation

HiromuHota commented Jul 30, 2020 • edited Loading

Description of the problems or issues

Description of the proposed changes

Test plan

Checklist

HiromuHota commented Jul 30, 2020

codecov-commenter commented Jul 31, 2020 • edited Loading

Codecov Report

HiromuHota commented Jul 31, 2020

senwu Jul 31, 2020 • edited Loading

Choose a reason for hiding this comment

HiromuHota Jul 31, 2020

Choose a reason for hiding this comment

senwu left a comment

Choose a reason for hiding this comment

HiromuHota commented Jul 30, 2020 •

edited

Loading

codecov-commenter commented Jul 31, 2020 •

edited

Loading

senwu Jul 31, 2020 •

edited

Loading