Execute preprocessing and parsing in parallel #439

HiromuHota · 2020-06-10T23:15:41Z

Description of the problems or issues

Is your pull request related to a problem? Please describe.

Currently, preprocessor and parser are executed in a complete sequential order.
i.e., preprocess N docs (and load them into a queue), then parse N docs.
This has two drawbacks:

the progress bar shows nothing during preprocessing.
the machine RAM has to be large enough to hold N preprocessed docs at a time.

They become more serious when N is large and/or each HTML file is large.

Does your pull request fix any issue.

Fix #435

Description of the proposed changes

A clear and concise description of what you propose.

This PR

places a cap on the in_queue so that only a certain number of documents are loaded to in_queue.
executes preprocessor and parser in parallel (ie the main process does preprocessing and child process(es) do parsing in parallel).

Test plan

A clear and concise description of how you test the new changes.

For the 1st issue: I manually check the progress bar starts showing progress right after starting parse.apply.

Checklist

I have updated the documentation accordingly.
I have added tests to cover my changes.
All new and existing tests passed.
I have updated the CHANGELOG.rst accordingly.

codecov-commenter · 2020-06-17T01:18:08Z

Codecov Report

Merging #439 into master will increase coverage by 0.06%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #439      +/-   ##
==========================================
+ Coverage   83.22%   83.28%   +0.06%     
==========================================
  Files          88       88              
  Lines        4559     4564       +5     
  Branches      837      837              
==========================================
+ Hits         3794     3801       +7     
  Misses        572      572              
+ Partials      193      191       -2

Flag	Coverage Δ
#unittests	`83.28% <100.00%> (+0.06%)`	⬆️

Impacted Files	Coverage Δ
src/fonduer/utils/udf.py	`88.67% <100.00%> (+1.17%)`	⬆️
src/fonduer/parser/visual_linker.py	`84.18% <0.00%> (+0.22%)`	⬆️
...c/fonduer/parser/preprocessors/doc_preprocessor.py	`88.09% <0.00%> (+2.38%)`	⬆️

HiromuHota · 2020-06-17T20:46:23Z

Tests passed, but extremely slow.
While the pytest step took 18m to finish on Linux on the master branch (8edfacd), it took 28m on 26ac90f.

senwu · 2020-06-17T21:19:18Z

Great! How about we set maxsize as argument in config and user can override it.

Is there a way to automatically estimate the maxsize?

Currently, preprocessor and parser are executed in a complete sequential order. i.e., preprocess N docs (and load them into a queue), then parse N docs. This has two drawbacks: 1. the progress bar shows nothing during preprocessing. 2. the machine RAM may not be large enought o hold N preprocessed docs. They become more serious when N is large and/or each HTML file is large.

HiromuHota · 2020-06-18T01:31:03Z

Good point. I think the current magic number (maxsize=parallelism * 2) is good enough, or at least there would be little benefit from optimizing it.

This maxsize is the buffer size between preprocessing and parsing.
Let's assume preprocessing is slower than parsing , the in_queue is always close to empty and Parser will be waiting. This case, there is no point in optimizing maxsize. maxsize=1 would be good enough.
Let's assume preprocessing is faster than parsing, the in_queue will be piling up, and up to maxsize. If maxsize is large, it could eat up the whole RAM (#435). Having a cap on queue by setting a low number for maxsize can prevents this, but Preprocessor will be waiting.
So what should actually be optimized is the speed of both preprocessing and parsing, in other words, they have to be balanced for the best performance and the best use of memory.

If you remember that only one process does preprocessing while multiple child process(es) do parsing in parallel, you may want to parallelize preprocessing too.
But then you would wonder what's the best parallelism for preprocessing and one for parsing. This will be a nightmare.
By the time we start thinking about the whole pipeline optimization, we have to leverage parallel/distributed computing frameworks like Dask, PySpark instead of direct use of multiprocessing.

HiromuHota · 2020-06-18T01:34:19Z

Regarding performance, the current commit (b01a9d5) took 21m in the pytest step on Linux. I think this is within an allowable margin of error. The performance on GitHub Actions fluctuates anyway.

HiromuHota · 2020-06-18T17:54:54Z

I can clearly confirm that the first issue has been resolved, but still seeing memory issue.
I thought I was seeing a memory issue on spaCy (explosion/spaCy#3618) and (explosion/spaCy#4486), but this issue has been fixed at v2.1.9.

HiromuHota · 2020-06-18T21:04:37Z

The 2nd issue: "the machine RAM has to be large enough to hold N preprocessed docs at a time" is not the case.
I ran only the preprocessing part as:

from multiprocessing import Queue

queue = Queue()
i = 0
for doc in doc_preprocessor:
    print(i)
    i += 1
    queue.put(doc)

I can see an increase of the memory usage, but this did not kill the python process.

Running the following (preprocessing + parsing), ie parser.apply(doc_preprocessor) kills it.
So I suspected a memory leak somewhere in parsing, and I updated spaCy from 2.1.9 to 2.2.4, SQLAlchemy from 1.3.13 to 1.3.17, but still having an issue.

HiromuHota · 2020-06-18T23:46:56Z

Now I have a better picture of the issue.
The memory issue is actually two-hold:
2. Loading all the preprocessed documents onto the memory at a time.
3. Parser itself is very memory-hungry. (e.g., a machine with 1GB of RAM barely processes ONSMS04099-1.html, 1.3MB on disk, in the hardware tutorial).

This PR addresses only the 2nd issue (and the 1st issue: the progress bar).

HiromuHota · 2020-06-19T00:34:53Z

It is typically difficult to test the parallel execution of multiple things (preprocessing and parsing in this case).
I hard-coded print in udf.py for testing purpose and got the following result when executing test_e2e.py:

Main process put        112823 into in_queue (in_queue.qsize: 0)
1-th worker process got 112823 from in_queue (in_queue.qsize: 0)
Main process put        2N3906-D into in_queue (in_queue.qsize: 0)
0-th worker process got 2N3906-D from in_queue (in_queue.qsize: 0)
Main process put        2N3906 into in_queue (in_queue.qsize: 1)
Main process put        2N4123-D into in_queue (in_queue.qsize: 2)
Main process put        2N4124 into in_queue (in_queue.qsize: 3)
Main process put        2N6426-D into in_queue (in_queue.qsize: 4)
1-th worker process got 2N3906 from in_queue (in_queue.qsize: 4)
Main process put        2N6427 into in_queue (in_queue.qsize: 4)
0-th worker process got 2N4123-D from in_queue (in_queue.qsize: 4)
Main process put        AUKCS04635-1 into in_queue (in_queue.qsize: 4)
1-th worker process got 2N4124 from in_queue (in_queue.qsize: 4)
Main process put        BC182-D into in_queue (in_queue.qsize: 4)
Main process put        BC182 into in_queue (in_queue.qsize: 4)
0-th worker process got 2N6426-D from in_queue (in_queue.qsize: 4)
1-th worker process got 2N6427 from in_queue (in_queue.qsize: 3)
Main process put        BC337-D into in_queue (in_queue.qsize: 4)
0-th worker process got AUKCS04635-1 from in_queue (in_queue.qsize: 4)
Main process put        BC337 into in_queue (in_queue.qsize: 4)
0-th worker process got BC182-D from in_queue (in_queue.qsize: 3)
1-th worker process got BC182 from in_queue (in_queue.qsize: 2)
0-th worker process got BC337-D from in_queue (in_queue.qsize: 1)
1-th worker process got BC337 from in_queue (in_queue.qsize: 0)

in_queue.qsize() returns "the approximate size of the queue" (https://docs.python.org/3/library/queue.html#queue.Queue.qsize) but you can roughly see that the number of documents in in_queue is capped (by 4 in this case) and that preprocessing and parsing are executed in parallel.

senwu · 2020-06-19T00:35:19Z

Great! Glad to have 1st issue fixed!

For the 3rd issue, what's the bottleneck? Spacy or other parts of the parsing procedure? My guess is that Spacy is the bottleneck here (Correct me If I am wrong.)

Also, one thing I am aware of is that we create Spacy model for parsing each document which is not friendly for memory.

HiromuHota · 2020-06-19T00:46:02Z

Good question. I'd like to address the 3rd issue (Parser itself is very memory-hungry) in a different issue/PR.
This could be spaCy, lxml, SQLalchemy, or the parsed Document itself.

IMO, before addressing the 3rd issue, there should be a guideline of how much memory (and PARALLEL) is recommended for how large the HTML file could be.
For example, 1GB+ RAM is a must and 2GB+ is recommended for ~1MB of HTML files and PARALLEL=1.
Another example could be 4GB+ is recommended for ~1MB of HTML files and PARALLEL=2.
This recommendation could also be affected by if visual=True, if lingual=True, or etc.

senwu · 2020-06-19T00:48:32Z

Agreed! Let's address the 3rd issue in another PR. Before giving any recommendation for memory usage, I think we would be better to do some tests.

Also, is there any smart queue we can use which means we can control the pool size by estimating the total memory usage of all documents in the pool?

src/fonduer/utils/udf.py

lukehsiao

This is a great step forward. Looking forward the reduced parser memory usage, too.

senwu

LGTM

HiromuHota · 2020-06-19T23:10:41Z

For future reference, disabling the global cache for IDs (ie collect_ids=False) in lxml may help solve the remaining issue (https://benbernardblog.com/tracking-down-a-freaky-python-memory-leak-part-2/).

HiromuHota mentioned this pull request Jun 11, 2020

Non-deterministic behavior in featurization #412

Closed

HiromuHota force-pushed the fix/435 branch from 077f6a3 to a267b88 Compare June 17, 2020 00:47

HiromuHota force-pushed the fix/435 branch from a267b88 to 26ac90f Compare June 17, 2020 18:09

HiromuHota changed the title ~~Cap the number of docs in in_queue~~ Execute preprocessing and parsing in parallel Jun 17, 2020

HiromuHota force-pushed the fix/435 branch 2 times, most recently from 3ba328e to e2dbfa8 Compare June 17, 2020 22:51

Hiromu Hota added 2 commits June 17, 2020 17:32

Preprocessing in a different thread for readability and logic simplicity

b01a9d5

HiromuHota force-pushed the fix/435 branch from e2dbfa8 to b01a9d5 Compare June 18, 2020 00:39

senwu requested review from lukehsiao and senwu June 18, 2020 03:29

Update CHANGELOG

9082c09

HiromuHota marked this pull request as ready for review June 19, 2020 00:35

lukehsiao reviewed Jun 19, 2020

View reviewed changes

src/fonduer/utils/udf.py Show resolved Hide resolved

lukehsiao added this to the v0.8.3 milestone Jun 19, 2020

lukehsiao added the enhancement New feature or request label Jun 19, 2020

lukehsiao approved these changes Jun 19, 2020

View reviewed changes

senwu removed their request for review June 19, 2020 09:02

senwu self-requested a review June 19, 2020 09:03

senwu approved these changes Jun 19, 2020

View reviewed changes

lukehsiao merged commit b0ac254 into HazyResearch:master Jun 19, 2020

HiromuHota deleted the fix/435 branch June 19, 2020 18:18

HiromuHota mentioned this pull request Jun 23, 2020

Unable to allocate 370. GiB for an array with shape (294791, 336473) and data type int32 #465

Closed

HiromuHota mentioned this pull request Jul 16, 2020

sqlalchemy.exc.InvalidRequestError during labeler.apply or featurizer.apply #482

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Execute preprocessing and parsing in parallel #439

Execute preprocessing and parsing in parallel #439

HiromuHota commented Jun 10, 2020 •

edited

codecov-commenter commented Jun 17, 2020 •

edited

HiromuHota commented Jun 17, 2020

senwu commented Jun 17, 2020

HiromuHota commented Jun 18, 2020

HiromuHota commented Jun 18, 2020

HiromuHota commented Jun 18, 2020

HiromuHota commented Jun 18, 2020

HiromuHota commented Jun 18, 2020 •

edited

HiromuHota commented Jun 19, 2020

senwu commented Jun 19, 2020 •

edited

HiromuHota commented Jun 19, 2020

senwu commented Jun 19, 2020 •

edited

lukehsiao left a comment

senwu left a comment

HiromuHota commented Jun 19, 2020

Execute preprocessing and parsing in parallel #439

Execute preprocessing and parsing in parallel #439

Conversation

HiromuHota commented Jun 10, 2020 • edited

Description of the problems or issues

Description of the proposed changes

Test plan

Checklist

codecov-commenter commented Jun 17, 2020 • edited

Codecov Report

HiromuHota commented Jun 17, 2020

senwu commented Jun 17, 2020

HiromuHota commented Jun 18, 2020

HiromuHota commented Jun 18, 2020

HiromuHota commented Jun 18, 2020

HiromuHota commented Jun 18, 2020

HiromuHota commented Jun 18, 2020 • edited

HiromuHota commented Jun 19, 2020

senwu commented Jun 19, 2020 • edited

HiromuHota commented Jun 19, 2020

senwu commented Jun 19, 2020 • edited

lukehsiao left a comment

Choose a reason for hiding this comment

senwu left a comment

Choose a reason for hiding this comment

HiromuHota commented Jun 19, 2020

HiromuHota commented Jun 10, 2020 •

edited

codecov-commenter commented Jun 17, 2020 •

edited

HiromuHota commented Jun 18, 2020 •

edited

senwu commented Jun 19, 2020 •

edited

senwu commented Jun 19, 2020 •

edited