Non-deterministic behavior in featurization #412

lukehsiao · 2020-05-06T21:59:53Z

Describe the bug
When working with large (~7k docs) corpus of hardware datasheets, extracting multiple relations, we expect that the features for each candidate would be deterministic between each run. Even more so if we have parallelism=1 set in the Featurizer. However, we find that there can be small (e.g., < 5) differences between feature tables, resulting in slightly different sparse matrices, and thus, slightly different results.

To Reproduce
Running on the HACK transistor dataset will reproduce the error. However, it will take a long time, and we haven't been able to get a very minimal example that reproduces the error yet. Attached are two feature table dumps between two different runs with parallelism=1. Note that there is only a single difference on line 65454.

feature_table.tar.gz

Note that it isn't always one difference, and the difference is not deterministic. The different attached is just an example.

Expected behavior
We would expect that these feature tables are identical between runs.

Error Logs/Screenshots
For convenience, here is the differing line in screenshot form

Additional context
If the issue is in the UDF implementation, this might affect the Labeler in addition to the Featurizer, since they share a lot of the UDF code.

The text was updated successfully, but these errors were encountered:

HiromuHota · 2020-06-11T16:27:20Z

I'm working on #439 and currently the tests fail https://github.com/HazyResearch/fonduer/pull/439/checks?check_run_id=759815347 at assertions on the number of feature keys like below.
Mysteriously, the tests fail only at the number of feature keys while other assertions pass.

# I refer to this error as 1B
...
        assert num_features == len(train_cands[0]) + len(train_cands[1])
        num_feature_keys = session.query(FeatureKey).count()
>       assert num_feature_keys == 4577
E       assert 4621 == 4577

tests/e2e/test_e2e.py:235: AssertionError

...
# I refer to this error as 2
        # Update features
        featurizer.update(new_docs, parallelism=PARALLEL)
        assert session.query(Feature).count() == len(train_cands[0])
        num_feature_keys = session.query(FeatureKey).count()
>       assert num_feature_keys == 2526
E       assert 2544 == 2526

tests/e2e/test_incremental.py:192: AssertionError

It seems to be reproducible as the same commit fails on my forked repo too (https://github.com/HiromuHota/fonduer/runs/759811429?check_suite_focus=true).

HiromuHota · 2020-06-11T16:46:38Z

One more thing: the same GitHub Actions run (https://github.com/HazyResearch/fonduer/pull/439/checks?check_run_id=759815303), but a different error on mac (3.6)

# I refer to this error as 1A
        # Test that FeatureKey is properly reset
        featurizer.apply(split=1, train=True, parallelism=PARALLEL)
>       assert session.query(Feature).count() == 214
E       assert 213 == 214
E        +  where 213 = <bound method Query.count of <sqlalchemy.orm.query.Query object at 0x138f72c50>>()
E        +    where <bound method Query.count of <sqlalchemy.orm.query.Query object at 0x138f72c50>> = <sqlalchemy.orm.query.Query object at 0x138f72c50>.count
E        +      where <sqlalchemy.orm.query.Query object at 0x138f72c50> = <bound method Session.query of <sqlalchemy.orm.session.Session object at 0x1355437b8>>(Feature)
E        +        where <bound method Session.query of <sqlalchemy.orm.session.Session object at 0x1355437b8>> = <sqlalchemy.orm.session.Session object at 0x1355437b8>.query

tests/e2e/test_e2e.py:188: AssertionError

Two differences:

test_e2e.py failed both Mac (3.6) and Linux (3.7), but it failed on the number of features on Mac.
test_incremental.py passed on Mac.

HiromuHota · 2020-06-11T17:39:22Z

By looking at two runs (https://github.com/HazyResearch/fonduer/actions/runs/131557275 and https://github.com/HiromuHota/fonduer/actions/runs/131556294), the three types of errors happened as below:

	3.6	3.7
Mac	1A	1B
Linux	1A, 2	1B, 2

HiromuHota · 2020-06-11T20:38:55Z

Regarding the 1A type of error, session.query(Feature).count() actually equals to the total number of candidates for split=1, ie session.query(Feature).count() == len(dev_cands[0]) + len(dev_cands[1]).
On my local Mac, they were 214 == 61 + 153.
But on Github Actions, they were 213 == 60 + 153.
Thus, the difference occurs on PartTemp and not on PartVolt.

HiromuHota · 2020-06-11T21:01:21Z

The number of contituent mentions of PartTemp: Part and Temp were the same on local Mac and GitHub Actions.
This means temp_throttler drops one extra candidate on GitHub Actions.

def temp_throttler(c):
    (part, attr) = c
    if same_table((part, attr)):
        return is_horz_aligned((part, attr)) or is_vert_aligned((part, attr))
    return True

The difference comes either from same_table, or from is_horz_aligned/is_vert_aligned.
I suspect the non-deterministic behaviour comes from the visual_linker.

HiromuHota · 2020-06-13T00:46:10Z

A few updates:

Found that the different number of candidates in Non-deterministic behavior in featurization #412 (comment) comes from tests/data/html/2N6427.html.
VisualLinker behaves non-deterministically.

Case 1 (the test assertion assumes the following result from VisualLinker)

[INFO] Extracted 2622 pdf words
[INFO] Extracted 2746 html words
[DEBUG] Global exact matching:
[INFO] (1730/2746) = 0.63
[DEBUG] Local exact matching:
[INFO] (2077/2746) = 0.76
[DEBUG] Local approximate matching:
[INFO] (2205/2746) = 0.80
[DEBUG] Linked 2205/2746 (0.80) html words exactly

Case 2 (this happens sometimes, not sure what make this happen)

[INFO] Extracted 2622 pdf words
[INFO] Extracted 2746 html words
[DEBUG] Global exact matching:
[INFO] (1730/2746) = 0.63
[DEBUG] Local exact matching:
[INFO] (1994/2746) = 0.73
[DEBUG] Local approximate matching:
[INFO] (2200/2746) = 0.80
[DEBUG] Linked 2200/2746 (0.80) html words exactly

HiromuHota · 2020-06-15T23:58:27Z

One update: the non-deterministic behaviour of VisualLinker comes from the fact that doc.sentences is not sorted.

I think the following relationship should have order_by=True to the backref argument.

fonduer/src/fonduer/parser/models/sentence.py

Lines 248 to 252 in 246a92d

    
           document = relationship( 
        
               "Document", 
        
               backref=backref("sentences", cascade="all, delete-orphan"), 
        
               foreign_keys=document_id, 
        
           )

Findings that lead me to the above statement:

"The local search for exact matches" by link_exact depends on how html_word_list and pdf_word_list are sorted.
The order of words in pdf_word_list depends on how block/line/word (in the output of pdftotext -bbox-layout) is sorted, which looks stable.
That in html_word_list depends on how doc.sentences is sorted.
According to SQLAlchemy doc, order_by=False by default at relationship (or backref) (https://docs.sqlalchemy.org/en/13/orm/relationship_api.html#sqlalchemy.orm.relationship)

HiromuHota · 2020-06-16T01:33:28Z

A few more updates:

In order to reproduce this, I had to delete and re-create a database (dropdb e2e_test and createdb e2e_test). (This is why I could not reproduce it on my local mac until I deleted and recreated the database). This would mean that the order of doc.sentences depends on some internal state of the postgres database.
Another observation that supports the above is that I have not observed this non-deterministic behaviour if a database is not used (ie just using UDFs).

lukehsiao added the bug Something isn't working label May 6, 2020

lukehsiao added this to the v0.8.3 milestone May 6, 2020

HiromuHota mentioned this issue Jun 16, 2020

Fix the non-deterministic behavior in VisualLinker #458

Merged

4 tasks

senwu closed this as completed in #458 Jun 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-deterministic behavior in featurization #412

Non-deterministic behavior in featurization #412

lukehsiao commented May 6, 2020 •

edited

HiromuHota commented Jun 11, 2020 •

edited

HiromuHota commented Jun 11, 2020 •

edited

HiromuHota commented Jun 11, 2020

HiromuHota commented Jun 11, 2020

HiromuHota commented Jun 11, 2020

HiromuHota commented Jun 13, 2020

HiromuHota commented Jun 15, 2020

HiromuHota commented Jun 16, 2020 •

edited

Non-deterministic behavior in featurization #412

Non-deterministic behavior in featurization #412

Comments

lukehsiao commented May 6, 2020 • edited

HiromuHota commented Jun 11, 2020 • edited

HiromuHota commented Jun 11, 2020 • edited

HiromuHota commented Jun 11, 2020

HiromuHota commented Jun 11, 2020

HiromuHota commented Jun 11, 2020

HiromuHota commented Jun 13, 2020

HiromuHota commented Jun 15, 2020

HiromuHota commented Jun 16, 2020 • edited

lukehsiao commented May 6, 2020 •

edited

HiromuHota commented Jun 11, 2020 •

edited

HiromuHota commented Jun 11, 2020 •

edited

HiromuHota commented Jun 16, 2020 •

edited