Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-deterministic behavior in featurization #412

Closed
lukehsiao opened this issue May 6, 2020 · 8 comments · Fixed by #458
Closed

Non-deterministic behavior in featurization #412

lukehsiao opened this issue May 6, 2020 · 8 comments · Fixed by #458
Labels
bug Something isn't working
Milestone

Comments

@lukehsiao
Copy link
Contributor

lukehsiao commented May 6, 2020

Describe the bug
When working with large (~7k docs) corpus of hardware datasheets, extracting multiple relations, we expect that the features for each candidate would be deterministic between each run. Even more so if we have parallelism=1 set in the Featurizer. However, we find that there can be small (e.g., < 5) differences between feature tables, resulting in slightly different sparse matrices, and thus, slightly different results.

To Reproduce
Running on the HACK transistor dataset will reproduce the error. However, it will take a long time, and we haven't been able to get a very minimal example that reproduces the error yet. Attached are two feature table dumps between two different runs with parallelism=1. Note that there is only a single difference on line 65454.

feature_table.tar.gz

Note that it isn't always one difference, and the difference is not deterministic. The different attached is just an example.

Expected behavior
We would expect that these feature tables are identical between runs.

Error Logs/Screenshots
For convenience, here is the differing line in screenshot form
image

Additional context
If the issue is in the UDF implementation, this might affect the Labeler in addition to the Featurizer, since they share a lot of the UDF code.

@lukehsiao lukehsiao added the bug Something isn't working label May 6, 2020
@lukehsiao lukehsiao added this to the v0.8.3 milestone May 6, 2020
@HiromuHota
Copy link
Contributor

HiromuHota commented Jun 11, 2020

I'm working on #439 and currently the tests fail https://github.com/HazyResearch/fonduer/pull/439/checks?check_run_id=759815347 at assertions on the number of feature keys like below.
Mysteriously, the tests fail only at the number of feature keys while other assertions pass.

# I refer to this error as 1B
...
        assert num_features == len(train_cands[0]) + len(train_cands[1])
        num_feature_keys = session.query(FeatureKey).count()
>       assert num_feature_keys == 4577
E       assert 4621 == 4577

tests/e2e/test_e2e.py:235: AssertionError

...
# I refer to this error as 2
        # Update features
        featurizer.update(new_docs, parallelism=PARALLEL)
        assert session.query(Feature).count() == len(train_cands[0])
        num_feature_keys = session.query(FeatureKey).count()
>       assert num_feature_keys == 2526
E       assert 2544 == 2526

tests/e2e/test_incremental.py:192: AssertionError

It seems to be reproducible as the same commit fails on my forked repo too (https://github.com/HiromuHota/fonduer/runs/759811429?check_suite_focus=true).

@HiromuHota
Copy link
Contributor

HiromuHota commented Jun 11, 2020

One more thing: the same GitHub Actions run (https://github.com/HazyResearch/fonduer/pull/439/checks?check_run_id=759815303), but a different error on mac (3.6)

# I refer to this error as 1A
        # Test that FeatureKey is properly reset
        featurizer.apply(split=1, train=True, parallelism=PARALLEL)
>       assert session.query(Feature).count() == 214
E       assert 213 == 214
E        +  where 213 = <bound method Query.count of <sqlalchemy.orm.query.Query object at 0x138f72c50>>()
E        +    where <bound method Query.count of <sqlalchemy.orm.query.Query object at 0x138f72c50>> = <sqlalchemy.orm.query.Query object at 0x138f72c50>.count
E        +      where <sqlalchemy.orm.query.Query object at 0x138f72c50> = <bound method Session.query of <sqlalchemy.orm.session.Session object at 0x1355437b8>>(Feature)
E        +        where <bound method Session.query of <sqlalchemy.orm.session.Session object at 0x1355437b8>> = <sqlalchemy.orm.session.Session object at 0x1355437b8>.query

tests/e2e/test_e2e.py:188: AssertionError

Two differences:

  • test_e2e.py failed both Mac (3.6) and Linux (3.7), but it failed on the number of features on Mac.
  • test_incremental.py passed on Mac.

@HiromuHota
Copy link
Contributor

By looking at two runs (https://github.com/HazyResearch/fonduer/actions/runs/131557275 and https://github.com/HiromuHota/fonduer/actions/runs/131556294), the three types of errors happened as below:

3.6 3.7
Mac 1A 1B
Linux 1A, 2 1B, 2

@HiromuHota
Copy link
Contributor

Regarding the 1A type of error, session.query(Feature).count() actually equals to the total number of candidates for split=1, ie session.query(Feature).count() == len(dev_cands[0]) + len(dev_cands[1]).
On my local Mac, they were 214 == 61 + 153.
But on Github Actions, they were 213 == 60 + 153.
Thus, the difference occurs on PartTemp and not on PartVolt.

@HiromuHota
Copy link
Contributor

The number of contituent mentions of PartTemp: Part and Temp were the same on local Mac and GitHub Actions.
This means temp_throttler drops one extra candidate on GitHub Actions.

def temp_throttler(c):
    (part, attr) = c
    if same_table((part, attr)):
        return is_horz_aligned((part, attr)) or is_vert_aligned((part, attr))
    return True

The difference comes either from same_table, or from is_horz_aligned/is_vert_aligned.
I suspect the non-deterministic behaviour comes from the visual_linker.

@HiromuHota
Copy link
Contributor

A few updates:

  1. Found that the different number of candidates in Non-deterministic behavior in featurization #412 (comment) comes from tests/data/html/2N6427.html.
  2. VisualLinker behaves non-deterministically.

Case 1 (the test assertion assumes the following result from VisualLinker)

[INFO] Extracted 2622 pdf words
[INFO] Extracted 2746 html words
[DEBUG] Global exact matching:
[INFO] (1730/2746) = 0.63
[DEBUG] Local exact matching:
[INFO] (2077/2746) = 0.76
[DEBUG] Local approximate matching:
[INFO] (2205/2746) = 0.80
[DEBUG] Linked 2205/2746 (0.80) html words exactly

Case 2 (this happens sometimes, not sure what make this happen)

[INFO] Extracted 2622 pdf words
[INFO] Extracted 2746 html words
[DEBUG] Global exact matching:
[INFO] (1730/2746) = 0.63
[DEBUG] Local exact matching:
[INFO] (1994/2746) = 0.73
[DEBUG] Local approximate matching:
[INFO] (2200/2746) = 0.80
[DEBUG] Linked 2200/2746 (0.80) html words exactly

@HiromuHota
Copy link
Contributor

One update: the non-deterministic behaviour of VisualLinker comes from the fact that doc.sentences is not sorted.

I think the following relationship should have order_by=True to the backref argument.

document = relationship(
"Document",
backref=backref("sentences", cascade="all, delete-orphan"),
foreign_keys=document_id,
)

Findings that lead me to the above statement:

  • "The local search for exact matches" by link_exact depends on how html_word_list and pdf_word_list are sorted.
  • The order of words in pdf_word_list depends on how block/line/word (in the output of pdftotext -bbox-layout) is sorted, which looks stable.
  • That in html_word_list depends on how doc.sentences is sorted.
  • According to SQLAlchemy doc, order_by=False by default at relationship (or backref) (https://docs.sqlalchemy.org/en/13/orm/relationship_api.html#sqlalchemy.orm.relationship)

@HiromuHota
Copy link
Contributor

HiromuHota commented Jun 16, 2020

A few more updates:

  • In order to reproduce this, I had to delete and re-create a database (dropdb e2e_test and createdb e2e_test). (This is why I could not reproduce it on my local mac until I deleted and recreated the database). This would mean that the order of doc.sentences depends on some internal state of the postgres database.
  • Another observation that supports the above is that I have not observed this non-deterministic behaviour if a database is not used (ie just using UDFs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants