Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Snorkel v0.9.X instead of snorkel-metal #320

Merged
merged 22 commits into from Nov 7, 2019

Conversation

HiromuHota
Copy link
Contributor

To leverage the new features of Snorkel, I'd like Fonduer to keep up with it.
This will close #310.

Copy link
Collaborator

@senwu senwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Left few comments.

@@ -421,14 +421,14 @@ def test_e2e(caplog):
]
labeler.update(split=0, lfs=[stg_temp_lfs_2, ce_v_max_lfs], parallelism=PARALLEL)
assert session.query(Label).count() == 6478
assert session.query(LabelKey).count() == 16
assert session.query(LabelKey).count() == 19
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why there are 3 more LabelKey here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The total number of unique LFs, len(set(stg_temp_lfs + ce_v_max_lfs + stg_temp_lfs_2)), is 20, while session.query(LabelKey).all() does not include LF_temp_outside_table, as a result session.query(LabelKey).count() is 19.

Like LF_operating_row returns 0 (ABSTAIN) for all candidates when ABSTAIN==0, LF_temp_outside_table returns 0 (FALSE) for all of them when ABSTAIN==-1 (FALSE==0).

Because Labeler skips records with value==0, no labels for LF_operating_row is recorded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Snorkel team standardized their coding strategy and API as follows:

label matrix is dense outside of compute modules
Internal computation for LF analysis and label model is with shifted sparse matrix: csr_matrix(L + 1)

snorkel-team/snorkel#1309 (comment)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using -1 as ABSTAIN is not a computation efficient way since most of the values are ABSTAIN.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thought here, the total number of LabelKey should be 20 since the 0 means negative (no longer means abstain) and labeler should not ignore it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!
To avoid any confusion and computational inefficiency, how about stick with snorkel's policy?

ABSTAIN is now represented by None (LFs), -1 (label matrix), or 0 (LFAnalysis, except for when we use L and then it's -1).
snorkel-team/snorkel#1309 (review)

So -1 for user-facing labels and 0 for internal computation and database?


train_marginals = gen_model.predict_proba(L_train[0])
train_marginals = gen_model.predict_proba(L_train[0].toarray())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use dense format here? It will cause potential overhead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is due to the API change at snorkel. They are now mainly use a dense format for matrices.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid the conversion (sparse to dense) overhead, labeler.get_label_matrices should return a dense format as it's been standardized at Snorkel like below:

label matrix is dense outside of compute modules

Copy link
Collaborator

@senwu senwu Sep 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no reason to return dense matrix here since the label matrix supposes to be sparse. Thoughts?

tests/shared/hardware_lfs.py Outdated Show resolved Hide resolved
src/fonduer/utils/utils_udf.py Outdated Show resolved Hide resolved
@senwu
Copy link
Collaborator

senwu commented Sep 30, 2019

Besides the several design choices of snorkel, LGTM, can merge. @lukehsiao, thoughts?

@HiromuHota HiromuHota marked this pull request as ready for review September 30, 2019 21:55
@lukehsiao
Copy link
Contributor

It does seem strange that snorkel switched to non-zero ABSTAIN and dense matrices. Can we chat with them about that? I would imagine that that results in some performance regressions?

If it doesn't seem like a major issue, I'm fine with the change.

@HiromuHota
Copy link
Contributor Author

@lukehsiao I also wonder the reason behind the switch. I'd appreciate if you could ask them about it.

@lukehsiao lukehsiao changed the title Use Snorkel of v0.9.X instead of snorkel-metal Use Snorkel v0.9.X instead of snorkel-metal Oct 11, 2019
Copy link
Collaborator

@senwu senwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Left several minor comments.

setup.py Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
src/fonduer/learning/classifier.py Outdated Show resolved Hide resolved
src/fonduer/learning/classifier.py Outdated Show resolved Hide resolved
src/fonduer/supervision/labeler.py Outdated Show resolved Hide resolved
src/fonduer/supervision/labeler.py Show resolved Hide resolved
src/fonduer/supervision/labeler.py Show resolved Hide resolved
src/fonduer/supervision/labeler.py Show resolved Hide resolved
src/fonduer/supervision/labeler.py Show resolved Hide resolved
@senwu
Copy link
Collaborator

senwu commented Oct 25, 2019

Can merge when conflict resolved.

@lukehsiao lukehsiao added this to the v0.8.0 milestone Oct 29, 2019
@senwu senwu merged commit 5016cd0 into HazyResearch:master Nov 7, 2019
@HiromuHota HiromuHota deleted the snorkel/0.9 branch November 7, 2019 04:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make -1 as abstain to be consistent with snorkel v0.9.0
3 participants