New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Snorkel v0.9.X instead of snorkel-metal #320
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Left few comments.
tests/e2e/test_e2e.py
Outdated
@@ -421,14 +421,14 @@ def test_e2e(caplog): | |||
] | |||
labeler.update(split=0, lfs=[stg_temp_lfs_2, ce_v_max_lfs], parallelism=PARALLEL) | |||
assert session.query(Label).count() == 6478 | |||
assert session.query(LabelKey).count() == 16 | |||
assert session.query(LabelKey).count() == 19 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why there are 3 more LabelKey here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The total number of unique LFs, len(set(stg_temp_lfs + ce_v_max_lfs + stg_temp_lfs_2))
, is 20, while session.query(LabelKey).all()
does not include LF_temp_outside_table
, as a result session.query(LabelKey).count()
is 19.
Like LF_operating_row
returns 0
(ABSTAIN) for all candidates when ABSTAIN==0
, LF_temp_outside_table
returns 0
(FALSE) for all of them when ABSTAIN==-1 (FALSE==0)
.
Because Labeler
skips records with value==0
, no labels for LF_operating_row
is recorded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Snorkel team standardized their coding strategy and API as follows:
label matrix is dense outside of compute modules
Internal computation for LF analysis and label model is with shifted sparse matrix: csr_matrix(L + 1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using -1 as ABSTAIN is not a computation efficient way since most of the values are ABSTAIN.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thought here, the total number of LabelKey should be 20 since the 0 means negative (no longer means abstain) and labeler should not ignore it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
To avoid any confusion and computational inefficiency, how about stick with snorkel's policy?
ABSTAIN is now represented by None (LFs), -1 (label matrix), or 0 (LFAnalysis, except for when we use L and then it's -1).
snorkel-team/snorkel#1309 (review)
So -1 for user-facing labels and 0 for internal computation and database?
tests/e2e/test_e2e.py
Outdated
|
||
train_marginals = gen_model.predict_proba(L_train[0]) | ||
train_marginals = gen_model.predict_proba(L_train[0].toarray()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use dense format here? It will cause potential overhead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is due to the API change at snorkel. They are now mainly use a dense format for matrices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid the conversion (sparse to dense) overhead, labeler.get_label_matrices
should return a dense format as it's been standardized at Snorkel like below:
label matrix is dense outside of compute modules
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no reason to return dense matrix here since the label matrix supposes to be sparse. Thoughts?
Besides the several design choices of snorkel, LGTM, can merge. @lukehsiao, thoughts? |
It does seem strange that snorkel switched to non-zero ABSTAIN and dense matrices. Can we chat with them about that? I would imagine that that results in some performance regressions? If it doesn't seem like a major issue, I'm fine with the change. |
@lukehsiao I also wonder the reason behind the switch. I'd appreciate if you could ask them about it. |
585178e
to
e68e2e3
Compare
8d49244
to
1dbe0db
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Left several minor comments.
Can merge when conflict resolved. |
To leverage the new features of Snorkel, I'd like Fonduer to keep up with it.
This will close #310.