Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore original index with PandasParallelLFApplier #1589

Open
henryre opened this issue May 17, 2020 · 3 comments
Open

Restore original index with PandasParallelLFApplier #1589

henryre opened this issue May 17, 2020 · 3 comments
Assignees
Labels
no-stale Auto-stale bot skips this issue

Comments

@henryre
Copy link
Member

henryre commented May 17, 2020

Describe the solution you'd like

Using PandasParallelLFApplier includes an index sort on the original DataFrame, which can result in unexpected row order if the index is not sorted when passed in. This should be documented, or the original index order should be restored.

Discussion: https://spectrum.chat/snorkel/help/how-to-use-the-pandasparallelapplier~cf50f563-28e6-418c-93a3-337384566c13

Additional context

Related issues: #1587 #1581 #1524

@bendeaton
Copy link

+1 to this. We just spent several cycles tracking down this issue in a model that uses Snorkel.

@github-actions
Copy link

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@henryre henryre added no-stale Auto-stale bot skips this issue and removed no-issue-activity labels Aug 26, 2020
@henryre henryre reopened this Aug 26, 2020
@BenjaminFraser
Copy link

BenjaminFraser commented Jan 6, 2021

Not sure on the exact solution you were looking for on this, but one possibility would be to add an additional default keyword arg to PandasParallelLFApplier on whether to index sort or not.

For example, at line 97 within snorkel/snorkel/labeling/apply/dask.py, dd.from_pandas() is called:

df = dd.from_pandas(df, npartitions=n_parallel)

If you want the index to remain unsorted, and prevent the problem highlighted in this issue, we can simply call dd.from_pandas() like so:

df = dd.from_pandas(df, npartitions=n_parallel, sort=False)

We might not always want this to be false however, since dask makes this by default for performance purposes and for obtaining meaningful divisions (Ref: dask/dask#1428).

Therefore, it could be worth letting users call PandasParallelLFApplier with sort=True / False, as required. Perhaps it could be made clear in the documentation that this sorting occurs by default, and if users do not want this to happen, they should provide sort=False.

Just a suggestion anyway!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-stale Auto-stale bot skips this issue
Projects
None yet
Development

No branches or pull requests

3 participants