Skip to content

Fix "idx" bug in split_data_by_length.py of BGE-M3#601

Merged
staoxiao merged 1 commit intoFlagOpen:masterfrom
nntoan209:fix-split-data
Mar 24, 2024
Merged

Fix "idx" bug in split_data_by_length.py of BGE-M3#601
staoxiao merged 1 commit intoFlagOpen:masterfrom
nntoan209:fix-split-data

Conversation

@nntoan209
Copy link
Copy Markdown
Contributor

In the split_data_by_length.py code inside BGE-M3, after filtering the dataset by "max_length" field, the "idx" field is somehow changed , so the split_dataset = dataset.select(idxs["idx"]) will result in the wrong data.
To deal with this issue, I suggest using the real list of "idx" given by list(idxs._indices.to_pandas()['indices'].values) .

@staoxiao
Copy link
Copy Markdown
Collaborator

staoxiao commented Mar 24, 2024

Thanks! And we further fix the bug in #605

@staoxiao staoxiao merged commit f961f12 into FlagOpen:master Mar 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants