Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge_sets creates duplicates in annotations.csv #384

Closed
William-N-Havard opened this issue Jul 28, 2022 · 4 comments · Fixed by #385
Closed

merge_sets creates duplicates in annotations.csv #384

William-N-Havard opened this issue Jul 28, 2022 · 4 comments · Fixed by #385
Labels
bug Something isn't working enhancement New feature or request

Comments

@William-N-Havard
Copy link

When merging several sets together using merge_sets several times (see below), duplicate lines are created in annotations.csv

am.merge_sets(
        left_set="vtc",
        right_set="alice",
        left_columns=["speaker_type"],
        right_columns=["phonemes", "syllables", "words"],
        output_set="alice_vtc",
    )

Duplicate lines in annotations.csv when running merge_sets twice

alice_vtc,14T_0_20220422_151000.wav,0,0,444797,"all.rttm,ALICE_output_utterances.txt",,14T_0_20220422_151000,14T_0_20220422_151000_0_444797.csv,2022-07-28 11:58:41,0.0.5,
alice_vtc,234T_0_20220425_135300.wav,0,0,413106,"all.rttm,ALICE_output_utterances.txt",,234T_0_20220425_135300,234T_0_20220425_135300_0_413106.csv,2022-07-28 11:58:41,0.0.5,
alice_vtc,2_FG_20220321_095400.wav,0,0,120857,"all.rttm,ALICE_output_utterances.txt",,2_FG_20220321_095400,2_FG_20220321_095400_0_120857.csv,2022-07-28 11:58:41,0.0.5,
alice_vtc,2_FG_20220325_074500.wav,0,0,239157,"all.rttm,ALICE_output_utterances.txt",,2_FG_20220325_074500,2_FG_20220325_074500_0_239157.csv,2022-07-28 11:58:41,0.0.5,
alice_vtc,2_FG_20220419_195000.wav,0,0,187617,"all.rttm,ALICE_output_utterances.txt",,2_FG_20220419_195000,2_FG_20220419_195000_0_187617.csv,2022-07-28 11:58:41,0.0.5,
alice_vtc,2_FG_20220419_195059.wav,0,0,344957,"all.rttm,ALICE_output_utterances.txt",,2_FG_20220419_195059,2_FG_20220419_195059_0_344957.csv,2022-07-28 11:58:41,0.0.5,
alice_vtc,6T_0_20220425_135300.wav,0,0,383117,"all.rttm,ALICE_output_utterances.txt",,6T_0_20220425_135300,6T_0_20220425_135300_0_383117.csv,2022-07-28 11:58:41,0.0.5,
alice_vtc,12T_0_20220425_185200.wav,0,0,145197,"all.rttm,ALICE_output_utterances.txt",,12T_0_20220425_185200,12T_0_20220425_185200_0_145197.csv,2022-07-28 11:58:41,0.0.5,
alice_vtc,23T_0_20220425_135300.wav,0,0,101437,"all.rttm,ALICE_output_utterances.txt",,23T_0_20220425_135300,23T_0_20220425_135300_0_101437.csv,2022-07-28 11:58:41,0.0.5,
alice_vtc,14T_0_20220422_151000.wav,0,0,444797,"all.rttm,ALICE_output_utterances.txt",,14T_0_20220422_151000,14T_0_20220422_151000_0_444797.csv,2022-07-28 11:58:41,0.0.5,
alice_vtc,234T_0_20220425_135300.wav,0,0,413106,"all.rttm,ALICE_output_utterances.txt",,234T_0_20220425_135300,234T_0_20220425_135300_0_413106.csv,2022-07-28 11:58:41,0.0.5,
alice_vtc,2_FG_20220321_095400.wav,0,0,120857,"all.rttm,ALICE_output_utterances.txt",,2_FG_20220321_095400,2_FG_20220321_095400_0_120857.csv,2022-07-28 11:58:41,0.0.5,
alice_vtc,2_FG_20220325_074500.wav,0,0,239157,"all.rttm,ALICE_output_utterances.txt",,2_FG_20220325_074500,2_FG_20220325_074500_0_239157.csv,2022-07-28 11:58:41,0.0.5,
alice_vtc,2_FG_20220419_195000.wav,0,0,187617,"all.rttm,ALICE_output_utterances.txt",,2_FG_20220419_195000,2_FG_20220419_195000_0_187617.csv,2022-07-28 11:58:41,0.0.5,
alice_vtc,2_FG_20220419_195059.wav,0,0,344957,"all.rttm,ALICE_output_utterances.txt",,2_FG_20220419_195059,2_FG_20220419_195059_0_344957.csv,2022-07-28 11:58:41,0.0.5,
alice_vtc,6T_0_20220425_135300.wav,0,0,383117,"all.rttm,ALICE_output_utterances.txt",,6T_0_20220425_135300,6T_0_20220425_135300_0_383117.csv,2022-07-28 11:58:41,0.0.5,
alice_vtc,12T_0_20220425_185200.wav,0,0,145197,"all.rttm,ALICE_output_utterances.txt",,12T_0_20220425_185200,12T_0_20220425_185200_0_145197.csv,2022-07-28 11:58:41,0.0.5,
alice_vtc,23T_0_20220425_135300.wav,0,0,101437,"all.rttm,ALICE_output_utterances.txt",,23T_0_20220425_135300,23T_0_20220425_135300_0_101437.csv,2022-07-28 11:58:41,0.0.5,

These lines should be dropped before (re)merging the sets and adding the resulting new annotation lines.

@William-N-Havard William-N-Havard added bug Something isn't working enhancement New feature or request labels Jul 28, 2022
@LoannPeurey
Copy link
Contributor

The merge overwrites the previously generated set files without needing an 'overwrite' argument and without any warning. It could probably cause problems when not careful. We could enforce that the output_set must not already exist. Is it possible that a set is constituted of multiple different merges (sounds to me like they should each have a separate set) ?
In this case, rerunning your merge would require that you remove the previously generated set before doing so (or we can add a 'replace-set' argument to the merging function to perform the removal prior to merging.

What do you think?

@William-N-Havard
Copy link
Author

Yes, I think the best would be to raise an error if the set already exists so that the user first deletes it and re-merges it.
Another problem with this set merging is that the resulting set can become outdated if one adds new files to one set that was used for the merge

@LoannPeurey
Copy link
Contributor

yeah, tracking down outdated sets will be kind of hard to do

@LoannPeurey
Copy link
Contributor

#385

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants