Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev #4

Merged
merged 15 commits into from
Jul 10, 2023
Merged

Dev #4

merged 15 commits into from
Jul 10, 2023

Conversation

rob-p
Copy link
Contributor

@rob-p rob-p commented Jul 10, 2023

Several important changes and fixes:

  • Fixes bug that would cause roers to fail with the --dedup option (due to a file OpenMode mistake).
  • Matches functionality of pyroe by writing a duplicate_entries.tsv file containing, for each deduplicated (removed) sequence, the (retained, duplicate) pair.
  • Vastly simplifies handling of sequence duplicates.
  • Substantially improves (reduces) memory usage when deduplicating sequences.
  • Prevents the IDs of duplicate sequences from being written from the t2g or t2g_3col files.
  • Ensures a consistent order of deduplicate across runs by sorting the feature data frame prior to deduplicating (this was an issue because polars, unlike pandas backing pyroe, is multithreaded).
  • Ensures that if there is an unspliced sequence that is a duplicate of a spliced sequence (e.g. like single-exon genes when preparing a spliceu reference), the spliced sequence will be retained and the unspliced sequence deduplicated.

Rob Patro and others added 15 commits July 5, 2023 16:54
…entries.tsv

* roers would not run when the --dedup-seqs flag was passed becuase the output
  file was created in an inconsistent mode (no .write(true), but .append(false)).
  This is fixed and the file is now created with .write(true).trunc(true).

* write out a duplicate_entries.tsv file when --dedup-seqs is used to track the
  name of the deduplicated (removed) sequence and the retained sequence to which
  it maps (matching the behavior of pyroe).

TODO: Come up with a more considered policy of which sequence to keep an which to
eliminate when deduplicating. Also, decide if we should deduplicate *across* sequence
types (spliced/unspliced) or only within.
@rob-p rob-p merged commit 9d33201 into main Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants