Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for inputting .sam files and .fasta files (with read offset support) #474

Merged
merged 3 commits into from
Aug 5, 2024

Conversation

davidgicev
Copy link
Contributor

resolves #416

Summary

PR Checklist

  • All necessary documentation has been adapted or there is an issue to do so.
  • The implemented feature is covered by an appropriate test.

Copy link
Contributor

github-actions bot commented Jul 9, 2024

This is a preview of the changelog of the next release:

0.2.12 (2024-07-31)

Features

  • bump serialization version to 2 (35b7b62)
  • change base image to ubuntu (64dee0d)
  • support SAM files as sequence input and allow partial sequence input with an offset (607bdde)

@Taepper Taepper force-pushed the newInputFormats branch 3 times, most recently from 2b36c6a to 1ab4fc9 Compare July 10, 2024 07:29
@Taepper
Copy link
Collaborator

Taepper commented Jul 10, 2024

note: we need to increase the serialization version!

Copy link
Contributor

@fengelniederhammer fengelniederhammer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a first round. I'll need to take another look tomorrow. The preprocessing is quite difficult to grasp.

include/silo/file_reader/fasta_reader.h Outdated Show resolved Hide resolved
src/silo/file_reader/fasta_reader.test.cpp Outdated Show resolved Hide resolved
src/silo/file_reader/fasta_reader.cpp Outdated Show resolved Hide resolved
include/silo/file_reader/sam_reader.h Outdated Show resolved Hide resolved
include/silo/file_reader/file_reader.h Outdated Show resolved Hide resolved
src/silo/storage/sequence_store.cpp Outdated Show resolved Hide resolved
include/silo/storage/sequence_store.h Outdated Show resolved Hide resolved
include/silo/storage/sequence_store.h Outdated Show resolved Hide resolved
src/silo/preprocessing/preprocessor.cpp Outdated Show resolved Hide resolved
src/silo/preprocessing/preprocessor.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@fengelniederhammer fengelniederhammer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks good IMO. Some minor code style things that should be fixed.

And as I mentioned, the preprocessing is really hard to understand. In the long term we should think of a concept how we can make the connections between all the DuckDB tables clear. But this is not an issue of this PR.

src/silo/preprocessing/preprocessor.cpp Outdated Show resolved Hide resolved
src/silo/preprocessing/preprocessor.test.cpp Outdated Show resolved Hide resolved
@fengelniederhammer
Copy link
Contributor

Could you please rebase (and maybe squash already)? This is probably good to be merged then.

@Taepper Taepper force-pushed the newInputFormats branch 2 times, most recently from 7efcbbd to cb380cf Compare July 29, 2024 13:06
Copy link
Contributor

@fengelniederhammer fengelniederhammer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


// optional header data
while ((data.empty() || data.at(0) == '@') && getline(in_file.getInputStream(), data)) {
;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this line? Or could we also simply delete it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely need it, some of the files have many header lines which we need to skip

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant the ; :)

@Taepper
Copy link
Collaborator

Taepper commented Jul 30, 2024

After testing, this degrades preprocessing performance (with alpine base) for 2M GenBank sequences from:
00:29:34.527 to 00:30:53.022.

For 200k sequences: 00:02:33.373 to 00:02:40.413

@fengelniederhammer
Copy link
Contributor

After testing, this degrades preprocessing performance (with alpine base) for 2M GenBank sequences from: 00:29:34.527 to 00:30:53.022.

For 200k sequences: 00:02:33.373 to 00:02:40.413

Let's try it on a full (open) dataset? Those number looks quite acceptable. It would be good to know how they scale on large datasets.

@Taepper
Copy link
Collaborator

Taepper commented Jul 30, 2024

I don't think it makes a big difference between 2M and 8M sequences, the general trend should stay the same

@Taepper
Copy link
Collaborator

Taepper commented Aug 2, 2024

57 minutes for the open dataset, whereas the before image needs 51 minutes

@Taepper Taepper merged commit e897203 into main Aug 5, 2024
10 checks passed
@Taepper Taepper deleted the newInputFormats branch August 5, 2024 07:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add optional offset to FastaReader
4 participants