Support for inputting .sam files and .fasta files (with read offset support) #474

davidgicev · 2024-06-10T13:59:45Z

resolves #416

Summary

PR Checklist

All necessary documentation has been adapted or there is an issue to do so.
The implemented feature is covered by an appropriate test.

github-actions · 2024-07-09T13:06:25Z

This is a preview of the changelog of the next release:

0.2.12 (2024-07-31)

Features

bump serialization version to 2 (35b7b62)
change base image to ubuntu (64dee0d)
support SAM files as sequence input and allow partial sequence input with an offset (607bdde)

Taepper · 2024-07-10T08:08:34Z

note: we need to increase the serialization version!

fengelniederhammer

This is a first round. I'll need to take another look tomorrow. The preprocessing is quite difficult to grasp.

include/silo/file_reader/fasta_reader.h

src/silo/file_reader/fasta_reader.test.cpp

src/silo/file_reader/fasta_reader.cpp

include/silo/file_reader/sam_reader.h

include/silo/file_reader/file_reader.h

src/silo/storage/sequence_store.cpp

include/silo/storage/sequence_store.h

src/silo/preprocessing/preprocessor.cpp

fengelniederhammer

Overall it looks good IMO. Some minor code style things that should be fixed.

And as I mentioned, the preprocessing is really hard to understand. In the long term we should think of a concept how we can make the connections between all the DuckDB tables clear. But this is not an issue of this PR.

src/silo/preprocessing/preprocessor.cpp

src/silo/preprocessing/preprocessor.test.cpp

src/silo/file_reader/sam_reader.test.cpp

fengelniederhammer · 2024-07-29T12:16:34Z

Could you please rebase (and maybe squash already)? This is probably good to be merged then.

fengelniederhammer

LGTM

fengelniederhammer · 2024-07-29T14:35:57Z

src/silo/file_reader/sam_reader.cpp

+
+   // optional header data
+   while ((data.empty() || data.at(0) == '@') && getline(in_file.getInputStream(), data)) {
+      ;


Do we need this line? Or could we also simply delete it?

We definitely need it, some of the files have many header lines which we need to skip

I meant the ; :)

Taepper · 2024-07-30T07:22:58Z

After testing, this degrades preprocessing performance (with alpine base) for 2M GenBank sequences from:
00:29:34.527 to 00:30:53.022.

For 200k sequences: 00:02:33.373 to 00:02:40.413

fengelniederhammer · 2024-07-30T13:16:51Z

After testing, this degrades preprocessing performance (with alpine base) for 2M GenBank sequences from: 00:29:34.527 to 00:30:53.022.

For 200k sequences: 00:02:33.373 to 00:02:40.413

Let's try it on a full (open) dataset? Those number looks quite acceptable. It would be good to know how they scale on large datasets.

Taepper · 2024-07-30T15:16:14Z

I don't think it makes a big difference between 2M and 8M sequences, the general trend should stay the same

…input with an offset

Taepper · 2024-08-02T07:47:09Z

57 minutes for the open dataset, whereas the before image needs 51 minutes

davidgicev requested a review from Taepper June 10, 2024 13:59

davidgicev linked an issue Jun 10, 2024 that may be closed by this pull request

Add optional offset to FastaReader #416

Closed

davidgicev force-pushed the newInputFormats branch from 994c618 to 40d9217 Compare June 24, 2024 21:04

Taepper force-pushed the newInputFormats branch from 2cbd26c to 0867963 Compare July 9, 2024 13:05

Taepper force-pushed the newInputFormats branch 3 times, most recently from 2b36c6a to 1ab4fc9 Compare July 10, 2024 07:29

Taepper requested a review from fengelniederhammer July 10, 2024 07:29

fengelniederhammer reviewed Jul 10, 2024

View reviewed changes

fengelniederhammer requested changes Jul 11, 2024

View reviewed changes

src/silo/preprocessing/preprocessor.cpp Outdated Show resolved Hide resolved

src/silo/preprocessing/preprocessor.test.cpp Outdated Show resolved Hide resolved

fengelniederhammer reviewed Jul 22, 2024

View reviewed changes

src/silo/file_reader/sam_reader.test.cpp Outdated Show resolved Hide resolved

Taepper requested a review from fengelniederhammer July 26, 2024 09:04

Taepper force-pushed the newInputFormats branch 2 times, most recently from 7efcbbd to cb380cf Compare July 29, 2024 13:06

fengelniederhammer approved these changes Jul 30, 2024

View reviewed changes

Taepper force-pushed the newInputFormats branch from ff5734d to ce39ad0 Compare July 30, 2024 08:53

David Gichev and others added 3 commits July 31, 2024 17:25

feat: support SAM files as sequence input and allow partial sequence …

607bdde

…input with an offset

chore: make github linter action not run on deleted files

57d40a4

feat: bump serialization version to 2

35b7b62

Taepper force-pushed the newInputFormats branch from ce39ad0 to 35b7b62 Compare July 31, 2024 15:25

Taepper merged commit e897203 into main Aug 5, 2024
10 checks passed

Taepper deleted the newInputFormats branch August 5, 2024 07:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for inputting .sam files and .fasta files (with read offset support) #474

Support for inputting .sam files and .fasta files (with read offset support) #474

davidgicev commented Jun 10, 2024

github-actions bot commented Jul 9, 2024 •

edited

Loading

Taepper commented Jul 10, 2024

fengelniederhammer left a comment

fengelniederhammer left a comment

fengelniederhammer commented Jul 29, 2024

fengelniederhammer left a comment

fengelniederhammer Jul 29, 2024

davidgicev Jul 31, 2024

fengelniederhammer Aug 2, 2024

Taepper commented Jul 30, 2024

fengelniederhammer commented Jul 30, 2024

Taepper commented Jul 30, 2024

Taepper commented Aug 2, 2024

Support for inputting .sam files and .fasta files (with read offset support) #474

Support for inputting .sam files and .fasta files (with read offset support) #474

Conversation

davidgicev commented Jun 10, 2024

Summary

PR Checklist

github-actions bot commented Jul 9, 2024 • edited Loading

0.2.12 (2024-07-31)

Features

Taepper commented Jul 10, 2024

fengelniederhammer left a comment

Choose a reason for hiding this comment

fengelniederhammer left a comment

Choose a reason for hiding this comment

fengelniederhammer commented Jul 29, 2024

fengelniederhammer left a comment

Choose a reason for hiding this comment

fengelniederhammer Jul 29, 2024

Choose a reason for hiding this comment

davidgicev Jul 31, 2024

Choose a reason for hiding this comment

fengelniederhammer Aug 2, 2024

Choose a reason for hiding this comment

Taepper commented Jul 30, 2024

fengelniederhammer commented Jul 30, 2024

Taepper commented Jul 30, 2024

Taepper commented Aug 2, 2024

github-actions bot commented Jul 9, 2024 •

edited

Loading