Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A new step call fastqprocessing for Optimus to speed up #82

Merged
merged 18 commits into from
Sep 28, 2020

Conversation

kishorikonwar
Copy link
Contributor

Purpose

To speed-up Optimus for large datasets

Copy link
Contributor

@jessicaway jessicaway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments and suggested changes, but most importantly I think we need to find a different solution that does not include checking in all the libStatGen files

.gitignore Show resolved Hide resolved
fastqpreprocessing/.gitmodules Outdated Show resolved Hide resolved
fastqpreprocessing/src/big-run.sh Outdated Show resolved Hide resolved
fastqpreprocessing/src/create_fastq.sh Outdated Show resolved Hide resolved
fastqpreprocessing/src/fastqprocess.cpp Outdated Show resolved Hide resolved
fastqpreprocessing/src/utilities.cpp Outdated Show resolved Hide resolved
fastqpreprocessing/src/utilities.cpp Outdated Show resolved Hide resolved
fastqpreprocessing/src/utilities.cpp Show resolved Hide resolved
fastqpreprocessing/src/utilities.cpp Show resolved Hide resolved
fastqpreprocessing/src/utilities.cpp Outdated Show resolved Hide resolved
@jessicaway
Copy link
Contributor

Are there any tests for this code?

@kbergin
Copy link

kbergin commented Sep 17, 2020

I think we said we were going to add tests with this ticket https://broadinstitute.atlassian.net/browse/GL-1179 .. or at least add more if there are some already.

@codecov-commenter
Copy link

codecov-commenter commented Sep 18, 2020

Codecov Report

Merging #82 into master will not change coverage.
The diff coverage is 91.83%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master      #82   +/-   ##
=======================================
  Coverage   88.85%   88.85%           
=======================================
  Files          28       28           
  Lines        3007     3007           
=======================================
  Hits         2672     2672           
  Misses        335      335           
Impacted Files Coverage Δ
src/sctools/gtf.py 65.24% <0.00%> (ø)
src/sctools/fastq.py 92.72% <70.58%> (ø)
src/sctools/metrics/merge.py 94.73% <83.33%> (ø)
src/sctools/metrics/writer.py 93.33% <90.00%> (ø)
src/sctools/consts.py 100.00% <100.00%> (ø)
src/sctools/count.py 79.64% <100.00%> (ø)
src/sctools/encodings.py 96.19% <100.00%> (ø)
src/sctools/groups.py 98.80% <100.00%> (ø)
src/sctools/metrics/gatherer.py 95.55% <100.00%> (ø)
src/sctools/reader.py 100.00% <100.00%> (ø)
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d09540...d7ae6d5. Read the comment docs.

@jessicaway
Copy link
Contributor

@kishorikonwar I think you may have missed several of my comments that github had hidden, please take another look

Copy link
Contributor

@jessicaway jessicaway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple more minor comments. Looks good overall!

// sam records in its buffer. This is the same behavior across all
// reader threads
if (r == block_size || !fastQFileR1.keepReadingFile()) {
submit_block_tobe_written(samrecord_data, tindex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
submit_block_tobe_written(samrecord_data, tindex);
submit_block_to_be_written(samrecord_data, tindex);

for (int j=0; j < 5; j++) {
char c = tp[i];
tp[i] = ATCG[j];
/* if the mutation in any of the positions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough. Should we put a ticket in the backlog to look into this ?

#!/bin/bash


# --tool=memcheck \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove unused code

@kishorikonwar kishorikonwar merged commit fd352ae into master Sep 28, 2020
}
/* getopt_long already printed an error message. */
return;
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary break; after return;

namespace fs = std::experimental::filesystem;

/** @copydoc filesize */
int32_t filesize(const char *filename) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is int32 sufficient here, or is a long type needed for large files? If file size is in bytes doesn't this top out at 2.1G?

FILE *f = fopen(filename, "rb"); /* open the file in read only */

int32_t size = 0;
if (fseek(f, 0, SEEK_END) ==0 ) /* seek was successful */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a better way to get a file's size, such as fstat?

fastqpreprocessing/src/create_fastq.sh Outdated Show resolved Hide resolved
@@ -0,0 +1 @@
./fastqproc ../../L8TX/L8TX_180221_01_F12_R1.fastq.gz ../../L8TX/L8TX_180221_01_F12_I1.fastq.gz ../../L8TX/L8TX_180221_01_F12_R2.fastq.gz ../../L8TX/L8TX_171026_01_F03_R1.fastq.gz ../../L8TX/L8TX_171026_01_F03_I1.fastq.gz ../../L8TX/L8TX_171026_01_F03_R2.fastq.gz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these scripts refer to files that aren't checked in, what use are they for developers? This script in particular doesn't seem worth checking in once the development work is complete, since it's just calling fastqproc on a list of fastqs.



/**
* @brief Compute the number of bam files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this match the function? It looks like it's computing the number of blocks in a bam file.

@khajoue2 khajoue2 deleted the kmk-fastqprocessing branch March 9, 2022 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants