Skip to content

Commit

Permalink
Add logged warning for spaces in sample names.
Browse files Browse the repository at this point in the history
This can lead to downstream failures in some contexts. In that case, the error message only mentions not being able to match read groups, with no extra information provided. With this change, we will at least have that reason availlable in the error logs.

See https://smrtlink-sms.nanofluidics.com:8243/sl/analysis/results/32588?show=status
  • Loading branch information
derekwbarnett committed Dec 15, 2020
1 parent e1718d2 commit d4d0722
Show file tree
Hide file tree
Showing 6 changed files with 33 additions and 4 deletions.
10 changes: 8 additions & 2 deletions src/SampleNames.cpp
Expand Up @@ -28,9 +28,15 @@ std::string SampleNames::SanitizeSampleName(const std::string& in)
if (trimmed.empty()) return fallbackSampleName;
std::string sanitizedName;
for (const char& c : trimmed) {
if (c < '!' || c > '~')
if (c < '!' || c > '~') {
sanitizedName += '_';
else
if (c == ' ') {
std::ostringstream msg;
msg << "Sample name '" << in << "' contains a space character. "
<< "This may interfere with matching read groups and samples later.";
PBLOG_WARN << msg.str();
}
} else
sanitizedName += c;
}
return sanitizedName;
Expand Down
7 changes: 7 additions & 0 deletions tests/cram/biosampleConsensus.t
Expand Up @@ -17,6 +17,7 @@
*\tSM:testSample\t* (glob)

$ $__PBTEST_PBMM2_EXE align $IN $REF $CRAMTMP/ccs3.bam --sample " TEST bla "
*Sample name ' TEST bla ' contains a space character* (glob)
$ samtools view -H $CRAMTMP/ccs3.bam | grep "@RG"
*\tSM:TEST_bla\t* (glob)

Expand All @@ -37,12 +38,18 @@
*\tSM:UnnamedSample\t* (glob)

$ $__PBTEST_PBMM2_EXE align $MERGED $REF $CRAMTMP/ccs6.bam
*Sample name 'test test' contains a space character* (glob)
*Sample name 'test test' contains a space character* (glob)
*Sample name 'test test' contains a space character* (glob)
$ samtools view -H $CRAMTMP/ccs6.bam | grep "@RG"
*\tSM:bamSample\t* (glob)
*\tSM:test_test\t* (glob)
*\tSM:test_test\t* (glob)

$ $__PBTEST_PBMM2_EXE align $MERGED $REF $CRAMTMP/ccs7.bam --sample testSample
*Sample name 'test test' contains a space character* (glob)
*Sample name 'test test' contains a space character* (glob)
*Sample name 'test test' contains a space character* (glob)
$ samtools view -H $CRAMTMP/ccs7.bam | grep "@RG"
*\tSM:testSample\t* (glob)
*\tSM:testSample\t* (glob)
Expand Down
6 changes: 6 additions & 0 deletions tests/cram/biosampleSubreads.t
Expand Up @@ -11,6 +11,7 @@
*\tSM:testSample\t* (glob)

$ $__PBTEST_PBMM2_EXE align $IN $REF $CRAMTMP/out3.bam --sample " TEST bla "
*Sample name ' TEST bla ' contains a space character* (glob)
$ samtools view -H $CRAMTMP/out3.bam | grep "@RG"
*\tSM:TEST_bla\t* (glob)

Expand All @@ -23,12 +24,17 @@
*\tSM:UnnamedSample\t* (glob)

$ $__PBTEST_PBMM2_EXE align $MERGED $REF $CRAMTMP/out6.bam
*Sample name ' UCLA 1023 ' contains a space character* (glob)
*Sample name 'test test ' contains a space character* (glob)
$ samtools view -H $CRAMTMP/out6.bam | grep "@RG"
*\tSM:3260208_188nM-GTAC_2xGCratio_LP7_100fps_15min_5kEColi_SP2p1_3uMSSB_BA243494\t* (glob)
*\tSM:test_test\t* (glob)
*\tSM:UCLA_1023\t* (glob)


$ $__PBTEST_PBMM2_EXE align $MERGED $REF $CRAMTMP/out7.bam --sample testSample
*Sample name ' UCLA 1023 ' contains a space character* (glob)
*Sample name 'test test ' contains a space character* (glob)
$ samtools view -H $CRAMTMP/out7.bam | grep "@RG"
*\tSM:testSample\t* (glob)
*\tSM:testSample\t* (glob)
Expand Down
10 changes: 8 additions & 2 deletions tests/cram/splitsample.t
Expand Up @@ -3,7 +3,8 @@
$ NO_SM_BIOSAMPLES=$TESTDIR/data/no_sm_biosamples.subreadset.xml

$ $__PBTEST_PBMM2_EXE align $MERGED $REF $CRAMTMP/split.bam --split-by-sample

*Sample name ' UCLA 1023 ' contains a space character* (glob)
*Sample name 'test test ' contains a space character* (glob)
$ ls -l $CRAMTMP/split.*.bam | wc -l | tr -d ' '
3

Expand All @@ -26,6 +27,8 @@
10

$ $__PBTEST_PBMM2_EXE align $MERGED $REF $CRAMTMP/split_dataset.alignmentset.xml --split-by-sample
*Sample name ' UCLA 1023 ' contains a space character* (glob)
*Sample name 'test test ' contains a space character* (glob)

$ [[ -f $CRAMTMP/split_dataset.3260208_188nM-GTAC_2xGCratio_LP7_100fps_15min_5kEColi_SP2p1_3uMSSB_BA243494.alignmentset.xml ]] || echo "File does not exist!"
$ [[ -f $CRAMTMP/split_dataset.test_test.alignmentset.xml ]] || echo "File does not exist!"
Expand Down Expand Up @@ -53,8 +56,11 @@

When both --split-by-sample and --sample were set, expect to see only one bam file with SM overridden.
$ IN=$TESTDIR/data/merged.consensusreadset.xml
$ $__PBTEST_PBMM2_EXE align $REF $IN $CRAMTMP/splitsampleoverride.consensusalignmentset.xml --sort -j 8 --split-by-sample --sample "MySample" 2>&1 | fgrep -v 'Requested more threads'
$ $__PBTEST_PBMM2_EXE align $REF $IN $CRAMTMP/splitsampleoverride.consensusalignmentset.xml --sort -j 8 --split-by-sample --sample "MySample" 2>&1 | fgrep -v 'Requested more threads'
*Options --split-by-sample and --sample are mutually exclusive. Option --sample will be applied and --split-by-sample is ignored! (glob)
*Sample name 'test test' contains a space character* (glob)
*Sample name 'test test' contains a space character* (glob)
*Sample name 'test test' contains a space character* (glob)
$ [[ -f $CRAMTMP/splitsampleoverride.bam ]] || echo "File does not exist!"
$ samtools view -H $CRAMTMP/splitsampleoverride.bam | grep "@RG" | grep -v "@PG ID:samtools" | cut -f 6 | sort | uniq
SM:MySample
Expand Down
2 changes: 2 additions & 0 deletions tests/cram/splitsamplejson.t
Expand Up @@ -3,6 +3,8 @@
$ REF=ecoliK12_pbi_March2013.fasta

$ $__PBTEST_PBMM2_EXE align $MERGED $REF $CRAMTMP/split_dataset_sorted_json.alignmentset.xml --split-by-sample
*Sample name ' UCLA 1023 ' contains a space character* (glob)
*Sample name 'test test ' contains a space character* (glob)

$ [[ -f $CRAMTMP/split_dataset_sorted_json.3260208_188nM-GTAC_2xGCratio_LP7_100fps_15min_5kEColi_SP2p1_3uMSSB_BA243494.alignmentset.xml ]] || echo "File does not exist!"
$ [[ -f $CRAMTMP/split_dataset_sorted_json.test_test.alignmentset.xml ]] || echo "File does not exist!"
Expand Down
2 changes: 2 additions & 0 deletions tests/cram/splitsamplesorted.t
Expand Up @@ -2,6 +2,8 @@
$ REF=$TESTDIR/data/ecoliK12_pbi_March2013.fasta

$ $__PBTEST_PBMM2_EXE align $MERGED $REF $CRAMTMP/split_dataset_sorted.alignmentset.xml --split-by-sample --sort
*Sample name ' UCLA 1023 ' contains a space character* (glob)
*Sample name 'test test ' contains a space character* (glob)

$ [[ -f $CRAMTMP/split_dataset_sorted.3260208_188nM-GTAC_2xGCratio_LP7_100fps_15min_5kEColi_SP2p1_3uMSSB_BA243494.alignmentset.xml ]] || echo "File does not exist!"
$ [[ -f $CRAMTMP/split_dataset_sorted.test_test.alignmentset.xml ]] || echo "File does not exist!"
Expand Down

0 comments on commit d4d0722

Please sign in to comment.