You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Putting here for discussion, as a thorny issue that could be dangerous if implemented incorrectly. Feedback welcome!
We generally take sample names from input filenames where possible. If a tool takes more than one input (eg. a pair of FastQ files), we typically ignore the second and take the first. This issue is about a suggested method to use both to try to create a "cleaner" resulting sample identifier (typically without a _1 suffix).
We could look into a generalised function that we could use for every module that has the possibility to find a pair of input FastQ filenames. To try to resolve that pair of names into a single identifier. This could be done by doing a diff of the two filenames (we can't make assumptions about syntax like _1 because this could break many valid sample identifiers).
For example, given:
sample_1_R1_L1
sample_1_R2_L1
We could remove the _R1/_R2 diff from the two strings to get sample_1_L1 as a sample identifier.
We would need to be very careful not to remove other data here, so maybe we only do this if the diff is _R1/_R2 or _1/_2 or something. As we don't want to do this:
sample_one_R1_Lane1
sample_1_2_L1
Going to sample (or similar). In this case we should just leave the behaviour as current, that is - use the FastQ input 1 (sample_one_R1_Lane1).
The text was updated successfully, but these errors were encountered:
fastp - if you run fastp in PE mode and output general stats as is done currently in 1.17, it mistakenly conveys the impression that the analyses began with a paired end dataset but then fastp somehow collapsed it all to be interleaved and/or have only _R1 datasets coming from the --in1 argument of fastp. I have had detailed discussions with @vladsavelyev on this and he can fill in on what potental problems this could cause.
kallisto - this tool has a similar problem/feature/issue - whether you do SE or PE analysis, the multiqc module takes the stderr file and uses _R1 for general stats display and detailed kallisto sections. Note - I am not saying this is a MultiQC problem; I am just highlighting this. Hence, after I run kallisto and before I run multiqc, I do this and then use this file below for multiqc. I know I could have done some multiqc_config.yaml ninja like @ewels and @vladsavelyev and could have said to replace all of _1 with fn_clean_trim but that would probably end up removing my _1 that I want for fastqc
sed -i -e '/^ /d' -e 's/_1.fastq.gz\$/.fastq.gz/' "${sample_id}.kallisto_stderr.txt"
Putting here for discussion, as a thorny issue that could be dangerous if implemented incorrectly. Feedback welcome!
We generally take sample names from input filenames where possible. If a tool takes more than one input (eg. a pair of FastQ files), we typically ignore the second and take the first. This issue is about a suggested method to use both to try to create a "cleaner" resulting sample identifier (typically without a
_1
suffix).We could look into a generalised function that we could use for every module that has the possibility to find a pair of input FastQ filenames. To try to resolve that pair of names into a single identifier. This could be done by doing a diff of the two filenames (we can't make assumptions about syntax like
_1
because this could break many valid sample identifiers).For example, given:
sample_1_R1_L1
sample_1_R2_L1
We could remove the
_R1
/_R2
diff from the two strings to getsample_1_L1
as a sample identifier.We would need to be very careful not to remove other data here, so maybe we only do this if the diff is
_R1
/_R2
or_1
/_2
or something. As we don't want to do this:sample_one_R1_Lane1
sample_1_2_L1
Going to
sample
(or similar). In this case we should just leave the behaviour as current, that is - use the FastQ input 1 (sample_one_R1_Lane1
).The text was updated successfully, but these errors were encountered: