Discussion: Sample name cleaning with pairs of input filenames #2162

ewels · 2023-11-07T11:19:52Z

Putting here for discussion, as a thorny issue that could be dangerous if implemented incorrectly. Feedback welcome!

We generally take sample names from input filenames where possible. If a tool takes more than one input (eg. a pair of FastQ files), we typically ignore the second and take the first. This issue is about a suggested method to use both to try to create a "cleaner" resulting sample identifier (typically without a _1 suffix).

We could look into a generalised function that we could use for every module that has the possibility to find a pair of input FastQ filenames. To try to resolve that pair of names into a single identifier. This could be done by doing a diff of the two filenames (we can't make assumptions about syntax like _1 because this could break many valid sample identifiers).

For example, given:

sample_1_R1_L1
sample_1_R2_L1

We could remove the _R1/_R2 diff from the two strings to get sample_1_L1 as a sample identifier.

We would need to be very careful not to remove other data here, so maybe we only do this if the diff is _R1/_R2 or _1/_2 or something. As we don't want to do this:

sample_one_R1_Lane1
sample_1_2_L1

Going to sample (or similar). In this case we should just leave the behaviour as current, that is - use the FastQ input 1 (sample_one_R1_Lane1).

The text was updated successfully, but these errors were encountered:

tamuanand · 2023-11-07T22:26:34Z

Thanks @ewels for starting this discussion.

My thoughts here:

fastp - if you run fastp in PE mode and output general stats as is done currently in 1.17, it mistakenly conveys the impression that the analyses began with a paired end dataset but then fastp somehow collapsed it all to be interleaved and/or have only _R1 datasets coming from the --in1 argument of fastp. I have had detailed discussions with @vladsavelyev on this and he can fill in on what potental problems this could cause.
FWIW, fastp also gives separate _R1 and _R2 stats

"read1_before_filtering": {
                "total_reads": 114077968,
                "total_bases": 17111695200,
                "q20_bases": 16805928223,
                "q30_bases": 16233110450,
                "total_cycles": 150,
                "quality_curves": {
SNIPPED
        "read2_before_filtering": {
                "total_reads": 114077968,
                "total_bases": 17111695200,
                "q20_bases": 16355237292,
                "q30_bases": 15526339378,
                "total_cycles": 150,
                "quality_curves": 
----
----
----
        "read1_after_filtering": {
                "total_reads": 110452640,
                "total_bases": 16352330858,
                "q20_bases": 16090855963,
                "q30_bases": 15555936668,
                "total_cycles": 150,
                "quality_curves":
SNIPPED
        "read2_after_filtering": {
                "total_reads": 110452640,
                "total_bases": 16353386063,
                "q20_bases": 15931142799,
                "q30_bases": 15191423293,
                "total_cycles": 150,
                "quality_curves": {

kallisto - this tool has a similar problem/feature/issue - whether you do SE or PE analysis, the multiqc module takes the stderr file and uses _R1 for general stats display and detailed kallisto sections. Note - I am not saying this is a MultiQC problem; I am just highlighting this. Hence, after I run kallisto and before I run multiqc, I do this and then use this file below for multiqc. I know I could have done some multiqc_config.yaml ninja like @ewels and @vladsavelyev and could have said to replace all of _1 with fn_clean_trim but that would probably end up removing my _1 that I want for fastqc

sed -i -e '/^ /d' -e 's/_1.fastq.gz\$/.fastq.gz/' "${sample_id}.kallisto_stderr.txt"

ewels added the core: back end label Nov 7, 2023

ewels mentioned this issue Nov 7, 2023

Fastp: correctly parse sample name from --in1/--in2 command. Fallback to file name #2139

Merged

vladsavelyev mentioned this issue Nov 16, 2023

Sample name cleaning with pairs of input filenames #2181

Merged

3 tasks

vladsavelyev closed this as completed Dec 13, 2023

vladsavelyev added this to the MultiQC v1.19 milestone Dec 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: Sample name cleaning with pairs of input filenames #2162

Discussion: Sample name cleaning with pairs of input filenames #2162

ewels commented Nov 7, 2023

tamuanand commented Nov 7, 2023 •

edited

Discussion: Sample name cleaning with pairs of input filenames #2162

Discussion: Sample name cleaning with pairs of input filenames #2162

Comments

ewels commented Nov 7, 2023

tamuanand commented Nov 7, 2023 • edited

tamuanand commented Nov 7, 2023 •

edited