Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Sample name cleaning with pairs of input filenames #2162

Closed
ewels opened this issue Nov 7, 2023 · 1 comment
Closed

Discussion: Sample name cleaning with pairs of input filenames #2162

ewels opened this issue Nov 7, 2023 · 1 comment

Comments

@ewels
Copy link
Member

ewels commented Nov 7, 2023

Putting here for discussion, as a thorny issue that could be dangerous if implemented incorrectly. Feedback welcome!

We generally take sample names from input filenames where possible. If a tool takes more than one input (eg. a pair of FastQ files), we typically ignore the second and take the first. This issue is about a suggested method to use both to try to create a "cleaner" resulting sample identifier (typically without a _1 suffix).

We could look into a generalised function that we could use for every module that has the possibility to find a pair of input FastQ filenames. To try to resolve that pair of names into a single identifier. This could be done by doing a diff of the two filenames (we can't make assumptions about syntax like _1 because this could break many valid sample identifiers).

For example, given:

  • sample_1_R1_L1
  • sample_1_R2_L1

We could remove the _R1/_R2 diff from the two strings to get sample_1_L1 as a sample identifier.

We would need to be very careful not to remove other data here, so maybe we only do this if the diff is _R1/_R2 or _1/_2 or something. As we don't want to do this:

  • sample_one_R1_Lane1
  • sample_1_2_L1

Going to sample (or similar). In this case we should just leave the behaviour as current, that is - use the FastQ input 1 (sample_one_R1_Lane1).

@tamuanand
Copy link

tamuanand commented Nov 7, 2023

Thanks @ewels for starting this discussion.

My thoughts here:

  • fastp - if you run fastp in PE mode and output general stats as is done currently in 1.17, it mistakenly conveys the impression that the analyses began with a paired end dataset but then fastp somehow collapsed it all to be interleaved and/or have only _R1 datasets coming from the --in1 argument of fastp. I have had detailed discussions with @vladsavelyev on this and he can fill in on what potental problems this could cause.
  • FWIW, fastp also gives separate _R1 and _R2 stats
"read1_before_filtering": {
                "total_reads": 114077968,
                "total_bases": 17111695200,
                "q20_bases": 16805928223,
                "q30_bases": 16233110450,
                "total_cycles": 150,
                "quality_curves": {
SNIPPED
        "read2_before_filtering": {
                "total_reads": 114077968,
                "total_bases": 17111695200,
                "q20_bases": 16355237292,
                "q30_bases": 15526339378,
                "total_cycles": 150,
                "quality_curves": 
----
----
----
        "read1_after_filtering": {
                "total_reads": 110452640,
                "total_bases": 16352330858,
                "q20_bases": 16090855963,
                "q30_bases": 15555936668,
                "total_cycles": 150,
                "quality_curves":
SNIPPED
        "read2_after_filtering": {
                "total_reads": 110452640,
                "total_bases": 16353386063,
                "q20_bases": 15931142799,
                "q30_bases": 15191423293,
                "total_cycles": 150,
                "quality_curves": {
  • kallisto - this tool has a similar problem/feature/issue - whether you do SE or PE analysis, the multiqc module takes the stderr file and uses _R1 for general stats display and detailed kallisto sections. Note - I am not saying this is a MultiQC problem; I am just highlighting this. Hence, after I run kallisto and before I run multiqc, I do this and then use this file below for multiqc. I know I could have done some multiqc_config.yaml ninja like @ewels and @vladsavelyev and could have said to replace all of _1 with fn_clean_trim but that would probably end up removing my _1 that I want for fastqc
sed -i -e '/^ /d' -e 's/_1.fastq.gz\$/.fastq.gz/' "${sample_id}.kallisto_stderr.txt"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants