Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging FASTQ files in paired-end sequencing? #16

Open
zapaterc opened this issue Jul 20, 2021 · 3 comments
Open

Merging FASTQ files in paired-end sequencing? #16

zapaterc opened this issue Jul 20, 2021 · 3 comments
Labels
question Further information is requested resolved Issue is resolved

Comments

@zapaterc
Copy link

zapaterc commented Jul 20, 2021

Hi Thomas and Matt,

I'm building some code to do QC in one RNAseq project that I'm working on... I have 2 fastq.gz files per sample and I just realized this is because there's one fastq.gz file per each paired-end sequencing run.

I was wondering if there's a way (a set of functions?) to integrate both files together so that I end up with just one fastq.gz per sample (so that I can also use that one file to align my reads)? Or which way would you recommend in order to move forward?

Thanks a lot!

Carolina

P.S.: @matthew-paul-2006, I wonder if the reason why my QC looked "clean" at the ends was because I did the analysis with run number 2 of the paired-end sequencing??

@ThomasCarroll
Copy link

hi Caroline,

To concatenate two fastqs samples you could use the Rfastp package's catfastq function. @cafeblue wrote this so can help if you have any issues.
http://www.bioconductor.org/packages/release/bioc/manuals/Rfastp/man/Rfastp.pdf

Just to be clear on when to merge,

If you have paired-end data you should typically have at least one read1 and one read2 fastq file.

These represent either end of the fragment and so are used in alignment as separate files (not merged). Typically denoted by R1 and R2 in name of file i.e. NAMEofSampleR1.fq and NAMEofSampleR2.fq

If you have multiples of R1 and R2, you can merge the R1 files together and then R2 files together prior to alignment as long as you ensure files being merged for R1 and R2 are in the same order i.e Sequencing Run1, Run2, Run3 fastqs are concatenated in that order for both the final merged R1 and R2 files.

vb

tom

@zapaterc
Copy link
Author

zapaterc commented Jul 21, 2021

Hi Tom,

This was very helpful, thank you! I do have just one R1 and R2 file per sample.
I just did the alignment but I'm still curious: When/How do I merge the information for the aligned data? There has to be a moment in the process where we merge those two runs, right? I might have missed something in the lecture.

Also, while using the rfastp() function in order to do some basic QC I observe the following:

  • If I only input read1 the % of duplication rate is about 67%.
    Screen Shot 2021-07-21 at 02 21 40

  • If I input both read1 and read2 I get about 6%.

Screen Shot 2021-07-21 at 02 18 43

I'm not sure I understand why the duplication rate is overestimated when you make the function think that the data comes from single-end sequencing.

As a follow-up question: If i use the readFastq() function followed by srduplicated() in order to get the duplicate reads in R1 I obtain approximately a 82% of duplicate reads.

Screen Shot 2021-07-21 at 02 18 03

Sorry for the lengthy questions and thanks a lot in advance!

Carolina

@matthew-paul-2006
Copy link

We talk about paired end data in subsequent sessions on technique specific analysis, so you haven't missed anything. So the question of merging all depends on the experimental approach, but for most cases you would do this at the alignment step. For RNAseq paired end, you can provide the aligner with both fastqs and it will align considering both reads. You can check the function help page to get the exact input variables and order you need to supply the file names.

As for the QC plots you showed me, it is unlikely that they would be gated in that way, based on just providing the fastq read2, instead of whole dataset. In fact it is often the case that the second read has slightly lower quality. Based on what you said the data had already been trimmed, and it fits with that explanation (especially as you said the read length varied a lot), but your best bet is to just check in with the folks who did the initial processing for you.

@matthew-paul-2006 matthew-paul-2006 added question Further information is requested resolved Issue is resolved labels Jul 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested resolved Issue is resolved
Projects
None yet
Development

No branches or pull requests

3 participants