Skip to content

Subsampling Metagenomic FASTQs

Gavin Douglas edited this page Oct 5, 2017 · 6 revisions

Rarefaction is a commonly used method in amplicon sequencing studies to ensure that all samples have the same read depth (by randomly subsampling reads). This method is controversial as it requires that many reads are discarded. Shotgun metagenomics reads are not usually rarified in practice although the same arguments can be made as for amplicon data for why this should be the case.

If you want to rarify your data you can use seqtk to randomly subsample a set number of reads from a FASTQ.

If you have paired-end data it is important that you make sure your read-pairs are in the same order in each FASTQ and use the same random seed when subsampling so that the reads from the same pairs will be subsampled correctly.

Sorting FASTQs

If your paired-end read pairs are not in the same order you can use the below commands to sort the FASTQs (note this assumes that every read in the forward FASTQ has a corresponding pair in the reverse FASTQ, and vice versa). This command was taken from the Edwards lab website.

mkdir sorted_fastqs

parallel -j 20 'cat {} | paste - - - - | sort -k1,1 -t " " | tr "\t" "\n" > sorted_fastqs/{/.}.sorted.fastq' ::: /path/to/fastqs/*.fastq

Subsampling FASTQs

Once your paired-end reads are in the same order you can use the sample command to subsample the reads. The input parameters are the input FASTQ and the number of reads to subsample ("DEPTH" in the below commands). It is also important to set the random seed (especially for paired-end data) so that reads from the same pairs are subsampled in the forward and reverse FASTQs. This option is set as "SEED" in the below commands and can be any number as long as it is the same for the forward and reverse FASTQs.

mkdir rarified_fastqs

parallel -j 4 'seqtk sample -s SEED {} DEPTH > rarified_fastqs/{/.}_subsampled.fastq' ::: path/to/fastqs/*_R1_*_paired.sorted.fastq

parallel -j 4 'seqtk sample -s SEED {} DEPTH > rarified_fastqs/{/.}_subsampled.fastq' ::: path/to/fastqs/*_R2_*_paired.sorted.fastq
Clone this wiki locally