Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time
title author date output
Instructions for extracting sequences from HTS target enrichment reads
Miao Sun ( _cactusresponsible[AT]gmail.com_)
July 16, 2019
html_document pdf_document word_document

This page is used to document steps of processing target enrichment reads from raw reads all the way down to phylogenetic tree reconstruction. Please also read the comments inside each script before excution, since they are all heavily commented.

Most scripts used here are wrotten in bash/shell and R

I also made some assumptions that:

General workingflow

workingflow diagram


The data used in this instruction was generated by RAPiD Genomics.

Within the data directory there is a SampleSheet csv file with the barcodes, filenames, and sample codes.

Note that Plates1-4 were sequenced on two lanes (L001 and L002), so there are two sets of fastq files per sample.

Raw Data:

  • This data has been demultiplexed using Illuminas BCLtofastq. No quality trimming or processing has been done beyond demutiplexing.

  • The adapters used are below, "BCBCBCBC" stands for the barcodes.



Assembly methods

Currently, Three ways you can analyze high-throughput sequencing reads using target enrichment:

  1. HybPiper
[Code in github](  
  1. aTRAM
[Code in github](  

[Code in github](  


In this tutorial, I only focus on HybPiper




1. Concatenate all lanes (L001 and L002; only if you have them on separate plates!)


`cat RAPiD-Genomics_F076_UFL_###_P003_WD02_i5-503_i7-72_S22_L001_R1_001.fastq.gz RAPiD-Genomics_F076_UFL_###_P003_WD02_i5-503_i7-72_S60_L002_R1_001.fastq.gz > P003_WD02_72_R1.fastq.gz`  

or run in a batch manner:

`bash sample_ID_file Seq_ID_table`  


`bash Evgeny_13.txt UFL_394803_SampleSheet.csv`  

This bash script will take two input files: one is sample ID file, and the other is sequence ID table. The formact and content of each file is as shown below:

[cactus]$ head -6 XXX_88.txt CPG00213 CPG00216

[cactus]$ head -6 UFL_XXX_SampleSheet_XXX86.csv RG_Sample_Code,Customer_Code,i5_Barcode_Seq,i7_Barcode_Seq,Sequence_Name,Sequencing_Cycle

UFL_394803_P002_WG08,D_4566,TAAGATTA,TTCACGCA,RAPiD-Genomics_F076_UFL_394803_P002_WG08_i5-506_i7-68_S171_L001_R1_001.fastq.gz,2x150 ...


2. fastqc to quick check the quality; and later on can be used for comparison after trim and clean.

  • scripts needed: mean.R


module load ufrc fastqc
srundev -t time
fastqc *.gz -o FastQC_result

For slurm job scripts see:

fastqc.sbatch [./Scripts/fastqc/fastqc.sbatch]

  • after runing, it will put fastqc results into a folder called FastQC_result;

  • Copy scripts, and mean.R, into FastQC_result, then excute the bash script, it will generate a summary table Illumina_FastQC_report.csv for reads quality. Other details see folder unzip_file.


    • and mean.R have to work together, you have to put them under the same directory

    • the R script is automatically invoked, you don't need to modify anything.

    • here is the example cmd (assuming you are in FastQC_result folder)

      cp /path/to/scripts/ /path/to/scripts/mean.R .


3. Trim and clean reads using Trimmomatic, and preapre for next step --- Hybpiper.

  • scripts needed:


    If you have a few sample you can just run bash on dev node, which is not necessary to schedule a slurm job.

    For large number of samples, submission to SLURM in HPC is required.

  • run: sbatch Trimmomatic.sbatch
    modify the recources requested to suit for your samples

  • Note:

    • You also need to run _fastqc.sh_ again, in order to make sure all the adaptors and low quality reads are removed.
    • keep in mind you need to modify the ending of sequence files, if they are fastaq, not .gz; the Fastqc is pretty flexable with sequence file format

Sequence Assembly using Hybpiper:

4. Run hybpiper

  • run this under dev node by

    ml ufrc

    srundev -t xxx

  • need to put this script under "trimmed_data" folder

  • should have "paired" and "unpaired" folders generated from Trimmomatic

  • the reference sequenc already share here:/ufrc/soltis/share/Miao/Angiosperms353_targetSequences.fasta


  • or If you have a lot samples; you need schedule a job in the SLURM

    sbatch HybPiper_array_pairedonly.sbatch

5. If you want introns run intron script on accession folders out putted from previous step

Skip here, please see the HybPiper manual

6. When the first HybPiper script finished, the results generated are enough for us to do some statistics to evaluate them. At this step, we do three things:

- Generate sequence length table using script from HybPiper --- ``  

- Assembling stats for each gene and each sample using script from HybPiper --- ``  

- using R script `gene_recovery_heatmap.R` to generate a heatmap showing to genes are recovered for each samples
  • All these three functionalities are summarized in a bash script called; you run:

    bash XXX

  • Note:

  • here "XXX" is prefixed string for all the results generated
  • prepare a plain txt file --- "XXXnamelist.txt" (has to name this way or you modify the script as you wish) to store sequence ID.


`[cactus]$ head -3 XXXnamelist.txt`  
  • The R scrit should be in "./Script/" directory; or modify the path
  • You should have these files in your current dir:
    • XXXseq_length.txt
    • XXX_assemble_stats.txt
    • XXX_heatmap.pdf

7. Before we retrieve the supercontig sequences (assembled seq for each gene) from the first run above, we need to put Seq_ID folders all in one place (so mv P*W* seq_dir):

  • If you want to run each individually:

module load python

python HybPiper/ baits1.fasta seq_dir dna
just exons use DNA, if you run intronerate use supercontig

  • If you want one-stop-shop, you need to run a comprehensive bash script:

    bash XXX

  • Note:
    • This bash script will do:
      1) retrieve sequences;
      2) invoke MAFFT to do the alignment;
      3) combined all genes into one supermatrix, and rename sequences using phyx remove 50% gaps (can be modified);
      4) generating gene sample present absent binary matrix

    • You also need to prepare a csv file with you "samples_ID,Seq_ID" or "species_ID,Seq_ID", which depends how you want you sequences and matrix represented in later summary and naming.

      `[cactus]$ head -3 XXX.csv`  
    • You can also run mafft script on individual gene (modify as needed)

    • You can also trim gaps in the alignment based on different criterias
      ml trimal/1.2
      trimal -in <inputfile> -out <outputfile> -gappyout

      _**Note:** Please be careful when using `trimal` auto-model, because it can trim out a lot informatative sites (e.g., only 3 columns left in the alignment), and these tiny left-over alignment will cause `raxml-ng` have "ERROR in SPR round (LIBPLL-2240): BL opt converged to a worse likelihood score ..." error message._  

      Quick fix: use the original aligniment from MAFFT mentioned above.

8. After you done with HybPiper, you'd better run under "sequence_dir" remove tons of intermedia results, saving space in HPC

  • need to put this script in the "sequence_dir" folder
    bash XXXnamelist.txt

  • and a list with all ids as "$file", one per line inside XXXnamelist.txt

  • the names list is the same as the folder names under "sequence_dir"

    head XXXnamelist.txt


skip this step if you already have outgroup data from Target Enrichment or don't neeed 1kp data

Beside the data generated from Target Enrichment of 353 universial probe sets, I also included some species with 1kP transcriptome data as Outgroups.

Given I have no pre-knowledge, of how 1kP transcriptome data will be compatible with alignments of 353 nuclear genes, so I used reference sequences of 353 nuclear genes to assemble 1kP data of those outgroup species in two ways. Then I aligned them, comapred and select one of best, or I choose the consensus sequence using Geneious.


9. Run raxml

(1). Three scripts used (./Scripts/raxml-ng/):


  • raxml_NG_check.sbatch

  • raxml_NG_model.sbatch



These three scripts will run sequentially. By providing a list with all the genera, will go through each genus folder, creacte a "raxml" folder (where the raxml tree recontruction will happen), and looking for how many gene alignments were assembled for each genus; these numbers will be insert as a array job parameter for the first raxml script raxml_NG_check.sbatch.

For each alignment, raxml_NG_check.sbatch will run raxml-ng "--parse" checking, for purpose that:

  • MSA sanity check (see Tutorial)

  • Compress alignment patterns as RAxML Binary Alignment (.rba file)
    It will loading faster for raxml, comapared to FASTA or PHYLIP (see Tutorial )

  • Getting estimated computation recources (e.g., Model, memory, and optimal number of CPUs/threads)

If the script detected that one alignment required larger mem (default is 1g ) or more threads (default is 1 ), then it will lauch the third Script raxml_NG_model.sbatch, otherwise it will complete the job using current script with configurarion of default computation recource requirest

If the third script raxml_NG_model.sbatch is launched, it will submit a new independent slurm job, with updated computation recources request based on the "--parse" results from raxml_NG_check.sbatch script.

(2). Post-run scripts used (./Scripts/raxml-ng/):


Note: You only need these scripts when your raxml-ng jobs failed.

Visualization of Gene Tree Conflict With Pie Charts

10. Please step to my website for instructions of making a Phyparts Piecharts

If you run this on your own, you need have Python3, ETE3, and X server installed.



Rebecca L. Stubbs & Johanna R Jantzen

sharing their modified scripts for runing PhyPartsPieCharts

Andre A Naranjo
sharing slurm job scripts for running Hybpiper

Matt Gitzendanner
helping with the slurm job schedule and raxml-ng MPI issues and other miscellaneous trouble shooting


Intructions for extracting sequences from HTS target enrichment reads






No releases published


No packages published