Skip to content
Intructions for extracting sequences from HTS target enrichment reads
Java Shell Python R
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
Scripts
Angiosperms353_targetSequences.fasta
IESHTSTE.Rproj
README.docx
README.html
README.md
featured.jpg
working_flowing.jpg

README.md

title author date output
Intructions for extracting sequences from HTS target enrichment reads
Miao Sun ( _cactusresponsible[AT]gmail.com_)
July 16, 2019
html_document pdf_document word_document
default
default
default

This page is used to document steps of processing target enrichment reads from raw reads all the way down to phylogenetic tree reconstruction. Please also read the comments inside each script before excution, since they are all heavily commented.

Most scripts used here are wrotten in bash/shell and R

I also made some assumptions that:

General workingflow

workingflow diagram

Data

The data used in this instruction was generated by RAPiD Genomics.

Within the data directory there is a SampleSheet csv file with the barcodes, filenames, and sample codes.

Note that Plates1-4 were sequenced on two lanes (L001 and L002), so there are two sets of fastq files per sample.

Raw Data:

  • This data has been demultiplexed using Illuminas BCLtofastq. No quality trimming or processing has been done beyond demutiplexing.

  • The adapters used are below, "BCBCBCBC" stands for the barcodes.

    • i7: GATCGGAAGAGCACACGTCTGAACTCCAGTCAC-BCBCBCBC-ATCTCGTATGCCGTCTTCTGCTTG

    • i5: AATGATACGGCGACCACCGAGATCTACAC-BCBCBCBC-ACACTCTTTCCCTACACGACGCTCTTCCGATCT

Assembly methods

Currently, Three ways you can analyze high-throughput sequencing reads using target enrichment:

  1. HybPiper
[Publication](https://bsapubs.onlinelibrary.wiley.com/doi/full/10.3732/apps.1600016)  
[Code in github](https://github.com/mossmatters/HybPiper)  
  1. aTRAM
[Publication](https://journals.sagepub.com/doi/10.1177/1176934318774546)  
  
[Code in github](https://github.com/moskalenko/aTRAM)  
  1. SECAPR
[Publication](https://peerj.com/articles/5175/)  

[Code in github](https://github.com/AntonelliLab/seqcap_processor)  

#######################################

In this tutorial, I only focus on HybPiper

#######################################

HybPiper

Preprocess:

  1. Concatenate all lanes (L001 and L002; only if you have them on separate plates!)
    Example,
`cat RAPiD-Genomics_F076_UFL_###_P003_WD02_i5-503_i7-72_S22_L001_R1_001.fastq.gz RAPiD-Genomics_F076_UFL_###_P003_WD02_i5-503_i7-72_S60_L002_R1_001.fastq.gz > P003_WD02_72_R1.fastq.gz`  

or run in a batch manner:

`bash fastq_lane_cat.sh sample_ID_file Seq_ID_table`  

Example,

`bash fastq_lane_cat.sh Evgeny_13.txt UFL_394803_SampleSheet.csv`  

This bash script will take two input files: one is sample ID file, and the other is sequence ID table. The formact and content of each file is as shown below:

Example,
[cactus]$ head -6 XXX_88.txt CPG00213 CPG00216

[cactus]$ head -6 UFL_XXX_SampleSheet_XXX86.csv RG_Sample_Code,Customer_Code,i5_Barcode_Seq,i7_Barcode_Seq,Sequence_Name,Sequencing_Cycle

UFL_394803_P002_WG08,D_4566,TAAGATTA,TTCACGCA,RAPiD-Genomics_F076_UFL_394803_P002_WG08_i5-506_i7-68_S171_L001_R1_001.fastq.gz,2x150 ...

UFL_394803_P002_WG12,D_4571,TAAGATTA,CGCATACA,RAPiD-Genomics_F076_UFL_394803_P002_WG12_i5-506_i7-42_S175_L001_R1_001.fastq.gz,2x150

  1. fastqc to quick check the quality; and later on can be used for comparison after trim and clean.
  • scripts needed:
    fastqc.sh check_result.sh mean.R

Example,

module load ufrc fastqc
srundev -t time
fastqc *.gz -o FastQC_result

For slurm job scripts see:

fastqc.sbatch [./Scripts/fastqc/fastqc.sbatch]

  • after runing fastqc.sh, it will put fastqc results into a folder called FastQC_result;

  • Copy scripts check_result.sh, and mean.R, into FastQC_result, then excute the bash script, it will generate a summary table Illumina_FastQC_report.csv for reads quality. Other details see folder unzip_file.

    note:

    • check_result.sh and mean.R have to work together, you have to put them under the same directory

    • the R script is automatically invoked, you don't need to modify anything.

    • here is the example cmd (assuming you are in FastQC_result folder)

      cp /path/to/scripts/check_result.sh /path/to/scripts/mean.R .

      bash check_result.sh

  1. Trim and clean reads using Trimmomatic, and preapre for next step --- Hybpiper.
  • scripts needed:

    Trimmomatic.sbatch

    If you have a few sample you can just run bash Trimmomatic.sh on dev node, which is not necessary to schedule a slurm job.

    For large number of samples, submission to SLURM in HPC is required.

  • run: sbatch Trimmomatic.sbatch
    modify the recources requested to suit for your samples

  • Note:

    • You also need to run _fastqc.sh_ again, in order to make sure all the adaptors and low quality reads are removed.
    • keep in mind you need to modify the ending of sequence files, if they are fastaq, not .gz; the Fastqc is pretty flexable with sequence file format

Sequence Assembly using Hybpiper:

  1. Run hybpiper
  • run this under dev node by

    ml ufrc

    srundev -t xxx

  • need to put this script under "trimmed_data" folder

  • should have "paired" and "unpaired" folders generated from Trimmomatic

  • the reference sequenc already share here:/ufrc/soltis/share/Miao/Angiosperms353_targetSequences.fasta

    bash HybPiper_array_pairedonly.sh

  • or If you have a lot samples; you need schedule a job in the SLURM

    sbatch HybPiper_array_pairedonly.sbatch

  1. If you want introns run intron script on accession folders out putted from previous step
    Skip here, please see the HybPiper manual

  2. When the first HybPiper script finished, the results generated are enough for us to do some statistics to evaluate them. At this step, we do three things:

- Generate sequence length table using script from HybPiper --- `get_seq_lengths.py`  

- Assembling stats for each gene and each sample using script from HybPiper --- `hybpiper_stats.py`  

- using R script `gene_recovery_heatmap.R` to generate a heatmap showing to genes are recovered for each samples
  • All these three functionalities are summarized in a bash script called HybPiper_summary.sh; you run:

    bash HybPiper_summary.sh XXX

  • Note:

  • here "XXX" is prefixed string for all the results generated
  • prepare a plain txt file --- "XXXnamelist.txt" (has to name this way or you modify the script as you wish) to store sequence ID.

Example:

`[cactus]$ head -3 XXXnamelist.txt`  
    P002_WA01_59  
    P002_WA02_27  
    P002_WA03_82  
  • The R scrit should be in "./Script/" directory; or modify the path
  • You should have these files in your current dir:
    • XXXseq_length.txt
    • XXX_assemble_stats.txt
    • XXX_heatmap.pdf
  1. Before we retrieve the supercontig sequences (assembled seq for each gene) from the first run above, we need to put Seq_ID folders all in one place (so mv P*W* seq_dir):
  • If you want to run each individually:

module load python

python HybPiper/retrieve_sequences.py baits1.fasta seq_dir dna
just exons use DNA, if you run intronerate use supercontig

  • If you want one-stop-shop, you need to run a comprehensive bash script:

    bash Extract_seq_align_supmtx_rename.sh XXX

  • Note:
    • This bash script will do:
      1) retrieve sequences;
      2) involk MAFFT to do the alignment;
      3) combined all genes into one supermatrix, and rename sequences using phyx remove 50% gaps (can be modified);
      4) generating gene sample present absent binary matrix

    • You also need to prepare a csv file with you "samples_ID,Seq_ID" or "species_ID,Seq_ID", which depends how you want you sequences and matrix represented in later summary and naming.

      Example:  
      `[cactus]$ head -3 XXX.csv`  
      
        ZYP_10,P002_WA04_7  
        ZYP_12,P002_WA05_38  
        ZYP_151_152,P002_WD03_30  
      
    • You can also run mafft script on individual gene (modify as needed)
      bash Mafft_alignment.sh

    • You can also trim gaps in the alignment based on different criterias
      ml trimal/1.2
      trimal -in <inputfile> -out <outputfile> -gappyout

  1. After you done with HybPiper, you'd better run clean_up.sh under "sequence_dir" remove tons of intermedia results, saving space in HPC
  • need to put this script in the "sequence_dir" folder
    bash clean_up.sh XXXnamelist.txt

  • and a list with all ids as "$file", one per line inside XXXnamelist.txt

  • the names list is the same as the folder names under "sequence_dir"

    head XXXnamelist.txt
    P002_WG08_68
    P002_WG09_41
    P002_WG10_37

Outgroup
skip this step if you already have outgroup data from Target Enrichment or don't neeed 1kp data

Beside the data generated from Target Enrichment of 353 universial probe sets, I also included some species with 1kP transcriptome data as Outgroups.

Given I have no pre-knowledge, of how 1kP transcriptome data will be compatible with alignments of 353 nuclear genes, so I used reference sequences of 353 nuclear genes to assemble 1kP data of those outgroup species in two ways. Then I aligned them, comapred and select one of best, or I choose the consensus sequence using Geneious.

RAxML-NG
10. Run raxml

Three scripts used (./Scripts/raxml-ng/):

  • raxmlng_laucher.sh

  • raxml_NG_check.sbatch

  • raxml_NG_model.sbatch

These three scripts will run sequentially. By providing a list with all the genera, raxmlng_laucher.sh will go through each genus folder, creacte a "raxml" folder (where the raxml tree recontruction will happen), and looking for how many gene alignments were assembled for each genus; these numbers will be insert as a array job parameter for the first raxml script raxml_NG_check.sbatch.

For each alignment, raxml_NG_check.sbatch will run raxml-ng "--parse" checking, for purpose that:

  • MSA sanity check (see Tutorial)

  • Compress alignment patterns as RAxML Binary Alignment (.rba file)
    It will laoding faster for raxml, comapared to FASTA or PHYLIP (see Tutorial )

  • Getting estimated computation recources (e.g., Model, memory, and optimal number of CPUs/threads)

If the script detected that one alignment required larger mem (default is 1g ) or more threads (default is 1 ), then it will lauch the third Script raxml_NG_model.sbatch, otherwise it will complete the job using current script with configurarion of default computation recource requirest

If the third script raxml_NG_model.sbatch is launched, it will submit a new independent slurm job, with updated computation recources request based on the "--parse" results from raxml_NG_check.sbatch script.

Visualization of Gene Tree Conflict With Pie Charts

  1. Please step to my website for instructions of making a Phyparts Piecharts

Phyparts_Piecharts

Acknowledgements

Rebecca L. Stubbs & Johanna R Jantzen

sharing their modified scripts for runing PhyPartsPieCharts

Andre A Naranjo
sharing slurm job scripts for running Hybpiper

Matt Gitzendanner
helping with the slurm job schedule and raxml-ng MPI issues and other miscellaneous trouble shooting

You can’t perform that action at this time.