Intructions for extracting sequences from HTS target enrichment reads
Miao Sun ( _cactusresponsible[AT]gmail.com_)
July 16, 2019
This page is used to document steps of processing target enrichment reads from raw reads all the way down to phylogenetic tree reconstruction. Please also read the comments inside each script before excution, since they are all heavily commented.
Most scripts used here are wrotten in bash/shell and R
I also made some assumptions that:
You are working on HiPerGator at Uiversity of Florida
I applied the same rule to name our files and your file tree is the same as mine (see below)
The data used in this instruction was generated by RAPiD Genomics.
Within the data directory there is a SampleSheet csv file with the barcodes, filenames, and sample codes.
Note that Plates1-4 were sequenced on two lanes (L001 and L002), so there are two sets of fastq files per sample.
This data has been demultiplexed using Illuminas BCLtofastq. No quality trimming or processing has been done beyond demutiplexing.
The adapters used are below, "BCBCBCBC" stands for the barcodes.
Currently, Three ways you can analyze high-throughput sequencing reads using target enrichment:
[Publication](https://bsapubs.onlinelibrary.wiley.com/doi/full/10.3732/apps.1600016) [Code in github](https://github.com/mossmatters/HybPiper)
[Publication](https://journals.sagepub.com/doi/10.1177/1176934318774546) [Code in github](https://github.com/moskalenko/aTRAM)
[Publication](https://peerj.com/articles/5175/) [Code in github](https://github.com/AntonelliLab/seqcap_processor)
In this tutorial, I only focus on HybPiper
- Concatenate all lanes (L001 and L002; only if you have them on separate plates!)
`cat RAPiD-Genomics_F076_UFL_###_P003_WD02_i5-503_i7-72_S22_L001_R1_001.fastq.gz RAPiD-Genomics_F076_UFL_###_P003_WD02_i5-503_i7-72_S60_L002_R1_001.fastq.gz > P003_WD02_72_R1.fastq.gz`
or run in a batch manner:
`bash fastq_lane_cat.sh sample_ID_file Seq_ID_table`
`bash fastq_lane_cat.sh Evgeny_13.txt UFL_394803_SampleSheet.csv`
This bash script will take two input files: one is sample ID file, and the other is sequence ID table. The formact and content of each file is as shown below:
head -6 XXX_88.txt
head -6 UFL_XXX_SampleSheet_XXX86.csv
- fastqc to quick check the quality; and later on can be used for comparison after trim and clean.
- scripts needed:
fastqc.sh check_result.sh mean.R
module load ufrc fastqc
srundev -t time
fastqc *.gz -o FastQC_result
For slurm job scripts see:
after runing fastqc.sh, it will put fastqc results into a folder called FastQC_result;
Copy scripts check_result.sh, and mean.R, into FastQC_result, then excute the bash script, it will generate a summary table Illumina_FastQC_report.csv for reads quality. Other details see folder unzip_file.
check_result.sh and mean.R have to work together, you have to put them under the same directory
the R script is automatically invoked, you don't need to modify anything.
here is the example cmd (assuming you are in FastQC_result folder)
cp /path/to/scripts/check_result.sh /path/to/scripts/mean.R .
- Trim and clean reads using Trimmomatic, and preapre for next step --- Hybpiper.
If you have a few sample you can just run
bash Trimmomatic.shon dev node, which is not necessary to schedule a slurm job.
For large number of samples, submission to SLURM in HPC is required.
modify the recources requested to suit for your samples
- You also need to run
_fastqc.sh_again, in order to make sure all the adaptors and low quality reads are removed.
- keep in mind you need to modify the ending of sequence files, if they are fastaq, not .gz; the Fastqc is pretty flexable with sequence file format
- You also need to run
Sequence Assembly using Hybpiper:
- Run hybpiper
run this under dev node by
srundev -t xxx
need to put this script under "trimmed_data" folder
should have "paired" and "unpaired" folders generated from Trimmomatic
the reference sequenc already share here:/ufrc/soltis/share/Miao/Angiosperms353_targetSequences.fasta
or If you have a lot samples; you need schedule a job in the SLURM
If you want introns run intron script on accession folders out putted from previous step
Skip here, please see the HybPiper manual
When the first HybPiper script finished, the results generated are enough for us to do some statistics to evaluate them. At this step, we do three things:
- Generate sequence length table using script from HybPiper --- `get_seq_lengths.py` - Assembling stats for each gene and each sample using script from HybPiper --- `hybpiper_stats.py` - using R script `gene_recovery_heatmap.R` to generate a heatmap showing to genes are recovered for each samples
All these three functionalities are summarized in a bash script called
HybPiper_summary.sh; you run:
bash HybPiper_summary.sh XXX
- here "XXX" is prefixed string for all the results generated
- prepare a plain txt file --- "XXXnamelist.txt" (has to name this way or you modify the script as you wish) to store sequence ID.
`[cactus]$ head -3 XXXnamelist.txt` P002_WA01_59 P002_WA02_27 P002_WA03_82
- The R scrit should be in "./Script/" directory; or modify the path
- You should have these files in your current dir:
- Before we retrieve the supercontig sequences (assembled seq for each gene) from the first run above, we need to put Seq_ID folders all in one place (so
mv P*W* seq_dir):
- If you want to run each individually:
module load python
python HybPiper/retrieve_sequences.py baits1.fasta seq_dir dna
just exons use DNA, if you run intronerate use supercontig
If you want one-stop-shop, you need to run a comprehensive bash script:
bash Extract_seq_align_supmtx_rename.sh XXX
This bash script will do:
1) retrieve sequences;
2) involk MAFFT to do the alignment;
3) combined all genes into one supermatrix, and rename sequences using phyx remove 50% gaps (can be modified);
4) generating gene sample present absent binary matrix
You also need to prepare a csv file with you "samples_ID,Seq_ID" or "species_ID,Seq_ID", which depends how you want you sequences and matrix represented in later summary and naming.
Example: `[cactus]$ head -3 XXX.csv` ZYP_10,P002_WA04_7 ZYP_12,P002_WA05_38 ZYP_151_152,P002_WD03_30
You can also run mafft script on individual gene (modify as needed)
You can also trim gaps in the alignment based on different criterias
trimal -in <inputfile> -out <outputfile> -gappyout
- After you done with HybPiper, you'd better run
clean_up.shunder "sequence_dir" remove tons of intermedia results, saving space in HPC
need to put this script in the "sequence_dir" folder
bash clean_up.sh XXXnamelist.txt
and a list with all ids as "$file", one per line inside XXXnamelist.txt
the names list is the same as the folder names under "sequence_dir"
skip this step if you already have outgroup data from Target Enrichment or don't neeed 1kp data
Beside the data generated from Target Enrichment of 353 universial probe sets, I also included some species with 1kP transcriptome data as Outgroups.
Given I have no pre-knowledge, of how 1kP transcriptome data will be compatible with alignments of 353 nuclear genes, so I used reference sequences of 353 nuclear genes to assemble 1kP data of those outgroup species in two ways. Then I aligned them, comapred and select one of best, or I choose the consensus sequence using Geneious.
10. Run raxml
Three scripts used (./Scripts/raxml-ng/):
These three scripts will run sequentially. By providing a list with all the genera,
raxmlng_laucher.sh will go through each genus folder, creacte a "raxml" folder (where the raxml tree recontruction will happen), and looking for how many gene alignments were assembled for each genus; these numbers will be insert as a array job parameter for the first raxml script
For each alignment,
raxml_NG_check.sbatch will run raxml-ng "--parse" checking, for purpose that:
MSA sanity check (see Tutorial)
Compress alignment patterns as RAxML Binary Alignment (.rba file)
It will laoding faster for raxml, comapared to FASTA or PHYLIP (see Tutorial )
Getting estimated computation recources (e.g., Model, memory, and optimal number of CPUs/threads)
If the script detected that one alignment required larger mem (default is 1g ) or more threads (default is 1 ), then it will lauch the third Script
raxml_NG_model.sbatch, otherwise it will complete the job using current script with configurarion of default computation recource requirest
If the third script
raxml_NG_model.sbatch is launched, it will submit a new independent slurm job, with updated computation recources request based on the "--parse" results from
Visualization of Gene Tree Conflict With Pie Charts
- Please step to my website for instructions of making a Phyparts Piecharts
sharing their modified scripts for runing PhyPartsPieCharts
Andre A Naranjo
sharing slurm job scripts for running Hybpiper
helping with the slurm job schedule and raxml-ng MPI issues and other miscellaneous trouble shooting