Please cite the article if you use the pipeline or dataset.
Pipeline/Scripts for evaluating the impact of repetitive elements on coalescent demographic inferences.
-BWA
- SAMTOOLS
- BCFTOOLS
- BEDTOOLS
- PSMC
- SEQTK
- MSMC
- MSMC-TOOLS
- R
**Add mod_dec.pl, decode2bed.pl and psmc_plot.pl file to the path and make sure it is available globally on the PC.
Prior to assessing the quality of genomic regions for performing coalescent analysis, we need to prepare an alignment of the short read data against a reference genome in the bam format. While this can be done using a short-read alignment program of your choice, we recommend using the coalmap command that uses the bwa read mapper to perform the read mapping. The following programs need to be installed and present in the PATH: bwa, samtools
coalmap -g genome -f fastqread1 -r fastqread2 -p prefix -n threads
As an example, we will download the genome assembly of the *Populus trichocarpa* from the NCBI website and use the publicly available short read data from the SRA dataset.
#Download genome reference file
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/775/GCF_000002775.4_Pop_tri_v3/GCF_000002775.4_Pop_tri_v3_genomic.fna.gz
#Uncompress the reference fasta file
gzip -d GCF_000002775.4_Pop_tri_v3_genomic.fna.gz
#Download short read data files
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR625/009/SRR6256359/SRR6256359_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR625/009/SRR6256359/SRR6256359_2.fastq.gz
#Use coalmap command to index the genome and map the short read data.
coalmap -g GCF_000002775.4_Pop_tri_v3_genomic.fna -f SRR6256359_1.fastq.gz -r SRR6256359_2.fastq.gz -p pt -n 24
The following information about the genome assembly used and percent of reads mapped to the genome are reported.
Percent of N's or hardmasked bases is 0.
Analysis will be performed by using this fasta file but make sure you are not using hardmasked genome.
Now, mapping the reads to the reference genome.
Mapping finished
The mapped percentage for this assembly is 98.91%.
Now, calculating coverage.
The mean coverage of reads is 27.1474
#Download the repeat masker output file from NCBI (or generate your own by running repeat masker).
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/775/GCF_000002775.4_Pop_tri_v3/GCF_000002775.4_Pop_tri_v3_rm.out.gz
Uncompress the repeat masker output and convert it to a BED format file. Note that the repeat type is encoded as a the fourth column.
gzip -d GCF_000002775.4_Pop_tri_v3_rm.out.gz
cat GCF_000002775.4_Pop_tri_v3_rm.out|tail -n +4|awk '{print $5,$6,$7,$11}'| sed 's/\?//g' | sed 's/\//_/g' |sed 's/ /\t/g' > GCF_000002775.4_Pop_tri_v3_rm.bed
Having created the read alignment file in bam format and the repeat location and type file in bed format, we are ready to start running the CoalRep module.
coalrep -g GCF_000002775.4_Pop_tri_v3_genomic.fna -b pt/pt.sort.bam -p pt -n 24 -m GCF_000002775.4_Pop_tri_v3_rm.bed -t 5 -u 3.75e-08 -P "4+25*2+4+6"
This will print on the screen some stuff like this
Approximate genome size is 439719174
Now, calculating coverage
The mean coverage of reads is 42.5397
The variants will be called now
[warning] samtools mpileup option `u` is functional, but deprecated. Please switch to using bcftools mpileup in future.
[mpileup] 1 samples in 1 input files
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[warning] samtools mpileup option `u` is functional, but deprecated. Please switch to using bcftools mpileup in future.
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 1 samples in 1 input files
Variants are ready
Making PSMC fasta from variants
Creating heterozygosity profile
Now, running PSMC
PSMC output is done
Now, creating plotting information
Now, running MSMC analysis
[warning] samtools mpileup option `u` is functional, but deprecated. Please switch to using bcftools mpileup in future.
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 1 samples in 1 input files
parsing position 135
parsing position 10773
parsing position 20773
parsing position 30773
parsing position 40773
parsing position 50926
parsing position 60926
parsing position 70926
parsing position 80926
parsing position 90926
This will go on for quite a long time as it will perform PSMC and MSMC analyses on your dataset and explore the repeat families and its effects on the demography. After run finishes you will see some lines like this.
pdf
2
Finished running Coalrep
If you get this output at the end that means your run was successful and the results are produced and already kept in the results folder for your observation. In results folder there would be 6 folder namely, PSMC_plots, Recomb_events, Repeat_correlation, Repeat_Distribution_in_AI, Repeat_length, Ts_Tv. What results these folders contain and what to look for in these, their details are given below.
The results directory will have 6 subdirectories: 1.PSMC_plots, 2.Recomb_events, 3.Repeat_correlation, 4.Repeat_Distribution_in_AI, 5.Repeat_length and 6.Ts_Tv.
PSMC trajectories differ due to inclusion/exclusion of repeats. Extent of change in trajectories due to these repeats are assessed in these plots. PSMC plots showing repeat content across atomic intervals, and comparative msmc inferred curves. Filename strucure is followed like this: {bamfile}.{prefix}_repeat_effect_{bin_size}.pdf for masked/unmasked psmc runs and {bamfile}.{prefix}_repclass_PSMC_s100.pdf for all repeat class PSMC. File ending with repclass_PSMC_s100.pdf will have PSMC plot having all the repeat classes plotted in single panel to get comparative differences due to inclusion of each repeat class during PSMC analyses. There should be 5 plots, one for each bin sizes, one combined and one for repeat classes. Occurance of sufficient number of recombinations in each atomic interval after 20th interval is one of the necessities for accurate coalescent inferrence in PSMC. These plots will show if there are sufficient number of recombiantions for your PSMC runs. Plots for Number of recombinations in each interval across 25 iterations for each PSMC run. Filename structure for masked and unmasked genomes PSMC runs is followed like this: {bamfile}.{prefix}.{mask_status}.{bin_size}.nrcomb_RecQC.pdf. For repeat class PSMC runs filemname structure is followed like this: {repeat_class}.nrcomb_RecQC.pdf. There will be 6 plots for masked/unmasked PSMC runs and equal number of plots corresponding to repeat classes. Correlation for repeat content in atomic intervals with differences in Effective population sizes (Ne) is tested using regression and Kendall's correlation test. If repeats are really affecting the PSMC inferrence, this correlation should be high. Filename structure is followed like this: {bamfile}.{prefix}_correlation_repeat_{bin_size}.pdf. In total 4 plots should be there. Contribution or abundance of repeat classes across intervals in each interval and across genome is shown to better assess which repeat classes have higher contribution to that atomic interval. Filename structure is followed like this: {bin_size}.AI_perc.pdf for abundance in each interval and {bin_size}.AI_perc_geno.pdf for abundance across genome. In total there should be 6 plots, 2 for each bin size. Lengths of sequences used for inference of Ne in each atomic interval varies. The distribution of lengths of sequences across atomic intervals is plotted for each PSMC run. Filename structure for masked/unmasked PSMC is followed like this: {bamfile}.{prefix}.{mask_status}.{bin_size}.RepLength_{bin_size}.pdf and for repeat class PSMC runs a single file RepLength_repclass.pdf will be produced. In total there should be 7 plots, 2 for each bin size and one for repeats. Heterozygosity and Ts/Tv ratio of corresponding atomic interval is calculated and plotted. Distribution of heterozygosity is expected to increase across intervals in increasing order of atomic itervals but corresponding Ts/Tv values should be randomly distributed. Filename structure is followed like this: {bamfile}.{prefix}.rep_tstv.{bin_size}.pdf There should be 3 plots in total one for each bin size.