AlleleSorting

The R scripts for automated sorting of alleles obtained from polyploid genomes to distinct homeologs as done in Šlenker et al. 2021. Currently, we are able to sort alleles of suspected allotetraploids. However, future developments will focus on resolving genomes of higher ploidy levels as well as autopolyploids.

Identifying homeologs inherited from different parents

Find two pairs of alleles (sequences), in which the alleles are closest to each other within the pairs while more distant between the pairs

First, interallelic distances are computed from the branch lengths of the corresponding ML tree (computed by cophenetic function of R base package stats). Distances between alleles are computed as an average length of path connecting all possible allele pairs or trios. As next, we compare the computed distances. If an average distance between alleles within any two pairs is more than between_homeolog_distance (the desired threshold, see below)-time shorter than the average distance between alleles within any other possible arrangements, these pairs of alleles are considered to be unequivocally different and attributable to different homeologs. The treatment of each sample is written into a logfile.

distance	formula	tree A	tree B	tree C	tree D	tree E
12_34	average [ (a1_a2) + (a3_a4) ]	2	10	14	10	6
13_24	average [ (a1_a3) + (a2_a4) ]	20	20	20	16	8
14_23	average [ (a1_a4) + (a2_a3) ]	20	20	20	16	8
123	average [ (a1_a2) + (a1_a3) + (a2_a3) ]	14	19.33	19.33	14	2.66
124	average [ (a1_a2) + (a1_a4) + (a2_a4) ]	14	19.33	19.33	14	8.66
134	average [ (a1_a3) + (a1_a4) + (a3_a4) ]	14	14	16.66	11.33	8.66
234	average [ (a2_a3) + (a2_a4) + (a3_a4) ]	14	14	16.66	16.66	8.66

lets set between_homeolog_distance = 4
Tree A:    The shortest distance (2) was found in the arrangement as a1 and a2 as a one pair and a3 and a4 as another pair. The second shortest distance was found in trios (14). This distance is more than 4-time longer (between_homeolog_distance), thus we consider pairs a1-a2 and a3-a4 as unequivocally different and attributable to different homeologs (in the following steps).
Trees B, C, D:    The shortest distance is not a 4-time shorter than the distance of any other possible arrangement. These alleles cannot be equivocally sorted to homeologs.
Tree E:    The shortest distance was found in a trio, what interrupts any further attempt to find two pairs of distinct alleles.

For more details about searching for the optimal between_homeolog_distance value, see Supplementary Text 1.

Sorting is successful (unequivocal) -> attribute to one parent

If the sorting is successful, we have two pairs. We don't know where they belong (to which subgenomes/parental genomes), the only thing we know is, that they are sufficiently similar within pairs and different between pairs.
The next step is to calculate the distance of each pair to a potential parent (Parent_A), e.g. the maternal parent suggested from cpDNA phylogeny. The pair closer to Parent_A is labelled as homeolog A, and its alleles as A1 and A2 (in random order). The pair more distant is labelled as homeolog B and its alleles as B1 and B2. The "*** conditions are met" is written to logfile.
Arguments use_between_parents_distance (logical) and between_parents_distance (numeric) can be used for more precise filtering. If use_between_parents_distance=TRUE, homeolog A has to be more than between_parents_distance closer to Parent_A than to homeolog B, to consider them as unequivocally sorted.

Sorting is NOT successful - mask/remove

If the sorting is not successful, sample alleles can be masked by using Ns on the positions of intra-allelic SNPs, or removed. Removing unsorted alleles is required for coalescent-based species tree estimation. However, if we are able to concatenate sequences to longer blocks (to genes, for example), partially masked sequences can be concatenated with sorted sequences to prolong alignment length. In this case, partially masked sequences will not disrupt phylogenetic analysis. On the contrary, these sequences can bring better support on the species level. After concatenation, the allele sorting into two homeologs has to be verified again.

As a result, you will get alignment(s) where suffixes of sequences of sorted samples are renamed to A1, A2, B1 and B2, indicating their assignment to parental genomes. Unsorted samples will be removed if if_below_treshold = "remove" is used. If you chose masking instead of removing, unsorted samples will be masked but renamed as A1, A2, B1 and B2 too, to allow following concatenation. The treatment of each sample is written into logfile. The "masking" approach homogenizes inter-allelic variation, however, if we are able to concatenate sequences to longer blocks (to genes, for example), even masked sequences can provide additional support for sample/species-level differentiation. The sorting into homeologs has to be verified after concatenation, to confirm unambiguity, or to remove the equivocal sample from the gene alignments.

The pipeline

Input files:

Aligned sequences in fasta format, all the same length
tetraploid samples, which should be sorted to two homeologs have to be named as follow:
        <4xSample>-a1
        <4xSample>-a2
        <4xSample>-a3
        <4xSample>-a4
        barC007_103-a1
        barC007_103-a2
        barC007_103-a3
        barC007_103-a4
where <4xSample>s have to be listed in a sample file (see bullet 3).

other sequences, or at least sequences of samples belonging to a diploid Parent A has to be named as:
        <2xSample>-a1
        <2xSample>-a2
        ambC014_101-a1
        ambC014_101-a2
Phylogenetic trees in newick format, built from aligned sequences.
A sample file, containing a list of tetraploid samples, which alleles have to be sorted to homeologs. Each sample in a separate line.
        <4xSample1>
        <4xSample2>
        barC007_103
        barC010_102
A species mapping file containing a list of diploid species and individuals (samples). Not all samples need to be listed here. To sort alleles of tetraploids, only samples of Parent A are required. However, if you want to use other tools, e.g. to calculate distances between identified homeologs of tetraploids to other diploid species, these other species have to be listed here. This mapping file should have one line per species, and each line needs to be in the following format:
        species1:<2xSample1>,<2xSample2>
        amara:ambC014_101,ambC024_103,ammC029_102
        matthioli:matGRM9
An exon/gene list to process. Aligned sequences and the corresponding phylogenetic trees must have the same basis of the file name. This is the list of those names. The script will iterate through this list and take a sequence and corresponding phylogenetic tree.

Pipeline Output:

Aligned sequences in fasta format. Alleles of tetraploids are either sorted or removed.
log file, containing a treatment of each sample for each sequence.

Running the pipeline

Use the Main.R script as a reference and edit it according to your needs. Set arguments and load functions from functions.R into the environment.

source('./functions.R')

Scripts depend on the following list of variables. You have to set them in the Main.R script. Please ensure they are set up correctly.

samples_2x_File: path to a species mapping file containing a list of diploid species and samples (see above).
samples_4x_File: path to a file containing a list of tetraploid samples, which alleles have to be sorted to homeologs (see above).
Parent_A: one of a potential parent. Species name has to be included in samples_2x_File. Can be represented by multiple samples (individuals) as specified in the species mapping file.
between_homeolog_distance: two allele pairs are considered to be different homeologs inherited from different parents if the distance between this arrangement of alleles is between_homeolog_distance-shorter than the distance of any other possible arrangement.
use_between_parents_distance and between_parents_distance: can be used for more precise filtering. If use_between_parents_distance=TRUE, homeolog A has to be more than between_parents_distance closer to Parent_A than to homeolog B, to consider them as unequivocally sorted.
treeDir: path to a directory containing phylogenetic trees.
seqDir: path to a directory containing aligned sequences.
patterns_exonsFile, patterns_genesFile: An exon/gene list to process.
if_below_treshold: two options - "remove" or "mask", see above.
ResDir: a results directory for sorted sequences and log file.
logfile: the log file, containing a treatment of each sample for each sequence.

Alleles are sorted to A and B homeologs using a sortAllels() function. If sorted sequences are concatenated, phylogenetic trees have to be computed and sorting has to be reevaluated again, using confirmSorting() function. Distances of each homeolog pair to diploid samples can be calculated using distanceTable() function.

The functions are implemented in a thread-safe way. A critical sections in code are protected by exclusive locking. Thus, multithreading (using a mclapply() function) is a preferred approach as it significantly speeds up computation time. Alternatively, use a for() loop to do all in a single thread.

Software Dependencies

R: https://www.r-project.org/

Following packages are required: ape, seqinr, and filelock. Within R command line use e.g. command

install.packages(pkgs=c("ape", "seqinr", "filelock"), repos="https://mirrors.nic.cz/R/", dependencies="Imports")

to install needed packages.

Installation

You can download this repository zipped (button on the right-hand side of the screen) or, if you have git installed on your system, clone it with:

git clone https://github.com/MarekSlenker/AlleleSorting.git

How to cite

Use the following formula: Allele sequences were sorted to homeologs using the scripts and following the workflow available at https://github.com/MarekSlenker/AlleleSorting, described in detail by Šlenker et al. (2021).

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
test_dataset		test_dataset
LICENSE		LICENSE
Main.R		Main.R
README.md		README.md
functions.R		functions.R
trees3.pdf		trees3.pdf
trees3.png		trees3.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_dataset

test_dataset

LICENSE

LICENSE

Main.R

Main.R

README.md

README.md

functions.R

functions.R

trees3.pdf

trees3.pdf

trees3.png

trees3.png

Repository files navigation

AlleleSorting

Identifying homeologs inherited from different parents

Find two pairs of alleles (sequences), in which the alleles are closest to each other within the pairs while more distant between the pairs

Sorting is successful (unequivocal) -> attribute to one parent

Sorting is NOT successful - mask/remove

The pipeline

Input files:

Pipeline Output:

Running the pipeline

Software Dependencies

Installation

How to cite

Other questions not covered here and reporting problems

About

Languages

License

MarekSlenker/AlleleSorting

Folders and files

Latest commit

History

Repository files navigation

AlleleSorting

Identifying homeologs inherited from different parents

Find two pairs of alleles (sequences), in which the alleles are closest to each other within the pairs while more distant between the pairs

Sorting is successful (unequivocal) -> attribute to one parent

Sorting is NOT successful - mask/remove

The pipeline

Input files:

Pipeline Output:

Running the pipeline

Software Dependencies

Installation

How to cite

Other questions not covered here and reporting problems

About

Resources

License

Stars

Watchers

Forks

Languages