Separate sub genomes of an allopolyploid

Theory

Polyploid chromosome-resolved genome assemblies have (finally!) become a reality in biology research. A polyploid genome contains two (or more) 'subgenomes', each comprising a set of chromosomes. It's crucial to distinguish between two types of polyploids: autopolyploids, which arise from a single species doubling its chromosome number, and allopolyploids, which emerge from the hybridization of two distinct species. This tutorial will focus exclusively on allopolyploids.

We can represent the evolutionary history of an allopolyploid lineage as follows:

Representation of the evolutionary history of an allopolyploid lineage

Note the period of divergence between the allopolyploid lineages (ancestral lineages A and B) and consider the accumulation of transposable elements (TEs; consider this before reading the spoilers just below ;) ).

Spoilers: During the initial phase, TE accumulation will be similar for A and B. During the second phase, distinct TEs will accumulate on the genomes of lineages A and B, as they are separate lineages. During the third phase, both subgenomes will possess common TEs once more.

In this tutorial, we will retrieve the subgenomes from A and B by identifying "fossil transposable elements" – ancestral transposable elements that accumulated separately in subgenomes (ancestral lineages) A and B. We will do so by fishing out k-mers with two properties:

Highly represented (some TEs tend to have repeats that are highly represented when decomposed in k-mers)
Distinct between homeolog pairs (i.e. groups - pairs, trios - of chromosomes in the same species that originated by speciation and were brought back together in the same genome by allopolyploidization

Before embarking on this analysis, you need to determine the homeolog pairings. For instance, if you have six chromosomes and a allopolyploid with two subgenomes, you need to identify the three pairs. This can be accomplished using UCEs (ultra-conserved elements), COS (conserved ortholog sequences), or synteny/circos. Consult Cerca et al for detailed instructions and literature.

Practical part

1. First, we organize our working area.

# I like to label folders numerically because it's easy to follow, and once you finish a whole set of analyses and come back to it a few months later, you're able to see the order very clearly.
mkdir 00_data 01_separating_genome 02_jellyfish_counts 03_jellyfish_dumps

2. Second, we download the publicly available Scalesia atractyloides genome published by Cerca et al 2022.

We move in into the folder data, and download the genome.

cd 00_data
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/947/069/175/GCA_947069175.1_scalesia_atractyloides_v1/GCA_947069175.1_scalesia_atractyloides_v1_genomic.fna.gz

In the interest of having clean and easy-to-read names, let's rename it, and unzip it.

mv GCA_947069175.1_scalesia_atractyloides_v1_genomic.fna.gz scalesia.fna.gz
gunzip scalesia.fna.gz*

3. Third, we will separate the fasta file.

From a single genome fasta file, we want a file for each of the chromosomes. If you notice, the fasta has 1,634 sequences (grep -c ">" scalesia.fna), but only 34 chromosomes (grep -c "chr" scalesia.fna). This means that the genome has 34 chromosomes and 1,600 'loose' scaffolds. For our approach, we only want the chromosomes.

NCBI fasta files are 'interleaved' - simply put, the sequence comes in multiple sequences. It's often better to operate with fastas that have a simple and repeated two-line structure (i.e. the first line with the header, starting with ">", and the second with the sequence.

awk '{if(NR==1) {print $0} else {if($0 ~ /^>/) {print "\n"$0} else {printf $0}}}' scalesia.fna > scalesia.deinterleaved.fna

Now, we obtain a index of the chromosomes, and we modify the names.

grep "chr" scalesia.deinterleaved.fna | \
sed "s/.*chr/chr/; s/,.*/_/" > ../01_separating_genome/chromosome.ids.tsv

I will also replace the "," by "_" from the headers so we operate with easier characters.

sed -i "s/,/_/g" scalesia.deinterleaved.fna

Look into the file.

head ../01_separating_genome/chromosome.ids.tsv

It should look like this: chr_20_ chr_28_ chr_14_ ...

Notice, I added an underscore in the end so it is easier to pull out the chromosomes. If we do: grep "chr1", this will fetch: chr1, chr10, chr11, ... , chr19. By having an underscore in the end, we are able to do: grep "chr1_", without pulling all other data.

Now I will make a while loop, which will work on the genome file (scalesia.deinterleaved.fna), and separate chromosomes into to separate files

while read chr; do \
echo "Working on $chr" ; \
grep --no-group-separator -A 1 "$chr" scalesia.deinterleaved.fna > ../01_separating_genome/${chr}.fa; \
done < ../01_separating_genome/chromosome.ids.tsv

Take a look at the files we created.

ls 01_separating_genome/

It should look like this: chr_10_.fa chr_11_.fa chr_12_.fa ... chr_34_.fa

4. Forth, now we do k-mer counting.

Remember that for your own genomes you may want to try different k-mer sizes. I used 13 since there was a lot of divergence in the Scalesia atractyloides subgenomes, and k-mer 25 gave me no results.

For this step we need Jellyfish. I used version 2.2.10, but novel versions may be faster. Here, you may want to play with the parameters. I am using 20 cpus (-t 20), a k-mer size of 13 (-m 13) Also, notice I used unix to create a new name for the files ("newname"). In essence chr_24_.fa creates chr_24_.jf The output of jellyfish will be k-mer counts.

for i in 01_separating_genome/*fa; do \
echo $i; newname=$(echo ${i#*/} | sed "s/\.fa/.jf/"); \
jellyfish count -m 13 -s 100M -o 02_jellyfish_counts/$newname -t 20 $i; \
done

5. Fifth, now we do k-mer dumping.

What we are doing now is to retrieve human-readable (also R-readable) data from jellyfish using its "dump" function. Here, again, you could tweet some of the parameters. I am using a column format (-c), and only k-mers present >100 times. Since the point of subgenome separation is to retrieve transposable elements, we are looking for regions of the genome that have high counts.

for i in 02_jellyfish_counts/*jf; do \
echo $i; \
jellyfish dump -c -L 100 $i > 03_jellyfish_dumps/${i#*/}.dumps.larger100.col; \
done

Before we move onto R, we will clean the files, by adding headers and replacing spaces " " with tabs "\t".

touch header; echo "information" >> header
for i in *col; do echo $i; cat header $i > tmp; mv tmp $i; sed -i "s/ /\t/; s/information/kmer\tfrequency/" $i; done

This is how the files should look like:

kmer    frequency
AAAAAAAAAAAAA   2253
AAAAAAAAAAAAC   504
AAAAAAAAAAAAG   237
AAAAAAAAAAAAT   214
AAAAAAAAAAACA   338

6. Sixth, we move on to R to do a hierarchical clustering and a plot.

Setting up your environment. We only need to install tidyverse, which is an excellent package.

getwd()
setwd("/Users/josecer/OneDrive - Universitetet i Oslo/tmp/tutorial") # Navigate to the folder where you have the col data.
library(tidyverse) # Load up your library

I will explain one pair. The code below will create an object a variable called s12 and s25 based on the files (ScDrF4C_12.jf.dumps.larger100.col)). We specify it is separated by tabs , and header is true. The weird symbol (%>%) is a pipe (|) provided by tidyverse, and then we rename the column

c12<-read.table("./chr_12_.jf.dumps.larger100.col",sep="\t", header=T) %>%
  rename(c12=frequency)
c19<-read.table("./chr_19_.jf.dumps.larger100.col",sep="\t",header=T)  %>%
  rename(c19=frequency)

We merge both datasets on an object called pair01

pair01<-full_join(c12, c19, by="kmer")

We now filter for k-mers that are either twice-represented on a member of the pair. E.g.: where c12 = 101, and c19 = 300 (keep) where c12 = 1000, and c19 = 1900 (remove) where c12 = 500, and c25 = 201 (keep) where c12 = 600, and c25 = 1004 (remove)

filtered_pair01<- pair01 %>% filter(c12 > 2*c19 | c19 > 2*c12)
nrow(filtered_pair01)

I will repeat the code for two other pairs. I will not do the whole genome here for the interest of the tutorial not becoming too big :-)

11-5


c20<-read.table("./chr_20_.jf.dumps.larger100.col",sep="\t", header=T) %>%
  rename(c20=frequency)
c1<-read.table("./chr_1_.jf.dumps.larger100.col",sep="\t",header=T)  %>%
  rename(c1=frequency)
pair02<-full_join(c20, c1, by="kmer")
filtered_pair02<- pair02 %>% filter(c20 > 2*c1 | c1 > 2*c20)
nrow(filtered_pair02)


c11<-read.table("./chr_11_.jf.dumps.larger100.col",sep="\t", header=T) %>%
  rename(c11=frequency)
c5<-read.table("./chr_5_.jf.dumps.larger100.col",sep="\t",header=T)  %>%
  rename(c5=frequency)
pair03<-full_join(c11, c5, by="kmer")
filtered_pair03<- pair03 %>% filter(c11 > 2*c5 | c5 > 2*c11)
nrow(filtered_pair03)

We make a merged dataframe.

df<-full_join(filtered_pair01, filtered_pair02, by="kmer") %>%
  full_join(filtered_pair03, by="kmer")

Prepare ourselves for plotting par(mfrow=c(1,1))

It is my experience that it is best to remove missing data. You may need to play here with your data (i.e. skip or keep this).

cleaned_df<-df[complete.cases(df), ]

We add names to the rownames, and clean the last column

rownames(cleaned_df)<-cleaned_df$kmer
squeaky_cleaned_df<-cleaned_df[,-1]

We transverse it so we can do a hierarchical clustering)

transversed_squeaky_cleaned_df<-t(squeaky_cleaned_df)

We calculate distances between variables, make a hierarchical clustering, and plot it.

dist_transversed_squeaky_cleaned_df <- dist(transversed_squeaky_cleaned_df)
hclust_avg <- hclust(dist_transversed_squeaky_cleaned_df)
plot(hclust_avg, main="13bp kmers at least 100x per chromossome, no missing data; scalesia")

You see the two separate groupings: (c20, c12, c11), (c19, c1, c5)? They're your subgenomes!