Skip to content

Crossmapper Tutorial

Hrant Hovhannisyan edited this page Nov 24, 2021 · 14 revisions

Introduction

In this tutorial we will guide the user of Crossmapper through the main steps of the analysis starting from the software installation up to interpretation of obtained results.

The workflow of Crossmapper is illustrated in the scheme below.

More broadly, the Crossmapper pipeline consists of three major steps:

  1. Read simulation from two or more reference genomes/transcriptomes;
  2. Data merging and mapping of simulated reads back to concatenated reference genomes;
  3. Cross-mapping quantification, plotting and reporting.

Overall description of the main steps

Read simulation

Read simulation in Crossmapper is performed using wgsim software. With standard parameters it simulates 10 mln 2x100 bp reads under 3 minutes for the primary assembly of human genome GRCh38, and under 1 min for Saccharomyces cerevisiae genome.

Crossmapper allows the user to simulate two types of data - reads from the complete genome (DNAseq, DNA mode) and reads from the transcriptome (RNAseq,RNA mode). In the latter case the user must specify the gtf/gff files for the corresponding genomes and Crossmapper will automatically extract the transcriptome using gffread utility, simulating the reads from the transcriptome. In principle, using this approach a user can simulate data from any part of the genome (e.g. 16sRNA for metagenomic studies, lncRNA annotations, etc.) by feeding the Crossmapper an annotation file with regions of interest.

Merging and mapping the data

When simulation step is finished, Crossmapper performs the merging step, which concatenates the genomes and fastq files of studied organisms together. After the merging step, Crossmapper will perform read mapping of the concatenated fastq file to concatenated genome file. Depending on what kind of data was simulated, the software will use either BWA (with MEM algorithm) or STAR for mapping DNAseq and RNAseq data, respectively. The final output .bam file will contain the alignments of reads from all studied organisms mapped to the reference genomes of all organisms.

NOTE: In case when the reference genome is not available (e.g. for non-model organisms), users can simulate reads directly from the transcriptomic fasta file, but setting Crossmapper to DNA mode to use BWA aligner (by default) to restrict splice-junction identification when mapping to transcriptome. On the other hand, this approach of simulating reads from transcriptome and back-mapping to it has a potential downside in case when simulated reads from transcripts of one organism map to non-coding regions of another organism (pseudogenes, lncRNAs, repeats, etc) which are not present in transcriptomes. As a consequence, the cross-mapping information about these reads will be lost. Thus, when both reference genome and annotations are available, it is highly recommended to use RNA mode which will simulate reads from transcripts but will map them to the entire genome.

Quantification and plotting

Considering that wgsim keeps track of information from which genome and which exact location each read has originated, Crossmapper is able to calculate the rate of erroneous mappings (ermaps), i.e. the reads that originated from one genome but map to another genome/s. This step of ermap quantification is implemented in a custom python script. There could be different types of ermaps, which we will describe in more details further in our tutorial. After the ermap quantification and depending on input parameters, Crossmapper will produce several colourful and highly interactive plots to summarize the ermap rates and help the user to decide the optimal configuration of the experimental setup minimizing the ermap rates.

Usage case

Rationale

There are numerous scenarios where Crossmapper can be used - dual RNAseq, metagenomics, hybrid species analysis, allele-specific gene expression, etc. All these experimental setups have a common feature that the data obtained by sequencing originates from multiple species combined in one sample. For example, today dual RNAseq experiments became an important and widely used technique for studying host-pathogen or cohabitant species interactions. In dual RNAseq, RNA from both species is extracted in one step, thus the sample contains the mixture of RNA originating from two organisms. In this case, the experimentator solely assumes that it would be possible to separate the reads of two organisms using read mapping. However, even in case of human-salmonella interaction studies, around 1.5% of total reads could be multimapped ermaps, without taking into account uniquely mapped ermaps [Westermann and Vogel, 2018]. A similar or more extreme situation could happen when analyzing more closely related species, for example, yeast or bacterial communities. Moreover, when analyzing allele-specific expression from two highly similar haplotypes, the ermaps and multimaps usually constitute the majority of alignments.

Obviously, in these cases high rates of ermaps can bias any downstream analysis, while the experimentator would not notice it. To address the issue before doing any sequencing experiments, Crossmapper can assess the ermap rates using different combinations of sequencing parameters, allowing the researches optimize their experimental design.
Another situation when Crossmapper can be applied is when dealing with large sequencing project or in sequencing facilities, where many species and samples have to be analyzed. While the costs of sequencing are continuously dropping, the cost of library preparation could become a bottleneck for large projects. In this situation to save resources on library preparation, sequencing facility can combine the DNA or RNA samples from different species together and sequence them as one sample. With Crossmapper, that experimental design can be assessed and the most optimal sequencing parameters can be decided in advance.

Software installation

Installation of the Crossmapper can be achieved by using Anaconda or miniconda environment manager. The user is only required to have Anaconda installed (see the manual for Anaconda or Miniconda installation steps), which will efficiently manage all dependencies required for Crossmapper. Our pipeline is available as a command-line tool for Linux and MacOS operating systems. Crossmapper is written in python 3.6, hence we highly recommend our users to run it in a separate environment with python 3.6 or later to avoid potential syntax conflicts with previous python versions.

Once Anaconda is set up, the user can install Crossmapper simply by:

## create environment for python 3.6 or later 
conda create -n py36 python=3.6
 
## activate the new environment 
source activate py36

## install crossmapper
conda install -c gabaldonlab -c bioconda crossmapper=1.1.1

The fist two steps are optional if python3.6 is already used. To check if the installation was successful, simply type

crossmapper -h

which should print the help page of the package.

Analyzed data

For this tutorial we will use several examples of Crossmapper usage.

Example 1

The first example will address allele-specific expression using RNA-Seq in human fungal pathogen Candida albicans. The genome of C. albicans is phased, i.e. both parental haplotypes are available, and thus allele-specific expression can be assessed by mapping RNA-Seq reads to diploid genome, which as a rule produces a substantial amount of multimapped reads. Here we will demonstrate how Crossmapper can help to design a sequencing setup which will minimize the multi-mapping rates. Reference genomes and annotations are available through the link.

crossmapper RNA -g C_alb_A.fasta C_alb_B.fasta -a C_alb_A.gff C_alb_B.gff -gn A_hapl B_hapl -r 0.01 -gb -N 10000000 10000000 -rlen 50,75,100,125,150 -rlay both -o output_folder -t 10 -max_mismatch 5

First we will describe in details the options used in the command above, then will describe output files and plots generated by Crossmapper.

Description of command-line options

As stated above, Crossmapper can simulate two types of data - DNA and RNA, and in this example we will simulate reads from transcriptomes.

RNA - simulate reads from transcriptome.

-g - list the reference genomes separated by whitespaces. In our case we use two genomes, but Crossmapper can handle any number of input reference genomes.

-a - list the corresponding annotation files in gtf or gff formats separated by whitespaces. If RNA option is used, -a is a required argument. Note, that the listing order of reference genomes and annotations must be the same. By default, gff format will be converted to gtf using gffread utility.

-gn - this option stands for --genome_names, which will appear in the report file. If this option is omitted, basenames of genome files will appear in the report.

-r - mutation rate. This float number is a mutation rate based on which wgsim simulator adds variation to the reads. The user can use higher numbers if there is a prior knowledge that the reference genome and the sequenced data are not very similar, e.g. a different strain from the reference genome.

-gb - if this option is specified the output report will be grouped by read lengths.

-N - number of reads to simulate for each genome. Wgsim simulator allows to specify the number of reads to simulate from the reference with -N option. In Crossmapper, alongside with -N, we have implemented -C option, which stands for coverage. Using this option wgsim will simulate enough number of reads to reach the specified genome/transcriptome coverage. Note that -N and -C are mutually exclusive since they can be derived from each other. Please use the help page of Crossmapper to see how the coverage is calculated. Whitespace separated list of coverages will correspond to the list of reference genomes.

-rlen - read length to simulate. Using -rlen option a user can specify which length of reads has to be simulated. This is one of the key options in Crossmapper since it allows to simulate several different read lengths in one run and then compare how the rates of ermaps change based on it. In this case, we will perform the whole analysis using 5 different read lengths - 50, 75, 100, 125 and 150. Note, that in this case the values have to be separate by coma. Also note that we have limited the possible read lengths to the ones available for Illumina machines to make the workflow more realistic. The list of possible read lengths is available in the Crossmapper help page.

-rlay - read layout. This parameter specifies whether to simulate single-end (SE), paired-end (PE) or both types (both) of data. If the latter is specified, for each read length Crossmapper will simulate both single- and paired end data, which doubles the possible combinations of experimental designs for further assessment.

-o - specifies the output directory where all the files will be produced and stored.

-t - number of threads used for parallelizable steps, which include genome index generation, mapping and bam file sorting.

-max_mismatch - this parameter is specific for the STAR mapper, and it defines the maximum number of base mismatches in the read alignment after which the read will be counted as unmapped. Lower the number higher the stringency of mapping is, which on the other hand can result in a substantial number of unmapped reads.

The above command will extract the transcriptomes from genomes, simulate 50, 75, 100, 125 and 150 bp single- and paired-end reads. Then Crossmapper will concatenate corresponding files, will generate genome index using STAR and will map all combinations of reads to the genome index. Then it will calculate the proportions of multimaps and ermaps and based on that will generate report plots and table depending on read lengths. Let's inspect the output files generated by Crossmapper.

Output folders

Crossmapper outputs all results to the folder specified by -o option, in our case output_folder. For the read simulation and mapping steps Crossmapper saves the corresponding intermediate files to wgsim_output and {bwa|star}_output folders, respectively.

Output files

In output_folder there are several files:

C_alb_A_rename.fasta and C_alb_B_rename.fasta - these are original genome files with changed chromosome names (hence rename in the filename). Chromosome renaming is done according to either genome basename or -gn option and is performed to avoid the same chromosome/contig names in different genome files.

*.fai - corresponding indexes of fasta files.

C_alb_A_rename.gtf and C_alb_B_rename.gtf - corresponding gtf files with renamed chromosome names, which are generated by gffread software from original *.gff files.

C_alb_A_rename_transcriptome1.fasta and C_alb_B_rename_transcriptome2.fasta - transcriptomes, produced by gffread using corresponding genome and annotation files.

concat.fasta and concat.gtf - concatenated genomes and annotation files.

In wgsim_output folder there are

*.fastq files - fastq files produced by wgsim. The header of each file has the following format: contigName_readStart_readEnd_2{nErr:nSub:nInd}_readIdx\nMate,

where: contigName - the name of the fasta entry from which the read has originated, readStart - the genomic coordinate of the first base of the first mate readEnd - the genomic coordinate of the last base of the second mate nErr:nSub:nInd - number of sequencing errors, mutations and indels, respectively, for each mate readIdx - the alphanumeric index of each read pair nMate - the number of each mate, 1 or 2.

For example, the read Ca22chr1A_C_albicans_SC5314_1127892_1128091_0:0:0_3:0:0_2e/1 was simulated from chromosome Ca22chr1A_C_albicans_SC5314 from coordinate 1127892 to 1128091, its second mate has 3 sequencing errors, its unique index is 2e and this is the mate 1 from the read pair.

In star_output folder there are files produced by STAR mapper. Please refer to STAR manual for details about each output file. Here we will describe only the *.bam file.

The .bam file is a sorted binary version of .sam file. It is generated by STAR, converted and coordinate sorted by samtools. .bai file is an index of .bam file generated by samtools; .bai file is used by the pysam module to parse a .bam file. Please see the sam/bam file specifications for details.

In "STAR_index" folder there are genome indexing files produced by STAR mapper.

Log files

Crossmapper generates several log files. The major log file for the whole workflow is crossmapper.log, where the user can track the steps of the pipeline. On the other hand, for the read simulation and mapping steps Crossmapper also generates and saves output and error log files. For the simulation step they are named as cmd_wgsim_stdout.txt and cmd_wgsim_stderr.txt, and for mapping steps - {bwa|star}_mapping_stdout.txt and {bwa|star}_mapping_stderr.txt. For other intermediate steps, for example, file merging or index generation, Crossmapper also generates both log files, but removes them if the steps were finished successfully. On top of that, the STAR mapper also produces its own log files (such as *Log.out, Log.final.out, etc.), where the user can find a very detailed information about software runs.

Report files and plots

The major file produced by Crossmapper is report.html (please download to display properly), which can be opened in any internet browser. The report file contains several highly interactive and intuitive plots and tables allowing the user to inspect different mapping rates. The first graph is a line plot that shows the total number cross-mapped reads per read length. This plot shows mapping statistics for the following ermap categories: For each single- and paired-end reads it shows

  1. uniquely crossmapped reads (reads originated on one genome but uniquely mapped to other genomes);
  2. multi-crossmapped reads (reads originated on one genome but multimapped on the source and other genomes);
  3. total crossmapped reads (sum of two previous).

Hovering the mouse over the line plot will show the details for each ermaping category. Additionally, by clicking on the legend of the plot on the left, the user can hide/show ermap categories.

The second graph is a bar plot showing the distribution of total cross-mapped reads per species, i.e. to which genomes reads of a given species are mapped. If -gb option is specified, this graph will be grouped by species, otherwise individual plots will be displayed. As the previous plot, this graph is interactive - user can click on read lengths to show/hide specific configurations, or can click on the specific barplot which will show the distribution of ermaps across species for that particular read length.

By default both graphs show read counts, but users can change them to percentages of total reads simply by clicking the switch on top-right corner of the report. Moreover, users can download both graphs in different formats, including png, jpeg, pdf, etc or in the form of tables.

The plots are followed by tables with detailed mapping statistics for each read length and layout. There are three tables for each configuration.

The first table show the overall mapping statistics, i.e. number of reads, uniquely mapped reads, unmapped reads, etc.

The second table summarizes mapping rates for each species. This table has information about different categories of reads. Two main categories are Correctly and Incorrectly mapped reads.

Correctly mapped reads refer to the reads that originate and map on the same genome regardless of their mapping position, e.g. even if the read has originated on chromosome 1 but mapped to chromosome 2 of the same species, it is counted as correctly mapped. Correctly mapped reads are subdivided into uniquely mapped and multimapped.

Incorrectly mapped reads are those that originate on one species, but map to another either in a unique or multimapped manner. In other words, incorrectly mapped read has at least one alignment in a different species. Hence, the category of Incorrectly mapped reads is subdivided into uniquely crossmapped and multi-crossmapped reads, and the latter is further subdivided into multimapped only to other species and multimapped to the source and other species.

Noteworthy, that by pointing the mouse on each category, the user can see the its description.

The final third table represents a detailed breakdown of three categories of ermapped reads (unique ermaps, multi ermaps and total ermaps) per species, and in fact represents a data matrix. By hovering the mouse over the ermap rate of interest, the user can additionally see the percentage of total mapped reads of the species that particular ermap rate represents.

As in the case of graphs, these tables also can be switched to show either percentages or raw read count.

Alongside with report.html file, Crossmapper provides the information of tables in the form of a text file (report.txt) in case if users need to analyze the data programmatically.

Coming back to the example of C. albicans, we can observe in the report file, than indeed mapping to the diploid genome produces a significant amount of reads that map equally well to both haplotypes. However, with Crossmapper it is possible to quantitatively assess this bias. In case of single end 50 bp reads, cross-multimapped reads constitute ~87% of alignment, while paired-end layout and longer reads (i.e. 150bp) can significantly reduce this number to 58%. Based on these results, researchers can decide which sequencing configuration will be more suitable for addressing a particular biological question.

Example 2

In this example we show the usage of Crossmapper in planning host-pathogen interaction study. We will design the optimal sequencing experiment for dual RNA-Seq analysis of human cells interacting with pathogenic fungus Candida glabrata. The reference genome and annotations for human and fungus were downloaded from Ensemble and Candida Genome Database, respectively.

crossmapper -g C_glabrata_CBS138_current_chromosomes.fasta Homo_sapiens.GRCh38.dna.primary_assembly.fa -a C_glabrata_CBS138_current_features.gff Homo_sapiens.GRCh38.94.gtf -gn yeast human -rlay both -rlen 50,75,100,125 -N 30000000 10000000 -r 0.03 -R 0.03 -gb -t 10

The results demonstrate that there is a negligible amount of crossmapping between the two species, thus dual RNA-Seq analysis with these two species will not suffer from poor read separation.

Example 3

In the third example of Crossmapper usage we will show how to optimize resources/sequencing costs by pooling the genetic material of several species together and then separating the reads by mapping strategy. This procedure can be done to save resources on library preparation, since its costs can be substantial for large projects. Here, we will analyze the data of the 4 most frequently sequenced organisms, namely human, mouse, fruit fly and nematode and see if these 4 species can be sequenced as one sample without interfering with each other in downstream analysis.

All reference genomes for human, mouse, fly and nematode Ensemble database.

The following Crossmapper command will be used:

crossmapper DNA -t 8 -gb -rlay both -g Homo_sapiens.GRCh38.dna.primary_assembly.fa Mus_musculus.GRCm38.dna.primary_assembly.fa Drosophila_melanogaster.BDGP6.dna.toplevel.fa Caenorhabditis_elegans.WBcel235.dna.toplevel.fa -gn human mouse fly nematode -N 2500000 2500000 2500000 2500000 -rlen 50,75,100,125,150 -r 0.01

The results of Crossmapper show that crossmapping rate between 4 species is generally very low with the highest observed between human reads and mouse genome (0.64%). On the other hand, using paired-end sequencing even with 50 bp reads almost entirely eliminates crossmapping, demonstrating that 4 abovementioned species can be successfully sequenced as one sample almost without crossmapping bias.

Running the Crossmapper with a desired mapping software

Crossmapper allows the users to use any desired mapping software instead of default BWA and STAR. To do it, users need to supply the Crossmapper with the configuration file indicating the software with all its parameters. Several configuration files, namely for Bowtie2, HISAT2 and NGM mappers are provided in config_examples folder with .yaml extension. For example, on can run Crossmapper using Hisat2 software:

crossmapper RNA -g C_alb_A.fasta C_alb_B.fasta -a C_alb_A.gff C_alb_B.gff -gn A_hapl B_hapl --mapper-template hisat2.yaml -gb -N 1000000 1000000 -rlen 50,75 -rlay both -o output_folder_hisat

A detailed documentation on how to build a configuration file is given here.

References

Westermann, A. J., & Vogel, J. (2018). Host-Pathogen Transcriptomics by Dual RNA-Seq. In Bacterial Regulatory RNA (pp. 59-75). Humana Press, New York, NY.