Skip to content

B-UMMI/ReMatCh

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

97 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ReMatCh

Reads mapping against target sequences, checking mapping and consensus sequences production

Codacy Badge

Table of Contents

Rational

ReMatCh was designed to map HTS reads onto a set of reference sequences in order to determine whether those sequences are present or absent in each sample, and to identify any variation compared to the reference. ReMatCh determines if a sequence is present or absent based on the proportion of reference sequence length covered by at least a predefined number of reads; and the sequence similarity. It relies mainly on the strength of high read numbers to correctly identify two types of variants: SNPs and INDELs. However, when a position does not meet the criteria for being unambiguously called, ReMatCh will designate it as a potential heterozygous position. In order to correctly identify variants over the entire length of a target region, references containing additional sequences flanking the region of interest can be provided and will grant a scaffold for proper read mapping. Moreover, to avoid errors when calling a position due to improper read mapping resulting from divergence between the allele in the genome of interest and the reference sequence, ReMatCh has the option to be executed in double run mode, in which the resulting consensus sequences are used as reference sequences in a second run, thereby facilitating read mapping. ReMatCh can use locally stored sequence data, but it can also directly interact with the ENA database and download the read files from sample/run accession numbers provided by the user or all data associated with a given taxon name. A ReMatCh module was designed to get the MLST sequence type from HTS reads. Using a provided MLST curated schema with flanking regions or an allele for each MLST loci obtained from PubMLST database (https://pubmlst.org) as reference sequences, the consensus sequences produced by ReMatCh are compared to the ones found in PubMLST database for allele scoring and ST determination. ReMatCh software dependencies are: Bowtie2 (Langmead and Salzberg 2012) for read mapping, Samtools (Li et al. 2009) for sam/bam manipulation and variant calling and Bcftools (Li 2011) for consensus sequence production. These software dependencies are provided together with ReMatCh to facilitate the installation and guarantee that the users have the correct versions. Besides the parallelization implemented within Bowtie2 and Samtools, ReMatCh assigns one sequence variant analysis and coverage determination to each available thread.

Dependencies

Mandatory

  • Python >= v3

Required to run ReMatch analysis (provided)

  • Bowtie2 >= v2.2.9
  • Samtools = v1.3.1
  • Bcftools = v1.3.1

These three executables are provided, but user's own executables can be used by providing --doNotUseProvidedSoftware option.

Optional
Required to download sequence data from ENA/SRA database:

  • Aspera Connect 2 >= v3.6.1
  • wget (normally found in Linux OS)
  • gzip >= v1.6 (normally found in Linux OS)
  • curl (optional)
  • SRA toolkit >= v2.8.2 (optional) (for SRA interaction)
  • GNU Awk (optional) (normally found in Linux OS) (for SRA interaction)

Installation

git clone https://github.com/B-UMMI/ReMatCh.git
cd ReMatCh
python3 setup.py install

NOTE:
If you don't have permission for global system installation, try the following install command instead:
python3 setup.py install --user

Input

Reference

ReMatCh requires for the reference sequeces, in fasta file, to be provided through the -roption. In our experience, the addition of 200nt upstream and downstream of the target region when using Illumina Miseq data (150nt reads), will have the desired effect, and these flanking regions will be ignored in variant calling, unless there is an INDEL affecting the target sequence. the size of the flanking regions can be set with the opion --extraSeq. If the --mlst option is used, the --mlstReference can be used intead of the -r, telling ReMatCh to use the curated scheme for the MLST scheme, if available, as reference sequences with 200nt flanking the target regions, or the first alleles of each MLST gene fragment in PubMLST as reference sequences.

Samples

The samples can be provided through the -w option, if stored locally in a directory, or by telling ReMatCh to interact directly with ENA. This can be done by passing rematch a list of IDs to download, through the -loption, or to download all the reads from a given taxon, though the --taxon option. The sample files are required to be in "fq.gz" (or "fastq.gz") format.

Usage

usage: rematch.py [-h] [--version]
                  (-r /path/to/reference_sequence.fasta | --mlstReference)
                  [-w /path/to/workdir/directory/] [-j N]
                  [--mlst "Streptococcus agalactiae"]
                  [--doNotUseProvidedSoftware]
                  [-l /path/to/list_IDs.txt | -t "Streptococcus agalactiae"]
                  [--extraSeq N] [--minCovPresence N] [--minCovCall N]
                  [--minFrequencyDominantAllele 0.6] [--minGeneCoverage N]
                  [--minGeneIdentity N] [--doubleRun]
                  [--reportSequenceCoverage] [--notWriteConsensus]
                  [--bowtieAlgo="--very-sensitive-local"]
                  [--bowtieOPT="--no-mixed"]
                  [--debug]
                  [--mlstSchemaNumber N] [--mlstConsensus noMatter]
                  [--mlstRun first]
                  [-a /path/to/asperaweb_id_dsa.openssh] [-k]
                  [--downloadLibrariesType PAIRED]
                  [--downloadInstrumentPlatform ILLUMINA] [--downloadCramBam]
                  [--SRA | --SRAopt]

Reads mapping against target sequences, checking mapping and consensus
sequences production

optional arguments:
  -h, --help            show this help message and exit
  --version             Version information
  -l /path/to/list_IDs.txt, --listIDs /path/to/list_IDs.txt
                  Path to list containing the IDs to be downloaded (one
                  per line) (default: None)
  -t "Streptococcus agalactiae", --taxon "Streptococcus agalactiae"
                  Taxon name for which ReMatCh will download fastq files
                  (default: None)

General facultative options:
  -r /path/to/reference_sequence.fasta, --reference /path/to/reference_sequence.fasta
                        Fasta file containing reference sequences (default:
                        None)
  -w /path/to/workdir/directory/, --workdir /path/to/workdir/directory/
                        Path to the directory where ReMatCh will run and
                        produce the outputs with reads (ended with
                        fastq.gz/fq.gz and, in case of PE data, pair-end
                        direction coded as _R1_001 / _R2_001 or _1 / _2)
                        already present (organized in sample folders) or
                        to be downloaded (default: .)
  -j N, --threads N     Number of threads to use (default: 1)
  --mlst "Streptococcus agalactiae"
                        Species name (same as in PubMLST) to be used in MLST
                        determination (default: None)
  --doNotUseProvidedSoftware
                        Tells ReMatCh to not use Bowtie2, Samtools and
                        Bcftools that are provided with it (default: False)

Download list options (one of the following):
  -l /path/to/list_IDs.txt, --listIDs /path/to/list_IDs.txt
                        Path to list containing the IDs to be downloaded (one
                        per line) (default: None)
  -t "Streptococcus agalactiae", --taxon "Streptococcus agalactiae"
                        Taxon name for which ReMatCh will download fastq files
                        (default: None)

ReMatCh module facultative options:
  --extraSeq N          Sequence length added to both ends of target sequences
                        (usefull to improve reads mapping to the target one)
                        that will be trimmed in ReMatCh outputs (default: 0)
  --minCovPresence N    Reference position minimum coverage depth to consider
                        the position to be present in the sample (default: 5)
  --minCovCall N        Reference position minimum coverage depth to perform a
                        base call. Lower coverage will be coded as N (default:
                        10)
  --minFrequencyDominantAllele 0.6
                        Minimum relative frequency of the dominant allele
                        coverage depth (value between [0, 1]). Positions with
                        lower values will be considered as having multiple
                        alleles (and will be coded as N) (default: 0.6)
  --minGeneCoverage N   Minimum percentage of target reference gene sequence covered
                        by --minCovPresence to consider a gene to be present
                        (value between [0, 100]) (default: 80)
  --minGeneIdentity N   Minimum percentage of identity of reference gene sequence
                        covered by --minCovCall to consider a gene to be present
                        (value between [0, 100]). One INDEL will be considered
                        as one difference (default: 70)
  --doubleRun           Tells ReMatCh to run a second time using as reference the
                        noMatter consensus sequence produced in the first run.
                        This will improve consensus sequence determination for
                        sequences with high percentage of target reference gene
                        sequence covered (default: False)
  --reportSequenceCoverage
                        Produce an extra combined_report.data_by_gene with
                        the sequence coverage instead of coverage depth
                        (default: False)
  --notWriteConsensus   Do not write consensus sequences (default: False)
  --summary             Produce extra report files containing only sequences
                        present in at least one sample (usefull when using a
                        large number of reference sequences, and only for
                        first run) (default: False)
  --bowtieAlgo="--very-sensitive-local"
                        Bowtie2 alignment mode. It can be an end-to-end alignment
                        (unclipped alignment) or local alignment (soft clipped
                        alignment). Also, can choose between fast or sensitive
                        alignments. Please check Bowtie2 manual for extra information:
                        http://bowtie-bio.sourceforge.net/bowtie2/index.shtml .
                        This option should be provided between quotes and starting
                        with an empty space (like --bowtieAlgo " --very-fast") or
                        using equal sign (like --bowtieAlgo="--very-fast")
                        (default: "--very-sensitive-local")
  --bowtieOPT="--no-mixed"
                        Extra Bowtie2 options. This option should be provided between
                        quotes and starting with an empty space
                        (like --bowtieOPT " --no-mixed") or using equal sign
                        (like --bowtieOPT="--no-mixed") (default: None)
  --debug               DeBug Mode: do not remove temporary files (default: False)
  --mlstReference       If the curated scheme for MLST alleles is available, tells
                        ReMatCh to use these as reference (force Bowtie2 to run
                        with very-sensitive-local parameters, and sets --extraSeq
                        to 200), otherwise ReMatCh uses the first alleles of each
                        MLST gene fragment in PubMLST as reference sequences (force
                        Bowtie2 to run with very-sensitive-local parameters, and
                        sets --extraSeq to 0)

MLST facultative options:
  --mlstSchemaNumber N  Number of the species PubMLST schema to be used in
                        case of multiple schemes available (by default will
                        use the first schema) (default: None)
  --mlstConsensus noMatter
                        Consensus sequence to be used in MLST determination
                        (default: noMatter)
  --mlstRun first       ReMatCh run outputs to be used in MLST determination
                        (default: all)

Download facultative options:
  -a /path/to/asperaweb_id_dsa.openssh, --asperaKey /path/to/asperaweb_id_dsa.openssh
                        Tells ReMatCh to download fastq files from ENA using
                        Aspera Connect. With this option, the path to Private-
                        key file asperaweb_id_dsa.openssh must be provided
                        (normaly found in
                        ~/.aspera/connect/etc/asperaweb_id_dsa.openssh).
                        (default: None)
  -k, --keepDownloadedFastq
                        Tells ReMatCh to keep the fastq files downloaded
                        (default: False)
  --downloadLibrariesType PAIRED
                        Tells ReMatCh to download files with specific library
                        layout (default: BOTH)
  --downloadInstrumentPlatform ILLUMINA
                        Tells ReMatCh to download files with specific library
                        layout (default: ILLUMINA)
  --downloadCramBam     Tells ReMatCh to also download cram/bam files and
                        convert them to fastq files (default: False)
                        SRA download options (one of the following):
  --SRA                 Tells getSeqENA.py to download reads in fastq format
                        only from NCBI SRA database (not recommended)
                        (default: False)
  --SRAopt              Tells getSeqENA.py to download reads from NCBI SRA
                        if the download from ENA fails

Usage Examples

Running ReMatCh Beginner

Using local samples for provided reference file

To run ReMatCh in local fastq files, please organize those files into sample folders, as shown bellow. E.g.:

  workir/
    sample_1/
      fastq_file_a_1.fq.gz
      fastq_file_a_2.fq.gz
    sample_2/
      fastq_file_b_R1_001.fastq.gz
      fastq_file_b_R2_001.fastq.gz

It is advisable to use copied fastq files or symbolic links to the original files. This directory, containing the sample folders, should then be provided through the --workdir option. ReMatCh will store the output files there. As so, the command should look something like:

rematch.py -r reference.fasta --workdir /path/to/workdir/

Running ReMatCh Moderate

Using specific ENA sequencing data for provided reference file

To run ReMatCh in a specific set of ENA IDs you need to provide a file to --listIDs containing a list of ENA IDs to be downloaded. The IDs can be Sample Accession numbers or Run Accession numbers (for example), as long as there's only one ID per line in the file. In case of IDs containing more than one Run Accession number (like Study accession numbers), only one of them will be downloaded and the remaining will be stored in sample_report..tab* file under extra_run_accession column in a comma separated style.
ReMatCh will store the output files in the --workdir.

rematch.py -r reference.fasta --listIDs IDs.txt --workdir /path/to/workdir/

By default ReMatCh uses wget to download the sample files from ENA. We recommend using Aspera Connect 2 to speed up this process by providing the path Private-key file asperaweb_id_dsa.openssh to -a.

rematch.py -r reference.fasta --listIDs IDs.txt -a /path/to/asperaweb_id_dsa.openssh --workdir /path/to/workdir/
Using ENA sequencing data of a given taxon for provided reference file

To run ReMatCh in all ENA data of a given taxon, provide the taxon name to --taxon. The ENA Run Accession numbers for the given taxon will be stored in IDs_list.seqFromWebTaxon.tab file.
The column content will be:

  1. Run Accession numbers
  2. Sequencing instrument models
  3. (secondary) Study Accession numbers
  4. library types
  5. library layouts

The first line of IDs_list.seqFromWebTaxon.tab will contain the date of accession.

rematch.py -r reference.fasta --taxon "Streptococcus dysgalactiae" /path/to/asperaweb_id_dsa.openssh --workdir /path/to/workdir/

Running ReMatCh Advanced

MultiLocus Sequence Typing for local samples

To run ReMatCh in a set of samples for MLST the option --mlst needs to be provided with species name (same as in PubMLST - https://pubmlst.org/databases/) to be used in MLST determination. If more than one scheme is available for the species, the desired schema number should be passed to ReMatCh with the --mlstSchemaNumber option. A fasta file containing the MLST reference sequences (-r) is required, along with the size of the flanking regions as recomended by us (set with the opion --extraSeq). Alternatively the --mlstReference option can be used, telling ReMatCH to use the curated scheme for the MLST scheme, if available, as reference sequences with 200nt flanking the target regions, or the first alleles of each MLST gene fragment in PubMLST as reference sequences. The MLST results will be in the mlst_report..tab* in the --workdir. Here's an example to run ReMatCh for MLST in all "Streptococcus agalactiae" samples in ENA:

rematch.py --mlst "Streptococcus agalactiae" --mlstReference --workdir /path/to/workdir/

As default, ReMatCh uses the consensus sequence "noMatter" in MLST determination, but this can be changed with the --mlstConsensus option. IF the option --doubleRun is used, ReMatCh can determine the MLST for the second run only, or for both runs, with the --mlstRun option. By default the MLST will be determined in both runs.

rematch.py --mlst "Streptococcus agalactiae" --mlstReference --taxon "Streptococcus agalactiae" --workdir /path/to/workdir/ --mlstConsensus all --doubleRun --mlstRun second
MultiLocus Sequence Typing for ENA list of IDs or taxon

As described above, you can run ReMatCh in a specific set of ENA IDs or in all taxon data for MLST by providing the -l or --taxon options respectively.

rematch.py --mlst "Streptococcus agalactiae" --mlstReference -l IDs.txt --workdir /path/to/workdir/

rematch.py --mlst "Streptococcus agalactiae" --mlstReference --taxon "Streptococcus agalactiae" --workdir /path/to/workdir/

Outputs

run.*.log
ReMatCh running log file.

sample_report.*.tab

  • sample - Sample ID
  • sample_run_successfully - Reports whether the sample globally run successfully
  • sample_run_time - Global sample running time (in seconds)
  • files_size - Sum of files size (in bytes)
  • download_run_successfully - Reports whether the sample downloading (if requested) run successfully
  • download_run_time - Download running time
  • rematch_run_successfully_first - Reports whether the first run of ReMatCh module run successfully
  • rematch_run_successfully_second - Reports whether the second run of ReMatCh module run successfully
  • rematch_run_time_first - ReMatCh first running time
  • rematch_run_time_second - ReMatCh second running time
  • number_absent_genes_first - Number of absent genes determined in the first ReMatch run
  • number_genes_multiple_alleles_first - Number of genes with multiple alleles among the genes present determined in the first ReMatch run
  • mean_sample_coverage_first - Mean sample coverage depth (only considering the genes present) determined in the first ReMatch run
  • number_absent_genes_second - Number of absent genes determined in the second ReMatch run
  • number_genes_multiple_alleles_second - Number of genes with multiple alleles among the genes present determined in the second ReMatch run
  • mean_sample_coverage_second - Mean sample coverage depth (only considering the genes present) determined in the second ReMatch run
  • run_accession - ENA Run Accession number used to download
  • instrument_platform - Sequencing technology used reported by ENA
  • instrument_model - Instrument model used for sequencing reported by ENA
  • library_layout - Single or paired-end sequencing reported by ENA
  • library_source - Library type reported by ENA (e.g. genomic, transcriptomic, synthetic)
  • extra_run_accession - Extra ENA Run Accession numbers found for the ID provided for download
  • date_download - Date of downloading try
  • fastq_used - Fastq files used in ReMatCh module

combined_report.data_by_gene.*.tab
combined_report.data_by_gene.first_run.*.tab and combined_report.data_by_gene.second_run.*.tab
This file contains a report with gene (in columns) presence/absence and coverage depth for the different samples (in lines).
In the case of genes being present (genes with at least --minGeneCoverage percentage of target reference gene sequence covered with --minCovPresence reads and with at least --minGeneIdentity percentage identity of target reference gene sequence covered with --minCovCall reads), the script will provide the mean target sequence coverage, otherwise will report "absent_" for genes not present.
In case of multiple alleles occurrence, if the frequency of the dominant allele is lower than --minFrequencyDominantAllele and the frequency of the most frequent minority allele is higher than 50% of the total of the minority alleles or is 50% but only 2 minority alleles exist, "multiAlleles_" will be reported.

cpu_information.*.cpu.txt and cpu_information.*.slurm.txt
Store CPUs and SLURM information at the time of run.

mlst_report.*.tab This file contains a report with the MLST information (columns) for the different samples (in lines). For each sample, the file will have information on the run the MLST was determined (first or second), the consensus sequenced used (noMatter, correct or alignment), the ST obtained ( or '-' if no ST was obtained) and the allele number (or '-' if not an exact match) for each loci in the scheme.

Samples folders
For each sample, three fasta files will be produced:

  • sample.noMatter.fasta - Fasta file containing the target gene sequence with the more probable nucleotides (determined by the majority rule for positions covered by >= --minCovPresence). Positions with less than --minCovPresence coverage depth will be considered as deletions.
  • sample.correct.fasta - Fasta file containing the target gene sequence with the correct nucleotides. Positions with less than --minCovPresence coverage depth will be considered as deletions, with less than --minCovCall coverage depth will be coded as "N" (due to low certainty in calling SNP) and positions with possible multiple alleles will also be considered as "N".
  • sample.alignment.fasta - Will be exactly the same as sample.correct.fasta, but INDELs will be coded as "N" in order to produce sequences with the same size that can be then concatenated to produce an alignment file
  • rematchModule_report.txt - Report file containing gene information: 1) gene name, 2) percentage of target gene sequence covered with at least --minCovPresence read depth, 3) Mean target gene coverage depth of present positions, 4) percentage of target gene sequence with lower --minCovCall coverage depth, 5) number of positions in target gene sequence containing multiple alleles, 6) percentage identity of target gene sequence covered with at least --minCovCall read depth. The general sample information will also be stored: number of absent genes, number of genes with multiple alleles among the genes present and the mean sample coverage depth (only considering the genes present).
  • rematch_module/ - Folder containing the temporary files. Only kept if --debug option is specified. It will contain the alignment.bam, bam and fasta indexes, sequence_data/ folder with subfolders (named with numbers) for each sequence in --reference file. In each sequence folder the different consensus *.vcf files and the original samtools_mpileup.*.vcf and samtools_depth.*.vcf files
  • rematch_second_run/ - Folder containing the same files/folders described above, but for the second_run. Only created if --doubleRun is set

Citation

MP Machado, B Ribeiro-Goncalves, M Silva, I Mendes, M Rossi, M Ramirez, JA Carrico. ReMatCh GitHub https://github.com/B-UMMI/ReMatCh

Contact

Miguel Machado
mpmachado@medicina.ulisboa.pt