# apytram v1.0
## Automated ( PYthon implemented ) Target Restricted Assembly Method

## Contact

carine.rey@ens-lyon.fr

## Preamble

This software is inspired from aTRAM (see References) but was implemented in Python and internal strategies have been designed differently. 


## Why use apytram?

apytram allows assembling sequences from RNA-seq data (paired-end or not) using a reference homologous sequence, possibly coming from a different species.

## How does it work?


### Database building
The RNA-seq data (a fastq or fasta file), given by the -fq or -fa option, is formatted into a BLAST database, whose  name is given by the -d option.

The data type, paired or single end, must be given by the -dt option. If data are paired-end, all reads (1 and 2) must be concatenated in a uniq file. WARNING: Paired read names must end with 1 or 2.
Reads contained in this file will all be used, so the file must have already been clean up.

If this file is a fastq file, it will be converted into a fasta file. This conversion can take a lot of time (between XX and YY hours on a laptop). 

As the database building step is time consuming, if a BLAST database is already present, the database building will be skipped.

### Iterative process

The sequence in the query file (-q option) will serve as the first bait sequence. At this time, only one reference sequence must be present in the query file. The possibility to use a multi-reference query file is in development.

#### A classical iteration (see Figure XX)
*    Reads recruting (Blast):

     Bait sequences are used to recrute homologous reads by BLAST. The -e option allows fixing the evalue threshold.
     If the data is paired-end (-dt option), all paired reads are added to the read list.
     
     Sequences of all these reads are put in a temporary file which is accesible if the -tmp option is used.
     
     
*    Read assembly (Trinity):

     All reads present in this temporary file are assembled (de novo) by Trinity with default parameters. If the data contains paired-end reads, Trinity takes reads as paired reads.
     
     
*    Reconstructed sequences quality (Exonerate):

     Exonerate is used to align each sequence to the reference. The length of the sequence, the length of the alignment, the percentage of identity, the alignment score are collected.
     
     
*    Reconstructed sequence filtering:

     Reconstructed sequences are filtered according to the option -id -mal -len. Sequences which pass all filters are used as bait sequences for the next iteration.
     
     
*    Comparaison with previous iteration  (Exonerate):
     
     Exonerate is used to find the "parent" contig from the previous contig for each contig and check they are really different.
     

*    Coverage calculation (Mafft):

    apytram calculates 2 coverage values, a Strictcoverage that means the percentage of sites of the query with a homologous site in the reconstructed sequences, and a Largecoverage that means the percentage of sites in the alignment with at least one representant in the reconstructed sequences divided by the length of the reference so by definition it can be superior to 100.
    
![ Figure to explain.](https://github.com/CarineRey/apytram/blob/master/Documentation/CoverageExplanation.png?raw=true "Coverage explain")

At the end of an iteration, reconstructed sequences become the bait sequences for the next iteration.


#### Criteria to stop the iterative process
If one of these criteria is completed during an iteration, the iterative process will stop.
The following criteria are implemented as default settings : 
*    The number of max iteration (-i option) is attained.
*    Recruted reads are the same as during the previous iteration.
*    Reconstructed sequences at the end of an iteration are almost the same as after the previous iteration. That means each sequence of an iteration has a corresponding sequence in the previous iteration with at least 99% identity and 98% of its length.
*    The number of reconstructed sequences has not changed AND the total length, score and LargeCoverage of all reconstructed sequences have not been improved.

If the --required_coverage option is used, the iterating process will stop if the Strict coverage is superior to the Required_coverage.

All these criteria are not applied if the --finish_all_iteration is used.

### Final Filter

A final filter (-fmal -fid -flen option) can be applied on the reconstructed sequences to be more stringent than the threshold used during the iterative process. 

### Writing outputs file
Finally, output files are writen:
*    `$OUTPUT_PREFIX`**.fasta**: 

     A fasta file containing all reconstructed sequences of the last iteration which pass the final filter.


*    `$OUTPUT_PREFIX`**.best.fasta**:

     A fasta file containing the reconstructed sequences of the last iteration with the best scores, and which pass the final filter.


*    `$OUTPUT_PREFIX`**.stats.pdf**:
    
    available if --plot option. A pdf containing 2 figures to know global information at each iteration.


*    `$OUTPUT_PREFIX`**.stats.csv**:
      
      available if --plot or --stats options. A csv file containing the raw data needed to draw the figures present in the .stats.pdf 


*    `$OUTPUT_PREFIX`**.ali.png**:

        A figure representing the alignment of all reconstructed sequences which pass the final filter on the query. White represents a gap, blue a base of the reference, green an identical base of the reference in a reconstructed sequence, red a different base of the reference and yellow a base corresponding to a gap in the reference.

## Dependencies
### Software (Must be available in the \$PATH)

*    Trinity >=v2.1 [Samtools  = v0.19, Java 1.7, bowtie] ([Download Trinity here](https://github.com/trinityrnaseq/trinityrnaseq/releases)) ([Download Samtools here](https://sourceforge.net/projects/samtools/files/samtools/0.1.19/)) 
    
*    Mafft >=v7 ([Download Mafft here](https://http://mafft.cbrc.jp/alignment/software/))

*    exonerate ([Download exonerate here](https://github.com/nathanweeks/exonerate))

*    Blastplus

To add a software to the \$PATH environment variable:
```sh
PATH=/path/to/the/software:$PATH
```

You can also install this software via a package management program (e.g. aptitude, synaptic ...).

For example to install exonerate:
```bash
apt-get install exonerate
```
### Python library
*    pandas
*    Matplolib >= 1.13


## How to get apytram?

### By downloading
apytram is available on github [https://github.com/CarineRey/apytram](https://github.com/CarineRey/apytram).

### By command line
This command will create a directory named "apytram".
```sh
git clone https://github.com/CarineRey/apytram.git
```


## Basic use

### Using the included example (Nucleotide query)
This example is included in the apytram directory. It should run in about a dozen seconds.
Code must be executed in the parent directory of apytram. Before running this example, check that all dependencies are checked.

```bash
export OUT="exec_example"
apytram/apytram.py -d $OUT/db/examplefq \
             -dt paired \
             -fq apytram/example/example_db.fastq \
             -q apytram/example/ref_gene.fasta \
             -out $OUT/apytram \
             -log $OUT/apytram.log \
             --plot \
             --plot_ali 
```

### How to get help?
```sh
/apytram_path/apytram.py -h
```

### Simplest use
```sh
./apytram.py -d $db_name -t $db_type -fq $fastq_name -q $query_file_name
```

### Change the number of threads
```sh
./apytram.py -d $db_name -t $db_type -fq $fastq_name -q $query_file_name -threads $nb_threads
```

### Access to tempory files
```sh
./apytram.py -d $db_name -t $db_type -fq $fastq_name -q $query_file_name -tmp $tmp_directory_name
```

### Database building only
```sh
./apytram.py -d $db_name -t $db_type -fq $fastq_name
```

### Use of the final filter
*    If you want to keep only reconstructed sequences which have a length superior to **X percent** of the reference sequence:
    ```sh
    ./apytram.py -d $db_name -t $db_type -fq $fastq_name -q $query_file_name -flen X
    ```
    
*    If you want to keep only reconstructed sequences which have a identity percentage superior to Y with the reference sequence on the length of their alignment:
    ```sh
    ./apytram.py -d $db_name -t $db_type -fq $fastq_name -q $query_file_name -fid Y
    ```
    
*    If you want to keep only reconstructed sequences which align on the reference on a length superior to **Z percent** of the reference sequence:
    ```sh
    ./apytram.py -d $db_name -t $db_type -fq $fastq_name -q $query_file_name -fmal Z
    
    ```
    
*   If you want to combine all these things:
    ```sh
    ./apytram.py -d $db_name -t $db_type -fq $fastq_name -q $query_file_name -flen X -fid Y -fmal Z
    
    ```

## Speed optimization

Paired-end RNA-seq data run faster than single-end data.

To save time:
*     not use -tmp option
*     not use --keep_iterations
*     not use --finish_all_iter
*     not use the --plot and --plot_ali options

If you want to use apytram on several query files and you have several available threads, it is more efficient to minimize the number of threads by apytram job than to maximize the number of threads by job. Each job will be slower but at the end you will save time. This is due to the non linearity of the time saved by Trinity and Blast when the number of threads is increased.

## Optimizing the accuracy

See folowing options:
*    -e
*    -mal
*    -id
*    -len
*    -fid
*    -fmal
*    -flen

## Following the program's progress

A job can take some minutes to several hours to complete. To know the progress of your job you can look into the log file (-log option or by default apytram.log).

You can look at the `$OUTPUT_PREFIX.stats.pdf` (--plot option) at the end the job, to have general information on the progress of your job. All values needed to create the plot in `$OUTPUT_PREFFIX.stats.pdf` are available in `$OUTPUT_PREFFIX.stats.csv`. The `$OUTPUT_PREFFIX.stats.csv` file can be only created using the --stats option.

## Advanced use
For the moment, it is the help message.

```
usage: apytram.py [-h] [--version] -d DATABASE -dt {single,paired} [-fa FASTA]
                  [-fq FASTQ] [-q QUERY] [-pep QUERY_PEP] [-i ITERATION_MAX]
                  [-out OUTPUT_PREFIX] [-log LOG] [-tmp TMP]
                  [--keep_iterations] [--no_best_file] [--no_last_iter_file]
                  [--stats] [--plot] [--plot_ali] [-e EVALUE] [-id MIN_ID]
                  [-mal MIN_ALI_LEN] [-len MIN_LEN]
                  [--required_coverage REQUIRED_COVERAGE] [--finish_all_iter]
                  [-flen FINAL_MIN_LEN] [-fid FINAL_MIN_ID]
                  [-fmal FINAL_MIN_ALI_LEN] [-threads THREADS]

Run apytram.py on a fastq file to retrieve homologous sequences of bait
sequences.

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit

Required arguments:
  -d DATABASE, --database DATABASE
                        Database prefix name. If a database with the same name
                        already exists, the existing database will be kept and
                        the database will NOT be rebuilt.
  -dt {single,paired}, --database_type {single,paired}
                        single or paired end RNA-seq data. WARNING: Paired
                        read names must finished by 1 or 2.

Input Files:
  -fa FASTA, --fasta FASTA
                        Fasta formated RNA-seq data to build the database of
                        reads
  -fq FASTQ, --fastq FASTQ
                        Fastq formated RNA-seq data to build the database of
                        reads. (The fastq will be first converted to a fasta
                        file. This process can require some time.
  -q QUERY, --query QUERY
                        Fasta file (nucl) with bait sequences for the apytram
                        run. If no query is submitted, the program will just
                        build the database.
  -pep QUERY_PEP, --query_pep QUERY_PEP
                        Fasta file containing the query in the peptide format.
                        It will be used at the first iteration as bait
                        sequences to fish reads. It is compulsory to include
                        also the query in nucleotide format (-q option)
  -i ITERATION_MAX, --iteration_max ITERATION_MAX
                        Maximum number of iterations. (Default 5)

Output Files:
  -out OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
                        Output prefix (Default ./apytram)
  -log LOG              a log file to report avancement (default: apytram.log)
  -tmp TMP              Directory to stock all intermediary files for the
                        apytram run. (default: a directory in /tmp which will
                        be removed at the end)
  --keep_iterations     A fasta file will be created at each iteration.
                        (default: False)
  --no_best_file        By default, a fasta file (Outprefix.best.fasta)
                        containing only the best sequence is created. If this
                        option is used, it will NOT be created.
  --no_last_iter_file   By default, a fasta file (Outprefix.fasta) containing
                        all sequences from the last iteration is created. If
                        this option is used, it will NOT be created.
  --stats               Create files with statistics on each iteration.
                        (default: False)
  --plot                Create plots to represent the statistics on each
                        iteration. (default: False)
  --plot_ali            Create file with a plot representing the alignement of
                        all sequences from the last iteration on the query
                        sequence. Take some seconds. (default: False)

Thresholds for EACH ITERATION:
  -e EVALUE, --evalue EVALUE
                        Evalue threshold of the blastn of the bait queries on
                        the database of reads. (Default 1e-3)
  -id MIN_ID, --min_id MIN_ID
                        Minimum identity percentage of a sequence with a query
                        on the length of their alignment so that the sequence
                        is kept at the end of a iteration (Default 20)
  -mal MIN_ALI_LEN, --min_ali_len MIN_ALI_LEN
                        Minimum alignment length of a sequence on a query to
                        be kept at the end of a iteration (Default 180)
  -len MIN_LEN, --min_len MIN_LEN
                        Minimum length to keep a sequence at the end of a
                        iteration. (Default 200)

Criteria to stop iteration:
  --required_coverage REQUIRED_COVERAGE
                        Required coverage of a bait sequence to stop iteration
                        (Default: No threshold)
  --finish_all_iter     By default, iterations are stop if there is no
                        improvment, if this option is used apytram will finish
                        all iteration (-i).

Thresholds for Final output files:
  -flen FINAL_MIN_LEN, --final_min_len FINAL_MIN_LEN
                        Minimum PERCENTAGE of the query length to keep a
                        sequence at the end of the run. (Default: 0)
  -fid FINAL_MIN_ID, --final_min_id FINAL_MIN_ID
                        Minimum identity PERCENTAGE of a sequence with a query
                        on the length of their alignment so that the sequence
                        is kept at the end of the run (Default 0)
  -fmal FINAL_MIN_ALI_LEN, --final_min_ali_len FINAL_MIN_ALI_LEN
                        Alignment length between a sequence and a query must
                        be at least this PERCENTAGE of the query length to
                        keep this sequence at the end of the run. (Default: 0)

Miscellaneous options:
  -threads THREADS      Number of available threads. (Default 1)
  
```

# Reference
Allen, JM, DI Huang, QC Cronk, KP Johnson. 2015. aTRAM automated target restricted assembly method a fast method for assembling loci across divergent taxa from next-generation sequencing data. BMC Bioinformatics 16:98 DOI 10.1186/s12859-015-0515-2