# Guide to the Source and the Results

## Part 1. Alignments and Feature Extraction
Alignments were computed and analyzed on Saga which is part of the [Norwegian Research Infrastructure Services](https://documentation.sigma2.no/).
Saga runs the Linux O/S and the Slurm job control system.
Many of our shell scripts include Slurm directives for resource allocation.
The directives are formatted as comments, so they would be ignored in other environments.
These scripts may run as-is on any Linux O/S,
but they may need customization to exploit other grid computing environments.

### Data organization on disk
This shows the directory hierarchy for one aligner and one genus. Other datasets were organized similarly.
* Equus
    * Transcriptomes - two gzipped cDNA FASTA files
    * Genomes - two gzipped gDNA FASTA files
    * Reads - two gzipped FASTQ files, trimmed, with R1 and R2 in the filenames
    * bowtie - align to each reference separately (for machine learning)
        * index_asinus
            * inputs = soft link to cDNA file
            * outputs = index files in bt2 format
        * index_caballus - similar
        * map_caballus_to_caballus
            * inputs = soft links to reads and index files
            * outputs = aligments to transcripts, filename Sorted.bam
        * map_caballus_to_asinus - similar
        * map_asinus_to_asinus - similar
        * map_asinus_to_caballus - similar
    * diploid_bowtie - align to concatenation of references (no machine learning)
        * map_caballus_to_diploid
            * inputs = soft links to reads and index files
            * outputs = alignments to transcripts, filename Primary.bam
        * map_asinus_to_diploid - similar

### Scripts
See the scripts directory.
This shows scripts related to one aligner.
Similar scripts were used for other aligners


Scripts to prepare for machine learning
* bowtie_index.sh - create an index of the transcriptome
* bowtie_best1.sh - align reads to transcripts
* make_stats.sh - launch feature extraction on particular alignments
* bam_two_targets.py - feature extraction, see Table 1, input BAM, output csv

Scripts for comparison to the aligner by itself
* make_diploid_fasta.sh - concatenate transcriptomes, modify deflines for counting
* analyze_diploid_maps.sh - cout read pair alignents to either reference in the diploid file.

## Part 2. Machine learning

The Jupyter notebooks ran on
[Google CoLab](https://colab.research.google.com/) Pro virtual computers.
All the RF notebooks contain two sets of statistics.
Stats under "Comparison" refer to the comparison of alignment scores (columns B and E in most tables).
Status under "Validation" refer to the machine learning method (columns C and F in most tables).

* Table 1 - List of features extracted by bam_two_targsts.py
* Table 2 - Arabidopsis RNA
    * RF_203 - Bowtie2
    * RF_207 - STAR RNA
* Table 3 - Arabidopsis DNA
    * RF_205 - HiSat2
    * RF_209 - STAR DNA
* Table 4 - Brassica RNA
    * RF_211 - Bowtie2
    * RF_213 - STAR RNA
* Table 5 - Brassica DNA
    * RF_215 - HiSat2
    * RF_217 - STAR DNA
* Table 6 - Equus RNA
    * RF_219 - Bowtie2
    * RF_223 - STAR RNA
* Table 7 - Equus DNA
    * RF_221 - HiSat2
    * RF_225 - STAR DNA
* Figure 1 - no source
* Figure 2 and Figure 3
    * See Figures3 notebook
* Table S1 - Compare architectures
    * GB_141
    * SVM_141b
    * MLP_141
* Table S2 - Compare forests
    * RF_141a
    * RF_141b
* Table S3 - Mus RNA
    * RF_149 - Bowtie2
    * RF_158 - STAR RNA
* Table S4 - Mus DNA
    * RF_150 - HiSat2
    * RF_159 - STAR DNA
* Table S5 - no source
* Table S6 - Equus generalization
    * RF_147 - Primary data used to train and test
    * RF_156 - Secondary data used for test only

## Part 3. Sample data
See the sample_data directory.

The reads and references are public data and too large to copy here.

The alignment files (BAM) are easily generated and too large to copy here.

The extracted feature files, in csv format, are also large and easily generated.

However, subsets with 10,000 rows are provided here.
These can be used with the appropriate RF notebook,
just by editing these variables in the notebook:
MAX_LINES_TO_LOAD (set to 10000),
DATA_FILE_0 (actual filename),
DATA_FILE_1 (actual filename.
We used similar files for quick initial testing of RF notebooks.
Even though the subset data size is 1/100 of that used for the paper,
the accuracy of models trained on the subsets was usually within
a percentage point of the result shown in the paper.