# Guide to the Source and the Results

## Part 1. Alignments and Feature Extraction
This process ran on Saga under the Linux O/S and the Slurm job control system.

The process is described here for one experimental setup,
specifically the one using Bowtie2 and transcripts and genus Equus.
The other experiments used similar layouts and scripts.

Many of the shell scripts include Slurm directives for resource allocation.
The directives are formatted as comments, so they would be ignored in other environments.

These scripts run as-is on any Linux O/S, 
but they would need customization to exploit other grid computing environments.

### Data organization on disk
* Equus
    * Transcriptomes - two gzipped cDNA FASTA files
    * Genomes - two gzipped gDNA FASTA files
    * Reads - two gzipped FASTQ files, trimmed, with R1 and R2 in the filenames
    * bowtie - align to each reference separately (for machine learning)
        * index_asinus
            * inputs = soft link to cDNA file
            * outputs = index files in bt2 format
        * index_caballus - similar
        * map_caballus_to_caballus
            * inputs = soft links to reads and index files
            * outputs = aligments to transcripts, filename Sorted.bam
        * map_caballus_to_asinus - similar
        * map_asinus_to_asinus - similar
        * map_asinus_to_caballus - similar
    * diploid_bowtie - align to concatenation of references (no machine learning)
        * map_caballus_to_diploid
            * inputs = soft links to reads and index files
            * outputs = alignments to transcripts, filename Primary.bam
        * map_asinus_to_diploid - similar

### Scripts
See the scripts directory.

Scripts to prepare for machine learning
* bowtie_index.sh - create an index of the transcriptome
* bowtie_best1.sh - align reads to transcripts
* make_stats.sh - launch feature extraction on particular alignments
* bam_two_targets.py - feature extraction, see Table 1, input BAM, output csv

Scripts for comparison to the aligner by itself
* make_diploid_fasta.sh - concatenate transcriptomes, modify deflines for counting
* count.sh - count reads mapped to correct parent in concatenated transcriptome

## Part 2. Machine learning

See the notebooks directory.

The Jupyter notebooks ran on Google CoLab Pro virtual computers.
All the RF notebooks contain two sets of statistics.
Stats under "Comparison" refer to the comparison of alignment scores (columns B and E in most tables). 
Status under "Validation" refer to the machine learning method (columns C and F in most tables).

* Columns A and D of all columns:
    * Scores
* Table 1 - see feature extraction above
* Table 2 - Arabidopsis RNA
    * RF_141 - Bowtie2
    * RF_142 - STAR RNA
* Table 3 - Arabidopsis DNA
    * RF_144 - HiSat2
    * RF_143 - STAR DNA
* Table 4 - Brassica RNA
    * RF_145 - Bowtie2
    * RF_148 - STAR RNA
* Table 5 - Brassica DNA
    * RF_146 - HiSat2
    * RF_151 - STAR DNA
* Table 6 - Equus RNA
    * RF_147 - Bowtie2
    * RF_157 - STAR RNA
* Table 7 - Equus DNA
    * RF_153 - HiSat2
    * RF_154 - STAR DNA
* Figure 1 - no source
* Figure 2 - Figures
* Figure 3 - Figures
* Table S1 - Compare architectures
    * GB_141
    * SVM_141b
    * MLP_141
* Table S2 - Compare forests
    * RF_141a
    * RF_141b
* Table S3 - Mus RNA
    * RF_149 - Bowtie2
    * RF_158 - STAR RNA
* Table S4 - Mus DNA
    * RF_150 - HiSat2
    * RF_159 - STAR DNA
* Table S5 - no source
* Table S6 - Equus generalization 
    * RF_147 - Primary data used to train and test
    * RF_156 - Secondary data used for test only

## Part 3. Sample data
See the sample_data directory.

The reads and references are public data and too large to copy here.

The alignment files (BAM) are easily generated and too large to copy here.

The extracted feature files, in csv format, are also large and easily generated.

However, subsets with 10,000 rows are provided here.
These can be used with the appropriate RF notebook, 
just by editing the DATA_FILE in the notebook.
We used similar files for quick initial testing of RF notebooks.
Even though the subset data size is 1/100 of that used for the paper,
the accuracy of models trained on the subsets was usually within
a percentage point of the result shown in the paper.