# Data Augmentation

Neural networks get more accurate with more training data,
but labeled training data is rare and valuable.
Data augmentation refers to artificially boosting the training set.
Data augmentation is established practice 
in the world of image classification where
real images are flipped, rotated, cropped, darkened, colorized, etc.
The real and almost-real images are all used in training.

## How should data augmentation work for RNA?

We cannot be sure a mutated RNA is still labeled correctly. 
After you change a letter, it may not be protein coding any more!

Hill et al. created an inflated training set
where every instance was a true RNA with a one-nucleotide mutation.
Aware that a mutant's label may no longer apply,
they used the augmented data for pre-training only.
That is, train initial layers on the augmented data,
then freeze their weights and train deeper layers on real data.

## Literature

Very long and entirely about images.  
2019: A survey on Image Data Augmentation for Deep Learning. Journal of Big Data [free html](https://link.springer.com/article/10.1186/s40537-019-0197-0). 

Naturally, someone has tried to automate the process.  
2019: AutoAugment. Conference on Computer Vision [free pdf](https://openaccess.thecvf.com/content_CVPR_2019/html/Cubuk_AutoAugment_Learning_Augmentation_Strategies_From_Data_CVPR_2019_paper.html). 

Image augmentation using GAN (generative adversarial neural networks).  
2017: The Effectiveness of Data Augmentation in Image Classification using Deep
Learning. arXiv [free pdf](https://arxiv.org/abs/1712.04621)

Image augmentation using GAN.  
2017: Data Augmentation Generative Adversarial Networks. arXiv [free pdf](https://arxiv.org/abs/1711.04340)

Sentence augmentation by replacing words.  
2018: Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations. arXiv [free pdf](https://arxiv.org/abs/1805.06201)
* Previous work augments sentences by replacing words with synonyms.
* Here, replace words with words predicted by models that predict the missing word based on the sentence (context). This seems like GAN.
* Tested CNN and LSTM.
* Very small gain achieved through training on augmented data.
* Could we train ANNs to predict a masked letter? Only accept changes that don't match the original? Extend to K-mers?

## Ideas

### Preserve the K-mer profile
Mutate an RNA and at least preserve its K-mer profile.
This should not change MLP accuracy.
How would it change RNN accuracy?

Here is a trivial mutation algorithm. 
For K=3, find any two identical 5-mers, and exchange their middle base.
So, ...AATCC...AAGCC... becomes ...AAGCC...AATCC...  

### Reverse, rotate, or swap

Reversing the RNA should have no effect on MLP accuracy.
We predict no effect on RNN accuracy since 
Bidirectional LSTM did no better than LSTM.

Rotating the RNA could be safe.
A 1-base rotation of ABCDE would be BCDEA.
Some rotations will break an ORF, 
but most rotations probably preserve the important sequence features.

Swapping an RNA is most risky. 
A 50:50 swap of ABCDEFGH could be EFGHABCD.

### Splice or alternate exons
Cells generate multiple transcripts from one gene sequence.
One mechanism is splicing out parts.
The parts that are always spliced out are called introns. 
The parts usually left in are called exons.
But alternate splicing is observed sometimes.
But if a gene usually retains exons 1,2,3,4,
then RNA transcripts with exons 1,2,4 
or 1,3,4 are called alternate splices.
We could get the predicted splice sites (exon boundaries)
from the sequence databases.

### Evolution
Perhaps we could augment human transcripts
by including primate transcripts or even other mammals.
Note Hill et al. tested their human-trained model on mouse RNA.

The NonCode database has primate lncRNA including ~15K from gorilla.
See their stats by species on this [page](http://www.noncode.org/analysis.php).
See their 2017 publication in [NAR](https://academic.oup.com/nar/article/46/D1/D308/4616876).