# Preprocessing DNA sequences for ML

In computational biology, many computational biologists work with DNA/RNA sequence data daily. DNA/RNA sequence data are usually contained in a file format called "fasta" format. Each sequence data consists in a sequence of four nucleotide types {A,C,G,U} in a FASTA format:
>information about the sequence
ATGTTTCGCATCACCAACATTGAGTTTCTTCCCGAATACCGACAAAAGGAGTCCAGGGAA



How can we use these sequences for machine or deep learning? 
The first preprocessing step is to encode the data so it can be used by different ML algorithms.

There are 3 general approaches for this:
*  1.Label-encoding 
*  2.One-hot encoding 
*  3.Using various "language" processing methods


## 1. Label-encoding

This approach is very simple and it involves converting each value in a column to a number.

For DNA it's encoding each nucleotide characters as an ordinal values. For example “ACGT” becomes [0, 1, 2, 3]. N is encoded as [4]. Let's see an example:

In [36]:
# EXAMPLE
fasta = ">test_sequence\nATGTGTCGTAGTCGTACGNN"
# load the required librairies
import numpy as np
from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
import re

#check for and grab sequence name
if re.search(">",fasta):
   name = re.split("\n",fasta)[0]
   sequence = re.split("\n",fasta)[1]
else :
   name = 'unknown_sequence'
   sequence = fasta
sequence = sequence.lower()  
#get sequence into an array
sequence = re.sub('[^acgt]', 'z', sequence)
seq_array = np.array(list(sequence))
print(seq_array)
#integer encode the sequence
label_encoder = LabelEncoder()
integer_encoded_seq = label_encoder.fit(np.array(['a','c','g','t','z']))
print(integer_encoded_seq)

['a' 't' 'g' 't' 'g' 't' 'c' 'g' 't' 'a' 'g' 't' 'c' 'g' 't' 'a' 'c' 'g'
 'z' 'z']
LabelEncoder()


And here is a function to encode a DNA sequence string as an ordinal vector. It returns a numpy array with a=0.25, c=0.50, g=0.75, t=1.00, n=0.00.

In [37]:
# function to encode a DNA sequence string as an ordinal vector
# returns a numpy vector with a=0.25, c=0.50, g=0.75, t=1.00, n=0.00
def ordinalEncoder(dna_array):
    integer_encoded_seq=label_encoder.transform(seq_array)
    print(integer_encoded_seq)
    float_encoded = integer_encoded_seq.astype(float)
    float_encoded[float_encoded == 0] = 0.25 # A
    float_encoded[float_encoded == 1] = 0.50 # C
    float_encoded[float_encoded == 2] = 0.75 # G
    float_encoded[float_encoded == 3] = 1.00 # T
    float_encoded[float_encoded == 4] = 0.00 # anything else, z
    return float_encoded

In [38]:
ordinalEncoder(test_sequence)

[0 3 2 3 2 3 1 2 3 0 2 3 1 2 3 0 1 2 4 4]


array([0.25, 1.  , 0.75, 1.  , 0.75, 1.  , 0.5 , 0.75, 1.  , 0.25, 0.75,
       1.  , 0.5 , 0.75, 1.  , 0.25, 0.5 , 0.75, 0.  , 0.  ])

## 2. One-hot encoding

One hot encoding is a way to represent categorical data as binary vectors. For DNA, we have four catagories A, T, G, and C. For RNA, the four catagories are A, U, G, and C.

Then, each nucleotide value is represented as a binary vector that is all zero values except the index of the nucleotide, which is marked with a 1.

Thus a one hot code for DNA could be:
A = [1, 0, 0, 0]
T = [0, 1, 0, 0]
G = [0, 0, 1, 0]
C = [0, 0, 0, 1]

So the sequence AATTC would be:
[[1, 0, 0, 0],
[1, 0, 0, 0],
[0, 1, 0, 0],
[0, 1, 0, 0],
[0, 0, 0, 1]]


In the same logic a one hot code for DNA could be:
A = [1, 0, 0, 0]
U = [0, 1, 0, 0]
G = [0, 0, 1, 0]
C = [0, 0, 0, 1]

So the sequence AAUUC would be:
[[1, 0, 0, 0],
[1, 0, 0, 0],
[0, 1, 0, 0],
[0, 1, 0, 0],
[0, 0, 0, 1]]

In [4]:
# one hot encode function for DNA
import numpy as np
def one_hot_encodedna(seq):
    mapping = dict(zip("ACGT", range(4)))    
    seq2 = [mapping[i] for i in seq]
    return np.eye(4)[seq2]

one_hot_encodedna("AACGTACGTGCGTAATGC")

array([[1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [0., 0., 0., 1.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.]])

In [6]:
# one hot encode function for RNA
import numpy as np
def one_hot_encodeRna(seq):
    mapping = dict(zip("ACGU", range(4)))    
    seq2 = [mapping[i] for i in seq]
    return np.eye(4)[seq2]

one_hot_encodeRna("AACGUACGUGCGUAAUGC")

array([[1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [0., 0., 0., 1.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.]])

# REFERENCES
https://www.biorxiv.org/content/10.1101/186965v1


https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/