# RNN with one-hot encoded RNA

There is an issue of whether to use LabelEncoder prior to OneHotEncoder.
LabelEncoder would transform letters to categories.
I found one code demo for FASTA that did this.
I found one answer (to different question) on stackoverflow that said you need to do this.
But the documentation for LabelEncoder says it should be applied only to y, never to X.

I got OneHotEncoder to work on one sequence at a time.
Any time I pass an array of sequences, I get confusing errors.
Would I have to build X_train in a loop? Seems wrong.

Once this works, we must start again and save length of each input sequence.
This is required for the stratified sampling.

## OneHot on toy example

In [1]:
import numpy as np
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore',sparse=False)
RNA1="ACCGTTT"
rna1=np.array(list(RNA1))
RNA2="ACCGTAT"
rna2=np.array(list(RNA2))
RNA=np.array([rna1,rna2])
rna=RNA.reshape(-1, 1)
rna

array([['A'],
       ['C'],
       ['C'],
       ['G'],
       ['T'],
       ['T'],
       ['T'],
       ['A'],
       ['C'],
       ['C'],
       ['G'],
       ['T'],
       ['A'],
       ['T']], dtype='<U1')

In [2]:
enc.fit(rna)
enc.categories_

[array(['A', 'C', 'G', 'T'], dtype='<U1')]

In [3]:
enc.transform(rna1.reshape(-1, 1))

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.]])

## OneHot on real data
Use GenCode fasta of protein-coding and non-coding RNA.
We pre-processed both so they contain one line per sequence.

In [4]:
enc=None

def load_one_fasta(filename):
    # assume file was preprocessed to contain one line per seq
    MIN_SEQ_LEN=200
    MAX_SEQ_LEN=25000
    DEFLINE='>'
    seqs=[]
    with open (filename,'r') as infile:
        for line in infile:
            if line[0]!=DEFLINE and len(line)>=MIN_SEQ_LEN and len(line)<=MAX_SEQ_LEN:
                line=line.rstrip()
                chars=np.array(list(line))
                seqs=chars
                break
                seqs.append(chars)
    # reshaped changes (any,) to (any,1)
    nparray=seqs.reshape(-1, 1)
    return nparray

ncfile='ncRNA.fasta' 
print("Load "+ncfile)
nc_seqs = load_one_fasta(ncfile)
pcfile='pcRNA.fasta' 
print("Load "+pcfile)
pc_seqs = load_one_fasta(pcfile)
nc_seqs

Load ncRNA.fasta
Load pcRNA.fasta


array([['T'],
       ['C'],
       ['A'],
       ['T'],
       ['C'],
       ['A'],
       ['G'],
       ['T'],
       ['C'],
       ['C'],
       ['A'],
       ['A'],
       ['A'],
       ['G'],
       ['T'],
       ['C'],
       ['C'],
       ['A'],
       ['G'],
       ['C'],
       ['A'],
       ['G'],
       ['T'],
       ['T'],
       ['G'],
       ['T'],
       ['C'],
       ['C'],
       ['C'],
       ['T'],
       ['C'],
       ['C'],
       ['T'],
       ['G'],
       ['G'],
       ['A'],
       ['A'],
       ['T'],
       ['C'],
       ['C'],
       ['G'],
       ['T'],
       ['T'],
       ['G'],
       ['G'],
       ['C'],
       ['T'],
       ['T'],
       ['G'],
       ['C'],
       ['C'],
       ['T'],
       ['C'],
       ['C'],
       ['G'],
       ['G'],
       ['C'],
       ['A'],
       ['T'],
       ['T'],
       ['T'],
       ['T'],
       ['T'],
       ['G'],
       ['G'],
       ['C'],
       ['C'],
       ['C'],
       ['T'],
       ['T'],
       ['G'],
      

In [5]:
def load_fasta(filename):
    # assume file was preprocessed to contain one line per seq
    MIN_SEQ_LEN=200
    MAX_SEQ_LEN=25000
    DEFLINE='>'
    seqs=[]
    with open (filename,'r') as infile:
        for line in infile:
            if line[0]!=DEFLINE and len(line)>=MIN_SEQ_LEN and len(line)<=MAX_SEQ_LEN:
                line=line.rstrip()
                chars=np.array(list(line))
                seqs.append(chars.reshape(-1, 1)) # reshaped changes (any,) to (any,1)
    # no need for this -- should use array or nparray consistently
    nparray=np.array(seqs)
    return nparray

ncfile='ncRNA.fasta' 
print("Load "+ncfile)
nc_seqs = load_fasta(ncfile)
pcfile='pcRNA.fasta' 
print("Load "+pcfile)
pc_seqs = load_fasta(pcfile)
nc_seqs

Load ncRNA.fasta
Load pcRNA.fasta


array([array([['T'],
       ['C'],
       ['A'],
       ['T'],
       ['C'],
       ['A'],
       ['G'],
       ['T'],
       ['C'],
       ['C'],
       ['A'],
       ['A'],
       ['A'],
       ['G'],
       ['T'],
       ['C'],
       ['C'],
       ['A'],
       ['G'],
       ['C'],
       ['A'],
       ['G'],
       ['T'],
       ['T'],
       ['G'],
       ['T'],
       ['C'],
       ['C'],
       ['C'],
       ['T'],
       ['C'],
       ['C'],
       ['T'],
       ['G'],
       ['G'],
       ['A'],
       ['A'],
       ['T'],
       ['C'],
       ['C'],
       ['G'],
       ['T'],
       ['T'],
       ['G'],
       ['G'],
       ['C'],
       ['T'],
       ['T'],
       ['G'],
       ['C'],
       ['C'],
       ['T'],
       ['C'],
       ['C'],
       ['G'],
       ['G'],
       ['C'],
       ['A'],
       ['T'],
       ['T'],
       ['T'],
       ['T'],
       ['T'],
       ['G'],
       ['G'],
       ['C'],
       ['C'],
       ['C'],
       ['T'],
       ['T'],
       ['G'],

In [6]:
encoder = OneHotEncoder(handle_unknown='ignore',sparse=False)
print("Fit")
encoder.fit(nc_seqs[0])
encoder.categories_

Fit


[array(['A', 'C', 'G', 'T'], dtype='<U1')]

In [7]:
example = encoder.transform(nc_seqs[0]) # Works!
example
# Our data contains four features per sample.
# I wonder why the encoding uses float not bool.
# The OneHotEncoder documentation says the output is bool. (There is a parameter for changing output type.)

array([[0., 0., 0., 1.],
       [0., 1., 0., 0.],
       [1., 0., 0., 0.],
       ...,
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [0., 1., 0., 0.]])

In [8]:
# This fails.
# instances=encoder.transform(nc_seqs) # Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
# This fails.
# instances=encoder.transform(nc_seqs.reshape(-1,1)) # AttributeError: 'bool' object has no attribute 'any'


In [12]:
print("non-coding")
nc_all=[]
for seq in nc_seqs:
    encoded=encoder.transform(seq)
    nc_all.append(encoded)
print("protein-coding")
pc_all=[]
for seq in pc_seqs:
    encoded=encoder.transform(seq)
    pc_all.append(encoded)

nc_all=np.array(nc_all).reshape(-1,1)
pc_all=np.array(pc_all).reshape(-1,1)
nc_all.shape,pc_all.shape

non-coding
protein-coding


((17711, 1), (20152, 1))

## Conclusion
We have shown how to OneHot encode an RNA sequence.
We have not been able to encode a block of sequences.
It seems necessary to encode in a loop.