In [1]:
import sys
sys.path.append(r'/home/martinha/propythia/propythia/src/propythia/')
sys.path.append(r'/home/martinha/propythia/propythia/src/')

# DNA physicochemical descriptors and encodings 
This jupyter notebook will demonstrate how to obtain dna sequence-based features and protein encoding with Propythia.

All the DNA notebooks will use the dataset from the tutorial linked to the manuscript, A Primer on Deep Learning in Genomics (Nature Genetics, 2018) by James Zou, Mikael Huss, Abubakar Abid, Pejman Mohammadi, Ali Torkamani & Amalio Telentil.

tutorial: https://colab.research.google.com/drive/17E4h5aAOioh5DiTo7MZg4hpL6Z_0FyWr#scrollTo=eiiwjw4yhX0P

paper: https://www.nature.com/articles/s41588-018-0295-5

This is a notebook that explains how to read a DNA sequence and calculate DNA descriptors or DNA encoders. These steps are necessary to use either ML or DL strategies afterwards.  The notebook includes:

    Data reading and validation 
    DNA Descriptors
    Encoders 


In [3]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import json

# from propythia.dna.descriptors import ProteinDescritors
# from propythia.dna.encoding import Encoding

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## 1. Data reading and validation

The first step is to get the data.

This module comprehends functions to read and to validate DNA sequences. 

First is necessary to create the object ReadDNA.


In [5]:
from propythia.dna.sequence import ReadDNA
reader = ReadDNA()

It is possible to create sequence objects using a single DNA sequence, a *CSV* and a *FASTA* file. The single sequence is going to be validated (check if all letters belong to the DNA alphabet) and the output will be the sequence in upper case.

In [6]:
data = reader.read_sequence("ACGTACGAGCATGCAT")
print(data)

ACGTACGAGCATGCAT


With *CSV* there must be at least a column named 'sequence' in the file. The labels may also be retrieved and validated if the user wants them, but he must specify the `with_label` parameter as **True** and the column with the labels must be named 'label'.

In [7]:
filename = "primer/dataset.csv"

data = reader.read_csv(filename, with_labels=False)
print(data.head())
print(data.shape)

print("-" * 100)

data = reader.read_csv(filename, with_labels=True)
print(data.head())
print(data.shape)

                                            sequence
0  CCGAGGGCTATGGTTTGGAAGTTAGAACCCTGGGGCTTCTCGCGGA...
1  GAGTTTATATGGCGCGAGCCTAGTGGTTTTTGTACTTGTTTGTCGC...
2  GATCAGTAGGGAAACAAACAGAGGGCCCAGCCACATCTAGCAGGTA...
3  GTCCACGACCGAACTCCCACCTTGACCGCAGAGGTACCACCAGAGC...
4  GGCGACCGAACTCCAACTAGAACCTGCATAACTGGCCTGGGAGATA...
(2000, 1)
----------------------------------------------------------------------------------------------------
                                            sequence  label
0  CCGAGGGCTATGGTTTGGAAGTTAGAACCCTGGGGCTTCTCGCGGA...      0
1  GAGTTTATATGGCGCGAGCCTAGTGGTTTTTGTACTTGTTTGTCGC...      0
2  GATCAGTAGGGAAACAAACAGAGGGCCCAGCCACATCTAGCAGGTA...      0
3  GTCCACGACCGAACTCCCACCTTGACCGCAGAGGTACCACCAGAGC...      1
4  GGCGACCGAACTCCAACTAGAACCTGCATAACTGGCCTGGGAGATA...      1
(2000, 2)


The *FASTA* format is similar to the *CSV* format. It always reads the sequence, and the labels only if the user wants them. The *FASTA* format must be one of the following examples:

```
>sequence_id1
ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG...
>sequence_id2
ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG...
``` 

```
>sequence_id1,label1
ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG...
>sequence_id2,label2
ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG...
``` 

In [8]:
filename = "primer/example.fasta"
data = reader.read_fasta(filename, with_labels=False)
print(data.head())
print(data.shape)

print("-" * 100)

data = reader.read_fasta(filename, with_labels=True)
print(data.head())
print(data.shape)

                                            sequence
0  CCGAGGGCTATGGTTTGGAAGTTAGAACCCTGGGGCTTCTCGCGGA...
1  GAGTTTATATGGCGCGAGCCTAGTGGTTTTTGTACTTGTTTGTCGC...
2  GATCAGTAGGGAAACAAACAGAGGGCCCAGCCACATCTAGCAGGTA...
3  GTCCACGACCGAACTCCCACCTTGACCGCAGAGGTACCACCAGAGC...
4  GGCGACCGAACTCCAACTAGAACCTGCATAACTGGCCTGGGAGATA...
(19, 1)
----------------------------------------------------------------------------------------------------
                                            sequence  label
0  CCGAGGGCTATGGTTTGGAAGTTAGAACCCTGGGGCTTCTCGCGGA...      0
1  GAGTTTATATGGCGCGAGCCTAGTGGTTTTTGTACTTGTTTGTCGC...      0
2  GATCAGTAGGGAAACAAACAGAGGGCCCAGCCACATCTAGCAGGTA...      0
3  GTCCACGACCGAACTCCCACCTTGACCGCAGAGGTACCACCAGAGC...      1
4  GGCGACCGAACTCCAACTAGAACCTGCATAACTGGCCTGGGAGATA...      1
(19, 2)


## 2. DNA Descriptors

Descriptors are manually calculated and are an attempt to serve as features for the classification model. 
This module comprehends functions to computing different types of DNA descriptors. It receives a sequence object (from previous module) and retrieves a dictionary with name of feature and value. The user can calculate individual descriptors and also calculate all descriptors. It also lets the users to use define the physicochemical indices for the autocorrelation descriptors if the user doesn't want to use the default values or if he wants to add new ones. 




There are a total of 17 DNA implemented descriptors. They can be found below:

<table>
<thead>
  <tr>
    <th>Group</th>
    <th>Name</th>
    <th>Output type</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td rowspan="3">Psycho Chemical</td>
    <td>length</td>
    <td>int</td>
  </tr>
  <tr>
    <td>gc_content</td>
    <td>float</td>
  </tr>
  <tr>
    <td>at_content</td>
    <td>float</td>
  </tr>
  <tr>
    <td rowspan="6">Nucleic Acid Composition</td>
    <td>nucleic_acid_composition</td>
    <td>dict</td>
  </tr>
  <tr>
    <td>dinucleotide_composition</td>
    <td>dict</td>
  </tr>
  <tr>
    <td>trinucleotide_composition</td>
    <td>dict</td>
  </tr>
  <tr>
    <td>k_spaced_nucleic_acid_pairs</td>
    <td>dict</td>
  </tr>
  <tr>
    <td>kmer</td>
    <td>dict</td>
  </tr>
  <tr>
    <td>accumulated_nucleotide_frequency</td>
    <td>list of dict</td>
  </tr>
  <tr>
    <td rowspan="6">Autocorrelation and Cross Covariance</td>
    <td>DAC</td>
    <td>list</td>
  </tr>
  <tr>
    <td>DCC</td>
    <td>list</td>
  </tr>
  <tr>
    <td>DACC</td>
    <td>list</td>
  </tr>
  <tr>
    <td>TAC</td>
    <td>list</td>
  </tr>
  <tr>
    <td>TCC</td>
    <td>list</td>
  </tr>
  <tr>
    <td>TACC</td>
    <td>list</td>
  </tr>
  <tr>
    <td rowspan="2">Pseudo Nucleic Acid Composition</td>
    <td>PseDNC</td>
    <td>dict</td>
  </tr>
  <tr>
    <td>PseKNC</td>
    <td>dict</td>
  </tr>
</tbody>
</table>

As mentioned above, the user can calculate all descriptors or individual descriptors. To calculate individual descriptors, the user must specify the name/names of the descriptor/descriptors in the `descriptor_list` parameter. If this parameter is not specified, the user will calculate all descriptors.

To calculate a single descriptor, called 'nucleic_acid_composition', for a single sequence, the user must do:

In [9]:
reader = ReadDNA()
data = reader.read_sequence("ACGTACGAGCATGCAT")

from propythia.dna.descriptors import DNADescriptor
calculator = DNADescriptor(data)

descriptor_list = ['nucleic_acid_composition']
result = calculator.get_descriptors(descriptor_list)
print(json.dumps(result, indent=4))

{
    "nucleic_acid_composition": {
        "A": 0.312,
        "C": 0.25,
        "G": 0.25,
        "T": 0.188
    }
}


To calculate all of them, the user must leave the `descriptor_list` parameter empty.

In [10]:
descriptor_list = []
result = calculator.get_descriptors(descriptor_list)
for key, val in result.items():
    print(key, val)
    print("-" * 100)

length 16
----------------------------------------------------------------------------------------------------
gc_content 0.5
----------------------------------------------------------------------------------------------------
at_content 0.5
----------------------------------------------------------------------------------------------------
nucleic_acid_composition {'A': 0.312, 'C': 0.25, 'G': 0.25, 'T': 0.188}
----------------------------------------------------------------------------------------------------
dinucleotide_composition {'AA': 0.0, 'AC': 0.133, 'AG': 0.067, 'AT': 0.133, 'CA': 0.133, 'CC': 0.0, 'CG': 0.133, 'CT': 0.0, 'GA': 0.067, 'GC': 0.133, 'GG': 0.0, 'GT': 0.067, 'TA': 0.067, 'TC': 0.0, 'TG': 0.067, 'TT': 0.0}
----------------------------------------------------------------------------------------------------
trinucleotide_composition {'AAA': 0.0, 'AAC': 0.0, 'AAG': 0.0, 'AAT': 0.0, 'ACA': 0.0, 'ACC': 0.0, 'ACG': 0.143, 'ACT': 0.0, 'AGA': 0.0, 'AGC': 0.071, 'AGG': 0.0

It is also possible to calculate the descriptors for the *CSV* and the *FASTA* files, which contains a list of sequences.

In [11]:
reader = ReadDNA()
filename = 'primer/dataset.csv'
data = reader.read_csv(filename=filename, with_labels=True)

# get the sequences from the dataframe
sequences = data['sequence'].to_list()

# specify the descriptor list
descriptor_list = ['nucleic_acid_composition']

# only for the first 10 sequences
for i in range(10):
    sequence = sequences[i]
    calculator = DNADescriptor(sequence)
    
    print(sequence)
    print(calculator.get_descriptors(descriptor_list))
    print("-" * 100)

CCGAGGGCTATGGTTTGGAAGTTAGAACCCTGGGGCTTCTCGCGGACACC
{'nucleic_acid_composition': {'A': 0.18, 'C': 0.26, 'G': 0.34, 'T': 0.22}}
----------------------------------------------------------------------------------------------------
GAGTTTATATGGCGCGAGCCTAGTGGTTTTTGTACTTGTTTGTCGCGTCG
{'nucleic_acid_composition': {'A': 0.12, 'C': 0.16, 'G': 0.32, 'T': 0.4}}
----------------------------------------------------------------------------------------------------
GATCAGTAGGGAAACAAACAGAGGGCCCAGCCACATCTAGCAGGTAGCCT
{'nucleic_acid_composition': {'A': 0.34, 'C': 0.26, 'G': 0.28, 'T': 0.12}}
----------------------------------------------------------------------------------------------------
GTCCACGACCGAACTCCCACCTTGACCGCAGAGGTACCACCAGAGCCCTG
{'nucleic_acid_composition': {'A': 0.24, 'C': 0.42, 'G': 0.22, 'T': 0.12}}
----------------------------------------------------------------------------------------------------
GGCGACCGAACTCCAACTAGAACCTGCATAACTGGCCTGGGAGATATGGT
{'nucleic_acid_composition': {'A': 0.28, '

## 3. Descriptors processing


So far we have seen how to read and validate DNA sequences. We've also seen how to calculate descriptors from a single sequence or multiple sequences. Now, we can use the descriptors to train a model.

However, as seen above, when calculating the descriptors for multiple sequences, the result is a list of dictionaries and each dictionary holds the calculated descriptors for a single sequence. So, the next step is to convert this data structure to a dataframe.

We can directly convert the list of dictionaries to a dataframe using the `pd.DataFrame()` function. The result of this step would be similar to the following:

(considering only the first few columns)

| sequence | length | gc_content | at_content | nucleic_acid_composition                         | ...
|----------|--------|------------|------------|--------------------------------------------------|---
| ACTGCGAT | 8      | 0.5        | 0.5        | {'A': 0.25, 'C': 0.25, 'T': 0.25, 'G': 0.25}     | ...
| TTGTTACT | 8      | 0.25       | 0.75       | {'A': 0.125, 'C': 0.125, 'T': 0.125, 'G': 0.625} | ...
| ...      | ...    | ...        | ...        | ...                                              | ...

As we can see, some of the descriptors are not numerical values (e.g. 'nucleic_acid_composition'). Descriptors that produce dictionaries or lists still need be normalized because the model cannot process data in those forms.

To normalize the data, dicts and lists need to "explode" into more columns. 
​
E.g. dicts:
​
| descriptor_hello |
| ---------------- |
| {'a': 1, 'b': 2} |
​
will be transformed into:
​
| descriptor_hello_a | descriptor_hello_b |
| ------------------ | ------------------ |
| 1                  | 2                  |
​
E.g. lists:
​
| descriptor_hello |
| ---------------- |
| [1, 2, 3]        |
​
will be transformed into:
​
| descriptor_hello_0 | descriptor_hello_1 | descriptor_hello_2 |
| ------------------ | ------------------ | ------------------ |
| 1                  | 2                  | 3                  |

The `calculate_and_normalize` function will be used to calculate the descriptors and normalize them. It can be found in the `calculate_features.py` file.

In [12]:
reader = ReadDNA()
filename = 'primer/dataset.csv'
data = reader.read_csv(filename=filename, with_labels=True)

# specify the descriptor list
descriptor_list = []

from propythia.dna.calculate_features import calculate_and_normalize
fps_x, fps_y = calculate_and_normalize(data)

fps_x

0 / 2000
100 / 2000
200 / 2000
300 / 2000
400 / 2000
500 / 2000
600 / 2000
700 / 2000
800 / 2000
900 / 2000
1000 / 2000
1100 / 2000
1200 / 2000
1300 / 2000
1400 / 2000
1500 / 2000
1600 / 2000
1700 / 2000
1800 / 2000
1900 / 2000
Done!


Unnamed: 0,length,gc_content,at_content,nucleic_acid_composition_A,nucleic_acid_composition_C,nucleic_acid_composition_G,nucleic_acid_composition_T,dinucleotide_composition_AA,dinucleotide_composition_AC,dinucleotide_composition_AG,...,accumulated_nucleotide_frequency_0_G,accumulated_nucleotide_frequency_0_T,accumulated_nucleotide_frequency_1_A,accumulated_nucleotide_frequency_1_C,accumulated_nucleotide_frequency_1_G,accumulated_nucleotide_frequency_1_T,accumulated_nucleotide_frequency_2_A,accumulated_nucleotide_frequency_2_C,accumulated_nucleotide_frequency_2_G,accumulated_nucleotide_frequency_2_T
0,50,0.60,0.40,0.18,0.26,0.34,0.22,0.041,0.061,0.061,...,0.462,0.154,0.20,0.12,0.40,0.28,0.184,0.184,0.368,0.263
1,50,0.48,0.52,0.12,0.16,0.32,0.40,0.000,0.020,0.061,...,0.308,0.385,0.20,0.16,0.36,0.28,0.158,0.132,0.316,0.395
2,50,0.54,0.46,0.34,0.26,0.28,0.12,0.082,0.061,0.163,...,0.385,0.154,0.44,0.12,0.36,0.08,0.368,0.263,0.263,0.105
3,50,0.64,0.36,0.24,0.42,0.22,0.12,0.020,0.143,0.082,...,0.231,0.077,0.24,0.44,0.16,0.16,0.237,0.421,0.211,0.132
4,50,0.54,0.46,0.28,0.26,0.28,0.18,0.082,0.102,0.041,...,0.308,0.077,0.32,0.36,0.20,0.12,0.289,0.342,0.211,0.158
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,50,0.66,0.34,0.16,0.26,0.40,0.18,0.000,0.020,0.082,...,0.538,0.154,0.12,0.20,0.52,0.16,0.184,0.211,0.421,0.184
1996,50,0.44,0.56,0.22,0.22,0.22,0.34,0.041,0.061,0.041,...,0.308,0.308,0.20,0.12,0.28,0.40,0.237,0.211,0.263,0.289
1997,50,0.46,0.54,0.18,0.24,0.22,0.36,0.020,0.061,0.041,...,0.154,0.231,0.16,0.36,0.16,0.32,0.132,0.263,0.211,0.395
1998,50,0.48,0.52,0.28,0.24,0.24,0.24,0.102,0.061,0.041,...,0.462,0.077,0.24,0.32,0.28,0.16,0.237,0.289,0.289,0.184


The obtained dataframe contains all calculated descriptors for the input dataset. As we can see by the dataframe shape, it now contains 247 columns instead of just 17. This is because the descriptors are now normalized and the data is finally ready to be used by the model. It is also important to note that, regardless of the size of the sequences, the final dataframe will always have the same number of columns since the implemented descriptors produce always the same number of values.


One needs also to take into account that scaling might be necessary! 


## 3. Encoders

Deep learning models automatically extract features from the sequences, but it is necessary to build a representation of the sequences first due to the fact that models can't handle anything other than numerical values. Encoders are easily calculated and can serve as numerical representations of sequences, which can subsequently be used as model input.

This module comprehends functions to encode the DNA sequences. The encoding step is important because sequences need to be converted into a numerical value in order to create an input matrix for the model. The encoders that have been implemented are:

- One-hot encoding
- Chemical encoding
- K-mer One-hot encoding

Below there's an example for each of them.

| Encoder             | Sequence | Encoded sequence                             |
| ------------------- | -------- | -------------------------------------------- |
| One-Hot             | ACGT     | [[1,0,0,0], [0,1,0,0], [0,0,1,0], [0,0,0,1]] |
| Chemical            | ACGT     | [[1,1,1], [0,1,0], [1,0,0], [0,0,1]]         |
| K-mer One-Hot (k=2) | ACGT     | [[0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0], [0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0], [0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0]] |

### 3.1. One-hot encoding

One-hot encoding is extensively used in deep learning models and is well suited for most models. It is a simple encoding that converts the DNA alphabet into a binary vector.

A -> [1,0,0,0]
C -> [0,1,0,0]
G -> [0,0,1,0]
T -> [0,0,0,1]

To encode a sequence, we need first to create the object DNAEncoder.

In [13]:
from propythia.dna.encoding import DNAEncoder
encoder = DNAEncoder('ACGTACGAGCATGCAT')

Now, we only need to specify the encoder method (one-hot, chemical, k-mer one-hot).

In [14]:
encoded_sequence = encoder.one_hot_encode()
print(encoded_sequence)

[[1 0 0 0]
 [0 1 0 0]
 [0 0 1 0]
 [0 0 0 1]
 [1 0 0 0]
 [0 1 0 0]
 [0 0 1 0]
 [1 0 0 0]
 [0 0 1 0]
 [0 1 0 0]
 [1 0 0 0]
 [0 0 0 1]
 [0 0 1 0]
 [0 1 0 0]
 [1 0 0 0]
 [0 0 0 1]]


### 3.2. Chemical encoding

The chemical encoding is a more complex encoding that uses the chemical properties of the DNA alphabet. Each letter is assigned a chemical property and the chemical properties are combined to create a vector. In a nutshell, the chemical properties are:

<table>
  <thead>
    <tr>
      <th>Chemical property</th>
      <th>Class</th>
      <th>Nucleotides</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td rowspan="2">Ring structure</td>
      <td>Purine</td>
      <td>A, G</td>
    </tr>
    <tr>
      <td>Pyrimidine</td>
      <td>C, T</td>
    </tr>
    <tr>
      <td rowspan="2">Hydrogen bond</td>
      <td>Weak</td>
      <td>A, T</td>
    </tr>
    <tr>
      <td>Strong</td>
      <td>C, G</td>
    </tr>
    <tr>
      <td rowspan="2">Functional group</td>
      <td>Amino</td>
      <td>A, C</td>
    </tr>
    <tr>
      <td>Keto</td>
      <td>G, T</td>
    </tr>
  </tbody>
</table>

If the letter is in the list of the first nucleotides, it is assigned the value 1 and if it is in the list of the second nucleotides, it is assigned the value 0. 

- A -> [1, 1, 1]
- C -> [0, 0, 1]
- G -> [1, 0, 0]
- T -> [0, 1, 0]

The encoder object is already created so we just need to specify the encoder method.

In [15]:
encoded_sequence = encoder.chemical_encode()
print(encoded_sequence)

[[1 1 1]
 [0 0 1]
 [1 0 0]
 [0 1 0]
 [1 1 1]
 [0 0 1]
 [1 0 0]
 [1 1 1]
 [1 0 0]
 [0 0 1]
 [1 1 1]
 [0 1 0]
 [1 0 0]
 [0 0 1]
 [1 1 1]
 [0 1 0]]


### 3.3. K-mer One-hot encoding

Using one-hot encoding on DNA sequences solely preserves the positional information of each nucleotide. Recent investigations, however, have shown that including high-order dependencies among nucleotides may enhance the efficacy of DNA models. The K-mer One-hot encoding is a method that aims to overcome this problem.

If k = 1,the encoder will create the same vector as the one-hot encoding.

If k = 2, 16 dinucleotides will be created, and the encoder will create a vector with the following values:

- AA = [1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
- AC = [0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
- AG = [0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0]
- ...
- TT = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1]

If k = 3, 64 trinucleotides will be created, and the encoder will create a vector with the following values:

- AAA = [1,0,0,0,...,0,0,0,0]
- AAC = [0,1,0,0,...,0,0,0,0]
- ...
- TTT = [0,0,0,0,...,0,0,0,1]

The value of K can be any integer greater than 1 and less than or equal to the length of the sequence.

In [16]:
encoded_sequence = encoder.kmer_one_hot_encode(k=2)
encoded_sequence

array([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
        0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 

This module also allows the user to encode multiple sequences at once. The encoder can receive a column of a dataframe full of sequences and return an array of all encoded sequences.

In [17]:
df = pd.DataFrame(
    [
        ['CGACGATGCAT', 1], 
        ['CGAAGGTGTAC', 0], 
        ['AGTAGGGGTAA', 1]
    ], 
    columns=['sequence', 'labels']
)

column = df['sequence'].values
encoder = DNAEncoder(column)
encoded_sequences = encoder.one_hot_encode()
print(encoded_sequences)

[[[0 1 0 0]
  [0 0 1 0]
  [1 0 0 0]
  [0 1 0 0]
  [0 0 1 0]
  [1 0 0 0]
  [0 0 0 1]
  [0 0 1 0]
  [0 1 0 0]
  [1 0 0 0]
  [0 0 0 1]]

 [[0 1 0 0]
  [0 0 1 0]
  [1 0 0 0]
  [1 0 0 0]
  [0 0 1 0]
  [0 0 1 0]
  [0 0 0 1]
  [0 0 1 0]
  [0 0 0 1]
  [1 0 0 0]
  [0 1 0 0]]

 [[1 0 0 0]
  [0 0 1 0]
  [0 0 0 1]
  [1 0 0 0]
  [0 0 1 0]
  [0 0 1 0]
  [0 0 1 0]
  [0 0 1 0]
  [0 0 0 1]
  [1 0 0 0]
  [1 0 0 0]]]




So, at this point, the user can either choose to use encoders or descriptors to proceed to the next step. Using descriptors it would be something like:

Using descriptors it would be something like:

In [18]:
reader = ReadDNA()
data = reader.read_csv(filename='primer/dataset.csv', with_labels=True)

from propythia.dna.calculate_features import calculate_and_normalize
from sklearn.preprocessing import StandardScaler

fps_x, fps_y = calculate_and_normalize(data)

scaler = StandardScaler().fit(fps_x)
fps_x = scaler.transform(fps_x)
fps_y = fps_y.to_numpy()
print(fps_x.shape)

0 / 2000
100 / 2000
200 / 2000
300 / 2000
400 / 2000
500 / 2000
600 / 2000
700 / 2000
800 / 2000
900 / 2000
1000 / 2000
1100 / 2000
1200 / 2000
1300 / 2000
1400 / 2000
1500 / 2000
1600 / 2000
1700 / 2000
1800 / 2000
1900 / 2000
Done!
(2000, 247)


Using encodings it would be something like:

In [19]:
reader = ReadDNA()
data = reader.read_csv(filename='primer/dataset.csv', with_labels=True)

fps_x = data['sequence'].values
fps_y = data['label'].values

# choosing one hot encoding
encoder = DNAEncoder(fps_x)
fps_x = encoder.one_hot_encode()
print(fps_x.shape)

(2000, 50, 4)


Now, with either physico chemical descriptors or encodings one can move on to ML and DL models. 
With physicochemical encodings one may need feature selection or dimensionality reduction.
Please check the correspondent notebooks. 