Skip to content

A Python module written in rust to encode DNA sequences for machine learning

License

Notifications You must be signed in to change notification settings

Mvila035/dna_parser

Repository files navigation

dna_parser

Build Status

dna-parser is a Python library written in rust to encode (or perform feature extraction on) DNA/RNA sequences for machine learning.

Table of contents

  1. Install
  2. Usage
    1. Loading Fasta Files
    2. Encodings
    3. Other Functions

Install

To install dna-parser simply run:

pip install dna-parser

If there is no Python wheel available for your OS you can install Rust and re-install dna-parser which should now compile and your machine. Run the following command on Unix-like OS to install Rust:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

or see more options at https://www.rust-lang.org/tools/install.

Usage

import dna_parser

Loading Fasta Files

#load both metadata and sequence in tuples (metadata,sequences)
metadata_and_sequences= dna_parser.load_fasta("path/to/fasta/file")

#load sequence only
sequences= dna_parser.seq_from_fasta("path/to/fasta/file")

#load metadata only
metadata= dna_parser.metadata_from_fasta("path/to/fasta/file")

Encodings

Currently only support ordinal encoding, onehot encoding, cross encoding and Term Frequency Inverse Document Frequency (TF-IDF).

Ordinal Encoding

Nucleotides are currently encoded as follow:

  • A= 0.25
  • C= 0.50
  • G= 0.75
  • T/U= 1.0
  • Other characters or gaps = 0
#returns a list of 1D numpy arrays representing the encoding
encoding= dna_parser.ordinal_encoding(sequences, pad_type, pad_length, n_jobs)

Function Arguments:

  • sequences (list of str): list of genomic sequences.
  • pad_type (str; default= "after"): pad (or trim) "before" or "after" the sequences.
  • pad_length (int; default= 0): -2 to pad according to the longest sequence, -1 to trim to the shortest sequence, 0 for no paddding, any positive number for a fixed length.
  • n_jobs (int; default= 1): number of threads to use to encode the sequences. 0 to use all cpus available.

OneHot Encoding

Nucleotides are currently encoded as follow:

  • A= [1,0,0,0]
  • C= [0,1,0,0]
  • G= [0,0,1,0]
  • T/U= [0,0,0,1]
  • Other characters or gaps = [0,0,0,0]
#returns a list of 2D numpy arrays representing the encoding
encoding= dna_parser.onehot_encoding(sequences, pad_type, pad_length, n_jobs)

Function Arguments:

  • sequences (list of str): list of genomic sequences.
  • pad_type (str; default= "after"): pad (or trim) "before" or "after" the sequences.
  • pad_length (int; default= 0): -2 to pad according to the longest sequence, -1 to trim to the shortest sequence, 0 for no paddding, any positive number for a fixed length.
  • n_jobs (int; default= 1): number of threads to use to encode the sequences. 0 to use all cpus available.

Cross Encoding

Nucleotides are currently encoded as follow:

  • A= [0,-1]
  • C= [-1,0]
  • G= [1,0]
  • T/U= [0,1]
  • Other characters or gaps = [0,0]
#returns a list of 2D numpy arrays representing the encoding
encoding= dna_parser.cross_encoding(sequences, pad_type, pad_length, n_jobs)

Function Arguments:

  • sequences (list of str): list of genomic sequences.
  • pad_type (str; default= "after"): pad (or trim) "before" or "after" the sequences.
  • pad_length (int; default= 0): -2 to pad according to the longest sequence, -1 to trim to the shortest sequence, 0 for no paddding, any positive number for a fixed length.
  • n_jobs (int; default= 1): number of threads to use to encode the sequences. 0 to use all cpus available.

TF-IDF Encoding

Note that for this function, your sequences need to be split up in words (or k-mers) where each word is separated by a whitespace. To do so you can use the make_kmers function (see Other Functions section).

encoding= dna_parser.tfidf_encoding(corpus)

Function Arguments:

  • corpus (list of str): genomic sequences.

Other Functions

Generating Random sequences

This function generates random dna, rna or amino acid sequences and returns them in a list.

sequences= dna_parser.random_seq(lenght, nb_of_seq, seq_type, n_jobs)

Function Arguments:

  • length (int): length of the sequences.
  • nb_of_seq (int): number of sequences to generate.
  • seq_type (str; default= dna): type of sequences. "dna", "rna" or "aa" (for amino acid).
  • n_jobs (int, default= 1): number of threads to use to generate the sequences. 0 to use all cpus available.

Making K-mers in Sequences

this function takes a string and returns a new one with withspaces inserted to form words of length k.

seq_k_mers= dna_parser.make_kmers(seq, k)

Function Arguments:

  • seq (str): the genomic sequence.
  • k (int): length of words to create in the sequence.