# Introduction to RBP package

This notebook includes a demonstration of `rbp` modules' functionality.

## Random

The `random` module contains functions for random genomic coordinates and sequences permutation.

In [1]:
%load_ext autoreload
%autoreload 2

from rbp.random import random_genomic_position, random_genomic_interval, seq_permutation
import pandas as pd

### Permutations

For a sequence, the package provides `seq_permutation` function to do a random shuffling of characters:

In [2]:
seq_permutation('ALPHABET')

'TLAEAHBP'

Optionally, we might want not to permute individual characters but k-mers. In the example below, note that `ALP`, `HAB` and `ET` k-mers are untouched.

In [3]:
seq_permutation('ALPHABET', k=3)

'ETALPHAB'

### Random positions and intervals

With `random_genomic_position` function, it is easy to get a random position in a human genome (GRCh38). Note that all coordinates are 1-based. 

In [4]:
random_genomic_position()

('18', 72138332)

We can, of course, use any other genome by providing its chromosomes' lengths (either a dictionary or pandas.Series object).

In [5]:
MINI_GENOME = {'1': 5, '2': 3}
random_genomic_position(MINI_GENOME)

('1', 5)

Instead of one position, we can generate a random interval as follows:

In [6]:
random_genomic_interval(interval_length=200)

('1', 137233750, 137233949)

And there is a straightforward way how to generate not one, but many genomic positions or intervals, e.g.

In [7]:
N = 1000
SEQ_LENGTH = 200
many_intervals = [random_genomic_interval(SEQ_LENGTH) for _ in range(N)]
many_intervals_df = pd.DataFrame(many_intervals, columns=['chr', 'start', 'end'])
many_intervals_df.head()

Unnamed: 0,chr,start,end
0,3,197472075,197472274
1,7,127998691,127998890
2,7,58337411,58337610
3,12,124938624,124938823
4,11,109996544,109996743


## Encoding / Decoding

Encoding is a process of conversion from genomic sequences to numerical values. Decoding is the opposite, from numerical values to genomic sequences.

Currently, only one-hot encoding is implemented. We use boolean representation.

In [8]:
from rbp.encoding import one_hot_encoding, one_hot_decoding

X = one_hot_encoding(['AACT', 'CCTG'])
X

array([[[ True, False, False, False],
        [ True, False, False, False],
        [False,  True, False, False],
        [False, False,  True, False]],

       [[False,  True, False, False],
        [False,  True, False, False],
        [False, False,  True, False],
        [False, False, False,  True]]])

Note the resulting array is three-dimensional. The shape of the array is (number of sequences x length of one sequence x sixe of the alphabet).

With `one_hot_decoding`, we can easily get back the original sequences:

In [9]:
one_hot_decoding(X)

['AACT', 'CCTG']