# Introduction to RBP package

This notebook includes a demonstration of `rbp` modules' functionality.

## Random

The `random` module contains functions for random genomic coordinates and sequences permutation.

In [1]:
%load_ext autoreload
%autoreload 2

from rbp.random import random_genomic_position, random_genomic_interval, seq_permutation, gen_random_intervals
import pandas as pd

### Permutations

For a sequence, the package provides `seq_permutation` function to do a random shuffling of characters:

In [2]:
seq_permutation('ALPHABET')

'HPEBLAAT'

Optionally, we might want not to permute individual characters but k-mers. In the example below, note that `ALP`, `HAB` and `ET` k-mers are untouched.

In [3]:
seq_permutation('ALPHABET', k=3)

'ETALPHAB'

### Random positions and intervals

With `random_genomic_position` function, it is easy to get a random position in a human genome (GRCh38). Note that all coordinates are 1-based. 

In [4]:
random_genomic_position()

('11', 108213279)

We can, of course, use any other genome by providing its chromosomes' lengths (either a dictionary or pandas.Series object).

In [5]:
MINI_GENOME = {'1': 5, '2': 3}
random_genomic_position(MINI_GENOME)

('2', 2)

Instead of one position, we can generate a random interval as follows:

In [6]:
random_genomic_interval(interval_length=200)

('12', 20274153, 20274352)

Generate random genomic intervals with `gen_random_intervals`, with reproducibility (seed).

In [7]:
gen_random_intervals(sample_size=10, interval_size=25, reference='hg19', seed=10)

Unnamed: 0,chr,start,end,name,score,strand
0,chr21,24062875,24062900,1,25,+
1,chr10,103186149,103186174,2,25,-
2,chr1,164187383,164187408,3,25,-
3,chr15,86182851,86182876,4,25,-
4,chr9,42140825,42140850,5,25,+
5,chr1,104836096,104836121,6,25,-
6,chr2,109157020,109157045,7,25,+
7,chrX,34974953,34974978,8,25,-
8,chr1,17061260,17061285,9,25,-
9,chr9,66562180,66562205,10,25,+


And there is a straightforward way how to generate not one, but many genomic positions or intervals, e.g.

In [8]:
N = 1000
SEQ_LENGTH = 200
many_intervals = [random_genomic_interval(SEQ_LENGTH) for _ in range(N)]
many_intervals_df = pd.DataFrame(many_intervals, columns=['chr', 'start', 'end'])
many_intervals_df.head()

Unnamed: 0,chr,start,end
0,1,245542258,245542457
1,12,61192098,61192297
2,4,69680053,69680252
3,9,129972462,129972661
4,11,53602178,53602377


generate a random set of nucleotide sequences with `random_nucleotides`.

In [12]:
from rbp.random.random_sequence import random_nucleotides

random_nucleotides(sample_size=10, seq_length=20)

['GGTACGGCGTGCTCCCGTGC',
 'CCGGTGAGGTAGGAGACTGA',
 'AAAAGCAGTCATCACCCTTG',
 'CGTATCGTTGGGGCGGATTG',
 'CGAACGCGCGCCTGGATGCT',
 'TGTGGTACCCACACTACCGG',
 'TAATTAGAACACTAGCCTTA',
 'TGGAGCTCAGCTATACTTCA',
 'AGACCGACACGAGGAGAAAT',
 'CACAGTCACATCCAGCACAA']

## Encoding / Decoding

Encoding is a process of conversion from genomic sequences to numerical values. Decoding is the opposite, from numerical values to genomic sequences.

Currently, only one-hot encoding is implemented. We use boolean representation.

In [9]:
from rbp.encoding import one_hot_encoding, one_hot_decoding
import matplotlib.pyplot as plt
import pandas as pd

%matplotlib inline

Convert sequences to one hot encoding with `one_hot_encoding`.

Note the resulting array is three-dimensional. The shape of the array is:

    (number of sequences x length of one sequence x size of the alphabet )

In [10]:
X = one_hot_encoding(['AACT', 'CCTG'])
print('array shape is:', X.shape, '\n', sep=' ')
#print('one hot encoded sequences are:', X.astype(int), sep='\n')

array shape is: (2, 4, 4) 



With `one_hot_decoding`, we can easily get back the original sequences.

In [11]:
Xd = one_hot_decoding(X)
print('decoded array is:', Xd)

decoded array is: ['AACT', 'CCTG']


A generic dot matrix of sequence vs sequence complementarity can be generated with `dot_matrix`.

Note the resulting array is three-dimensional. The shape of the array is:

    ( number of sequence pairs x length of seq_X x length of seq_Y x 1 )

In [13]:
from rbp.random.random_sequence import random_nucleotides
from rbp.encoding import dot_matrix

import pandas as pd
# create dataframe
df = pd.DataFrame( {"seq_X" : random_nucleotides(sample_size=10, seq_length=20),
                    "seq_Y" : random_nucleotides(sample_size=10, seq_length=20)})

array_ohe = dot_matrix(df)
print('array shape is:',array_ohe.shape)


array shape is: (10, 20, 20, 1)
