# Introduction to RBP package

This notebook includes a demonstration of `rbp` modules' functionality.

## Random

The `random` module contains functions for random genomic coordinates and sequences permutation.

In [1]:
# notebook setup - ignore this cell
%load_ext autoreload
%autoreload 2
import pandas as pd

### Permutations

For a sequence, the package provides `seq_permutation` function to do a random shuffling of characters:

In [2]:
from rbp.random import seq_permutation

seq_permutation('ALPHABET')

'EAPLHTAB'

Optionally, we might want not to permute individual characters but k-mers. In the example below, note that `ALP`, `HAB` and `ET` k-mers are untouched.

In [3]:
seq_permutation('ALPHABET', k=3)

'ALPHABET'

### Random positions and intervals

With `random_genomic_position` function, it is easy to get a random position in a human genome (GRCh38).

In [4]:
from rbp.random import random_genomic_position

random_genomic_position()

('2', 105958476)

It is also possible to use a custom genome by providing its chromosomes' lengths (either a dictionary or pandas.Series object).

In [5]:
MINI_GENOME = {'chr1': 5, 'chr2': 3}
MINI_GENOME2 = pd.Series([5, 3], index = ['chr1', 'chr2'])
random_genomic_position(MINI_GENOME), random_genomic_position(MINI_GENOME2)

(('chr1', 5), ('chr2', 3))

Instead of one position, we can generate a random interval.

Currently, the package contains two implementations:

1) `gen_random_intervals` that is a wrapper calling `bedtools`

In [6]:
from rbp.random import gen_random_intervals

gen_random_intervals(sample_size=10, interval_size=25, reference='hg19', seed=10)

Unnamed: 0,chr,start,end,name,score,strand
0,chr21,24062875,24062900,1,25,+
1,chr10,103186149,103186174,2,25,-
2,chr1,164187383,164187408,3,25,-
3,chr15,86182851,86182876,4,25,-
4,chr9,42140825,42140850,5,25,+
5,chr1,104836096,104836121,6,25,-
6,chr2,109157020,109157045,7,25,+
7,chrX,34974953,34974978,8,25,-
8,chr1,17061260,17061285,9,25,-
9,chr9,66562180,66562205,10,25,+


...and 2) `random_genomic_interval` that is implemented directly in Python

In [7]:
from rbp.random import random_genomic_interval

random_genomic_interval(interval_length=200)

('5', 120549503, 120549702)

To generate not one, but many genomic positions or intervals, use the following:

In [8]:
N = 1000
SEQ_LENGTH = 200
many_intervals = [random_genomic_interval(SEQ_LENGTH) for _ in range(N)]
many_intervals_df = pd.DataFrame(many_intervals, columns=['chr', 'start', 'end'])
many_intervals_df.head()

Unnamed: 0,chr,start,end
0,8,101851930,101852129
1,11,15970859,15971058
2,13,80964409,80964608
3,18,34343109,34343308
4,4,38321182,38321381


Bedtools also provide a possibility to generate a random set of nucleotide sequences with `random_nucleotides` (all letters generated with 25% chance)

In [9]:
from rbp.random.random_sequence import random_nucleotides

random_nucleotides(sample_size=10, seq_length=20)

['AAGACATTATATCGAGAAAC',
 'GGCCCGGACGAAGGATAGCG',
 'TAGTTGTTTGTATGCATACT',
 'CTTGGTGGAACGTCGTTAAG',
 'CCTCTTGACCAGTACGTTGT',
 'ATCTTTAACGATGGCGTAGA',
 'GGCAACATTGAGGAGGCACC',
 'TAAATTGGCGCGAGTCTGGA',
 'TGGGACTATATAGAGAAGAG',
 'CATGACGATCAGTATAATCA']

## Encoding / Decoding

Encoding is a process of conversion from genomic sequences to numerical values. Decoding is the opposite, from numerical values to genomic sequences.

Currently, only one-hot encoding is implemented. We use boolean representation.

### One-hot encoding

Convert sequences to one hot encoding with `one_hot_encoding`.

Note the resulting array is three-dimensional. The shape of the array is:

    (number of sequences x length of one sequence x size of the alphabet )

In [10]:
from rbp.encoding import one_hot_encoding

X = one_hot_encoding(['AACT', 'CCTG'])
print('array shape is:', X.shape, '\n', sep=' ')
print('one hot encoded sequences are:\n', X.astype(int), sep='\n')

array shape is: (2, 4, 4) 

one hot encoded sequences are:

[[[1 0 0 0]
  [1 0 0 0]
  [0 1 0 0]
  [0 0 1 0]]

 [[0 1 0 0]
  [0 1 0 0]
  [0 0 1 0]
  [0 0 0 1]]]


### One-hot decoding

With `one_hot_decoding`, we can easily get back the original sequences.

In [11]:
from rbp.encoding import one_hot_decoding

Xd = one_hot_decoding(X)
print('decoded array is:', Xd)

decoded array is: ['AACT', 'CCTG']


### Sequence vs sequence complementarity

A generic dot matrix of sequence vs sequence complementarity can be generated with `dot_matrix`.

Note the resulting array is three-dimensional. The shape of the array is:

    ( number of sequence pairs x length of seq_X x length of seq_Y x 1 )

In [12]:
from rbp.random.random_sequence import random_nucleotides
from rbp.encoding import dot_matrix

import pandas as pd
# create dataframe
df = pd.DataFrame( {"seq_X" : random_nucleotides(sample_size=5, seq_length=10),
                    "seq_Y" : random_nucleotides(sample_size=5, seq_length=10)})

array_ohe = dot_matrix(df)
print('array shape is:', array_ohe.shape, "\n")

print('First pair of samples:', df["seq_X"][0], df["seq_Y"][0])

array_ohe[0,:,:,0] # just the first pair of samples

array shape is: (5, 10, 10, 1) 

First pair of samples: ACAAGACATG ACGCAGTATG


array([[0., 0., 0., 0., 0., 0., 1., 0., 1., 0.],
       [0., 0., 1., 0., 0., 1., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 0., 1., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 1., 0.],
       [0., 1., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 1., 0.],
       [0., 0., 1., 0., 0., 1., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 0., 1., 0., 1., 0.],
       [1., 0., 0., 0., 1., 0., 0., 1., 0., 0.],
       [0., 1., 0., 1., 0., 0., 0., 0., 0., 0.]], dtype=float32)