# Complex Data Transformations
The complexity of deep learning models makes overfitting a common problem in practice. In genomics, we often augment sequences that are fed to the model with various transformations like jittering, reverse complementing or in silico mutation. Though it is possible to generate augmented sequences prior to training, this limits the number of possible transformations and increases the size of the dataset. This notebook demonstrates how to use tranformations for data augmentation `on-the-fly` using the `transforms` argument in `get_torch_dataloader`.

In [1]:
import seqdata as sd
from pathlib import Path
sd.__version__

The history saving thread hit an unexpected error (DatabaseError('database disk image is malformed')).History will not be written to the database.


Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.

Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.

  from .autonotebook import tqdm as notebook_tqdm


'0.1.2'

## SeqData to PyTorch dataloader

In [2]:
# Get file name
bw_fname = Path(sd.__file__).resolve().parent.parent / 'tests' / 'data' / 'tangermeme.bw'
bed_fname = Path(sd.__file__).resolve().parent.parent / 'tests' / 'data' / 'tangermeme.bed'
fasta_fname = Path(sd.__file__).resolve().parent.parent / 'tests' / 'data' / 'tangermeme.fa'
bw_fname, bed_fname, fasta_fname

(PosixPath('/cellar/users/aklie/projects/ML4GLand/SeqData/tests/data/tangermeme.bw'),
 PosixPath('/cellar/users/aklie/projects/ML4GLand/SeqData/tests/data/tangermeme.bed'),
 PosixPath('/cellar/users/aklie/projects/ML4GLand/SeqData/tests/data/tangermeme.fa'))

In [3]:
from seqdata import read_bigwig

In [10]:
sdata = read_bigwig(
    bigwigs=[bw_fname],  # bigwig files
    fasta=fasta_fname,  # reference genome
    seq_name="seq",  # name of resulting xarray variable containing sequences
    cov_name="cov",  # name of resulting xarray variable containing coverage
    bed=bed_fname,  # bed file with regions to extract
    samples=["tangermeme"],  # sample names
    out=bw_fname.with_suffix(".zarr"),
    fixed_length=True,  # whether all sequences are the same length
    batch_size=1000,  # number of sequences to load at once
    n_jobs=1,  # number of parallel jobs
    overwrite=True,  # overwrite the output file if it exists
    max_jitter=10
)
sdata

100%|██████████| 5/5 [00:00<00:00, 61141.46it/s]
100%|██████████| 5/5 [00:00<00:00, 12764.16it/s]


Unnamed: 0,Array,Chunk
Bytes,40 B,40 B
Shape,"(5,)","(5,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 40 B 40 B Shape (5,) (5,) Dask graph 1 chunks in 2 graph layers Data type object numpy.ndarray",5  1,

Unnamed: 0,Array,Chunk
Bytes,40 B,40 B
Shape,"(5,)","(5,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,40 B,40 B
Shape,"(5,)","(5,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,int64 numpy.ndarray,int64 numpy.ndarray
"Array Chunk Bytes 40 B 40 B Shape (5,) (5,) Dask graph 1 chunks in 2 graph layers Data type int64 numpy.ndarray",5  1,

Unnamed: 0,Array,Chunk
Bytes,40 B,40 B
Shape,"(5,)","(5,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,int64 numpy.ndarray,int64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,40 B,40 B
Shape,"(5,)","(5,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,int64 numpy.ndarray,int64 numpy.ndarray
"Array Chunk Bytes 40 B 40 B Shape (5,) (5,) Dask graph 1 chunks in 2 graph layers Data type int64 numpy.ndarray",5  1,

Unnamed: 0,Array,Chunk
Bytes,40 B,40 B
Shape,"(5,)","(5,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,int64 numpy.ndarray,int64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,800 B,800 B
Shape,"(5, 1, 40)","(5, 1, 40)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 800 B 800 B Shape (5, 1, 40) (5, 1, 40) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",40  1  5,

Unnamed: 0,Array,Chunk
Bytes,800 B,800 B
Shape,"(5, 1, 40)","(5, 1, 40)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,200 B,200 B
Shape,"(5, 40)","(5, 40)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,|S1 numpy.ndarray,|S1 numpy.ndarray
"Array Chunk Bytes 200 B 200 B Shape (5, 40) (5, 40) Dask graph 1 chunks in 2 graph layers Data type |S1 numpy.ndarray",40  5,

Unnamed: 0,Array,Chunk
Bytes,200 B,200 B
Shape,"(5, 40)","(5, 40)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,|S1 numpy.ndarray,|S1 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,40 B,40 B
Shape,"(5,)","(5,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 40 B 40 B Shape (5,) (5,) Dask graph 1 chunks in 2 graph layers Data type object numpy.ndarray",5  1,

Unnamed: 0,Array,Chunk
Bytes,40 B,40 B
Shape,"(5,)","(5,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,object numpy.ndarray,object numpy.ndarray


In [11]:
from seqdata import get_torch_dataloader

In [12]:
import seqpro as sp

In [13]:
def one_hot_encode(batch):
    batch["seq"] = sp.ohe(batch["seq"], alphabet=sp.DNA)
    return batch

In [14]:
dataloader = get_torch_dataloader(
    sdata,
    sample_dims=["_sequence"],
    variables=["seq", "cov"],
    transform=one_hot_encode,
    batch_size=10
)

In [15]:
next(iter(dataloader))

{'seq': tensor([[[0., 1., 0., 0.],
          [0., 0., 1., 0.],
          [1., 0., 0., 0.],
          [0., 1., 0., 0.],
          [0., 0., 0., 1.],
          [1., 0., 0., 0.],
          [0., 1., 0., 0.],
          [0., 0., 0., 1.],
          [1., 0., 0., 0.],
          [0., 1., 0., 0.],
          [0., 1., 0., 0.],
          [0., 0., 1., 0.],
          [1., 0., 0., 0.],
          [0., 1., 0., 0.],
          [0., 0., 0., 1.],
          [1., 0., 0., 0.],
          [1., 0., 0., 0.],
          [0., 1., 0., 0.],
          [0., 0., 0., 1.],
          [0., 0., 1., 0.],
          [1., 0., 0., 0.],
          [0., 1., 0., 0.],
          [0., 0., 0., 1.],
          [0., 0., 1., 0.],
          [1., 0., 0., 0.],
          [0., 0., 0., 1.],
          [0., 0., 1., 0.],
          [1., 0., 0., 0.],
          [0., 0., 0., 1.],
          [0., 0., 1., 0.],
          [1., 0., 0., 0.],
          [0., 0., 0., 1.],
          [0., 0., 1., 0.],
          [0., 1., 0., 0.],
          [1., 0., 0., 0.],
          [0.

In [16]:
import numpy as np

In [49]:
def transform(batch):
    batch['seq'], batch['cov'] = sp.jitter(batch['seq'], batch['cov'], max_jitter=10, length_axis=-1, jitter_axes=0)  # jitter
    batch['cov'] = batch['cov'][..., 5:-5]  # crop 
    batch['seq'] = sp.DNA.ohe(batch['seq']).transpose(0, 2, 1)  # one hot encode
    if np.random.rand() < 0.5:  # reverse complement
        batch['seq'] = sp.reverse_complement(batch['seq'], alphabet=sp.DNA, length_axis=-1, ohe_axis=1).copy()
        batch['cov'] = np.flip(batch['cov'], axis=-1).copy()
    return batch

In [50]:
dataloader = get_torch_dataloader(
    sdata,
    sample_dims=["_sequence"],
    variables=["seq", "cov"],
    transform=transform,
    batch_size=10
)

In [51]:
batch = next(iter(dataloader))
batch['seq'].shape, batch['cov'].shape

(torch.Size([5, 4, 20]), torch.Size([5, 1, 10]))

## Tra (talk to david about this)

In [55]:
def transform(batch):
    batch['seq'], batch['cov'] = sp.jitter(batch['seq'], batch['cov'], max_jitter=10, length_axis=-1, jitter_axes=0)  # jitter
    batch['cov'] = batch['cov'][..., 5:-5]  # crop 
    batch['seq'] = sp.DNA.ohe(batch['seq']).transpose(0, 2, 1)  # one hot encode
    print("here")
    batch['rev_seq'] = sp.reverse_complement(batch['seq'], alphabet=sp.DNA, length_axis=-1, ohe_axis=1).copy()
    return batch

In [56]:
dataloader = get_torch_dataloader(
    sdata,
    sample_dims=["_sequence"],
    variables=["seq", "cov"],
    transform=transform,
    batch_size=10
)

In [57]:
batch = next(iter(dataloader))
batch['seq'].shape, batch['cov'].shape

here


KeyError: 'rev_seq'