# Reading Sequences from Flat Files
The simplest way to store genomic sequence data is from a flat file in which the sequences are stored in plain text. In this notebook, we will learn how to read sequences from two types of flat files using SeqData

In [1]:
import os
import seqdata as sd
from pathlib import Path
sd.__version__

The history saving thread hit an unexpected error (DatabaseError('database disk image is malformed')).History will not be written to the database.


'0.0.0'

In [2]:
# Make a temporary directory for the output
os.makedirs(Path(sd.__file__).resolve().parent.parent / 'tests' / 'data' / 'tmp', exist_ok=True)

## Reading tabular files (CSV, TSV, etc.)

Though reading tabular data can easily be accomplished with existing packages (e.g. pandas), the SeqData interface keeps the resulting on-disk and in-memory objects standardized with the rest of the SeqData API.

In [6]:
# Get file name
fname = Path(sd.__file__).resolve().parent.parent / 'tests' / 'data' / 'variable.tsv'
fname

PosixPath('/cellar/users/aklie/projects/ML4GLand/SeqData/tests/data/variable.tsv')

To load this flat file as a SeqData object, we need to specify the path to the file and the output file name.

In [7]:
from seqdata import read_table

In [8]:
sdata = read_table(
    tables=fname,
    out=Path(sd.__file__).resolve().parent.parent / 'tests' / 'data' / 'tmp' / 'variable_table.zarr',
    seq_col="seq",
    name="seq",
    fixed_length=False,
    batch_size=1000,
    overwrite=True,
)
sdata

7it [00:00, 301.73it/s]


Unnamed: 0,Array,Chunk
Bytes,56 B,56 B
Shape,"(7,)","(7,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 56 B 56 B Shape (7,) (7,) Dask graph 1 chunks in 2 graph layers Data type object numpy.ndarray",7  1,

Unnamed: 0,Array,Chunk
Bytes,56 B,56 B
Shape,"(7,)","(7,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,56 B,56 B
Shape,"(7,)","(7,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 56 B 56 B Shape (7,) (7,) Dask graph 1 chunks in 2 graph layers Data type float64 numpy.ndarray",7  1,

Unnamed: 0,Array,Chunk
Bytes,56 B,56 B
Shape,"(7,)","(7,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


Will generate a `.zarr` file containing the sequences in the `seq_col` column of `variable.tsv`. The resulting `sdata` object can then be used for downstream analysis.

You can also pass in a list of tabular files. This will generate a SeqData object with each table concatenated along the row axis in the order they were passed in.

In [9]:
# Get file name
fnames = [fname, Path(sd.__file__).resolve().parent.parent / 'tests' / 'data' / 'fixed.tsv']
fnames

[PosixPath('/cellar/users/aklie/projects/ML4GLand/SeqData/tests/data/variable.tsv'),
 PosixPath('/cellar/users/aklie/projects/ML4GLand/SeqData/tests/data/fixed.tsv')]

In [10]:
sdata2 = read_table(
    name="seq",
    tables=fnames,
    out=Path(sd.__file__).resolve().parent.parent / 'tests' / 'data' / 'tmp' / 'combo_table.zarr',
    seq_col="seq",
    fixed_length=False,
    batch_size=1000,
    overwrite=True,
)
sdata2

14it [00:00, 540.54it/s]


Unnamed: 0,Array,Chunk
Bytes,112 B,112 B
Shape,"(14,)","(14,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 112 B 112 B Shape (14,) (14,) Dask graph 1 chunks in 2 graph layers Data type object numpy.ndarray",14  1,

Unnamed: 0,Array,Chunk
Bytes,112 B,112 B
Shape,"(14,)","(14,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,112 B,112 B
Shape,"(14,)","(14,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 112 B 112 B Shape (14,) (14,) Dask graph 1 chunks in 2 graph layers Data type float64 numpy.ndarray",14  1,

Unnamed: 0,Array,Chunk
Bytes,112 B,112 B
Shape,"(14,)","(14,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


## Reading "flat" FASTA files

Sometimes, sequences without are stored without any labels in [FASTA format](https://en.wikipedia.org/wiki/FASTA_format). These can be read in much the same way as tabular files:

In [11]:
# Get file name
fname = Path(sd.__file__).resolve().parent.parent / 'tests' / 'data' / 'variable.fa'
fname

PosixPath('/cellar/users/aklie/projects/ML4GLand/SeqData/tests/data/variable.fa')

In [12]:
from seqdata import read_flat_fasta

In [13]:
sdata3 = read_flat_fasta(
    name="seq",
    out=Path(sd.__file__).resolve().parent.parent / 'tests' / 'data' / 'tmp' / 'variable_fasta.zarr',
    fasta=fname,
    fixed_length=False,
    batch_size=1000,
    overwrite=True,
)
sdata3

100%|██████████| 7/7 [00:00<00:00, 1581.31it/s]


Unnamed: 0,Array,Chunk
Bytes,56 B,56 B
Shape,"(7,)","(7,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 56 B 56 B Shape (7,) (7,) Dask graph 1 chunks in 2 graph layers Data type object numpy.ndarray",7  1,

Unnamed: 0,Array,Chunk
Bytes,56 B,56 B
Shape,"(7,)","(7,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,object numpy.ndarray,object numpy.ndarray


Unlike tabular files however, `read_flat_fasta` does not support reading multiple files at once. If this behavior is desired, simply concatenate the fasta files on disk (e.g. with `cat`) or use the XArray `.concat` method after reading in the files separately.

## Composing readers

The functions we used (`read_table` and `read_flat_fasta`) are actual custom examples of using a more basic set of building blocks called readers,`Table` and `FlatFasta` respectively. These readers are called 'FlatReaders' and are designed to be composable, so you can create your own custom readers by combining these basic readers in a way that suits your needs.

In [14]:
from seqdata import Table, FlatFASTA

Let's take an example where we want to read in two sets sequences into separate variables in the resulting XArray. Perhaps you want to do some kind of contrastive learning between pairs of sequences:

In [15]:
# Get file names
fnames = [Path(sd.__file__).resolve().parent.parent / 'tests' / 'data' / 'variable.fa', Path(sd.__file__).resolve().parent.parent / 'tests' / 'data' / 'fixed.fa']
fnames

[PosixPath('/cellar/users/aklie/projects/ML4GLand/SeqData/tests/data/variable.fa'),
 PosixPath('/cellar/users/aklie/projects/ML4GLand/SeqData/tests/data/fixed.fa')]

We can build a reader for each of the files:

In [16]:
# Note that we specify different names for the two readers
reader1 = FlatFASTA(fasta=fnames[0], name="seq", batch_size=1000)
reader2 = FlatFASTA(fasta=fnames[1], name="seq2", batch_size=1000)

In [17]:
from seqdata import from_flat_files

And load them into an XArray object using the `from_flat_files` function:

In [18]:
sdata4 = from_flat_files(
        reader1, reader2,
        path=Path(sd.__file__).resolve().parent.parent / 'tests' / 'data' / 'tmp' / 'combo_fasta.zarr',
        fixed_length=False,
        overwrite=True
    )
sdata4

100%|██████████| 7/7 [00:00<00:00, 4391.28it/s]
100%|██████████| 7/7 [00:00<00:00, 2411.71it/s]


Unnamed: 0,Array,Chunk
Bytes,56 B,56 B
Shape,"(7,)","(7,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 56 B 56 B Shape (7,) (7,) Dask graph 1 chunks in 2 graph layers Data type object numpy.ndarray",7  1,

Unnamed: 0,Array,Chunk
Bytes,56 B,56 B
Shape,"(7,)","(7,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,56 B,56 B
Shape,"(7,)","(7,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 56 B 56 B Shape (7,) (7,) Dask graph 1 chunks in 2 graph layers Data type object numpy.ndarray",7  1,

Unnamed: 0,Array,Chunk
Bytes,56 B,56 B
Shape,"(7,)","(7,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,object numpy.ndarray,object numpy.ndarray


You can see that the above SeqData includes both the `seq1` and `seq2` variables, which can be used for downstream analysis.

## Clean-up

In [19]:
import shutil

In [20]:
# Remove all directories and files created
shutil.rmtree(Path(sd.__file__).resolve().parent.parent / 'tests' / 'data' / 'tmp')