# Reading tabular data into SeqData objects
The simplest way to store genomic sequence data is in a flat "tabular" file. Though this can easily be accomplished using something like `pandas.read_csv`, the SeqData interface keeps the resulting on-disk and in-memory objects standardized with the rest of the SeqData and larger ML4GLand API.

In [3]:
import os
import seqdata as sd
from pathlib import Path
sd.__version__

'0.0.0'

In [4]:
# Make a temporary directory for the output
os.makedirs(Path(sd.__file__).resolve().parent.parent / 'tests' / 'data' / 'tmp', exist_ok=True)

In [5]:
# Get file name
fname = Path(sd.__file__).resolve().parent.parent / 'tests' / 'data' / 'sample100.tsv'
fname

PosixPath('/cellar/users/aklie/projects/ML4GLand/SeqData/tests/data/sample100.tsv')

To load this flat file as a SeqData object, we need to specify the path to the file and the output file name.

In [10]:
from seqdata import read_table

In [11]:
sdata = read_table(
    tables=fname,
    out=fname.with_suffix(".zarr"),
    seq_col="seq",
    name="seq",
    fixed_length=False,
    batch_size=1000,
    overwrite=True,
)
sdata

100it [00:00, 3488.13it/s]


Unnamed: 0,Array,Chunk
Bytes,800 B,800 B
Shape,"(100,)","(100,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 800 B 800 B Shape (100,) (100,) Dask graph 1 chunks in 2 graph layers Data type object numpy.ndarray",100  1,

Unnamed: 0,Array,Chunk
Bytes,800 B,800 B
Shape,"(100,)","(100,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,800 B,800 B
Shape,"(100,)","(100,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 800 B 800 B Shape (100,) (100,) Dask graph 1 chunks in 2 graph layers Data type float64 numpy.ndarray",100  1,

Unnamed: 0,Array,Chunk
Bytes,800 B,800 B
Shape,"(100,)","(100,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [12]:
sdata["target"].values[:15]

array([11.      ,  7.      , 13.      ,  3.      , 13.      , 10.      ,
        5.      , 12.      , 11.      , 11.      ,  8.      ,  9.171174,
        4.      , 12.8937  ,  0.      ])

Will generate a `sdata.zarr` file containing the sequences in the `seq_col` column of `sequences.tsv`. The resulting `sdata` object can then be used for downstream analysis.

You can also pass in a list of tabular files. This will generate a SeqData object with each table concatenated along the row axis in the order they were passed in.

In [37]:
# Get file name
fnames = [fname, Path(sd.__file__).resolve().parent.parent / 'tests' / 'data' / 'sample2.tsv']
fnames

[PosixPath('/cellar/users/aklie/projects/ML4GLand/SeqData/tests/data/sample.tsv'),
 PosixPath('/cellar/users/aklie/projects/ML4GLand/SeqData/tests/data/sample2.tsv')]

In [38]:
sdata2 = read_table(
    name="seq",
    tables=fnames,
    out=fname.with_suffix(".zarr"),
    seq_col="seq",
    fixed_length=False,
    batch_size=1000,
    overwrite=True,
)
sdata2

200000it [00:05, 35884.36it/s]


Unnamed: 0,Array,Chunk
Bytes,781.25 kiB,7.81 kiB
Shape,"(100000,)","(1000,)"
Dask graph,100 chunks in 2 graph layers,100 chunks in 2 graph layers
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 781.25 kiB 7.81 kiB Shape (100000,) (1000,) Dask graph 100 chunks in 2 graph layers Data type object numpy.ndarray",100000  1,

Unnamed: 0,Array,Chunk
Bytes,781.25 kiB,7.81 kiB
Shape,"(100000,)","(1000,)"
Dask graph,100 chunks in 2 graph layers,100 chunks in 2 graph layers
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,781.25 kiB,7.81 kiB
Shape,"(100000,)","(1000,)"
Dask graph,100 chunks in 2 graph layers,100 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 781.25 kiB 7.81 kiB Shape (100000,) (1000,) Dask graph 100 chunks in 2 graph layers Data type float64 numpy.ndarray",100000  1,

Unnamed: 0,Array,Chunk
Bytes,781.25 kiB,7.81 kiB
Shape,"(100000,)","(1000,)"
Dask graph,100 chunks in 2 graph layers,100 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
