# Tutorial: Interacting with Data

This notebook will provide a demonstration for converting common data formats into formats that the data generators/loaders can handle. Most of the utilities needed for handling datasets come from the [dnadb](https://github.com/DLii-Research/dnadb) library.

---

## Paths

Below quickly imports the Path class from `pathlib` to make dealing with paths easier, as well as sets the dataset root to work with.

In [1]:
from pathlib import Path

In [2]:
sfd_fasta_file = "/home/shared/walker_lab/alex/P_A_221205_cmfp.trim.contigs.pcr.good.unique.good.filter.unique.precluster.denovo.vsearch.pick.opti_mcc.0.03.pick.0.03.abund.0.03.pick.fasta"
otu_list_path = "/home/shared/walker_lab/digitalocean/Alex_SFD/shared_list/221205_cmfp.trim.contigs.pcr.good.unique.good.filter.unique.precluster.denovo.vsearch.asv.list"
otu_shared_path = "/home/shared/walker_lab/digitalocean/Alex_SFD/shared_list/221205_cmfp.trim.contigs.pcr.good.unique.good.filter.unique.precluster.denovo.vsearch.asv.shared"

---

## Interfacing with FASTA Files



In [3]:
from dnadb import fasta

In order to read entries out of a FASTA file, we use the `fasta.entries` function. This function returns a generator, meaning that it will not read an entry until requested. This allows us to work with FASTA files of any size without ever running out of memory.

In [21]:
entries = fasta.entries(sfd_fasta_file)
entries

<generator object entries at 0x7f4b5f51cac0>

In [24]:
# Read a single entry
entry = next(entries)
print(entry)

>M03064_61_000000000-CKV98_1_2104_9506_7395 CMFP657	Otu000002	NumRep=1
T--AC--GT-AG-GGT----GCG-A-G----C--G--T---T--GT-C-CGG-AA-----TT-A--T-T--GG-GC------GT--A-----AA-GA-GC-TT-------G-TA-G-G-C-G---------------G--T-TT-G-T-C-------------GC----G-T-C-T----------------G-C-T--G--TG--A-AA-AT--C-C-GG-G-G------------------------CT-C-AA-------------------------C-C-C-C-G-G-A--C-T----T-G--C-A---G--T-----------------------G--GG-T-A---C-----------G--G-G--CA--G-A-C------------------------------------------------------------------------------T-A-G-A-G-T--G-----T-GG------TA-G-G-------------------G-G-A-G---AC-T------------------------------GG--A--ATT--------------------------C-C-T-G-GT--GT-A-G-CG-GT--G---------G--A-A-----------------TG-C-GC-AG--AT-A-TC-------------------A-G------G-A------A-G-A-AC-A-CC-----------------GA--T--T--GC-GAA-G--G-C----A--------G--G-T-C-T---CTG--------G--GC-C-A-----------------------------C-T--------A-C-T--GA--CG-----C--------T-G--A-GA--A-G-CG-A--AA-G-C------A-TG--GG-G--AG-C-G-AA

In [27]:
# Access the entry identifier
entry.identifier

'M03064_61_000000000-CKV98_1_2104_9506_7395'

In [28]:
# Access the entry's sequence
entry.sequence

'T--AC--GT-AG-GGT----GCG-A-G----C--G--T---T--GT-C-CGG-AA-----TT-A--T-T--GG-GC------GT--A-----AA-GA-GC-TT-------G-TA-G-G-C-G---------------G--T-TT-G-T-C-------------GC----G-T-C-T----------------G-C-T--G--TG--A-AA-AT--C-C-GG-G-G------------------------CT-C-AA-------------------------C-C-C-C-G-G-A--C-T----T-G--C-A---G--T-----------------------G--GG-T-A---C-----------G--G-G--CA--G-A-C------------------------------------------------------------------------------T-A-G-A-G-T--G-----T-GG------TA-G-G-------------------G-G-A-G---AC-T------------------------------GG--A--ATT--------------------------C-C-T-G-GT--GT-A-G-CG-GT--G---------G--A-A-----------------TG-C-GC-AG--AT-A-TC-------------------A-G------G-A------A-G-A-AC-A-CC-----------------GA--T--T--GC-GAA-G--G-C----A--------G--G-T-C-T---CTG--------G--GC-C-A-----------------------------C-T--------A-C-T--GA--CG-----C--------T-G--A-GA--A-G-CG-A--AA-G-C------A-TG--GG-G--AG-C-G-AA----CA-GG'

In [75]:
# Access the entry's metadata line
print(entry.extra)

CMFP13	Otu000002	NumRep=1


In [23]:
# Iterate over the entries
for entry in entries:
    print(entry)
    break # so we don't print them all

>M03064_23_000000000-C8T53_1_1103_13633_21201 CMFP171-CMFP377-CMFP900	Otu000002	NumRep=6
T--AC--GT-AG-GGT----GCA-A-G----C--G--T---T--GT-C-CGG-AA-----TT-A--T-T--GG-GC------GT--A-----AA-GA-GC-TC-------G-TA-G-G-C-G---------------G--T-TT-G-T-C-------------GC----G-T-C-T----------------G-C-T--G--TG--A-AA-AT--C-C-GG-G-G------------------------CT-C-AA-------------------------C-C-C-C-G-G-A--C-T----T-G--C-G---G--T-----------------------G--GG-T-A---C-----------G--G-G--CA--G-A-C------------------------------------------------------------------------------T-A-G-A-G-T--G-----T-GG------TA-G-G-------------------G-G-A-G---AC-T------------------------------GG--A--ATT--------------------------C-C-T-G-GT--GT-A-G-CG-GT--G---------A--A-A-----------------TG-C-GC-AG--AT-A-TC-------------------A-G------G-A------G-G-A-AC-A-CC-----------------GA--T--G--GC-GAA-G--G-C----A--------G--G-T-C-T---CTG--------G--GC-C-A-----------------------------C-T--------A-C-T--GA--CG-----C--------T-G--A-GA--A-G-CG-A--AA-G-C------G-T

---

## Cleaning FASTA Entries

Sometimes the entries within a FASTA file may need to be cleaned up in some way. This quick demo removes all of the dashes in the sequence part of the entries.

In [50]:
from dnadb import fasta

In [53]:
from dnadb import dna
from dataclasses import replace
import re

def clean_entry(entry: fasta.FastaEntry):
    """
    Clean the sequence in the given entry by removing all non-nucleotide-base characters.
    """
    sequence = re.sub(r"[^" + dna.ALL_BASES + r"]", "", entry.sequence)
    return replace(entry, sequence=sequence)

In [56]:
raw_entry = next(fasta.entries(sfd_fasta_file))
cleaned_entry = clean_entry(raw_entry)

print("Raw entry:")
print(raw_entry)
print()
print("Cleaned entry:")
print(cleaned_entry)
print()

Raw entry:
>M03064_22_000000000-C8H43_1_1101_10211_8648 CMFP13	Otu000002	NumRep=1
T--AC--GT-AG-GGT----GCG-A-G----C--G--T---T--GT-C-CGG-AA-----TT-A--T-T--GG-GC------GT--A-----AA-GA-GC-TT-------G-TA-G-G-C-G---------------G--T-CT-G-T-C-------------GC----G-T-C-C----------------G-C-T--G--TG--A-AA-AT--C-C-GG-G-G------------------------CT-C-AA-------------------------C-C-C-C-G-G-A--C-T----T-G--C-A---G--T-----------------------G--GG-T-A---C-----------G--G-G--CA--G-A-C------------------------------------------------------------------------------T-A-G-A-G-T--G-----T-GG------TA-G-G-------------------G-G-A-G---AC-T------------------------------GG--A--ATT--------------------------C-C-T-G-GT--GT-A-G-CG-GT--G---------A--A-A-----------------TG-C-GC-AG--AT-A-TC-------------------A-G------G-A------G-G-A-AC-A-CC-----------------GA--T--G--GC-GAA-G--G-C----A--------G--G-T-C-T---CTG--------G--GC-C-A-----------------------------C-T--------A-C-T--GA--CG-----C--------T-G--A-GA--A-G-CG-A--AA-G-C------G-TG--GG-G


Another convenience of the entries generator is that it interfaces well with Python's map function, allowing you to perform operations on each entry read. The following cleans each entry as you iterate over it.

In [57]:
for cleaned_entry in map(clean_entry, fasta.entries(sfd_fasta_file)):
    print(cleaned_entry)
    break

>M03064_22_000000000-C8H43_1_1101_10211_8648 CMFP13	Otu000002	NumRep=1
TACGTAGGGTGCGAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTTGTAGGCGGTCTGTCGCGTCCGCTGTGAAAATCCGGGGCTCAACCCCGGACTTGCAGTGGGTACGGGCAGACTAGAGTGTGGTAGGGGAGACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGGTCTCTGGGCCACTACTGACGCTGAGAAGCGAAAGCGTGGGGAGCAAACAGG


---

## FASTA DB

![img](./images/fasta_to_fastadb.png)

FASTA files can get quite large. There can also be many different FASTA files that need to be interefaced with. As a result, this can quickly eat up resources if they are loaded into memory. To get around this, this library introduces the `.fasta.db` file format, allowing one to access all of the sequence data directly from disk without having to load it into memory. This also allows random reads anywhere in the file.

### Conversion

The following shows how to clean and convert a FASTA file to a FASTA DB.

In [3]:
from dnadb import fasta

In [59]:
with fasta.FastaDbFactory("/tmp/sfd.fasta.db") as factory:
    entries = fasta.entries(sfd_fasta_file)     # Get the entries from the FASTA file
    cleaned_entries = map(clean_entry, entries) # For each entry, pass it through clean_entry.
    factory.write_entries(cleaned_entries)     # Write the entries to the database.

### Interfacing

The following cells demonstrate how one can interface with a FASTA DB.

In [4]:
# Open the FASTA DB
fasta_db = fasta.FastaDb("/tmp/sfd.fasta.db")

In [65]:
# Get the number of entries/sequences in the DB
len(fasta_db)

1883478

In [78]:
# Get an entry at an arbitrary index
print(fasta_db[5])

>M03064_41_000000000-CRMWT_1_2104_7498_19863 CMFP363	Otu000002	NumRep=1
TACGTAGGGTGCAAGCGTTATCCGGAATTATTGGGCGTAAAGAGCTTGTAGGCGGTTTGTCGCGTCTGCTGTGAAATCCCGGGGCTCAACCCCGGACTTGCAGTGGGTACGGGCAGACTAGAGTGTGGTAGGGGAGACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGGTCTCTGGGCCACTACTGACGCTGAGAAGCGAAAGCATGGGGAGCGAACAGG


In [79]:
# Get an entry from its identifier
print(fasta_db["M03064_45_000000000-CRYB8_1_1103_5880_16744"])

>M03064_45_000000000-CRYB8_1_1103_5880_16744 CMFP415	Otu000002	NumRep=1
TACGTAGGGTGCGAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTTGTAGGCGGTTTGTCGCGTCTGCTGTGAAAATCCGGGGCTCAACCCCGGACTTGCAGTGGGTACGGGCAGACTAGAGTGTGGTAGGGGAGACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGGTCTCTGGGCCACTACTGACGCTGAGAAGCGAAAGCATTAGGAGCGAACAGT


In [80]:
# Loop over entries in the DB
for entry in fasta_db:
    print(entry)
    break

>M03064_22_000000000-C8H43_1_1101_10211_8648 CMFP13	Otu000002	NumRep=1
TACGTAGGGTGCGAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTTGTAGGCGGTCTGTCGCGTCCGCTGTGAAAATCCGGGGCTCAACCCCGGACTTGCAGTGGGTACGGGCAGACTAGAGTGTGGTAGGGGAGACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGGTCTCTGGGCCACTACTGACGCTGAGAAGCGAAAGCGTGGGGAGCAAACAGG


---

## Sample Interface

Built on top of the FASTA DB, the Sample interface provides additional features for interacting with FASTA DBs.

In [5]:
from dnadb import fasta, sample

In [7]:
sample_fasta = sample.load_fasta("/tmp/sfd.fasta.db", name="Sample A")
sample_fasta.name

'Sample A'

The sample interface includes the ability to sample (i.e. draw sequences randomly) the sample.

In [13]:
for entry in sample_fasta.sample(3):
    print(entry)
    print()

>M03064_53_000000000-J6PGC_1_1106_19270_23222 CMFP124	Otu002281	NumRep=3
TACGAAGGGGGCTAGCGTTGCTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGCTTGCCAAGTCAGGCGTGAAATTCCTGGGCTCAACCTGGGGACTGCGCTTGATACTGGCTGAGCTTGAGGATGGAAGAGGCTCGTGGAATTCCCAGTGTAGAGGTGAAATTCGTAGATATTGGGAAGAACACCGGTGGCGAAGGCGGCAACCTGGTCCATTACTGACGCTGAGGCGCGAAAGCGTGAGGAGCAAACAGG

>M03064_22_000000000-C8H43_1_2110_14193_24272 CMFP70	Otu007455	NumRep=1
TACGGAGGGGGCTAGCGTTATTCGGAATTACTGGGCGTAAAGCGCACGTAGGCGGATTGGAAAGTCAGAGGTGAAATCCCAGGGCTCAACCTTGGAACTGCCTTTGAAACTCCCAGTCTTGAGGTCGAGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAGGAACACCAGTGGCGAAGGCGGCTCACTGGCTCGATACTGACGCTGAGGCACGAAAGCATGGGGAGCAAACAGG

>M03064_61_000000000-CKV98_1_1104_5261_18907 CMFP874	Otu009109	NumRep=2
TACGTAGGGTCCGAGCGTTGTCCGGAATTATTGGGCGTAAAGGGCTCGTAGGCGGTTTGTCACGTCGGGAGTGAAAACTCGGGGCTCAACCCCGAGCCTGCTTCCGATACGGGCAGACTAGAGGGATGCAGGGGAGAACGGAATTCCTGGTGTAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGTTCTCTGGGCATTACCTGACGCTGAGGAGCGAAAGCATGGGGAGCGAACAGG



---

## Multiplexed Samples

![img](./images/sample_mappings.png)

One of the big features of the FASTA DB format is the ability to create samples that map to corresponding sequences present in the FASTA DB. Not only can these samples specify which sequences occur, but they can also indicate their abundance. This drastically reduces the disk space required, and also allows storing all of your sequences in one location.

In [16]:
from dnadb import fasta, sample

### Creating the Corresponding Index DB

The index creates a numerical map of integer identifiers to a corresponding FASTA ID, while also giving the ability to reference alias FASTA IDs with custom keys. These are useful for OTU labels as given with this dataset. It is currently required for creating sample mappings as will be demonstrated later, but this will likely no longer be the case in the future. For this dataset, we are given an OTU mapping that maps an ASV to a corresponding FASTA ID. The sample information refers to these OTU identifiers rather than the FASTA IDs, so they need to be stored in an index.

In [99]:
with open(otu_list_path) as f:
    print(f.readline()[:100])
    print(f.readline()[:100])

label	numASVs	ASV0000001	ASV0000002	ASV0000003	ASV0000004	ASV0000005	ASV0000006	ASV0000007	ASV000000
asv	3748806	M03064_63_000000000-JHTY5_1_1110_16981_26249	M03064_63_000000000-JHTY5_1_2103_18141_1718


Here we load the OTU map and convert it to a Python dictionary

In [96]:
with open(otu_list_path) as f:
    keys = f.readline().strip().split('\t')
    values = f.readline().strip().split('\t')
otu_index = dict(zip(keys[2:], values[2:]))

Finally we create the actual index on disk. We write one entry at a time, ensuring that it exists in the FASTA DB.

In [97]:
with fasta.FastaIndexDbFactory("/tmp/sfd.fasta.index.db") as factory:
    for otu, fasta_id in otu_index.items():
        if fasta_id not in fasta_db:
            continue
        factory.write_entry(fasta_db[fasta_id], key=otu)

In [105]:
index_db = fasta.FastaIndexDb("/tmp/sfd.fasta.index.db")

### Creating the Sample DB

Here we create the sample mappings. First we examine how the sample information is given to us. In this file, each row represents a sample, and its corresponding name and abundances of each OTU is given in the columns.

In [109]:
with open(otu_shared_path) as f:
    print(f.readline()[:50])
    print(f.readline()[:50])
    print(f.readline()[:50])

label	Group	numASVs	ASV0000001	ASV0000002	ASV00000
asv	ANSP6neg	3748806	121545	0	0	0	0	0	0	0	0	0	0	0	
asv	ANSP8neg	3748806	33	1	1	0	2	0	1	0	0	1	0	0	0	0	


In [115]:
with open(otu_shared_path) as f:
    header = f.readline().strip().split('\t')
    rows = [line.strip().split('\t') for line in f]
len(rows) # number of samples

887

In [119]:
# Find indices of OTU columns that are present in the FASTA DB.
indices = [i for i in range(3, len(header)) if index_db.contains_key(header[i])]

# map those indices their corresponding FASTA ID
index_to_fasta_id_map = {i: index_db.key_to_fasta_id(header[i]) for i in indices}

In [120]:
with sample.SampleMappingDbFactory("/tmp/sfd.fasta.mapping.db") as factory:
    for row in lines:
        sample_name = row[1]
        sample_factory = sample.SampleMappingEntryFactory(sample_name, index_db)
        for index, fasta_id in index_to_fasta_id_map.items():
            abundance = int(row[index])
            if abundance == 0:
                continue # don't add empty abundance
            entry = fasta_db[fasta_id]
            sample_factory.add_entry(entry, abundance)
        factory.write_entry(sample_factory.build())

### Demultiplexing the FASTA DB

In [17]:
samples = sample.load_multiplexed_fasta("/tmp/sfd.fasta.db", "/tmp/sfd.fasta.mapping.db")
len(samples)

887

In [18]:
samples[:5]

(DemultiplexedFastaSample: ANSP6neg,
 DemultiplexedFastaSample: ANSP8neg,
 DemultiplexedFastaSample: CMFP1,
 DemultiplexedFastaSample: CMFP10,
 DemultiplexedFastaSample: CMFP100)

In [25]:
for entry in samples[0].sample(3):
    print(entry.sequence)
    print()

TACGTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTATGTAAGACAGAGGTGAAATCCCCGGGCTCAACCTGGGAACGGCCTTTGTGACTGCATAGCTAGAGTACGGTAGAGGGGGATGGAATTCCGCGTGTAGCAGTGAAATGCGTAGATATGCGGAGGAACACCGATGGCGAAGGCAATCCCCTGGACCTGTACTGACGCTCATGCACGAAAGCGTGGGGAGCAAACAGG

TACGGAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCACGTAGGCGGACTGGAAAGTCAGAGGTGAAATCCCAGGGCTCAACCTTGGAACTGCCTTTGAAACTATCAGTCTGGAGTTCGAGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAGGAACACCAGTGGCGAAGGCGGCTCACTGGCTCGATACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGG

TACGAAGGGGGCTAGCGTTGCTCGGAATTACTGGGCGTAAAGGGCGCGTAGGCGGACATTTAAGTCAGGGGTGAAATCCCAGAGCTCAACTCTGGAACTGCCTTTGATACTGGATGTCTTGAGTGTGAGAGAGGTATGTGGAACTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAAGAACACCAGTGGCGAAGGCGACATACTGGCTCATTACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGG

