# File IO

When scientists share or store DNA sequences, they store it using standard file formats like `FASTA`.


## FASTA

Here's what a simple FASTA file looks like:

```shell
> seq1 | Human beta-globin gene
ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAG
```

- The first line starts with `>` and contains a description and ID.
- The next line contains the sequence (usually wrapped at 60-80 characters).


FASTA is the universal format for biological sequences (DNA, RNA and proteins). It's simple, human-readable plain-text, and supported by every bio-informatics tool.

FASTA often contains **many** sequences, e.g. whole genomes or protein databases.

```shell
> geneA
ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG
> geneB
ATGACCATGATTACGGATTCAAGGAGGACGACGAGCGTA
```

Let's start by creating a class to hold the sequence along with the metadata.

In [9]:
from dataclasses import dataclass, field
from typing import Dict, Optional, Union
from dna import DNASequence

@dataclass
class SeqRecord:
    """Data Structure to represent a single biological sequence (DNA, RNA or Protein) along with its metadata"""
    seq: DNASequence # The sequence object
    id: str # A unique identifier (often from FASTA header)
    name: Optional[str] = "" # A short name or label
    description: Optional[str] = "" # A human readable line of information
    annotations: Dict[str, Union[str, float, int]] = field(default_factory=dict) # Optional dict for extra info

    def __len__(self):
        return len(self.seq.sequence)

    def __getitem__(self, key):
        """Allow slicing like record[0:10] -> returns a new SeqRecord"""
        sliced_seq = DNASequence(self.seq.sequence[key])
        return SeqRecord(
            seq = sliced_seq,
            id = self.id,
            name = self.name,
            description = f"{self.description} (slice)",
            annotations = self.annotations.copy()
        )

    def __str__(self):
        seq_str = self.seq.sequence
        # Split into 60 character lines (FASTA standard)
        wrapped_seq = "\n".join([seq_str[i:i+60] for i in range (0, len(seq_str), 60)])
        return f">{self.id} {self.description}\n{wrapped_seq}"

    def __repr__(self):
        return f"<SeqRecord(id='{self.id}', length={len(self)})>"

In [10]:
seq = DNASequence.random(24)
seq

<DNASequence(length=24)?

In [11]:
record = SeqRecord(seq, id="seq1", description="Test gene sequence")
print(record)
print("Length:", len(record))
print("Slice:", record[0:5].seq.sequence)

>seq1 Test gene sequence
TTGTGAATTTCAAGAACTGACTAG
Length: 24
Slice: TTGTG


## Parser

In [None]:
from typing import Iterator
from dna import DNASequence

class SeqIO:
    @staticmethod
    def parse(filepath: str, format: str = "fasta") -> Iterator[SeqRecord]:
        """Parse a FASTA file and yield SeqRecord objects"""
        if format.lower() != "fasta":
            raise ValueError("Only FASTA format is currently supported")

        with open(filepath, 'r') as file:
            seq_id = None
            description = ""
            seq_lines = []

            for line in file:
                line = line.strip()
                print(line)
                if not line:
                    continue

            if line.startswith('>'):
                print('here')
                # If there's an existing sequence, yield it
                if seq_id:
                    yield SeqRecord(
                        seq = DNASequence("".join(seq_lines)),
                        id = seq_id,
                        description = description
                    )
                # Otherwise, start a new record
                header = line[1:].strip()
                parts = header.split(maxsplit=1)
                seq_id = parts[0]
                description = parts[1] if len(parts) > 1 else ""
            else:
                seq_lines.append(line)

        # Don't forget the last record
        if seq_id:
            yield SeqRecord(
                seq = DNASequence("".join(seq_lines)),
                id = seq_id,
                description = description
            )

Let's create a small FASTA file for testing

In [30]:
%%writefile test_sequences.fasta
>seq1 Example sequence 1
ATGCGTACGTAG
>seq2 Example sequence 2
TTGGCATAGCTA 

Overwriting test_sequences.fasta


In [31]:
for record in SeqIO.parse("test_sequences.fasta", "fasta"):
    print(record)