# Importing and viewing sequence files in pydna
> Visit the full library documentation [here](https://bjornfjohansson.github.io/pydna/)

pydna can be used to work with FASTA, Genbank, EMBL, and snapgene files (.fasta, .gb, .embl, .dna). Specifically, pydna provides ways to read these file types, and store them as a record (I.e `Dseqrecord` object) that one can view and work with. 

To import files into pydna is simple. pydna provides the `parse` method to read all sequences in a file into a list. As an input, it can take the path to a file from your computer, or a string with the file content. The following code shows an example of how to use the `parse` function to import a hypothetical downloaded FASTA file, but other types of files can also be imported in the same way.

manu_todo:
* Use an example in which you read from a string (e.g. write a string with a FASTA file and read it).
* Make two sections here: read from file and read from string. Read from sting is an important use-case if people are for example reading sequences that they may get from the genbank API.
* Do not use pseudopaths (/your/absolute...) in the examples. Instead, simply put an actual file. (in other words the second example is enough)
* Don't use absolute paths, instead of `/Users/PeilunXie/Desktop/pydna/docs/notebooks/sequence.gb` use `./sequence.gb` or `../etc` if it's in a directory above the current one. You can read about relative paths or ask chatgpt.
* I added an extra info section for some more control of the parsing

In [4]:
from pydna.parsers import parse

file_path = '/your/absolute/path/to/the/file/file.fasta'
files = parse(file_path)
files[0].format("fasta")

IndexError: list index out of range

The last line of code uses the `format` method to view the imported file in your Python Interpreter (e.g interactive window on Visual Studio Code). Note that `parse` returns a `list` object, hence requiring `[0]` to take the first element of the list. If you have a FASTA file that contains multiple sequences, you can index the list accordingly (e.g  `[0]`, `[1]`, ...)

Another example, using a real GenBank file, is shown below. The GenBank file is downloaded [here](https://www.ncbi.nlm.nih.gov/nucleotide/U49845). I've done this on my Mac, and dragged/dropped the sequence.gb file in a series of subfolders on my Desktop. Replace my file path with yours to access the file.

In [None]:
from pydna.parsers import parse

file_path = '/Users/PeilunXie/Desktop/pydna/docs/notebooks/sequence.gb'
files = parse(file_path)
files[0].format("gb")


LOCUS       SCU49845                5028 bp    DNA     linear   PLN 29-OCT-2018
DEFINITION  Saccharomyces cerevisiae TCP1-beta gene, partial cds; and Axl2p
            (AXL2) and Rev7p (REV7) genes, complete cds.
ACCESSION   U49845
VERSION     U49845.1
KEYWORDS    .
SOURCE      Saccharomyces cerevisiae (brewer's yeast)
  ORGANISM  Saccharomyces cerevisiae
            Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina;
            Saccharomycetes; Saccharomycetales; Saccharomycetaceae;
            Saccharomyces.
REFERENCE   1  (bases 1 to 5028)
  AUTHORS   Roemer,T., Madden,K., Chang,J. and Snyder,M.
  TITLE     Selection of axial growth sites in yeast requires Axl2p, a novel
            plasma membrane glycoprotein
  JOURNAL   Genes Dev. 10 (7), 777-793 (1996)
   PUBMED   8846915
REFERENCE   2  (bases 1 to 5028)
  AUTHORS   Roemer,T.
  TITLE     Direct Submission
  JOURNAL   Submitted (22-FEB-1996) Biology, Yale University, New Haven, CT
            06520, USA
FEATURES            

Now, you can work with the sequence record using pydna, using the `Dseqrecord` class. `Dseqrecord` provides way to select regions of interest on the sequence, adding new features to the record, removing features, and creating new `Dseqrecord` objects to store and export your changes. Please refer to the `Dseq_Features` notebook for more information.

## Extra info

Note that pydna's `parse` guesses whether the argument passed is a file path or a string, and also guesses the file type based on the content, so it can give unexpected behaviour if your files are not well formatted. To have more control over the parsing of sequences, you can use biopython's `parse` from `Bio.SeqIO`, and then instantiate a Dseqrecord from the biopython's `SeqRecord`

In [None]:
from Bio.SeqIO import parse as seqio_parse
from pydna.dseqrecord import Dseqrecord

file_path = './sequence.gb'

# Extract the first Seqrecord of the SeqIO.parse iterator
seq_record = next(seqio_parse(file_path, 'genbank'))

# This is how circularity is stored in biopython's seqrecord
is_circular = 'topology' in seq_record.annotations.keys() and seq_record.annotations['topology'] == 'circular'

# Convert into Dseqrecord
dseq_record = Dseqrecord(seq_record, circular=is_circular)

dseq_record

Dseqrecord(-5028)