## DIY FASTA parser

Now we'd like to do some simple sequence analysis on our BLAST hits.

A simple FASTA parser can be written in ~10 lines of Python (probably less).


Biological sequences are often stored in FASTA format.
It looks like this:
```python
>sp|P0C8E7|EVA1_RHISA Evasin-1 OS=Rhipicephalus sanguineus PE=1 SV=1
MTFKACIAIITALCAMQVICEDDEDYGDLGGCPFLVAENKTGYPTIVACKQDCNGTTETA
PNGTRCFSIGDEGLRRMTANLPYDCPLGQCSNGDCIPKETYEVCYRRNWRDKKN
>sp|P0C8E9|EVA4_RHISA Evasin-4 OS=Rhipicephalus sanguineus PE=1 SV=1
MAFKYWFVFAAVLYARQWLSTKCEVPQMTSSSAPDLEEEDDYTAYAPLTCYFTNSTLGLL
APPNCSVLCNSTTTWFNETSPNNASCLLTVDFLTQDAILQENQPYNCSVGHCDNGTCAGP
PRHAQCW
```

Each sequence is preceded by a single line 'header', beginning with a `>`.
Headers can contain anything, but usually contain identifiers and free-text.
Sequences can be split across multiple lines.

In [1]:
# Start simple - just print the lines in the file
# The evasins_5.fasta file contains just 5 evasin sequences, to make the Juypter output shorter
with open('evasins_canonical.fasta', 'r') as handle:
    for line in handle:
        print(line)

>sp|P0C8E7|EVA1_RHISA Evasin-1 OS=Rhipicephalus sanguineus PE=1 SV=1

MTFKACIAIITALCAMQVICEDDEDYGDLGGCPFLVAENKTGYPTIVACKQDCNGTTETA

PNGTRCFSIGDEGLRRMTANLPYDCPLGQCSNGDCIPKETYEVCYRRNWRDKKN

>sp|P0C8E9|EVA4_RHISA Evasin-4 OS=Rhipicephalus sanguineus PE=1 SV=1

MAFKYWFVFAAVLYARQWLSTKCEVPQMTSSSAPDLEEEDDYTAYAPLTCYFTNSTLGLL

APPNCSVLCNSTTTWFNETSPNNASCLLTVDFLTQDAILQENQPYNCSVGHCDNGTCAGP

PRHAQCW



## Challenge

Write some code to print just the text in the header lines.

eg

```
sp|P0C8E7|EVA1_RHISA Evasin-1 OS=Rhipicephalus sanguineus PE=1 SV=1

.. etc ..
```

## Solution

In [2]:
with open('evasins_canonical.fasta', 'r') as handle:
    for line in handle:
        if line.startswith('>'):
            print(line[1:])

sp|P0C8E7|EVA1_RHISA Evasin-1 OS=Rhipicephalus sanguineus PE=1 SV=1

sp|P0C8E9|EVA4_RHISA Evasin-4 OS=Rhipicephalus sanguineus PE=1 SV=1



In [4]:
# Now, lets expand this to parse sequences into a dictionary,
# with the header line as the dictionary key

seqs = dict()  # an empty dictionary
seq = ''
seq_id = None
with open('evasins_canonical.fasta', 'r') as handle:
    for line in handle:
        if line.startswith('>'):
            if seq_id is not None:
                seqs[seq_id] = seq
            seq_id = line[1:].strip()
        else:
            seq += line.strip()
    seqs[seq_id] = seq

# del seqs[None]
            
for seq_id, seq in list(seqs.items()):
    print(seq_id)
    print(seq)
    print()


sp|P0C8E9|EVA4_RHISA Evasin-4 OS=Rhipicephalus sanguineus PE=1 SV=1
MTFKACIAIITALCAMQVICEDDEDYGDLGGCPFLVAENKTGYPTIVACKQDCNGTTETAPNGTRCFSIGDEGLRRMTANLPYDCPLGQCSNGDCIPKETYEVCYRRNWRDKKNMAFKYWFVFAAVLYARQWLSTKCEVPQMTSSSAPDLEEEDDYTAYAPLTCYFTNSTLGLLAPPNCSVLCNSTTTWFNETSPNNASCLLTVDFLTQDAILQENQPYNCSVGHCDNGTCAGPPRHAQCW

sp|P0C8E7|EVA1_RHISA Evasin-1 OS=Rhipicephalus sanguineus PE=1 SV=1
MTFKACIAIITALCAMQVICEDDEDYGDLGGCPFLVAENKTGYPTIVACKQDCNGTTETAPNGTRCFSIGDEGLRRMTANLPYDCPLGQCSNGDCIPKETYEVCYRRNWRDKKN



We should wrap this up as a function to make it easy to reuse. Also, let's use the OrderedDict data structure from the `collections` module in the standard library so we maintain the order in the original FASTA file.

(*Bonus points*: what would be a better thing to pass into the function instead of the `filename` as a string ?)

In [5]:
from collections import OrderedDict

def parse_fasta(filename):
    seqs = OrderedDict()
    seq = ''
    seq_id = None
    with open(filename, 'r') as handle:
        for line in handle:
            if line.startswith('>'):
                if seq_id is not None:
                    seqs[seq_id] = seq
                seq_id = line[1:].strip()
            else:
                seq += line.strip()
        seqs[seq_id] = seq
    return seqs
            
seqs = parse_fasta('evasins_canonical.fasta')

for seq_id, seq in list(seqs.items()):
    print(seq_id)
    print(seq)
    print()

sp|P0C8E7|EVA1_RHISA Evasin-1 OS=Rhipicephalus sanguineus PE=1 SV=1
MTFKACIAIITALCAMQVICEDDEDYGDLGGCPFLVAENKTGYPTIVACKQDCNGTTETAPNGTRCFSIGDEGLRRMTANLPYDCPLGQCSNGDCIPKETYEVCYRRNWRDKKN

sp|P0C8E9|EVA4_RHISA Evasin-4 OS=Rhipicephalus sanguineus PE=1 SV=1
MTFKACIAIITALCAMQVICEDDEDYGDLGGCPFLVAENKTGYPTIVACKQDCNGTTETAPNGTRCFSIGDEGLRRMTANLPYDCPLGQCSNGDCIPKETYEVCYRRNWRDKKNMAFKYWFVFAAVLYARQWLSTKCEVPQMTSSSAPDLEEEDDYTAYAPLTCYFTNSTLGLLAPPNCSVLCNSTTTWFNETSPNNASCLLTVDFLTQDAILQENQPYNCSVGHCDNGTCAGPPRHAQCW

