# Reading and Writing FASTA in Python
The purpose of this notebook is to demonstrate how to import, read and interpret data from FASTA format protein files, as well as to sort data, and to rewrite that data into other formats. 

## Libraries Used
For this notebook the following Python libraries will be used.

| Library | Uses | Abbreviation |
| :-------: | :----: | :------------: |
| os | File management | os |
| Bio.SeqIO | Parsing FASTA files | SeqIO |
| Bio.SeqUtils.ProtParam | Protein calculations | PA |
| pandas | Managing data | pd |

In [1]:
import os
from Bio import SeqIO
from Bio.SeqUtils.ProtParam import ProteinAnalysis as PA
import pandas as pd
from IPython.display import Image

## Why are FASTA necessary
FASTA is a standard file format when working with bioinformatic data. It is also the format accepted by the JBioframework 1D and 2D Electrophoresis simulations.

<img src="Images/1DE.png" height="100"/>

## What is FASTA?
The FASTA format (.fasta, .faa, .fa) is a text-based way of representing nucleotide or protein sequences using single letters to represent each residue in the chain. Each entry begins with a greater-than symbol ('>') to indicate a new sequence. This is then followed by a description line, which will typically contain a sequence ID and additional information about the origin, purpose and properties of the sequence. An example description might look like this (via NCBI):

Below the description line comes the sequence which will consist of anywhere from a few dozen to a few thousand residues. For readability purposes, typically each line will only contain a certain amount of characters. Each residue is represented by a capital letter, with the meaning of each depending on if the file is representing a protein or nucleic acid sequence. The meanings of each character can be found [here](https://en.wikipedia.org/wiki/FASTA_format#Sequence_representation).

A complete protein sequence in a FASTA file might look like:

## Reading a FASTA file
For this notebook, the proteome of the species *Streptococcus pyogenes* will be used as our data set. The file is stored in the directory "data" under the name *Streptococcus_pyogenes.faa*.

The Bio.SeqIO library can iterate through a FASTA file using the .parse() method, which returns an iteratable set of SeqRecord objects. The following code will print out the first SeqRecord of the file

In [2]:
with open("data/Streptococcus_pyogenes.faa") as prot_file:
    parsed = SeqIO.parse(prot_file,"fasta")
    for record in parsed:
        print(record)
        break

ID: ERL19293
Name: ERL19293
Description: ERL19293 pep supercontig:SpyogGA06023v1.0:contig00001:421:1035:1 gene:HMPREF1231_2265 transcript:ERL19293 gene_biotype:protein_coding transcript_biotype:protein_coding description:SNARE-like domain protein
Number of features: 0
Seq('MTKRNSKAYILWQKIIKILGIIALIGTFFLAFWLYRLGILNDSNALKDLVQRYR...LQH')


When pulling from a FASTA file, a SeqRecord will be populated with:
- **seq** - The amino acid sequence of the protein, stored as a biopython Seq object.
- **id** - The first word of the first line of the entry with the ">" removed.
- **name** - Same as **id** when reading a FASTA.
- **description** - The entire first line of the entry with the ">" removed.


In [3]:
with open("data/Streptococcus_pyogenes.faa","r") as prot_file:
    parsed = SeqIO.parse(prot_file,"fasta")
    for record in parsed:
        sample_record = record
        break

print(f"Sequence: {str(sample_record.seq)}")
print(f"ID: {sample_record.id}")
print(f"Name: {sample_record.name}")
print(f"Description: {sample_record.description}")

Sequence: MTKRNSKAYILWQKIIKILGIIALIGTFFLAFWLYRLGILNDSNALKDLVQRYRLWGPFVFIVVQIIQIVFPVIPGGLTTVAGFLIFGPVTGFIYNYVGIIIGSIVLFLLVKTYGRKFILLFVNDKTFYKYERRLETPGYEKLFIFCMASPVSPADIMVMITGLTDMSLKRFVTILLITKPISIIGYSYLFIFGKDVISWFLQH
ID: ERL19293
Name: ERL19293
Description: ERL19293 pep supercontig:SpyogGA06023v1.0:contig00001:421:1035:1 gene:HMPREF1231_2265 transcript:ERL19293 gene_biotype:protein_coding transcript_biotype:protein_coding description:SNARE-like domain protein


The SeqRecord object also has the attributes **.dbxrefs**, **.annotations**, **.letter_annotations**, and **.features**, but when pulling from a FASTA none of these attributes will be populated. As such the following code will return empty values.

In [4]:
print(f"dbxrefs: {sample_record.dbxrefs}")
print(f"annotations: {sample_record.annotations}")
print(f"letter_annotations: {sample_record.letter_annotations}")
print(f"features: {sample_record.features}")

dbxrefs: []
annotations: {}
letter_annotations: {}
features: []


## Pulling data from a FASTA file
Using the .parse() method, whatever data is deemed necessary can be copied into a relevant variable. In order to be compatible with a pandas dataframe, the data will be compiled into a dictionary with each piece of data being appended to a list within the dictionary.

In [5]:
datadict = {
    'sequence' : [],
    'description' : [],
    'length' : []
}

with open("data/Streptococcus_pyogenes.faa","r") as prot_file:
    for record in SeqIO.parse(prot_file,"fasta"):
        sequence = str(record.seq)  #It is not strictly necessary to record the sequence as a string, but it is easier to work with and displays better. (Ex:len() does not work on Seq objects)
        datadict['sequence'].append(sequence)
        datadict['description'].append(record.description)
        datadict['length'].append(len(sequence))

By formatting the data this way it can be displayed into a pandas dataframe using **pd.DataFrame.from_dict()**. This command will make each key of the dictionary into a column index and the contents of each list into the data of that column.

In [6]:
df = pd.DataFrame.from_dict(datadict)
df

Unnamed: 0,sequence,description,length
0,MTKRNSKAYILWQKIIKILGIIALIGTFFLAFWLYRLGILNDSNAL...,ERL19293 pep supercontig:SpyogGA06023v1.0:cont...,204
1,MFGKLLKYEFRSIGKWYFALNAFVIAIAAILSFTIKLFAQSNSDGL...,ERL19300 pep supercontig:SpyogGA06023v1.0:cont...,261
2,MLEFFSDFYSDFDNSKATSLLRDLELDPEDRFKTLSKGNKEKVQLI...,ERL19240 pep supercontig:SpyogGA06023v1.0:cont...,139
3,MSWKFDEKSPIYAQIAQHVMMQIISQEIKSGDQLPTVREYAEIAGV...,ERL19267 pep supercontig:SpyogGA06023v1.0:cont...,123
4,MFAQLDTKTVYSFMDSLIDLNHYFERAKQFGYHTIGIMDKDNLYGA...,ERL19294 pep supercontig:SpyogGA06023v1.0:cont...,1036
...,...,...,...
2275,TPTDRRCLFVFTILVTLIFYGILASIQTAYLLVSTLSIATSFSAVY...,ERL20374 pep supercontig:SpyogGA06023v1.0:cont...,128
2276,DKPISFKDKDGNFVSAADVWNAEKLEELFNLLNPNRRLRLEREKLK...,ERL20373 pep supercontig:SpyogGA06023v1.0:cont...,50
2277,MNIEVKEKISFVFFINSLFSGVIILLCFNITLSKEIIINNFVDILA...,ERL22652 pep supercontig:SpyogGA06023v1.0:cont...,54
2278,MTKKLDVRDARDFFINSEMDEYAANDFKAGDKIAVFSVPFDWN,ERL22653 pep supercontig:SpyogGA06023v1.0:cont...,43


This is also a good step to apply any data filters. For example, if only proteins with an isoelectric point below 7.0 are desired in the data set, a check can be put for that condition before an entry is added. The following dataframe is sorted by isoelectric point to demonstrate that anything with an IEP above 7 has been cut.

In [7]:
datadict_withIEP = {
    'sequence' : [],
    'description' : [],
    'length' : [],
    'isoelectric_point' : []
}

with open("data/Streptococcus_pyogenes.faa","r") as prot_file:
    for record in SeqIO.parse(prot_file,"fasta"):
        sequence = str(record.seq)
        parameters = PA(sequence)
        if parameters.isoelectric_point() < 7.0:
            datadict_withIEP['sequence'].append(sequence)
            datadict_withIEP['description'].append(record.description)
            datadict_withIEP['length'].append(len(sequence))
            datadict_withIEP['isoelectric_point'].append(parameters.isoelectric_point())

df_withIEP = pd.DataFrame.from_dict(datadict_withIEP)
df_withIEP.sort_values(by="isoelectric_point",inplace = True)
df_withIEP

Unnamed: 0,sequence,description,length,isoelectric_point
1219,QELALEMDYKDVVDGNNATITGQWSDSPQIILDGGN,ERL07827 pep supercontig:SpyogGA06023v1.0:cont...,36,4.050028
223,MKLDVFAGQEKSELSMIEVARAILEERGRDNEMYFSDLVNEIQNYL...,ERL12475 pep supercontig:SpyogGA06023v1.0:cont...,191,4.050028
1171,MATLDEVLSFAKGLADTGQGVDLDNVYGTQCVDLPNWITTKYFGIA...,ERL17586 pep supercontig:SpyogGA06023v1.0:cont...,66,4.050028
901,MSDLGIIIVSHSKNIAQGVVDLISEVATDVAITYVGGTEDGGIGTS...,ERL23309 pep supercontig:SpyogGA06023v1.0:cont...,124,4.050028
1170,MATLDEVLSFAKGLADTGQGVDLDNVYGTQCVDLPNWITTKYFGIA...,ERL14875 pep supercontig:SpyogGA06023v1.0:cont...,56,4.050028
...,...,...,...,...
1023,MTQMTVQVVTPDGIKYDHHAKCISVTTPDGEMGILPNHINLIAPLQ...,ERL15522 pep supercontig:SpyogGA06023v1.0:cont...,138,6.975588
708,MEYDKINQYLVDIFNRILVIEEMSLKTSQFSDVSLKEMHTIEIIGK...,ERL10459 pep supercontig:SpyogGA06023v1.0:cont...,144,6.976441
144,MLTKIGLYTGSFDPVTNGHLDIVKRASGLFDQIYVGIFDNPTKKSY...,ERL06642 pep supercontig:SpyogGA06023v1.0:cont...,163,6.976782
1358,APLLSVSHRAVKGRSGYIYDGATFTTKTLKVKGRVTVSNVERLLDK...,ERL07030 pep supercontig:SpyogGA06023v1.0:cont...,111,6.978487


## Writing FASTA files
One of the easiest ways to generate new FASTA files is with **SeqIO.write()**. This has 3 arguments:
- **sequences** - A list of SeqRecord objects
- **handle** - The file handle to write to, or the file path given as a string
- **format** - The file format to write in given as a string, in this case "fasta"

The sequences records can be pulled while sorting through the original file. The same filter by isoelectric point will be applied and saved to a new file, the modified copy will be saved in the *data* folder as *Streptcoccus_pyogenes_IEP_below_7.faa*.

In [8]:
records_list = []
with open("data/Streptococcus_pyogenes.faa","r") as prot_file:
    for record in SeqIO.parse(prot_file,"fasta"):
        sequence = str(record.seq)
        parameters = PA(sequence)
        if parameters.isoelectric_point() < 7.0:
            records_list.append(record)
            
SeqIO.write(records_list,"data/Streptcoccus_pyogenes_IEP_below_7.faa","fasta")

1366

This file should contain the same data as the filtered data frame above. Exporting it as a dataframe show this to be true.

In [9]:
datadict_modified_file = {
    'sequence' : [],
    'description' : [],
    'length' : [],
    'isoelectric_point' : []
}

with open("data/Streptcoccus_pyogenes_IEP_below_7.faa","r") as prot_file:
    for record in SeqIO.parse(prot_file,"fasta"):
        sequence = str(record.seq)
        parameters = PA(sequence)
        datadict_modified_file['sequence'].append(sequence)
        datadict_modified_file['description'].append(record.description)
        datadict_modified_file['length'].append(len(sequence))
        datadict_modified_file['isoelectric_point'].append(parameters.isoelectric_point())
        
df_modified_file = pd.DataFrame.from_dict(datadict_modified_file)
df_modified_file.sort_values(by="isoelectric_point",inplace=True)
df_modified_file

Unnamed: 0,sequence,description,length,isoelectric_point
1219,QELALEMDYKDVVDGNNATITGQWSDSPQIILDGGN,ERL07827 pep supercontig:SpyogGA06023v1.0:cont...,36,4.050028
223,MKLDVFAGQEKSELSMIEVARAILEERGRDNEMYFSDLVNEIQNYL...,ERL12475 pep supercontig:SpyogGA06023v1.0:cont...,191,4.050028
1171,MATLDEVLSFAKGLADTGQGVDLDNVYGTQCVDLPNWITTKYFGIA...,ERL17586 pep supercontig:SpyogGA06023v1.0:cont...,66,4.050028
901,MSDLGIIIVSHSKNIAQGVVDLISEVATDVAITYVGGTEDGGIGTS...,ERL23309 pep supercontig:SpyogGA06023v1.0:cont...,124,4.050028
1170,MATLDEVLSFAKGLADTGQGVDLDNVYGTQCVDLPNWITTKYFGIA...,ERL14875 pep supercontig:SpyogGA06023v1.0:cont...,56,4.050028
...,...,...,...,...
1023,MTQMTVQVVTPDGIKYDHHAKCISVTTPDGEMGILPNHINLIAPLQ...,ERL15522 pep supercontig:SpyogGA06023v1.0:cont...,138,6.975588
708,MEYDKINQYLVDIFNRILVIEEMSLKTSQFSDVSLKEMHTIEIIGK...,ERL10459 pep supercontig:SpyogGA06023v1.0:cont...,144,6.976441
144,MLTKIGLYTGSFDPVTNGHLDIVKRASGLFDQIYVGIFDNPTKKSY...,ERL06642 pep supercontig:SpyogGA06023v1.0:cont...,163,6.976782
1358,APLLSVSHRAVKGRSGYIYDGATFTTKTLKVKGRVTVSNVERLLDK...,ERL07030 pep supercontig:SpyogGA06023v1.0:cont...,111,6.978487
