# Reading and Writing FASTA in Python
The purpose of this notebook is to demonstrate how to import, read and interpret data from FASTA format protein files, as well as to sort data, and to rewrite that data into other formats. 

## Libraries Used
For this notebook the following Python libraries will be used.

| Library | Uses | Abbreviation |
| :-------: | :----: | :------------: |
| os | File management | os |
| Bio.SeqIO | Parsing FASTA files | SeqIO |
| Bio.SeqUtils.IsoelectricPoint | Calculating isoelectric point<br />and protein charge | IP |
| Bio.SeqUtils.ProtParam | Other protein calculations | PA |
| pandas | Managing data | pd |
| collections.defaultdict | Ease of use converting files<br />(Not strictly necessary) | defaultdict |

In [2]:
import os
from Bio import SeqIO
from Bio.SeqUtils.IsoelectricPoint import IsoelectricPoint as IP
from Bio.SeqUtils.ProtParam import ProteinAnalysis as PA
import pandas as pd
from collections import defaultdict

## What is FASTA?
The FASTA format (.fasta, .faa, .fa) is a text-based way of representing nucleotide or protein sequences using single letters to represent each residue in the chain. Each entry begins with a greater-than symbol ('>') to indicate a new sequence. This is then followed by a description line, which will typically contain a sequence ID and additional information about the origin, purpose and properties of the sequence. An example description might look like this (via NCBI):

Below the description line comes the sequence which will consist of anywhere from a few dozen to a few thousand residues. For readability purposes, typically each line will only contain a certain amount of characters. Each residue is represented by a capital letter, with the meaning of each depending on if the file is representing a protein or nucleic acid sequence. The meanings of each character can be found [here](https://en.wikipedia.org/wiki/FASTA_format#Sequence_representation).

A complete protein sequence in a FASTA file might look like:

## Reading a FASTA file
For this notebook, the proteome of the species *Streptococcus pyogenes* will be used as our data set. The file is stored in the directory "data" under the name *Streptococcus_pyogenes.faa*. 

In [None]:
with open