# Assignment 3.2 - Working with sequence data
In this assignment you will work with fasta-files and explore some biological data we can extract from them. For this assignment we will use the Biopython-library, which is commonly used when working with biological data. Here is a link to the [Biopython manual](https://biopython.org).

In [None]:
from Bio import SeqIO  # Imports the 

First, let's have a look at a fasta file. The fasta format is probably one of the most common format in bioinformatics and is used to represent sequence data. The first line of the file starts with `>`, this indicates a new sequence record. This is also called the fasta-header, from the `>` to the first space in the header is the accession-id and after the first space we have the description. The format of the header and the description can be different depending on from where the fasta-file was obtained.  

After the line with the `>` we have lines containing the sequence of the record. The sequence could be on just one line or for readability split to several lines. The sequence can either represent DNA-sequence or sequences of proteins.  

Following the sequence we have a new line that starts with `>` and is the header of the second sequence in the fasta file.

The output below is a fasta file of protein sequences from a bacterium called _Ca_ Pelagibacter ubique.

In [33]:
%%bash
head Ca_Pelagibacter_ubique_proteins.faa

>lcl|NC_007205.1_prot_1 [locus_tag=SAR11_RS00005] [db_xref=GeneID:66294503] [protein=2-isopropylmalate synthase] [pseudo=true] [partial=3'] [location=complement(join(<1308756..1308759,1..410))] [gbkey=CDS]
MSDKDKIFIFDTTMRDGEQSPGASMSLEEKIQIARIFDELGIDVIEAGFPIASPGDFEAVTAISKILKNS
IPCGLSRASKKDIDACHEALKASPRFRIHTFISTSPLHMKHKLNKTPEQVLDAIKESVTYARNLTDEV
>lcl|NC_007205.1_prot_WP_006997909.1_2 [gene=ilvC] [locus_tag=SAR11_RS00010] [db_xref=GeneID:66294504] [protein=ketol-acid reductoisomerase] [protein_id=WP_006997909.1] [location=complement(516..1535)] [gbkey=CDS]
MKMFYEKDADVDLIKSKKIAIFGYGSQGHAHALNLKDSGAKEVVVALRDGSASKAKAESKGLRVMNMSDA
AEWAEVAMILTPDELQASIYKNHIEQRIKQGTSLAFAHGLNIHYKLIDARKDLDVFMVAPKGPGHLVRSE
FERGGGVPCLFAVHQDGTGKARDLALSYASAIGGGKSGIIETTFKDECETDLFGEQSVLCGGLVELIKNG
FETLTEAGYEPEMAYFECLHEVKLIVDLIYEGGIANMNYSISNTAEYGEYVSGKKVVDSESKKRMKEVLA
DIQSGKFTKDWMKECEGGQKNFLKMRKDLADHPIEKVGAELRAMMPWIGKKKLIDSDKS
>lcl|NC_007205.1_prot_WP_006997908.1_3 [gene=ilvN] [locus_tag=SAR11_RS00015] [db_xref=GeneID:662

The code below illustrate how we can use the SeqIO-module to parse fasta files and work with the data. Here we generate a list called `records`, in this case each item in the list is a amino-acid sequence from _Ca._ Pelagibacter ubique.  
But as we will see later in this assignment, the items can also be DNA-sequences, either entire chromosomes or genomic features such as genes or exons.  

Each record have an id-variable which is the identifier of the record, they also have a description-variable which represent the entire fasta-header, and finally, they have a seq-variable which contains the sequence of the record. You can see in the example below how we can access these different variables from the first protein in the `records`-list.

In [None]:
records = list(SeqIO.parse("Ca_Pelagibacter_ubique_proteins.faa", "fasta"))
print(records[0].id)
print(records[0].description)
print(records[0].seq)

__Task 1:__ Your first task is to use the records list with the proteins to find how many proteins there is in _Ca._ Pelagibacter ubique.

In [None]:
# Write the code to find the number of proteins below

__Task 2:__ Your next task will be to find the longest protein in _Ca._ Pelagibacter ubique. We can use the `len()` function on the seq-variable to find the length of a protein sequence. You will need one variable to store the length of the longest sequence and another to store the id of the longest sequence.

In [None]:
# Write the code to find the longest protein below.

Now we will load a new fasta-file. This time we load the DNA-sequence of the _Ca._ Pelagibacter ubique genome into a list called `genome_records`.

In [None]:
genome_records = list(SeqIO.parse('Ca_Pelagibacter_ubique.fna', 'fasta'))

We can have a look at the 100 first nucleoteides of the genome

In [None]:
print(genome_records[0].seq[:100])

__Task 3:__ Your final assignment will be to calculate the GC-content of _Ca._ Pelagibacter ubique. The GC-content is defined as the precentage of the sequence that are either a Cytosine (C) or Guanine (G). As you will see later in the course, this measure is important when we will search for genes in a genome. We can use the `in` operator to loop over each nucleotide in the sequence.

In [None]:
# Write the code to calculate the GC-content below