# Lecture 3 - From sequences to genes

In the previous lecture, you learned how to assemble raw sequencing data into contigs (*i.e. longer DNA sequences*). Now, you will learn how to find genes present in those sequences. 

## Exercise 1 - Gene prediction with Prokka/Prodigal:

In this exercise we will use a genome annotation called [Prokka](https://academic.oup.com/bioinformatics/article/30/14/2068/2390517), which uses a gene prediction method called [Prodigal](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-119). 

> As in the previous lecture, you can try one of two options for running Prokka/Prodigal:

### Option 1: Using the command line

Run the commands bellow to install **Prodigal** and confirm that it was correctly installed:

In [None]:
!conda install -c bioconda -y prodigal

In [None]:
!prodigal -v

Now let's run **Prodigal** to find *open reading frames* (ORFs) in our assembled contigs.

In [None]:
!prodigal -i files/input/contigs.fasta -o files/output/ORFs.gbk -d files/output/ORFs.ffn -a files/output/ORFs.faa 

If everything went well you will now find several files. One of them is a FASTA file with the DNA sequence of the ORFs:

In [None]:
!head -20 files/output/ORFs.ffn

You can also find another FASTA file with the translated amino acid sequence of the ORFs:

In [None]:
!head -10 files/output/ORFs.faa

### Option 2: Using Galaxy

- Go to [usegalaxy.no](https://usegalaxy.no/) and login with your FEIDE credentials
- Download file `files/input/contigs.fasta` and upload it in Galaxy
- Search for `prokka` in the tools menu and open the main dialog
- Under **Contigs to annotate** select the file you just uploaded 
- Select the option **--compliant**
- Scroll down and click **Execute** (this will take several minutes)
- Explore the different output files that were generated
- Download the file with `.ffn` extension

## Exercise 2 - Biopython

Let's start by reading the one of the generated files that contains the predicted gene sequences in fasta format. We will also load the original contigs file for comparison.

In [None]:
from Bio import SeqIO

contigs = list(SeqIO.parse('files/input/contigs.fasta', 'fasta'))

annotated = list(SeqIO.parse('files/output/ORFs.ffn', 'fasta'))

Let's check how many genes where predicted:

In [None]:
len(annotated)

And how the first five genes looks like:

In [None]:
for seq in annotated[:5]:
    print(seq)
    print()

### Exercise 2.1

Estimate the coding density of the given genome.

> Tip: divide the total length of the coding regions by the total length of the contigs.

In [None]:
# type your code here...

Solution (click to expand):

In [None]:
coding = sum(len(seq) for seq in annotated)
total = sum(len(contig) for contig in contigs)
density = coding / total

print(f'Coding density: {density:.1%}')

### Exercise 2.2

Use the `translate()` method to translate the first gene sequence to a protein sequence.

In [None]:
# type your code here...

Solution (click to expand):

In [None]:
gene = annotated[0]
protein = gene.seq.translate()
print(protein)

Compare this result with the first entry of the [translated fasta file](files/output/ORFs.faa). Is the output the same?

> If only the first amino acid is different, [here is an explanation](https://www.biostars.org/p/364080/).

### Exercise 2.3:

A frameshift is a kind of mutation where a portion of DNA (with a length that is not a multiple of 3) 
gets deleted or inserted and changes the codon reading frame.

Remove the first (and then also the second) nucleotide of the gene you just translated. 

How do the protein sequence(s) look like?

In [None]:
# type your code here...

Solution (click to expand):

In [None]:
gene = annotated[0]

frameshift1 = gene.seq[1:]
frameshift2 = gene.seq[2:]

protein1 = frameshift1.translate()
protein2 = frameshift2.translate()

print(protein1)
print(protein2)

Great job :)