# Lecture 3 - From sequences to genes

In the previous lecture, you learned how to assemble raw sequencing data into *contigs* (longer fragments of a chromosome). In this session, you will learn how to identify the coding regions of the genome, also called ORFs (*open reading frames*), which enconde for individual gene sequences. 

## Learning objectives:

- Learn to find open reading frames in a genome assembly
- Loading and handling sequence data with Biopython

## Exercise 1 - Gene prediction with Prodigal:

Here, we will use a popular gene prediction method called [**Prodigal**](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-119). 

> Note: before proceeding with the exercises, always open the links (like the one above) to become more familiar with the tools/methods we are working with.

### Option 1:

Just like we did in the previous lecture, we will explore two options for this exercise. The first (and recommended) option is to run it directly from the command line. If everything is installed correctly you should have [**Pyrodigal**](https://pyrodigal.readthedocs.io/en/stable/) (a python implementation of Prodigal) already available. 

Let's test it:

In [None]:
!pyrodigal -h

Now lets run it using as input (`-i` flag) the contigs FASTA file that we assembled in the previous lecture:

In [None]:
!pyrodigal -i files/input/contigs.fasta \
           -o files/output/ORFs.gbk     \
           -d files/output/ORFs.ffn     \
           -a files/output/ORFs.faa

If everything went well you will now find several files. One of them is a FASTA file with the DNA sequence of the ORFs. 

We will use the linux `head` command to print the first 20 lines of the file:

In [None]:
!head -20 files/output/ORFs.ffn

You can also find another FASTA file with the translated amino acid sequence of the ORFs:

In [None]:
!head -10 files/output/ORFs.faa

### Option 2: Using Galaxy

Here we use [**Prokka**](https://academic.oup.com/bioinformatics/article/30/14/2068/2390517), a genome annotation tool available in **Galaxy** (that uses **Prodigal** under the hood) to find and *annotate*(*) genes in a genome. 

- Go to [usegalaxy.no](https://usegalaxy.no/) and login with your FEIDE credentials
- Download file `files/input/contigs.fasta` and upload it in Galaxy
- Search for `prokka` in the tools menu and open the main dialog
- Under **Contigs to annotate** select the file you just uploaded 
- Select the option **--compliant**
- Scroll down and click **Execute** (this will take several minutes)
- Explore the different output files that were generated
- Download the file with `.ffn` extension

(*) We will discuss *gene annotation* in the next lecture.

## Exercise 2 - Biopython

Let's start by reading the generated file that contains the predicted gene sequences in FASTA format (and we will also load the original *contigs* for comparison).

We will again use the [**Biopython**](https://biopython.org/) library we used in the previous lecture:

In [None]:
from Bio import SeqIO

contigs = list(SeqIO.parse('files/input/contigs.fasta', 'fasta'))

annotated = list(SeqIO.parse('files/output/ORFs.ffn', 'fasta'))

Let's check how many genes where predicted:

In [None]:
len(annotated)

> 🧠 Does this correspond to the number of genes reported for this species (*M. pneumoniae*) ? 

And now let's see what the first five genes looks like:

In [None]:
for seq in annotated[:5]:
    print(seq)
    print()

🤔 Does this output look a bit strange? 

That's because Biopython loads every entry in a FASTA file as a **SeqRecord** object, which contains not only the gene sequence,
but additional information extracted from the header line of each entry. 

> Before proceeding with the exercise, take a moment to get
familiar with the [documentation of the **SeqRecord** class](https://biopython.org/wiki/SeqRecord). 

### Exercise 2.1

We mentioned in the lecture that, contrary to eukaryotes, prokaryotes have high *coding density* (most of their genome encodes for genes). 

Estimate the *coding density* of the assembled genome.

> Tip: divide the total length of the coding regions by the total length of the contigs.

In [None]:
# type your code here...

Solution (click to expand):

In [None]:

coding = sum(len(seq) for seq in annotated)
total = sum(len(contig) for contig in contigs)
density = coding / total

print(f'Coding density: {density:.1%}')

### Exercise 2.2

When running **Prodigal**, we used the `-a` option to translate the ORFs into amino acid sequences and generated a protein fasta file. 

But we can also use **Biopython**'s [*translate()*](https://biopython.org/docs/1.75/api/Bio.Seq.html#Bio.Seq.Seq.translate) method to translate individual nucleotide sequences.

Try to translate the first sequence record in the file.

In [None]:
# type your code here...

Solution (click to expand):

In [None]:

gene = annotated[0]

#option 1 (apply translate to the SeqRecord object)
protein_record = gene.translate()
print(protein_record.seq)

#option 2 (apply translate to the Seq object)
protein = gene.seq.translate()
print(protein)


Compare this result with the first entry of the [translated fasta file](files/output/ORFs.faa). Is the output the same?

> If only the first amino acid is different, [here is an explanation](https://www.biostars.org/p/364080/).

### Exercise 2.3:

A *frameshift* is a type of mutation where a portion of DNA (with a length that is not a multiple of 3) 
gets deleted or inserted and changes the codon reading frame.

Remove the first (and then also the second) nucleotide of the gene you just translated, and then try translating it again. 

How do the protein sequence(s) look like?

In [None]:
# type your code here...

Solution (click to expand):

In [None]:

gene = annotated[0]

frameshift1 = gene.seq[1:]
frameshift2 = gene.seq[2:]

protein1 = frameshift1.translate()
protein2 = frameshift2.translate()

print(protein1)
print(protein2)

> 🧠 What would happen if the cell tried to translate these sequences? 

## Wrap-up

You should now be comfortable using Biopython to load a FASTA file and doing basic operations such as counting, printing, and modifying sequences.

Finished early? Consider walking around the room and helping a colleague... 😉