Skip to content

Module 7 Playing with annotations

Harald Gruber-Vodicka edited this page Feb 24, 2021 · 14 revisions

Questions we can address with simple command line analyses:

  • What is the copy-number of the rRNA operon in your genome bin?
  • How many tRNAs are there?
  • Are there tRNAs missing for certain amino acids?
  • How many protein coding sequences are in your genome bin?
  • What is the function of the gene with the highest GC content?

Play with rRNAs

run barrnap

lets run barrnap to find the rRNAs in the genome bins

 barrnap YOURFASTA.fasta 

barrnap outputs a table of the genes predicted as a gff file. This output can be diverted into a file using >

 barrnap YOURFASTA.fasta > YOURFASTA.gff

If you want to further process your 16S predictions, the bedtools can be helpful, see below for some help.

OPTIONAL run a loop on all files

If we use a for loop we can automatically do this for all the fasta files in the directory

 for n in *.fasta; do barrnap $n; done

inspect headers to look at the coverages of all contigs

SPAdes encodes both length and Coverage in the header of a scaffold e.g. >NODE_60803+_length_9320_cov_122.682

you can get the headers of a fasta file with

 grep '>' YOURFASTA.fasta

To find out what the copy numbers of your rRNA operons are have many options. One way to get to this is to establish a baseline for the coverage of the larger parts of the genomes by using the information encoded in the headers. We can then compare those coverages to the coverages of the contigs with the rRNA operons

So lets try this - we take the headers (cat YOURFILE.fasta | grep '>'), sort them by length (sort -t '_' -k4 -rh), take the longest 100 contigs (head -n 100), get the reported coverage (cut -d '_' -f6), cut the coverage to the number before the comma (sort can be a little stupid at times... cut -d '.' -f1) and sort those numbers to see the coverages of the longest 100 contigs (sort -n). If the coverages of the larger parts of the genome are in the same ballpark range as for the contig with the rRNAs, we could argue that we have a copy number of 1

  for n in *.fasta; do echo $n; cat $n | grep '>' | sort -t '_' -k4 -rh | head -n 100 | cut -d '_' -f6 | cut -d '.' -f1 | sort -n; done

ALTERNATIVE plot the bins using brandons gblite

we can use bbmap with the reads of the library to get the coverage, GC etc. with the covstats option. Check out the screte bbmap helper page for instructions!

We could then upload the covstats data into gblite and look where the contig falls compared to the cloud of other contigs.

play with tRNAs

Aragorn

Aragorn is Arathorn's sun, but also a neat program to search for tRNAs

 aragorn YOURFASTA.fasta 

If you want to know how many tRNAs are in a given fasta file, you could use line counting (wc -l) on all reported tRNA sequence headers (remember - grepping for the > gives you headers)

 aragorn -t -fon YOURFASTA.fasta | grep '>' | wc -l

Now because we are lazy we want the work done by the machine, for all libraries!

 for n in *.fasta; do echo $n; aragorn -t -fon $n | grep '>' | wc -l; done

Did you compare the reported number to the number given at the end of a normal aragorn run?

do we have all tRNAs?

The headers of the fasta files also contain the amino acid the tRNA serves for, we could use this list and some sorting and counting to get the numbers of amino acids that are encoded for.

We could solve this with a little pipe where

  • take the headers (you know this part)
  • get the part that codes for the amino acid (that is the second column in the header)
  • cut away the tRNA-
  • and the codon (nnn)
  • then sort them into a uniqe list
  • and finally we count the lines from this list

as you see - all of this can be done with grep, cut, sort and wc

 for n in *.fasta; do echo $n; aragorn -t -fon $n | grep '>' | cut -d' ' -f2 | cut -d'-' -f2 | cut -d '(' -f1 |sort -u |wc -l; done

CDS prediction

prodigal

prodigal predicts coding sequences of a multifasta file and can e.g. write a genbank or gff file for these annotations.

prodigal -i YOURFASTA.fasta -f gff -o YOURFASTA.gff

Explore the options you have with prodigal - how can you e.g. get the predicted peptides?

OPTIONAL bedtools

A powerful toolset to work with genomic annotations in gff format are the bedtools suite. They can read annotation files and the corresponding .fasta files and process them in various ways - http://bedtools.readthedocs.io/en/latest/

'bedtools nuc' profiles the nucleotides sequences that were annotated and e.g. calculates their length or GC content. 'bedtools getfasta' can give you the fasta for e.g. the annotations in a .gff file

How many protein coding sequences are in your genome bin?

you need to count fasta sequences in a multifasta... da, that's easy!

What are the proteins that are encoded with the highest GC?

To search for the proteins encoded with the highest GC and then extract those gene products one could e.g.

use 'bedtools nuc' to calculate the GC for each gene (reported in column 11), sort them, get the ten with the highest GC score and output those annotations to a list of annotations (a new gff file). That will be our first step,

 for n in *.fasta; do echo $n;bedtools nuc -fi $n -bed $n.gff | sort -k11 -n -r | head -n 10 >  ${n}_longest.gff; done

Now to look a a function we would need the peptides - we got those from prodigal (using the -a option). A slight problem is that if we want to find certain genes in a list of peptides that are output by prodigal the peptide has a a header name that is derived from the gff annoation - NODE_X from ID=NODEcount_X, (X being the gene number on that contig or NODE). No worries - the ID=Nodecount_X is also written to the headers, so we can search and select the peptide fasta sequences that way. Lets first get the IDs of those genes we are interested in with a little cutting

 for n in *longest*; do cat $n | cut -f9 | cut -d';' -f1 > ${n}.ID; done

then we can use this list of IDs to get a list of the fasta headers for the peptide sequences of those genes

 for n in *.ID; do grep -f $n ${n%_longest.gff.ID}.faa | cut -d ' ' -f1 | sed 's/>//' > $n.list; done

we can then use the fasta headers to extract those genes with the simple faSomeRecords program that takes a list of fasta headers, but the headers must not contain the > sign - we removed this already, that is the sed 's/>//' command above. The call is simple:

 faSomeRecords MULTIFASTA.fa list.txt MULTIFASTA_of_list.fa

to do this for all genomes, we can do

 for n in *.ID; do faSomeRecords ${n%_longest.gff.ID}.faa $n.list ${n%_longest.gff.ID}.longest_gene.faa; done

To get an idea what those curious genes are doing you could now search for related sequences in databases e.g using BLAST - https://blast.ncbi.nlm.nih.gov/. Use the protein BLAST module and upload the file with the 10 high GC genes and BLAST them against SWISSPROT or nr.

Annotation with Prokka

A powerful software to generate annoations is Prokka - https://github.com/tseemann/prokka

To be able to annotate with Prokka, we need to fix the headers of your files to be free of symbols and short. Look at your headers e.g. with the 'head' command. You can use e.g. sed to fix them prior to annotation

 sed 's/\+.*//' TS_Lvic_1094_B.fasta > TS_Lvic_1094_B.fixed.fasta 

We are using a regular expression to have sed remove the + after the node and every thing that follows. For a guide in sed and regular expressions e.g. see https://www.tutorialspoint.com/unix/unix-regular-expressions.htm

You can then activate the prokka environment with

conda activate /bioinf/transfer/Marmics_SPMG2021/SPMG_software_env/prokka

Prokka is a very flexible annotation pipeline for prokaryote genomes. A simple prokka call would be

prokka YOURBIN.fasta

Prokka offers the flexibility to annotate with a custom set of proteins. You can try this using one of the annotation databases provided in /bioinf/transfer/Marmics_SPMG2021/lecture_8_annotation_and_RNASeq.

prokka YOURBIN.fasta --protein /bioinf/transfer/Marmics_SPMG2021/lecture_8_annotation_and_RNASeq/custom.faa

Have fun