# Genome annotation
This notebook contains preparations for, and the genome annotation itself using MAKER-P.
First, evidence data (transcripts, proteins and gene models) are collected and arranged. Next, the Heinz reference is annotated and the result is compared to the official annotation to assess the quality of annotation that the procedure used can produce. Finally, all genomes are annotated.

In [1]:
import sys
sys.path.append('../../queue_utilities/')
from queueUtils import *
queue_conf = '../../queue_utilities/queue.conf'
sys.path.append('../python/')
from os.path import realpath
from os import chdir
import os
from get_genome_stats import get_stats
import pandas
from IPython.display import display
pandas.set_option('display.float_format', lambda x: "{:,.2f}".format(x) if int(x) != x else "{:,.0f}".format(x))
from shutil import rmtree
from subprocess import Popen

DATA_PATH = realpath("../data/")
PY_PATH = realpath("../python/")
FIGS_PATH = realpath("../figs/")
OUT_PATH = realpath("../output/")

bedtools_exe = "/share/apps/bedtools2/bin/bedtools"

## Collect evidence data
### Proteins
The proteins evidence data is comprised of 5 proteomes of high-quality reference annotations:
- A. thaliana
- V. vinifira (grape vine)
- S. lycopersicum (tomato)
- O. sativa (rice)
- G. max (soybean)

Additionally, all SwissProt (i.e manually curated/reviewed) proteins from other Embryophyte plants were used.

In [20]:
proteins_evidence_dir = DATA_PATH + "/proteins_evidence"

In [3]:
! mkdir $proteins_evidence_dir

In [3]:
# download and extract proteins data
s_lyc_proteins_url = "ftp://ftp.solgenomics.net/tomato_genome/annotation/ITAG3.2_release/ITAG3.2_proteins.fasta"
a_thaliana_proteins_url = "ftp://ftp.ensemblgenomes.org/pub/plants/release-39/fasta/arabidopsis_thaliana/pep/Arabidopsis_thaliana.TAIR10.pep.all.fa.gz"
o_sativa_proteins_url = "ftp://ftp.ensemblgenomes.org/pub/plants/release-39/fasta/oryza_sativa/pep/Oryza_sativa.IRGSP-1.0.pep.all.fa.gz"
v_vinifera_proteins_url = "ftp://ftp.ensemblgenomes.org/pub/plants/release-39/fasta/vitis_vinifera/pep/Vitis_vinifera.IGGP_12x.pep.all.fa.gz"
g_max_proteins_url = "ftp://ftp.ensemblgenomes.org/pub/plants/release-39/fasta/glycine_max/pep/Glycine_max.Glycine_max_v2.0.pep.all.fa.gz"
swiss_prot_embryophte_proteins_url = "https://www.uniprot.org/uniprot/?sort=&desc=&compress=yes&query=reviewed:yes%20taxonomy:3193&fil&format=fasta&force=yes"

In [19]:
protein_download_commads = []
protein_download_commads.append("wget \"%s\" -P %s" %(s_lyc_proteins_url, proteins_evidence_dir))
protein_download_commads.append("wget \"%s\" -P %s" %(a_thaliana_proteins_url, proteins_evidence_dir))
protein_download_commads.append("wget \"%s\" -P %s" %(o_sativa_proteins_url, proteins_evidence_dir))
protein_download_commads.append("wget \"%s\" -P %s" %(v_vinifera_proteins_url, proteins_evidence_dir))
protein_download_commads.append("wget \"%s\" -P %s" %(g_max_proteins_url, proteins_evidence_dir))
protein_download_commads.append("wget \"%s\" -O %s/swissProt_emb_proteins.fasta.gz" %(swiss_prot_embryophte_proteins_url, proteins_evidence_dir))
protein_download_commads.append("for f in `ls -1 %s/*.gz`; do gzip -d $f; done" % proteins_evidence_dir)

send_commands_to_queue("download_extract_proteins", protein_download_commads, queue_conf)

NameError: name 's_lyc_proteins_url' is not defined

In [10]:
# filter SwissProt proteins to remove species for which a proteome was downloaded
filter_commands = ['module load python/python-3.3.0']
filter_commands.append("python %s/filter_fasta.py %s/swissProt_emb_proteins.fasta -out_fasta %s/swissProt_emb_proteins_filtered.fasta -v %s" %(PY_PATH, proteins_evidence_dir, proteins_evidence_dir, "'Arabidopsis thaliana' 'Solanum lycopersicum' 'Oryza sativa' 'Glycine max' 'Vitis vinifera'"))
send_commands_to_queue('filter_swissprot', filter_commands, queue_conf)

Job filter_swissprot (job id 5977785) completed successfully


('5977785', 0)

In [21]:
# finally, concatenate all protein fasta files
all_protein_evidence_fasta = "%s/all_proteins.fasta" % proteins_evidence_dir
s_lyc_proteins_fasta = "%s/ITAG3.2_proteins.fasta" % proteins_evidence_dir
a_thaliana_proteins_fasta = "%s/Arabidopsis_thaliana.TAIR10.pep.all.fa" % proteins_evidence_dir
o_sativa_proteins_fasta = "%s/Oryza_sativa.IRGSP-1.0.pep.all.fa" % proteins_evidence_dir
v_vinifera_proteins_fasta = "%s/Vitis_vinifera.IGGP_12x.pep.all.fa" % proteins_evidence_dir
g_max_proteins_fasta = "%s/Glycine_max.Glycine_max_v2.0.pep.all.fa" % proteins_evidence_dir
swiss_prot_embryophte_proteins_filtered_fasta = "%s/swissProt_emb_proteins_filtered.fasta" % proteins_evidence_dir

In [11]:
command = "cat %s > %s" %(' '.join([s_lyc_proteins_fasta, a_thaliana_proteins_fasta,
                               o_sativa_proteins_fasta, v_vinifera_proteins_fasta, g_max_proteins_fasta,
                               swiss_prot_embryophte_proteins_filtered_fasta]), all_protein_evidence_fasta)
send_commands_to_queue('cat_protein_evidence',command, queue_conf)

Job cat_protein_evidence (job id 5988646) completed successfully


('5988646', 0)

### Transcripts
Collect all transcript assemblies by concatenating into one file, while including the data set name in the headers.
This was already done in transcriptome assembly notebook.
At this point, I decided to just concatenate all transcripts, but I might come back here in order to filter the transcripts in some way or merge redundant transcripts.

In [12]:
s_lyc_transcript_assemblies_dir = OUT_PATH + "/S_lyc_transcriptome_assembly"
s_lyc_all_transcripts_concat = transcript_assemblies_dir + "/all_transcripts_concat.fasta"

### Repeats library
In addition to the RepBase library, a species-specific library was created. This was done based on the ITAG3.2_RepeatModeler_repeats_light.gff file downloaded from SolGenomics. The gff file was used on the reference genome to extract fasta sequences.

In [10]:
# get repeats gff
!wget ftp://ftp.solgenomics.net/tomato_genome/annotation/ITAG3.2_release/ITAG3.2_RepeatModeler_repeats_light.gff -P $DATA_PATH

--2018-07-23 14:31:57--  ftp://ftp.solgenomics.net/tomato_genome/annotation/ITAG3.2_release/ITAG3.2_RepeatModeler_repeats_light.gff
           => “/groups/itay_mayrose/liorglic/Projects/tomato_pan_genome/data/ITAG3.2_RepeatModeler_repeats_light.gff”
Resolving ftp.solgenomics.net... 132.236.81.147
Connecting to ftp.solgenomics.net|132.236.81.147|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /tomato_genome/annotation/ITAG3.2_release ... done.
==> SIZE ITAG3.2_RepeatModeler_repeats_light.gff ... 86039779
==> PASV ... done.    ==> RETR ITAG3.2_RepeatModeler_repeats_light.gff ... done.
Length: 86039779 (82M) (unauthoritative)


2018-07-23 14:32:09 (8.40 MB/s) - “/groups/itay_mayrose/liorglic/Projects/tomato_pan_genome/data/ITAG3.2_RepeatModeler_repeats_light.gff” saved [86039779]



In [15]:
s_lyc_repeats_gff = "%s/ITAG3.2_RepeatModeler_repeats_light.gff" % DATA_PATH
s_lyc_genome_fasta = "%s/S_lycopersicum_chromosomes.3.00.fa" % DATA_PATH
s_lyc_repeats_fasta = "%s/ITAG3.2_RepeatModeler_repeats_light.fasta" % OUT_PATH

In [17]:
gff_to_fasta_command = "%s getfasta -fi %s -bed %s -fo %s" %(bedtools_exe, s_lyc_genome_fasta, s_lyc_repeats_gff, s_lyc_repeats_fasta)
send_commands_to_queue("repeats_gff_to_fasta", gff_to_fasta_command, queue_conf)

Job repeats_gff_to_fasta (job id 6087294) completed successfully


('6087294', 0)

In [23]:
print(s_lyc_repeats_fasta)

/groups/itay_mayrose/liorglic/Projects/tomato_pan_genome/output/ITAG3.2_RepeatModeler_repeats_light.fasta
