# Transcriptome assembly and assessment
This notebook includes the analyses performed in order to assemble transcriptome data from multiple tomato accessions towards transcript-based genome annotation.  
The analysis starts by exploring the official ITAG _S. lycopersicum_ annotation in order to set a baseline for quality metrics.  
Then, a pan-transcriptome-assembly procedure per _Solanum_ species is performed. This analysis includes the following general steps (although see detailed description below):
1. Download RNA-seq data from multiple studies (hosted on SRA), covering diverse variants, tissues and conditions.
2. Assemble each data set independently, using Trinity with its genome-guided mode.
3. QA each result and filter unreliable outputs.
4. Perform transcriptome-merging from all data sets, using StringTie (merge mode) to obtain a single non-redundant species-specific pan-transcriptome.
5. Further QA and cleanup on merged transcriptome.

In [1]:
import sys
sys.path.append('../../queue_utilities/')
from queueUtils import *
queue_conf = '../../queue_utilities/queue.conf'
sys.path.append('../python/')
from os.path import realpath
from os import chdir
import os
from get_genome_stats import get_stats
import pandas
from IPython.display import display
pandas.set_option('display.float_format', lambda x: "{:,.2f}".format(x) if int(x) != x else "{:,.0f}".format(x))
from shutil import rmtree

DATA_PATH = realpath("../data/")
PY_PATH = realpath("../python/")
FIGS_PATH = realpath("../figs/")
OUT_PATH = realpath("../output/")

## Reference transcriptome assessment
The first stage of this analysis is to get some statistics on the official tomato reference transcriptome.  
This should serve as a baseline for later QA of newly-assembled transcriptomes from non-reference accessions.  
QA procedures outlined in the [trinity tutorial](https://github.com/trinityrnaseq/trinityrnaseq/wiki/Transcriptome-Assembly-Quality-Assessment) were followed.

### Get data
Data from the Heinz 1706 reference assembly SL3.0 annotation build ITAG3.2

In [7]:
itag32_ftp_url = "ftp://ftp.solgenomics.net/tomato_genome/annotation/ITAG3.2_release/ITAG3.2_cDNA.fasta"
download_command = "wget %s -P %s" % (itag32_ftp_url, realpath(DATA_PATH))
send_commands_to_queue('download_itag32_cdna', download_command, queue_conf)

Job download_itag32_cdna (job id 1277211) completed successfully


('1277211', '0')

In [6]:
itag32_cdna_fasta_path = "%s/ITAG3.2_cDNA.fasta" % DATA_PATH

### Basic stats

In [33]:
itag32_stats = get_stats(itag32_cdna_fasta_path)

In [35]:
display(pandas.DataFrame.from_dict(itag32_stats, orient='index'))

Unnamed: 0,0
Total length,54440050.0
Total scaffolds,35768.0
# of gaps,241894.0
% gaps,0.44
N50,2226.0
L50,7838.0
N90,801.0
L90,22989.0
Min scaffold length,63.0
Max scaffold length,23222.0


### Run BUSCO
Run BUSCO to assess completeness of transcripts set.

In [31]:
busco_dir = "/groups/itay_mayrose/liorglic/software/busco"
busco_command = "python %s/scripts/run_BUSCO.py --in %s --out ITAG3.2_BUSCO --lineage_path %s/embryophyta_odb9/ --mode transcriptome" \
% (busco_dir, realpath(itag32_cdna_fasta_path), busco_dir)
env_commands = ['export PATH="/share/apps/augustus/bin:$PATH"',
               'export PATH="/share/apps/augustus/scripts:$PATH"',
               'export AUGUSTUS_CONFIG_PATH="/groups/itay_mayrose/liorglic/software/busco/augustus_config"',
               "cd %s" % realpath(OUT_PATH)]
busco_run_commands = env_commands + [busco_command]
send_commands_to_queue("ITAG32_BUSCO", busco_run_commands, queue_conf, block=False, n_cpu=20)

('1281124', None)

### Full length transcripts
In this analysis, transcripts are aligned to all known proteins from SwissProt in order to determine the distribution of % coverage of protein sequences by transcripts. See further explanation [here](https://github.com/trinityrnaseq/trinityrnaseq/wiki/Counting-Full-Length-Trinity-Transcripts).

In [3]:
# download and extract SwissProt
sp_ftp_url = "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz"
download_command = "wget %s -P %s" %(sp_ftp_url, realpath(DATA_PATH))
extract_command = "gzip -d %s/uniprot_sprot.fasta.gz" % realpath(DATA_PATH)
send_commands_to_queue("get_SwissProt",[download_command, extract_command],queue_conf)

In [7]:
# build blast DB
sp_fasta = "%s/uniprot_sprot.fasta" % DATA_PATH
build_blast_db_commands = ['module load blast/blast240',
                          'makeblastdb -in %s -dbtype prot' % realpath(sp_fasta)]
send_commands_to_queue("build_SP_blast_DB",build_blast_db_commands,queue_conf, block=False)

In [8]:
# Perform the blast search, reporting only the top alignment:
blast_res = OUT_PATH + "ITAG_vs_SP.blastx.outfmt6"
blast_search_commands = ['module load blast/blast240',
                        'blastx -query %s -db %s -out %s -evalue 1e-20 -num_threads 10 -max_target_seqs 1 -outfmt 6'
                         %(realpath(itag32_cdna_fasta_path), realpath(sp_fasta), realpath(blast_res))]
send_commands_to_queue("blast_ITAG_vs_SP",blast_search_commands,queue_conf, block=False, n_cpu=10)

In [10]:
# use utility script provided within Trinity to parse blast result
script_path = "/share/apps/Trinity-v2.6.6/util/analyze_blastPlus_topHit_coverage.pl"
parse_blast_res_commands = ["%s %s %s %s"
                            %(script_path, realpath(blast_res), realpath(itag32_cdna_fasta_path), realpath(sp_fasta))]
send_commands_to_queue('parse_ITAG_blast_res', parse_blast_res_commands, queue_conf)

Job parse_ITAG_blast_res (job id 1508779) completed successfully


('1508779', '0')

In [13]:
full_length_hist = OUT_PATH + "ITAG_vs_SP.blastx.outfmt6.hist"
display(pandas.DataFrame.from_csv(full_length_hist, sep='\t'))

Unnamed: 0_level_0,count_in_bin,>bin_below
#hit_pct_cov_bin,Unnamed: 1_level_1,Unnamed: 2_level_1
100,6751,6751
90,1538,8289
80,759,9048
70,534,9582
60,447,10029
50,418,10447
40,318,10765
30,260,11025
20,218,11243
10,62,11305


## _S. lycopersicum_ pan transcriptome

The table below describes the data sets used in the analysis. Each data set corresponds to a single variant from a single study. Multiple tissues, conditions, developmental stages and library types may be included. A data set ID is composed of the study ID (s) and the variant ID (v).

In [17]:
s_lyc_datasets_tsv = DATA_PATH + "/S_lyc_RNA_seq_datasets.tsv"
s_lyc_datasets = pandas.DataFrame.from_csv(s_lyc_datasets_tsv, sep='\t')
s_lyc_datasets.index.name = 'ID'
s_lyc_datasets.reset_index(inplace=True)
display(s_lyc_datasets)

  


Unnamed: 0,ID,Variant,SRR list
0,s1v1,Micro-Tom,DRR074670 DRR074671 DRR074672 DRR074673
1,s2v1,Moneymaker,ERR1533151 ERR1533152 ERR1533153 ERR1533156 ER...
2,s4v1,PI114490,SRR390335 SRR390336
3,s4v2,FL7600,SRR389806 SRR389807
4,s4v3,NC84173,SRR389808 SRR390315
5,s4v4,OH9242,SRR390328 SRR390329
6,s4v5,T5,SRR390330 SRR390331
7,s5v1,M82,SRR863016 SRR863017 SRR863018 SRR863024 SRR863025
8,s6v1,Ailsa Craig,SRR863042 SRR863043 SRR863044 SRR863045 SRR863...
9,s7v1,HG6-61,SRR1759290 SRR1759289 SRR1759288 SRR1759287 SR...


Each data set was processed using the following steps:
1. Download raw RNA-seq data using sra-dump
1. Unzip downloaded files
1. Parse downloaded file names to detect paired end libraries and prepare for assembly
1. Align reads to reference genome using TopHat and sort bam file using samtools
1. Use alignment results and raw data for genome-guided assembly using Trinity
1. Run BUSCO on the assembly
1. Clean up - remove raw data and other intermediate files to save disc space

However, in order to allow for genome-guided transcriptome assembly, the reference genome was first downloaded and indexed (using Bowtie). For improved reads alignment, the reference annotation was also used.

In [12]:
FASTQ_DUMP_PATH = "/groups/itay_mayrose/liorglic/sratoolkit.2.9.0-ubuntu64/bin/fastq-dump"
download_to_dir = "%s/S_lyc_RNA_Seq" % realpath(DATA_PATH)
analysis_dir = "%s/S_lyc_RNA_Seq" % realpath(OUT_PATH)
TOPHAT_EXEC_PATH = "/share/apps/tophat210/bin/tophat"
SAMTOOLS_EXEC_PATH = "/share/apps/samtools12/bin/samtools"
BOWTIE_EXEC_PATH = "/share/apps/bowtie112/bin/bowtie-build"

In [10]:
!mkdir $download_to_dir

In [32]:
# download reference genome and annotation
S_lyc_ref_annotation_url = "ftp://ftp.solgenomics.net/tomato_genome/annotation/ITAG3.2_release/ITAG3.2_gene_models.gff"
!wget $S_lyc_ref_annotation_url -P $DATA_PATH
S_lyc_ref_genome_url = "ftp://ftp.solgenomics.net/tomato_genome/assembly/build_3.00/S_lycopersicum_chromosomes.3.00.fa"
!wget $S_lyc_ref_genome_url -P $DATA_PATH

--2018-05-27 14:53:08--  ftp://ftp.solgenomics.net/tomato_genome/annotation/ITAG3.2_release/ITAG3.2_gene_models.gff
           => “/groups/itay_mayrose/liorglic/Projects/tomato_pan_genome/data/ITAG3.2_gene_models.gff.2”
Resolving ftp.solgenomics.net... 132.236.81.147
Connecting to ftp.solgenomics.net|132.236.81.147|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /tomato_genome/annotation/ITAG3.2_release ... done.
==> SIZE ITAG3.2_gene_models.gff ... 56409515
==> PASV ... done.    ==> RETR ITAG3.2_gene_models.gff ... done.
Length: 56409515 (54M) (unauthoritative)


2018-05-27 14:53:20 (5.99 MB/s) - “/groups/itay_mayrose/liorglic/Projects/tomato_pan_genome/data/ITAG3.2_gene_models.gff.2” saved [56409515]

--2018-05-27 14:53:20--  ftp://ftp.solgenomics.net/tomato_genome/assembly/build_3.00/S_lycopersicum_chromosomes.3.00.fa
           => “/groups/itay_mayrose/liorglic/Projects/tomato_pan_genome/data/S_lycop

In [10]:
s_lyc_ref_annotation_path = "%s/ITAG3.2_gene_models.gff" % DATA_PATH
s_lyc_ref_genome_path = "%s/S_lycopersicum_chromosomes.3.00.fa" % DATA_PATH

In [35]:
# index reference genome
bowtie_index_command = "%s %s %s" %(BOWTIE_EXEC_PATH, s_lyc_ref_genome_path, genome_index_base)
send_commands_to_queue("bowtie_index_s_lyc", [bowtie_index_command], queue_conf)

Job bowtie_index_s_lyc (job id 3743509) completed successfully


('3743509', '0')

In [17]:
genome_index_base = os.path.splitext(s_lyc_ref_genome_path)[0]

In [11]:
ds = "s5v1"
sra = "SRR863016 SRR863017 SRR863018 SRR863024 SRR863025"
download_dir = "%s/S_lyc_RNA_Seq/%s" %(DATA_PATH, ds)
analysis_dir = "%s/S_lyc_transcriptome_assembly/%s" %(OUT_PATH,ds)

pipeline_script = "%s/transcriptome_assembly_pipeline.py" % PY_PATH
%run $pipeline_script $queue_conf $ds "$sra" $download_dir $analysis_dir -a $s_lyc_ref_annotation_path -g $s_lyc_ref_genome_path --first_command 5

FileNotFoundError: [Errno 2] No such file or directory: '/groups/itay_mayrose/liorglic/Projects/tomato_pan_genome/data/S_lyc_RNA_Seq/s5v1'

In [12]:
ds = "s4v1"
sra = "SRR390335 SRR390336"
download_dir = "%s/S_lyc_RNA_Seq/%s" %(DATA_PATH, ds)
analysis_dir = "%s/S_lyc_transcriptome_assembly/%s" %(OUT_PATH,ds)
pipeline_script = "%s/transcriptome_assembly_pipeline.py" % PY_PATH
%run $pipeline_script $queue_conf $ds "$sra" $download_dir $analysis_dir -a $s_lyc_ref_annotation_path -g $s_lyc_ref_genome_path --first_command 4

Job s4v1_BUSCO (job id 3932479) completed successfully
Failed to clean seembly dir


In [13]:
get_job_exit_code(3932479)

137

In [None]:
ds = "s4v2"
sra = "SRR389806 SRR389807"
download_dir = "%s/S_lyc_RNA_Seq/%s" %(DATA_PATH, ds)
analysis_dir = "%s/S_lyc_transcriptome_assembly/%s" %(OUT_PATH,ds)
pipeline_script = "%s/transcriptome_assembly_pipeline.py" % PY_PATH
%run $pipeline_script $queue_conf $ds "$sra" $download_dir $analysis_dir -a $s_lyc_ref_annotation_path -g $s_lyc_ref_genome_path

In [None]:
send_commands_to_queue('try',['echo "hi"', 'sleep 30', 'echo "bye"'], queue_conf)