# Astrangia de novo Transcriptome Assembly
The following pipeline was used to de novo assemble an *Astrangia poculata* transcriptome from RNAseq data. The following scripts were run on the LEAP server at Texas State University. The code here provides both the commands and the SBATCH parameters for the shell scripts that were submitted to the job manager at each step.

## 1. Install programs
The following pipeline uses FastQC, MultiQC, Cutadapt, bbmap, bowtie2, Trinity, BUSCO, and bioperl. If these are not already installed, you will need to do that before moving forward. On the LEAP server, these programs can be installed using the following scripts. Some of the provided install commands rely on conda, which can be installed following the instructions on the Miniconda website (https://docs.conda.io/en/latest/miniconda.html). 

*FastQC*

Download the FastQC zipfile (https://www.bioinformatics.babraham.ac.uk/projects/download.html#fastqc) and unzip the file in the location you want it installed.

In [None]:
unzip fastqc_v0.11.9.zip

*MultiQC*

In [None]:
pip install multiqc 

*Cutadapt*

In [None]:
conda install -c bioconda cutadapt

*bbmap*

In [None]:
conda install -c bioconda bbmap

*Bowtie2*

In [None]:
conda install -c bioconda bowtie2

*Trinity*

In [None]:
conda install -c bioconda trinity

*BUSCO*

BUSCO can be installed using conda following the command below. However, before installing BUSCO you should makes sure you have all of the necessary dependencies installed. For the list of dependencies look on the BUSCO website (https://busco.ezlab.org/busco_userguide.html#manual-installation).

In [None]:
conda install -c conda-forge -c bioconda busco=5.3.2

*bioperl*

In [None]:
conda install -c bioconda perl-bioperl

## 2. Download data
Data was downloaded from Novogene directly to the LEAP server using the `wget` code provided by Novogene. The data was downloaded to a folder designated for the transcriptome data and assembly titled `AstrangiaTranscriptome_042622` and within the transcriptome directory, the sequence files were sorted into a subdirectory titled `raw_data`.

In [None]:
mkdir AstrangiaTranscriptome_042622
cd AstrangiaTranscriptome_042622
mkdir raw_data
cd raw_data

## 3. Quality assessment
Raw RNAseq files were quality assessed first using FastQC and MultiQC.

In [None]:
~/FastQC/fastqc ~/AstrangiaTranscriptome_042622/raw_data/EB*.fq

In [None]:
multiqc ~/AstrangiaTranscriptome_042622/raw_data/

Once the MultiQC file was generated it was secure copied to my local computer and opened to view the sequence quality plots and statistics.

## 4. Trimming and quality filtering
Sequences were trimmed to remove sequences with high numbers of N base calls and low quality sequences (sequences with average Phred quality scores of <20). The filtering and trimming was done using cutadapt by submitting the following script titled `cutadapt.sh` to the job manager:

In [None]:
#!/bin/bash
#SBATCH --job-name=cutadapt
#SBATCH -N 1
#SBATCH -t 6-24:00
#SBATCH --partition=shared
#SBATCH --mem=50G
#SBATCH --mail-type=end
#SBATCH --mail-user=eborbee@txstate.edu
#SBATCH -o trim_%j.out
#SBATCH -e trim_%j.err

cutadapt --max-n 0 -q 20 -o allReads_1_trimmed.fq -p allReads_2_trimmed.fq allReads_1.fq allReads_2.fq

## 5. Separating host and symbiont reads
Before assembling you should start by separating host and symbiont reads into separate files. This is done using `bbsplit` (command from bbmap) and mapping reads to the *Breviolum psygmophilum* reference transcriptome. Reads that map to the transcriptome will sort into one file, while the reads that don't map to the reference will sort into a separate file that will be designated as reads belonging to the host (*Astrangia poculata*). 

The *Breviolum psygmophilum* reference transcriptome can be accessed on the Reef Genomics database (http://zoox.reefgenomics.org/download/).

This process was done by submitting the following code in a script titled `bbsplit.sh` to the job manager:

In [None]:
#!/bin/bash
#SBATCH --job-name=bbsplit
#SBATCH -N 1
#SBATCH -t 13-24:00
#SBATCH --partition=himem
#SBATCH --mem=250G
#SBATCH --mail-type=end
#SBATCH --mail-user=eborbee@txstate.edu
#SBATCH -o bbsplit_%j.out
#SBATCH -e bbsplit_%j.err

~/miniconda3/bin/bbsplit.sh ref=~/BrevPsygmophilum_transcriptome/psyg_assembly_longest_250.fa \
in1=allReads_1_trimmed.fq in2=allReads_2_trimmed.fq basename=out_%.fa refstats=sampleStats.txt \
outu1=unmatched_reads1.fa outu2=unmatched_reads2.fa

The script will result in an output file with the reads mapping to the reference *B. psygmophilum* transcriptome (`out_psyg_assembly_longest_250.fa`), and two files for the forward and reverse files of sequences that did not map to the reference (`unmatched_reads1.fa` and `unmatched_reads2.fa`). The two unmatched files will contain the sequences belonging to *Astraangia poculata* and will be used in the next step for the assembly.

## 6. Transcriptome assembly
Once we have the symbiont and host reads separated, we can move into transcriptome assembly with the host reads. To assemble the transcriptome we use the program Trinity. For explanation of how Trinity works, check out their Github page (https://github.com/trinityrnaseq/trinityrnaseq/wiki). This is a computationally heavy step and requires a high memory node on the LEAP server as indicated in the SBATCH parameters in the script. The high memory nodes have max time limits of 60 days. You should plan to request a minimum of 1 month of time to be sure the job has enough time to complete. The assembly can be run by submitting the following script titled `trinity.sh` to the job manager:

In [None]:
#!/bin/bash
#SBATCH --job-name=trinity
#SBATCH -N 1
#SBATCH -t 30-24:00
#SBATCH --partition=himem
#SBATCH --mem=500G
#SBATCH --mail-type=end
#SBATCH --mail-user=eborbee@txstate.edu
#SBATCH -o trinity_%j.out
#SBATCH -e trinity_%j.err

Trinity --seqType fa --max_memory 500G --left unmatched_reads1.fa --right unmatched_reads2.fa

Trinity will generate an output directory titled `trinity_out_dir`. Inside that directory, you will find outputs from each step of the program, and the assembled transcriptome in fasta format titled `Trinity.fa`. The fasta output from Trinity is not in proper fasta format as there are line breaks inserted periodically throughout the sequences. You will need to remove these line breaks before moving forward with the next steps. You can do this with the following awk command:

In [None]:
cat Trinity.fa | awk '{if (substr($0,1,1)==">"){if (p){print "\n";} print $0} else printf("%s",$0);p++;}END{print "\n"}' > Trinity_fixed.fa

## 7. Evaluating assembly quality
### 7.1 Assessment of read content in transcriptome assembly
One way to evaluate the quality of a transcriptome assembly is to map reads from your original sequence files back to the newly assembled transcriptome. This can be done using bowtie2 and the following scripts. This will take a few days to run on the LEAP server so be sure to request an appropriate amount of time (I requested 7 days to be safe). More information on this process can be found here: https://github.com/trinityrnaseq/trinityrnaseq/wiki/RNA-Seq-Read-Representation-by-Trinity-Assembly.

In [None]:
#!/bin/bash
#SBATCH --job-name=bowtie
#SBATCH -N 1
#SBATCH -t 6-24:00
#SBATCH --partition=himem
#SBATCH --mem=250G
#SBATCH --mail-type=end
#SBATCH --mail-user=eborbee@txstate.edu
#SBATCH -o bowtie_%j.out
#SBATCH -e bowtie_%j.err

bowtie2-build Trinity_fixed.fa Trinity_fixed.fa

bowtie2 -p 10 -q --no-unal -k 20 -x Trinity_fixed.fa \
-1 ~/AstrangiaTranscriptome_042622/allReads_1_trimmed.fq \
-2 ~/AstrangiaTranscriptome_042622/allReads_2_trimmed.fq  \
     2>align_stats.txt| samtools view -@10 -Sb -o bowtie2.bam

### 7.2 Evaluating completeness of ortholog content with BUSCO
BUSCO allows us to evaluate the completeness of our transcriptome based on the content of highly conserved single-copy orthologs in closely related species. On the LEAP server, we have to run BUSCO in offline mode which means you need to download the lineage dataset and upload it to the server manually. Lineage datasets can be downloaded from the BUSCO website (https://busco-data.ezlab.org/v5/data/lineages/). Once downloaded, you will have to designate the path to the lineage dataset in the BUSCO script. BUSCO can be run by submitting the following script to the job manager:

In [None]:
#!/bin/bash
#SBATCH --job-name=busco
#SBATCH -N 1
#SBATCH -t 30-24:00
#SBATCH --partition=himem
#SBATCH --mem=250G
#SBATCH --mail-type=end
#SBATCH --mail-user=eborbee@txstate.edu
#SBATCH -o busco_%j.out
#SBATCH -e busco_%j.err

busco -i Trinity_fixed.fasta  \
-l ~/AstrangiaTranscriptome_042622/trinityOutput/trinity_out_dir/busco_downloads/eukaryota_odb10 \
-o busco_output -m transcriptome --offline

### 7.3 Counting number of transcripts in the assembly
The number of transcripts in a good transcriptome assembly should be between 50,000-100,000 reads. To count the number of transcripts in the assembly we can use `fgrep` to count the number of `>` characters in the assembly fasta file using the following command.

In [None]:
fgrep -c ">" Trinity_fixed.fa

If you have a high number of transcripts you can try either a genome-guided assembly or using methods detailed below to reduce the number of transcripts by filtering isoforms and other steps. If you do not have a high number of transcripts, skip ahead to the "Annotating transcriptome" section of this file.

### 7.4 Assembly statistics (N50)
N(x) statistics tell you the length of the transcript at X% the total length of the assembly when transcripts are lined up by length order. The statistic most commonly reported in publications is the N50 value. N10, N20, N30, N40, and N50 values can all be calculated using the `Trinity_stats.pl` file found in the `trinityrnaseq` github repository and the script is explained at the link below.

https://github.com/trinityrnaseq/trinityrnaseq/wiki/Transcriptome-Contig-Nx-and-ExN50-stats

In [None]:
~/trinityrnaseq/util/Trinity_stats.pl Trinity_fixed.fa

## 8. Genome-guided assembly
Genome-guided assemblies work similar to *de novo* assemblies with the difference that they use previously assembled genomes for the species of interest as a reference for assembling transcripts. The following section's code was constructed using the script and information provided on the Trinity GitHub page linked below.

https://github.com/trinityrnaseq/trinityrnaseq/wiki/Genome-Guided-Trinity-Transcriptome-Assembly

### 8.1 Installing additional programs
For a genome-guided assembly you will need to provide Trinity with read alignments to the reference genome as a coordinate-sorted `bam` file. This file can be generated using `GSNAP`, `TopHat`, or `STAR`. The code for installing each of these is provided below.

In [None]:
conda install -c compbiocore gsnap

In [None]:
conda install -c bioconda tophat

In [None]:
conda install -c bioconda star

### 8.2 Genome-guided assembly
The following code has Trinity use GSNAP to align RNAseq reads to the reference genome and then runs the genome-guided assembly.

In [None]:
 Trinity --genome_guided_bam rnaseq.coordSorted.bam \
         --genome_guided_max_intron 10000 \
         --max_memory 10G --CPU 10 

## Annotating transcriptome
The steps taken to annotate the new transcriptome assembly were taken from Misha Matz's GitHub (https://github.com/z0on/annotatingTranscriptomes/blob/master/annotating%20trascriptome.txt). To start we need to clone the git repository on our accounts on the LEAP server so we have access to all of the scripts. You will also want to make a directory to store all of the transcriptome annotation files and navigate to that directory.

In [None]:
wget https://github.com/z0on/annotatingTranscriptomes/archive/master.zip
unzip master

In [None]:
mkdir transcriptomeAnnotations
cd transcriptomeAnnotations

### Download UniProt database
We will annotate the transcriptome using the UniProt database. To do this we have to first download and unzip the UniProt database in a new directory on our LEAP account. Then we will use `makeblastdb` from BLAST to construct the database on the LEAP server.

In [None]:
mkdir uniprotDB
cd uniprotDB
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
gunzip uniprot_sprot.fasta.gz

In [None]:
makeblastdb -in uniprot_sprot.fasta -dbtype prot

In [None]:
cd ../

### BLAST transcriptome against UniProt database
The next step in annotating the transcriptom is BLASTing the sequences against the UniProt database. In order for this process to run quicker we first split the transcriptome into 40 chunks that can then run in parallel. After the BLAST is complete we can combine the outputs back into one. 

In [None]:
../annotatingTranscriptomes-master/splitFasta.pl Trinity_fixed.fasta 40

The following perl command will construct all of the necessary BLAST commands for each of the 40 subsets of the transcriptome generated above in place them in a file titled `blast.sh`. Once the commands are generated, you will have to add in the `SBATCH` parameters to the top of the script and then change the permissions to make the script executable using `chmod`.

In [None]:
ls subset* | perl -pe 's/^(\S+)$/blastx -query $1 -db uniprot_sprot\.fasta -evalue 0\.0001 -num_threads 3 -num_descriptions 5 -num_alignments 5 -out $1.br/'>blast.sh

In [None]:
#!/bin/bash
#SBATCH --job-name=blast
#SBATCH -N 1
#SBATCH -t 30-24:00
#SBATCH --partition=himem
#SBATCH --mem=250G
#SBATCH --mail-type=end
#SBATCH --mail-user=eborbee@txstate.edu
#SBATCH -o blast_%j.out
#SBATCH -e blast_%j.err

blastx -query subset10_Trinity_fixed.fasta -db uniprot_sprot.fasta -evalue 0.0001 -num_threads 3 -num_descriptions 5 -num_alignments 5 -out subset10_Trinity_fixed.fasta.br
blastx -query subset11_Trinity_fixed.fasta -db uniprot_sprot.fasta -evalue 0.0001 -num_threads 3 -num_descriptions 5 -num_alignments 5 -out subset11_Trinity_fixed.fasta.br
blastx -query subset12_Trinity_fixed.fasta -db uniprot_sprot.fasta -evalue 0.0001 -num_threads 3 -num_descriptions 5 -num_alignments 5 -out subset12_Trinity_fixed.fasta.br
blastx -query subset13_Trinity_fixed.fasta -db uniprot_sprot.fasta -evalue 0.0001 -num_threads 3 -num_descriptions 5 -num_alignments 5 -out subset13_Trinity_fixed.fasta.br
blastx -query subset14_Trinity_fixed.fasta -db uniprot_sprot.fasta -evalue 0.0001 -num_threads 3 -num_descriptions 5 -num_alignments 5 -out subset14_Trinity_fixed.fasta.br
blastx -query subset15_Trinity_fixed.fasta -db uniprot_sprot.fasta -evalue 0.0001 -num_threads 3 -num_descriptions 5 -num_alignments 5 -out subset15_Trinity_fixed.fasta.br
blastx -query subset16_Trinity_fixed.fasta -db uniprot_sprot.fasta -evalue 0.0001 -num_threads 3 -num_descriptions 5 -num_alignments 5 -out subset16_Trinity_fixed.fasta.br
blastx -query subset17_Trinity_fixed.fasta -db uniprot_sprot.fasta -evalue 0.0001 -num_threads 3 -num_descriptions 5 -num_alignments 5 -out subset17_Trinity_fixed.fasta.br
blastx -query subset18_Trinity_fixed.fasta -db uniprot_sprot.fasta -evalue 0.0001 -num_threads 3 -num_descriptions 5 -num_alignments 5 -out subset18_Trinity_fixed.fasta.br
blastx -query subset19_Trinity_fixed.fasta -db uniprot_sprot.fasta -evalue 0.0001 -num_threads 3 -num_descriptions 5 -num_alignments 5 -out subset19_Trinity_fixed.fasta.br
blastx -query subset1_Trinity_fixed.fasta -db uniprot_sprot.fasta -evalue 0.0001 -num_threads 3 -num_descriptions 5 -num_alignments 5 -out subset1_Trinity_fixed.fasta.br
blastx -query subset20_Trinity_fixed.fasta -db uniprot_sprot.fasta -evalue 0.0001 -num_threads 3 -num_descriptions 5 -num_alignments 5 -out subset20_Trinity_fixed.fasta.br
blastx -query subset21_Trinity_fixed.fasta -db uniprot_sprot.fasta -evalue 0.0001 -num_threads 3 -num_descriptions 5 -num_alignments 5 -out subset21_Trinity_fixed.fasta.br
blastx -query subset22_Trinity_fixed.fasta -db uniprot_sprot.fasta -evalue 0.0001 -num_threads 3 -num_descriptions 5 -num_alignments 5 -out subset22_Trinity_fixed.fasta.br
blastx -query subset23_Trinity_fixed.fasta -db uniprot_sprot.fasta -evalue 0.0001 -num_threads 3 -num_descriptions 5 -num_alignments 5 -out subset23_Trinity_fixed.fasta.br
blastx -query subset24_Trinity_fixed.fasta -db uniprot_sprot.fasta -evalue 0.0001 -num_threads 3 -num_descriptions 5 -num_alignments 5 -out subset24_Trinity_fixed.fasta.br
blastx -query subset25_Trinity_fixed.fasta -db uniprot_sprot.fasta -evalue 0.0001 -num_threads 3 -num_descriptions 5 -num_alignments 5 -out subset25_Trinity_fixed.fasta.br
blastx -query subset26_Trinity_fixed.fasta -db uniprot_sprot.fasta -evalue 0.0001 -num_threads 3 -num_descriptions 5 -num_alignments 5 -out subset26_Trinity_fixed.fasta.br
blastx -query subset27_Trinity_fixed.fasta -db uniprot_sprot.fasta -evalue 0.0001 -num_threads 3 -num_descriptions 5 -num_alignments 5 -out subset27_Trinity_fixed.fasta.br
blastx -query subset28_Trinity_fixed.fasta -db uniprot_sprot.fasta -evalue 0.0001 -num_threads 3 -num_descriptions 5 -num_alignments 5 -out subset28_Trinity_fixed.fasta.br
blastx -query subset29_Trinity_fixed.fasta -db uniprot_sprot.fasta -evalue 0.0001 -num_threads 3 -num_descriptions 5 -num_alignments 5 -out subset29_Trinity_fixed.fasta.br
blastx -query subset2_Trinity_fixed.fasta -db uniprot_sprot.fasta -evalue 0.0001 -num_threads 3 -num_descriptions 5 -num_alignments 5 -out subset2_Trinity_fixed.fasta.br
blastx -query subset30_Trinity_fixed.fasta -db uniprot_sprot.fasta -evalue 0.0001 -num_threads 3 -num_descriptions 5 -num_alignments 5 -out subset30_Trinity_fixed.fasta.br
blastx -query subset31_Trinity_fixed.fasta -db uniprot_sprot.fasta -evalue 0.0001 -num_threads 3 -num_descriptions 5 -num_alignments 5 -out subset31_Trinity_fixed.fasta.br
blastx -query subset32_Trinity_fixed.fasta -db uniprot_sprot.fasta -evalue 0.0001 -num_threads 3 -num_descriptions 5 -num_alignments 5 -out subset32_Trinity_fixed.fasta.br
blastx -query subset33_Trinity_fixed.fasta -db uniprot_sprot.fasta -evalue 0.0001 -num_threads 3 -num_descriptions 5 -num_alignments 5 -out subset33_Trinity_fixed.fasta.br
blastx -query subset34_Trinity_fixed.fasta -db uniprot_sprot.fasta -evalue 0.0001 -num_threads 3 -num_descriptions 5 -num_alignments 5 -out subset34_Trinity_fixed.fasta.br
blastx -query subset35_Trinity_fixed.fasta -db uniprot_sprot.fasta -evalue 0.0001 -num_threads 3 -num_descriptions 5 -num_alignments 5 -out subset35_Trinity_fixed.fasta.br
blastx -query subset36_Trinity_fixed.fasta -db uniprot_sprot.fasta -evalue 0.0001 -num_threads 3 -num_descriptions 5 -num_alignments 5 -out subset36_Trinity_fixed.fasta.br
blastx -query subset37_Trinity_fixed.fasta -db uniprot_sprot.fasta -evalue 0.0001 -num_threads 3 -num_descriptions 5 -num_alignments 5 -out subset37_Trinity_fixed.fasta.br
blastx -query subset38_Trinity_fixed.fasta -db uniprot_sprot.fasta -evalue 0.0001 -num_threads 3 -num_descriptions 5 -num_alignments 5 -out subset38_Trinity_fixed.fasta.br
blastx -query subset39_Trinity_fixed.fasta -db uniprot_sprot.fasta -evalue 0.0001 -num_threads 3 -num_descriptions 5 -num_alignments 5 -out subset39_Trinity_fixed.fasta.br
blastx -query subset3_Trinity_fixed.fasta -db uniprot_sprot.fasta -evalue 0.0001 -num_threads 3 -num_descriptions 5 -num_alignments 5 -out subset3_Trinity_fixed.fasta.br
blastx -query subset40_Trinity_fixed.fasta -db uniprot_sprot.fasta -evalue 0.0001 -num_threads 3 -num_descriptions 5 -num_alignments 5 -out subset40_Trinity_fixed.fasta.br

In [None]:
chmod a+x blast.sh

Once your script is executable, submit the script to the job manager using the `sbatch` command. The output from this script should be files ending in `.br` for each of the 40 subsets. Once that script has finished running, we concatenate those files together into a single output file titled `myblast.br`.

In [None]:
cat subset*br > myblast.br

After concatenating the output file you can either delete the subset files using the `rm` command, or you can organize them into a separate folder as done below.

In [None]:
mkdir subsets
mv subset*_* subsets/

### Annotating transcriptome with isogroup
For transcriptomes assembled using Trinity, the assembled transcriptome file needs to be annotated with the isogroup, which we can do using the following `grep` and `cat` commands.

In [None]:
grep ">" tr.fasta | perl -pe 's/>((TRINITY.+_g\d+)\S+)/$1\t$2/' >transcriptome_seq2iso.tab 
cat transcriptome.fasta | perl -pe 's/>((TRINITY.+_g\d+)\S+)/>$1 gene=$2/' >transcriptome_iso.fasta