This data set is part of the Polistes dominula genome project, and details the assembly of the P. dominula transcriptome, as described in (Standage et al., Molecular Ecology, 2016). Included in this data set is the final transcript assembly itself, as well as documentation providing complete disclosure of the assembly procedure.
RNA-Seq reads were processed using Trimmomatic version 0.22 to remove adapter contamination and low-confidence base calls, and the processed reads were assembled using Trinity version r20131110. The assembled transcripts were then processed with mRNAmarkup version 10-3-2013 to remove contaminants and correct erroneously assembled chimeric transcripts.
Raw Illumina data is available from the NCBI Sequence Read Archive under the accession number PRJNA291219.
The GetTranscriptomeSRA.make
script automates the process of downloading these data files and converting them from SRA format to Fastq format.
This script in turn depends on the fastq-dump
command included in the SRA toolkit binary distribution.
./GetTranscriptomeSRA.make
First, we designated the number of processors available, and provided the path of the trimmomatic-0.22.jar
file distributed with the Trimmomatic source code distribution.
NumProcs=16
TrimJar=/usr/local/src/Trimmomatic-0.22/trimmomatic-0.22.jar
We then applied the following filters to each read pair (see run-trim.sh
for details).
- remove adapter contamination
- remove any nucleotides at either end of the read whose quality score is below 3
- trim the read once the average quality in a 5bp sliding window falls below 20
- discard any reads which, after processing, fall below the 40bp length threshold
for caste in q w
do
for rep in {1..6}
do
sample=${caste}${rep}
./run-trim.sh $sample $TrimJar $NumProcs
done
done
Trinity requires a pair of input files for paired-end data.
cat pdom-rnaseq-*-trim-1.fq > pdom-rnaseq-all-trim-1.fq
cat pdom-rnaseq-*-trim-2.fq > pdom-rnaseq-all-trim-2.fq
We then executed the Trinity assember using the --CuffFly
algorithm.
Trinity.pl --seqType fq \
--JM 100G \
--bflyHeapSpaceMax 50G \
--output pdom-trinity \
--CPU $NumProcs \
--left pdom-rnaseq-all-trim-1.fq \
--right pdom-rnaseq-all-trim-2.fq \
--full_cleanup \
--jaccard_clip \
--CuffFly
Contaminant, reference protein, and miRNA databases were collected as described in the mRNAmarkup documentation (db/0README
and db/0README-hy
).
The mRNAmarkup procedure was then run on the Trinity output.
Be sure to edit the mRNAmarkup.conf
file with the correct paths to the databases.
mRNAmarkup -c mRNAmarkup.conf \
-i pdom-trinity/Trinity.fasta \
-o output-mRNAmarkup
This work is licensed under a Creative Commons Attribution 4.0 International License.