Instructions to take raw transcriptomics reads and progress to a counts file that can be used in R packages for statistical analysis of differential expression.
To follow this tutorial you will need to have some basic familiarity with working in a unix-like environment. An excellent place to start would be to attend a software carpentry or data carpentry workshop. The VLSCI also have several tutorials for working with unix and/or a HPC environment
This will usually mean opening a terminal and using ssh to connect. The general format of this command is;
ssh username@my.hpc.server.edu.auOften the sequencing center will provide a url to download raw sequencing data do that a tool like wget can be used to do the download. For example;
wget -O SequencingData.tar https://data.sequencingcenter.edu.au/download?keyOnce you have downloaded the data it is a good idea to keep the original unmodified files somewhere.
How to do this depends on how the files were packaged up. The general tool to use is tar, possibly with options to unzip. Assuming the downloaded data is in a file called Project_ANDI1971.tar we would use this command to unpack.
tar -xvf Project_ANDI1971.tarFor gzipped files (.gz extension) you can use the -z option to unzip and untar at the same time.
tar -zxvf Project_ANDI1971.tar.gzNow move all these raw reads into a folder called raw_data
mkdir raw_data
mv Project_ANDI1971 raw_data/Check the quality of your sequence data with fastqc. If your reads contain adapter sequences you may need to trim those before going forward.
This might involve a whole extra step of assembling the transcriptome from your own data. Alternatively, a reference might already be available. We assume a reference transcriptome is already available and is called Trinity.fasta. Put this reference transcriptome inside the raw_data folder.
This is done with bowtie2. On many HPC systems this can be loaded by doing
module load bowtie2Check that it is installed by doing
which bowtie2This should return a path showing the location of the bowtie2 program
Now make a directory where we are going to do all the alignment and read counting. Change into this new directory
mkdir corset
cd corsetRun the following script to make symbolic links (aliases) for all your raw data files inside the corset directory
file_paths=`find ../raw_data/ -name *.fastq.gz`
for f in $file_paths; do
link_name=`basename $f`
ln -s $f $link_name
doneAlso make a symbolic link for the reference transcriptome
ln -s ../raw_data/Trinity.fasta transcriptome.fastaFrom within the corset directory run the bowtie2-build command to index the transcriptome
bowtie2-build transcriptome.fasta transcriptomeNow make a file called bowtie.sh containing the following script. Note that bowtie2 options used in this script were derived from this post on the corset users forum.
This script requires both bowtie2 and samtools to be installed and available. You might need to do module load samtools. The reason we need samtools is to avoid creating a large intermediate file in sam format. Instead we pipe outputs from bowtie2 directly to samtools which converts to the more compacts bam.
for f in `ls *R1*.fastq.gz`; do
r1=$f
r2=${f/R1/R2}
r12name=${f/R1/both}
outname=${r12name%.fastq.gz}
echo "Running bowtie2 on $outname and writing outputs to $outputs.bam"
bowtie2 --no-mixed --no-discordant --end-to-end --all \
--min-score L,-0.1,-0.1 \
--threads 32 \
-x transcriptome -1 $r1 -2 $r2 | samtools view --threads 4 -b - > $outname.bam
done