Skip to content

3. Obtain data files

Frederick Tan edited this page Oct 9, 2015 · 5 revisions

##Getting the RNA-seq data

Fastq-dump is a command included in the Sequence Read Archive (SRA) Toolkit. Fastq-dump allows individuals to easily retrieve publicly available fastq files via the command line. These fastq files are available on the Sequence Read Archive (SRA) webpage. SRA makes this sequence data available online for use by researchers in order to increase transparency, reproducibly, and collaboration within the scientific community.

We have preselected one individual's RNA-seq data for the analysis. The sample identifier is NA12878, but this individual has been sequenced many times and therefore, there are a variety of fastq files to pick from. To access the data from the one specific sequencing run we have in mind we will use a unique identifier: SRR1153470.

We will store the fastq files in the subdirectory "rawdata". Switch to that directory and run the fastq-dump script with the following commands:

Command to type:

$ cd ~/rawdata
$ fastq-dump --split-files SRR1153470

More information about the --split-files argument is below.

Once the download is complete you should see two files in the raw data directory:

File output:

SRR1153470_1.fastq SRR1153470_2.fastq

This experiment is "paired-ended" -- the cDNA was chopped into many pieces about a 100bp in size and then the sequence of each 100bp cDNA fragment was read forward (SRR1153470_1.fastq) and backwards (SRR1153470_2.fastq). The --split-files argument places the forward and reverse reads into seperate files -- without it they would be placed into the same file. When the sequence is only read in one direction this is called single-ended. In order to simplify your first alignment we will only map the forward read, so our analysis will be single-ended.

##Getting references for aligners

All aligners require a reference -- this is the known sequence of the genome that the reads will be aligned to. We will be using four different aligners within this tutorial: blastmapper, hisat, bwa, and star -- all will require uniquely formatted index files.

The following commands will retrieve the reference file for the human chromosome 20, simplify the longer name to a more manageable, shorter name, and unzip the file:

Command to type:

NOTE: The ftp link for wget is really long, so make sure you scroll the box all the way to the right (ends in .fna.gz)

$ cd ~/genome
$ wget -O chr20.fa.gz ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000001405.30_GRCh38.p4/GCF_000001405.30_GRCh38.p4_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/chr20.fna.gz
$ gunzip chr20.fa.gz

To simply your first alignment we will only be aligning to chromosome 20.