# RNA-Seq from scratch - Kallisto

Notes on the example data and experiment

## Organizing our inputs

Kallisto requires just a few simple things to run 

- A reference transcriptome
- FastQ files (your sequence data)

Since we already have this data available on the computers you are connecting to, let's find, inspect, and organize it

## Organizing files and directories

This notebook is running the bash shell. We could run all of these commands from the terminal, and if you run these on a terminal on your own after the workshop the commands will be similar (the locations of your files may be different)

### Get the working directory and set the locations of files

First, let's go to our home directory. If you ever get lost there is no place like home:

In [2]:
cd

Let's see the contents of our home directory

In [4]:
ls

-i  kallisto-rnaseq-jupyter  notebooks  tutorial-data


All of the data we need for our bulk RNA-Seq experiment should be in the `tutorial-data` folder; let's inspect its contents:

In [6]:
ls ./tutorial-data

data           kallisto_bulk_rna-seq_ouputs  kallisto-single-cell
kallisto-bulk  kallisto_sc_rna-seq_ouputs    processed_files


For this experiment, all of our input data files will be in the `kallisto-bulk/data` folder: 

In [8]:
ls ./tutorial-data/kallisto-bulk/data

fastq_files  hnpc_SRA_accessions.txt  study_design  transcriptomes


The `fastq_files` directory contains data generated from the example experiment

In [10]:
ls ./tutorial-data/kallisto-bulk/data/fastq_files

SRR3191541_1.fastq.gz  SRR3191544_1.fastq.gz  SRR3194428_1.fastq.gz
SRR3191541_2.fastq.gz  SRR3191544_2.fastq.gz  SRR3194429_1.fastq.gz
SRR3191543_1.fastq.gz  SRR3191545_1.fastq.gz  SRR3194430_1.fastq.gz
SRR3191543_2.fastq.gz  SRR3191545_2.fastq.gz  SRR3194431_1.fastq.gz


If you remember in the previous exercise, we have already done some quality control on these files. Rather than repeat this let's move those QC files to this directory. First though, let's separate files into the set that was paired-end sequenced, and the set that was single-end. A file in the `study_design` folder hass this information:

In [13]:
cat ./tutorial-data/kallisto-bulk/data/study_design/zikadesignmatrix.csv

﻿,sample,condition,read,fragments,Instrument,LoadDate
1,SRR3191542,mock,paired,7927777,Illumina MiSeq,2016-02-26
2,SRR3191543,mock,paired,7391076,Illumina MiSeq,2016-02-26
3,SRR3191544,zika,paired,7361527,Illumina MiSeq,2016-02-26
4,SRR3191545,zika,paired,7621347,Illumina MiSeq,2016-02-26
5,SRR3194428,mock,single,72983243,NextSeq 500,2016-02-29
6,SRR3194429,mock,single,94729809,NextSeq 500,2016-02-29
7,SRR3194430,zika,single,71055823,NextSeq 500,2016-02-29
8,SRR3194431,zika,single,66528035,NextSeq 500,2016-02-29

Samples `SRR3194428`, `SRR3194429`, `SRR3194430`, and `SRR3194431` are single-end sequenced, so we will move the coresponding fastq files into their own directories:

In [17]:
mkdir ./tutorial-data/kallisto-bulk/data/fastq_files/single-end ./tutorial-data/kallisto-bulk/data/fastq_files/paired-end 

We now have two new directories

In [20]:
ls ./tutorial-data/kallisto-bulk/data/fastq_files

paired-end             SRR3191543_2.fastq.gz  SRR3194428_1.fastq.gz
single-end             SRR3191544_1.fastq.gz  SRR3194429_1.fastq.gz
SRR3191541_1.fastq.gz  SRR3191544_2.fastq.gz  SRR3194430_1.fastq.gz
SRR3191541_2.fastq.gz  SRR3191545_1.fastq.gz  SRR3194431_1.fastq.gz
SRR3191543_1.fastq.gz  SRR3191545_2.fastq.gz


Let's change into this directory so our commands don't too long

In [22]:
cd ./tutorial-data/kallisto-bulk/data/fastq_files/

Now we can move the desired files

In [23]:
mv SRR3194428_1.fastq.gz SRR3194429_1.fastq.gz SRR3194430_1.fastq.gz SRR3194431_1.fastq.gz -t ./single-end

and we can move the paired-end files as well

In [24]:
mv *.fastq.gz ./paired-end

mv: cannot move 'paired-end' to a subdirectory of itself, './paired-end/paired-end'
mv: cannot stat '.fastq.gz': No such file or directory


: 1

Now we can see that the contents of both folders contain the appropriate sequences

In [34]:
ls -R

.:
paired-end  single-end

./paired-end:
SRR3191541_1.fastq.gz  SRR3191543_2.fastq.gz  SRR3191545_1.fastq.gz
SRR3191541_2.fastq.gz  SRR3191544_1.fastq.gz  SRR3191545_2.fastq.gz
SRR3191543_1.fastq.gz  SRR3191544_2.fastq.gz

./single-end:
SRR3194428_1.fastq.gz  SRR3194430_1.fastq.gz
SRR3194429_1.fastq.gz  SRR3194431_1.fastq.gz


### Import transcriptome data

We obtained the human reference transcriptome data from [Ensemble](https://uswest.ensembl.org/Homo_sapiens/Info/Index). Specifically, we want the set of cDNAs available from the [ftp site](). 

In [18]:
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-39/fasta/arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz

--2018-11-27 01:40:38--  ftp://ftp.ensemblgenomes.org/pub/plants/release-39/fasta/arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz
           => ‘Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz’
Resolving ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)... failed: Name or service not known.
wget: unable to resolve host address ‘ftp.ensemblgenomes.org’


: 4

Verify the the checksum of the downloaded file with the [publushed sum](ftp://ftp.ensemblgenomes.org/pub/plants/release-39/fasta/arabidopsis_thaliana/cdna/CHECKSUMS)

In [None]:
sum Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz

Let's also organize our downloaded data

In [None]:
mkdir transcriptome && mv Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz transcriptome

### Index transcriptome

We will now use Kallisto's indexing function to prepare the transcriptome for analysis. First let's organize our files:

In [None]:
mkdir $HOME/analysis

Next run the indexing command. This prepares the transcriptome so that we can peudoalign reads to it. 

In [None]:
kallisto index --index="athalianaTAIR10_index" $HOME/raw_data/transcriptome/Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz

We now have a transcriptome index which can now be used for pseudoalignment, we'll move it into the transcriptome folder:

In [None]:
mv athalianaTAIR10_index transcriptome/

### Quantify reads

In this final step, we will run Kallisto on all of our files to quantify the reads. We will write a for loop to do this. 

In [None]:
pwd
ls

All instructions for the commands we are using are in the Kallisto manual: https://pachterlab.github.io/kallisto/manual. Since we are using single read data, we need to provide information on the fragment length used for the library (200) and an estimate of the standard deviation for this value - here we will have to guess (20). These values are derived from the the case where paired end sequence is provided. 

*If needed, the results for this are located on the CyVerse Data commons at (/iplant/home/shared/cyverse_training/datasets/PRJNA79729/kallisto_quantified) and on the Amazon AMI in the dcuser home directory in the `.quantfied` folder.* 



In [None]:
cd $HOME/raw_data/fastq_files
for file in *.fastq.gz; do output="${file%.*.*}"_quant; kallisto quant\
 --single\
 --index=$HOME/raw_data/transcriptome/athalianaTAIR10_index\
 --single\
 --bootstrap-samples=25\
 --fragment-length=200\
 --sd=20\
 --output-dir=$output\
 $file; done

Finally, we can move our results folders into our analysis folder:

In [None]:
mv $HOME/raw_data/fastq_files/*/ $HOME/analysis 
ls $HOME/analysis