# RNA-Seq from scratch - Kallisto

In the experiment described [in this paper](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0175744) a total of eight RNA-seq samples of ZIKV-infected and mock-infected hNPCs were analyzed. The results of the sequence data (as available on the [NCBI Sequence Read Archive](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA313294)) are analyzed here. 

## Organizing our inputs

Kallisto requires just a few simple things to run 

- A reference transcriptome
- FastQ files (your sequence data)

Since we already have this data available on the computers you are connecting to, let's find, inspect, and organize it

## Organizing files and directories

This notebook is running the bash shell. We could run all of these commands from the terminal, and if you run these on a terminal on your own after the workshop the commands will be similar (the locations of your files may be different)

### Get the working directory and set the locations of files

First, let's go to our home directory. If you ever get lost there is no place like home:

In [None]:
cd

Let's see the contents of our home directory

In [None]:
ls

Another way of specifying the home directory will be to use the `$HOME` shell variable

In [None]:
ls $HOME

All of the data we need for our bulk RNA-Seq experiment should be in the `tutorial-data` folder; let's inspect its contents:

In [None]:
ls $HOME/tutorial-data

For this experiment, all of our input data files will be in the `kallisto-bulk/data` folder: 

In [None]:
ls $HOME/tutorial-data/kallisto-bulk/data

The `fastq_files` directory contains data generated from the example experiment

In [None]:
ls -R $HOME/tutorial-data/kallisto-bulk/data/fastq_files

Let's see which of the files were from paired-end sequenced libraries, and those that were single-end sequenced. A file in the `study_design` folder has this information:

In [None]:
cat $HOME/tutorial-data/kallisto-bulk/data/study_design/zikadesignmatrix.csv

Samples `SRR3194428`, `SRR3194429`, `SRR3194430`, and `SRR3194431` are single-end sequenced.

If you remember in the previous exercise, we have already done some quality control on these files. Rather than repeat this let's we will use these data as inputs for Kallisto

In [None]:
ls -R $HOME/tutorial-data/kallisto-bulk/analyses/pre-processed_fastp

### Obtain reference transcriptome

We obtained the human reference transcriptome data from [Ensemble](https://uswest.ensembl.org/Homo_sapiens/Info/Index). Specifically, we want the set of cDNAs available from the [ftp site](ftp://ftp.ensembl.org/pub/release-94/fasta/homo_sapiens/cdna/). Those data are available here:

In [None]:
ls $HOME/tutorial-data/kallisto-bulk/data/transcriptomes

### Using Kallisto to index the transcriptome

We will now use Kallisto's indexing function to prepare the transcriptome for analysis. First let's organize our files:

In [None]:
cd $HOME/tutorial-data/kallisto-bulk/indicies
ls

The `pre-processed-index` folder contains a backup index, but we can make our own. 

Now we are finally ready to use kallisto. Let's check that Kallisto is installed, check it's version, and get a little help:

In [None]:
kallisto

Next run the indexing command (5-7 minutes). This prepares the transcriptome so that we can peudoalign reads to it. 

In [None]:
kallisto index --index="human_GRCh38_transcriptome_index" $HOME/tutorial-data/kallisto-bulk/data/transcriptomes/Homo_sapiens.GRCh38.cdna.all.fa.gz

We now have a transcriptome index which can now be used for pseudoalignment. As long as we intend to use this version of the transcriptome, we can use this index for all our future Kallisto experiments - no need to index again.

In [None]:
ls

### Quantify reads

In this final step, we will run Kallisto on all of our files to quantify the reads. We will create a directory and then write some bash shell for loops to run kallisto independently on each read file (single-end) or pair of read files (paired-end). 

In [None]:
cd $HOME/tutorial-user/tutorial-data/kallisto-bulk/analyses/pre-processed_fastp/
ls -R

All instructions for the commands we are using are in the Kallisto manual: https://pachterlab.github.io/kallisto/manual. 

First, let's do the analysis of the single-end reads, based on the parameters use in the paper

In [None]:
cd $HOME/tutorial-user/tutorial-data/kallisto-bulk/analyses/pre-processed_fastp/single-end
for file in *.fastq.gz; do output="${file%.*.*}"_quant; kallisto quant\
 --threads=4\
 --single\
 --index=$HOME/tutorial-data/kallisto-bulk/indicies/human_GRCh38_transcriptome_index\
 --bootstrap-samples=100\
 --fragment-length=187\
 --sd=70\
 --output-dir=$output\
 $file; done

Now we should have four folders containing results of our quantification of the single-end reads:

In [None]:
ls $HOME/tutorial-user/tutorial-data/kallisto-bulk/analyses/pre-processed_fastp/single-end

Let's create a folder for these results and organize outputs there

In [None]:
mkdir $HOME/tutorial-data/kallisto-bulk/analyses/kallisto-quantification/

Let's move the folders to our analyses folder:

In [None]:
mv $HOME/tutorial-user/tutorial-data/kallisto-bulk/analyses/pre-processed_fastp/single-end/*/ $HOME/tutorial-data/kallisto-bulk/analyses/kallisto-quantification

In [None]:
cd $HOME/tutorial-user/tutorial-data/kallisto-bulk/analyses/pre-processed_fastp/paired-end
for file in *_1.fastq.gz; do re1input=$file;\
 re2input=fastp_$(echo $file|cut -f2 -d _)_2.fastq.gz;\
 output="${file%.*.*}"_quant;\
 output=fastp_$(echo $file|cut -f2 -d _)_quant;\
 kallisto quant\
 --threads=4\
 --index=$HOME/tutorial-data/kallisto-bulk/indicies/human_GRCh38_transcriptome_index\
 --bootstrap-samples=100\
 --output-dir=$output\
 $re1input $re2input;\
 done

We now will have folders for the results of our paired-end quantification

In [None]:
ls $HOME/tutorial-data/kallisto-bulk/analyses/kallisto-quantification/

### Examining Kallisto Outputs

Given the parameters we used, we expect the following outputs (as taken from the [kallisto manual](https://pachterlab.github.io/kallisto/manual)):

- **abundances.h5**: HDF5 binary file containing run info, abundance estimates, bootstrap estimates, and transcript length information length. This file can be read in by sleuth
- **abundances.tsv**: Plaintext file of the abundance estimates. It does not contains bootstrap estimates (but these can be output using the `--plaintext` argument. `kallisto h5dump` can be used to output an HDF5 file to plaintext. The first line contains a header for each column, including estimated counts, TPM, effective length.
- **run_info.json**:json file containing information about the run

Let's look at the first few lines of the abundences in `SRR3191541_quant`:


In [None]:
head $HOME/tutorial-data/kallisto-bulk/analyses/kallisto-quantification/fastp_SRR3191542_quant/abundance.tsv

We can also get information about the run in the json file

In [None]:
cat $HOME/tutorial-data/kallisto-bulk/analyses/kallisto-quantification/fastp_SRR3191542_quant/run_info.json