# Welcome to the Bash notebook for Neuronal Differentation 2021

In this notebook, all the major steps involving the RNASeq analysis process (using bash) will be outlined to the best of my ability, so that everyone can follow along with the RNASeq process!

Entries will be formatted in the following manner:

1. **Input**: If the data is external to our shared directory at **/project/data/neuronaldifferentiation2021**, then it will provided as such (i.e. a hyperlink leading to the data or script used); however, if the data is internal, within the directory, its path with filename will be included.

2. **Code**: This part contains the actual bash script/code used to execute any steps in the RNASeq analysis process.

3. **Output**: If the output is another file, it will be outlined as such; if the output is as set of files, the directory  (and path) containing them will be presented; if the output is in some other format or transient in nature, it will be pasted directly into the notebook.

*Note: It is not advised (trivial) to actually run this code in the notebook, as it is entirely separate to the Imperial College London servers and as such no directories/files will exist, along with a lack of all the required modules*

*This notebook uses the following Bash kernel: https://github.com/takluyver/bash_kernel*

## Part 1: Acquiring the relevant RNASeq data and pre-processing it

**1.1** Acquiring the data from the SRA Cloud

* The raw data human and mouse data which was obtained through Illumina sequencing can be found <a href="https://www.ncbi.nlm.nih.gov/Traces/study/?acc=%20%09PRJNA590754&o=acc_s%3Aa&s=SRR10503052,SRR10503053,SRR10503054,SRR10503055,SRR10503056,SRR10503057,SRR10503058,SRR10503059,SRR10503060,SRR10503061,SRR10503062,SRR10503063,SRR10503064,SRR10503065,SRR10503066,SRR10503067,SRR10503068,SRR10503069,SRR10503070,SRR10503071,SRR10503072,SRR10503073,SRR10503074,SRR10503075,SRR10503076,SRR10503077,SRR10503078,SRR10503079,SRR10503080,SRR10503081,SRR10503082,SRR10503083,SRR10503085,SRR10503086,SRR10503087,SRR10503088,SRR10503089,SRR10503090,SRR10503091,SRR10503092,SRR10503093,SRR10503094,SRR10503095,SRR10503096,SRR10503097,SRR10503100,SRR10503101,SRR10503103,SRR10503104,SRR10503105,SRR10503106,SRR10503107,SRR10503108,SRR10503109,SRR10503110,SRR10503111,SRR10503112,SRR10503113,SRR10503114,SRR10503115,SRR10503116,SRR10503117,SRR10503118,SRR10503119,SRR10503120,SRR10503121,SRR10503122,SRR10503123,SRR10503124,SRR10503125,SRR10503126,SRR10503127,SRR10503128,SRR10503129,SRR10503130,SRR10503131,SRR10503132,SRR10503133,SRR10503134,SRR10503135,SRR10503136" target="_blank">here</a>


**1.2** The following script was designed to `wget` each individual sample listed above: 

In [None]:
wget https://sra-download.ncbi.nlm.nih.gov/traces/sra13/SRR/010256/*

wget https://sra-download.ncbi.nlm.nih.gov/traces/sra20/SRR/010256/*

wget https://sra-download.ncbi.nlm.nih.gov/traces/sra33/SRR/010256/*


**1.3** The resulting output was all 81 raw samples, in order, which can be found at **/project/data/neuronaldifferentiation2021/sra**

**2.1** Converting the data into FASTQ format
* In order to make the sample data compatible with any future pipelines, it must be converted into FASTQ format
* The toolkit used to make this conversion is **sratoolkit/2.10.9.0**, configured using the `vdbconfig` command
* The raw data in its original format can be found at **/project/data/neuronaldifferentiation2021/sra** (from **1.3**)

**2.2** Due to the lengthy nature of this conversion process, a script named **convert.sh** was implemented using `nohup` to ensure that the process could run its course without being interrupted. This script is detailed below:

*Note that ***/project/home20/nam220/ND2021*** *is the shortcutted directory with the same base path as in* ***1.3***

In [None]:
#!/bin/sh

cd /project/data/neuronaldifferentiation2021/sra #position the script in the working directory

echo "Starting at "`date` > convert.log #this keeps track script start time convert.log

module load sratoolkit/2.10.9.0 #loads the necessary module for sample conversion

for i in SRR*  #iterate through each SRR sample
do 
echo "Processing:  "$i >> convert.log #this keeps track of sample iteration
fastq-dump -O /project/data/neuronaldifferentiation2021/fastq $i  #converts files
done

echo "finished at "`date` >> convert.log #this keeps track of script stop time

In [None]:
nohup ./convert.sh > messages.out 2>&1 & #runs the scripts; error messages stored in messages.out

**2.3** This script created two log-type output files : **/project/data/neuronaldifferentiation2021/sra/convert.log** and **/project/data/neuronaldifferentiation2021/sra/messages.out** tracking the time taken for the task and any error messages/progress reports, respectively. According to **convert.log** this task took approximately 3.5 hours.



The primary output of **convert.sh** are the FASTQ formatted SRR files from **1.3**, which can be found in **/project/data/neuronaldifferentiation2021/fastq**

*Note: these files have been distributed into separate species within the **fastq** directory*

## Part 2.1: Simplified pipeline (known as orange_pipeline)

The orange pipeline is run manually using guidance provided in our genomics practical and **Protocol 1** from <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6373869/" target="_blank">this</a> paper.


**1.1** FastQC was first used to perform a quality control check on the raw reads. This required the .fastq files, which are in separate folders for each species at:

* **/project/home20/nam220/ND2021/fastq/human_fastq** for human
* **/project/home20/nam220/ND2021/fastq/mouse_fastq** for mouse

**1.2** The following one-line script was used to run this process in the background:

In [None]:
module load fastqc

cd /project/home20/nam220/ND2021/fastq/human_fastq
nohup fastqc *.fastq -o /project/home20/nam220/ND2021/orange_pipeline/FastQC/human_qc/ > progress_human.out 2>&1 &

cd /project/home20/nam220/ND2021/fastq/mouse_fastq 
nohup fastqc *.fastq -o /project/home20/nam220/ND2021/orange_pipeline/FastQC/mouse_qc/ > progress_mouse.out 2>&1 &

**1.3** The output files are found in **/project/home20/nam220/ND2021/orange_pipeline/FastQC**
* The key output files are .html files named after the .fastq files, which provide a wealth of information regarding the QC parameters.

**2.1** CutAdapt was used for quality trimming of the FastQ files in:
* **/project/home20/nam220/ND2021/fastq/human_fastq** for human
* **/project/home20/nam220/ND2021/fastq/mouse_fastq** for mouse

**2.2** The following script was used to achieve this, run in the background using `nohup`

*Note: The script included here is for the human fastq, the same would apply for the mouse fastq*

In [None]:
#!/bin/sh

cd /project/data/neuronaldifferentiation2021/fastq/human_fastq

echo "Starting at "`date` > trim.log 

module load cutadapt/1.9.1-python3

for i in SRR*
do 
echo "Processing:  "$i >> trim.log
cutadapt -q 30,30 -o tr_$i $i
done

echo "finished at "`date` >> trim.log

for n in tr_SRR*
do
echo "Moving: "$n >> movetrim.log
mv $n ./trimmed_human/
done


In [None]:
nohup ./trimmer.sh > cutadapt_human.out 2>&1 &

**2.3** Three log output files are produced:

* cutadapt_human.out or cutadapt_mouse.out (depending on species)
* trim.log 
* movetrim.log 

These files tracked the progress of the script and noted any standard outputs or error messages.

The main files produced from this script are tr_SRR* fastq files, which represent the equivalent files but trimmed at a minimum phred score of 30. These are stored in the CutAdapt directory at **/project/home20/nam220/ND2021/orange_pipeline/CutAdapt/**

**3.1** The next step is to use STAR in order to index and align the reference genomes to our trimmed .fastq files. This is done in multiple steps: 

* First, the genome is indexed
* Second, the reads are aligned to the genome

The files obtained using `wget` are located <a href="https://www.gencodegenes.org/" target="_blank">here</a> , from gencodes.

The following `nohup` script was used to grab the links for each species, and `gunzip` was used to decompress the downloaded files:

In [None]:
nohup wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_36/gencode.v36.annotation.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_36/GRCh38.primary_assembly.genome.fa.gz > genome_log.out 2>&1 &

gunzip gencode.v36.annotation.gtf.gz 
gunzip GRCh38.primary_assembly.genome.fa.gz

nohup wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/gencode.vM25.annotation.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/GRCm38.primary_assembly.genome.fa.gz > genome_log.out 2>&1 &

gunzip gencode.vM25.annotation.gtf.gz
gunzip GRCm38.primary_assembly.genome.fa.gz

*Note: these are stored in our scratch directories, as they are disposable files (they're huge)*

**3.2.1** Indexing: using the downloaded fasta and gtf files, we can create a genome index for each species, using **STAR** and `nohup` to run it in background:

In [None]:
module load star/2.6.0

nohup star --runThreadN 2 --runMode genomeGenerate --genomeDir /project/home20/nam220/ND2021/orange_pipeline/STAR/human_genome_annot --genomeFastaFiles /project/home20/nam220/scratch/genomes/human_genomes/GRCh38.primary_assembly.genome.fa --sjdbGTFfile /project/home20/nam220/scratch/genomes/human_genomes/gencode.v36.annotation.gtf > STAR_human_log.out 2>&1 &

#this command is the one run for the human genome, a similar process was applied to the mouse genome


**3.2.2** Mapping: the gtf files and the genomeDir are included along with our trimmed fastq files. The command is run through the following script using `nohup`:

*Note: The script included here is for the human fastq, the same would apply for the mouse fastq*

In [None]:
#!/bin/sh

cd /project/data/neuronaldifferentiation2021/orange_pipeline/CutAdapt/trimmed_human

echo "Starting at "`date` > STAR_nam220_errors.log

module load star/2.6.0

for i in tr_SRR*
do 
echo "Processing:  "$i >> STAR_nam220_errors.log
star --runThreadN 2 --genomeDir /project/data/neuronaldifferentiation2021/orange_pipeline/STAR/human_genome_annot --sjdbGTFfile /project/scratch/nam220/genomes/human_genomes/gencode.v36.annotation.gtf --sjdbOverhang 100 --readFilesIn $i --outSAMtype BAM SortedByCoordinate --outFileNamePrefix /project/data/neuronaldifferentiation2021/orange_pipeline/STAR/human_starmapped/$i
done

echo "finished at "`date` >> STAR_nam220_errors.log

In [None]:
nohup ./map.sh > map_nam220_log.out 2>&1 &

**3.3** The subsequent output files were produced for each trimmmed tr_SRR*.fastq sample:

* tr_SRR10503052.fastqAligned.out.sam
* tr_SRR10503052.fastqAligned.sortedByCoord.out.bam
* tr_SRR10503052.fastqLog.final.out
* tr_SRR10503052.fastqLog.out 
* tr_SRR10503052.fastqLog.progress.out 
* tr_SRR10503052.fastqSJ.out.tab 
* tr_SRR10503052.fastq_STARgenome 

*Note: the same output files were created for all other samples*

These files are located in **/project/data/neuronaldifferentiation2021/orange_pipeline/STAR/human_starmapped** and **/project/data/neuronaldifferentiation2021/orange_pipeline/STAR/mouse_starmapped**

**4.1** Using StringTie, we now count the reads mapped to the annotated genes. This involves using the previously created .bam file, only (as seen in **3.3**), as well as the original .gtf annotation file (as seen in **3.1**)

**4.2.1** The following script was used along with `nohup` to count the reads:

*Note this is the script for the human files*

In [None]:
#!/bin/sh

cd /project/data/neuronaldifferentiation2021/orange_pipeline/StringTie/human_count

echo "Starting at "`date` > StringTie_errors.log

module load stringtie/1.3.4c

for i in /project/data/neuronaldifferentiation2021/orange_pipeline/STAR/human_starmapped/*.bam
do 
name=$(echo "$i" | cut -c81-94)
echo "Processing:  "$name >> StringTie_errors.log
stringtie $i -p 2 -e -G gencode.v36.annotation.gtf -o ballgown/$name/$name.gtf
done

echo "finished at "`date` >> StringTie_errors.log

In [None]:
nohup ./counter.sh > stringtie_nam220_log.out 2>&1 &

**4.2.2** In order for this data to be used in downstream analysis, a premade python script was run in `python 2.7.11` to convert the resulting files into **matrix.csv** files:

*Note the python script can be found at* ***/project/data/huntley/rnaseq_safe/StringTie/prepDE.py*** *and on the GitHub repository under* ***ND2021-ICL/pipe-ABC/RNASeqPipeline/StringTie***

In [None]:
module load python/2.7.11

nohup python prepDE.py > prepDE_nam220_log.out 2>&1 &

**4.2.3** At a later date, RSEM was also used to obtain readcounts, in order to compare the two program outputs

In [None]:
#!/bin/sh

cd /project/data/neuronaldifferentiation2021/orange_pipeline/STAR_RSEM

echo "Starting at "`date` > RSEM_time_human.log

module load perl/5.20.3
module load rsem/1.3.1

for i in /project/data/neuronaldifferentiation2021/orange_pipeline/STAR_RSEM/human_starmapped/*.toTranscriptome.out.bam
do 
name=$(echo "$i" | cut -c86-96)
echo "Processing:  "$name >> RSEM_time_human.log
rsem-calculate-expression --bam --no-bam-output -p 8  \ $i /project/data/neuronaldifferentiation2021/orange_pipeline/STAR_RSEM/human_RSEM_ref/ref /project/data/neuronaldifferentiation2021/orange_pipeline/RSEM/human_counts/$name

done

echo "finished at "`date` >> RSEM_time_human.log


In [None]:
nohup ./RSEM_counter_human.sh > RSEM_nam220_log.out 2>&1 &

**4.3** The resulting output files can be found in **/StringTie/human_count** and **/StringTie/mouse_count**:

* .py, .sh and .log/out files for running the above scripts and tracking them
* transcript_count_matrix.csv
* gene_count_matrix.csv

**At this point, these .csv files are used in conjunction with DESeq2 and DP_GP cluster for downstream analysis. Please refer to the RNASeqAnalysis and Clustering notebooks for the next instructions.**