# RNA-seq calling and QC

This document shows the use of various modules to prepare reference data, perform RNA-seq calling, expression level quantification and quality control. In particular,

1. [`reference_data.ipynb`](../data_preprocessing/reference_data.html)
2. [`RNA_calling.ipynb`](calling/RNA_calling.html)
3. [`readCount_QC.ipynb`](QC/readCount_QC.html)

A minimal working example is available on [Google Drive](https://drive.google.com/drive/u/0/folders/11kQv7PXozsKkgeqADH-28bC_kZ-w_oHo).

## Download reference data (first time use)

In [None]:
sos run reference_data download_hg_reference --cwd reference_data
sos run reference_data download_gene_annotation --cwd reference_data
sos run reference_data download_ercc_reference --cwd reference_data

## Reformat reference data (first time use)

In [None]:
sos run reference_data.ipynb hg_reference \
    --cwd reference_data \
    --ercc-reference reference_data/ERCC92.fa \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.fa \
    --container container/rna_quantification.sif

**Notice: you need to know if your RNA-seq data is stranded or unstranded. If it is stranded, you should add `--is-stranded` to the command below.**

In [None]:
sos run reference_data.ipynb gene_annotation \
    --cwd reference_data \
    --ercc-gtf reference_data/ERCC92.gtf \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.gtf \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --container container/rna_quantification.sif

## Index reference data

In [None]:
sos run reference_data.ipynb STAR_index \
    --cwd reference_data \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container container/rna_quantification.sif \
    --mem 40G

In [None]:
sos run reference_data.ipynb RSEM_index \
    --cwd reference_data \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container container/rna_quantification.sif \
    --mem 40G

## Perform data quality summary via `fastqc`

In [None]:
sos run RNA_calling.ipynb fastqc \
    --cwd output/rnaseq/fastqc \
    --samples data/sample_fastq.list \
    --data-dir data \
    --container container/rna_quantification.sif

## Call gene-level RNA expression via `rnaseqc`

In [None]:
sos run RNA_calling.ipynb rnaseqc_call \
    --cwd output/rnaseq \
    --samples data/sample_fastq.list \
    --data-dir data \
    --fasta_with_adapters_etc TruSeq3-PE.fa \
    --STAR-index reference_data/STAR_Index/ \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.gene.ERCC.gtf \
    --container container/rna_quantification.sif \
    --mem 40G

## Call transcript level RNA expression via `RSEM`

In [None]:
sos run RNA_calling.ipynb rsem_call \
    --cwd output/rnaseq \
    --samples data/sample_fastq.list \
    --data-dir data \
    --fasta_with_adapters_etc TruSeq3-PE.fa \
    --STAR-index reference_data/STAR_Index/ \
    --RSEM-index reference_data/RSEM_Index/ \
    --container container/rna_quantification.sif \
    --mem 40G

## Multi-sample RNA-seq QC

We need to use a different MWE data-set that contains multiple samples -- here is the [Google Drive link](https://drive.google.com/drive/u/0/folders/1Rv2bWHBbX_tastTh49ToYVDMV6rFP5Wk).

In [None]:
sos run bulk_expression_QC.ipynb qc \
    --cwd output \
    --tpm-gct data/mwe.TPM.gct \
    --counts-gct data/mwe.Counts.gct \
    --container container/rna_quantification.sif

## Multi-sample read count normalization

In [None]:
sos run bulk_expression_normalization.ipynb normalize \
    --cwd output \
    --tpm-gct data/mwe.low_expression_filtered.outlier_removed.tpm.gct.gz \
    --counts-gct data/mwe.low_expression_filtered.outlier_removed.geneCount.gct.gz \
    --annotation-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.gene.ERCC.gtf  \
    --sample-participant-lookup data/sampleSheetAfterQC.txt \
    --container container/rna_quantification.sif \
    --count-threshold 1