# RNA Oxford Nanopore Processing and Analysis

* **Project:** African-ancestry intronic *GBA1* branch point variant
* **Language:** Bash 
* **Last updated:** 20-DEC-2023

## Notebook Overview
- Process raw RNAseq data from basecalling to mapping
- Calculate coverage across regions and plot
- Get TPMs for *GBA1* transcripts

**Note**: Notebook is only showing processing of the CRISPR-edited lines. Other RNA-seq ONT data was also processed the same way.

### CHANGELOG
20-DEC-2023: Notebook final draft

---

**CRISPR NAMING KEY**  \
    CT_37 --> ND01137_TT \
    CT_89 --> ND22789_GG \
    MT_37 --> ND01137_GG_Mock \
    MT_89 --> ND22789_TT_Mock \
    PT_37 --> ND01137_GT \
    PT_89 --> ND22789_GT \
    WT_89 --> ND22789_TT_OG 

In [3]:
MAIN=./GBA1_CRISPR/RNA/

In [None]:
cat $MAIN/sample_names_RNA.txt

MT_37   PAM80561
MT_89   PAM79877
CT_37   PAG71406
CT_89   PAM08329
PT_37   PAM72829
PT_89   PAM30096
WT_89   PAQ45921


## 1a. Basecalling

In [None]:
cat sample_names_RNA.txt | while read -r first second ; do
sbatch --partition=gpu --cpus-per-task=10 --mem=50g --gres=gpu:a100:2,lscratch:200 --time=5-0 \
--wrap="bash basecalling_RNA_R9_NO_METH.sh $MAIN/CRISPR_"$first"_RNA/fast5/ $MAIN/CRISPR_"$first"_RNA/out_GUP/"
done

## 1b. Cleaning post basecalling

In [None]:
cat sample_names_RNA.txt | while read -r first second ; do
sbatch --mem=80g --cpus-per-task=5 --time=1-0 2_ONT_basecalling_clean_up.sh \
$MAIN/CRISPR_"$first"_RNA/out_GUP/pass/ \
"$first"_"$second"
done

In [None]:
cat sample_names_RNA.txt | while read -r first second ; do
cd $MAIN/CRISPR_"$first"_RNA/out_GUP/
mkdir log_files
mv *log log_files
mv sequencing_summary.txt ../other_reports/
mv sequencing_telemetry.js ../other_reports/
mv log_files ../other_reports/
mv ./pass/*.fastq.gz ../
cd $MAIN
done

In [None]:
cat sample_names_RNA.txt | while read -r first second ; do
cd $MAIN/CRISPR_"$first"_RNA/out_GUP/
mv ./pass/pycoQC* ../other_reports/
mv ./pass/stats.pass.tsv ../other_reports/
rm ./pass/*.fastq
rm -r ./pass/
rm -r ./fail/
cd ../
du -sh ./out_GUP/
rm -r ./out_GUP/
cd $MAIN
done

## 2. Mapping

In [None]:
mkdir $MAIN/pychopper/
mkdir $MAIN/pychopper/stats/
mkdir $MAIN/pychopper/fastqs/
mkdir $MAIN/minimap2/

In [None]:
cat sample_names_RNA.txt | while read -r first second; do
sbatch --mem=80g --cpus-per-task=10 --time=4-0 --mail-type=END pychopper_minimap.sh "$first"_"$second" $first
done

## 3. Calculate coverage

In [None]:
mkdir $MAIN/depth/
cd $MAIN/depth/

Need four files files: \
    - GBA1.bed : BED file with coordinates for GBA1 gene regions of interest \
    - GBA1_regions: File with names for this regions \
    - sample_names.txt : File with sample name prefixes \
    - sample_names_geno.txt : File with sample name prefixes in column 1 and their genotypes in column 2

In [None]:
cat GBA1.bed

chr1    155235845       155235885       intron8transcript
chr1    155235681       155235844       exon9
chr1    155236245       155236469       exon8
chr1    155235845       155236244       intron8
chr1    155235886       155236244       intron8_minus_transcript


In [None]:
cat GBA1_regions.txt

intron8transcript
exon9
exon8
intron8
intron8_minus_transcript


In [2]:
cat sample_names_geno.txt

MT_37   GG
MT_89   TT
CT_89   GG
PT_37   GT
PT_89   GT
WT_89   TT


In [None]:
# Regional coverage
sh GBA_regional_coverage.sh \
sample_names_geno.txt \
$MAIN/minimap2/ \
sample_names.txt
# Note: Generating number of primary aligned reads will take the longest time if not generated previously

In [None]:
# Whole gene coverage, GBA1 and GBAP1
sh GBA_whole_coverage.sh \
sample_names_geno.txt \
$MAIN/minimap2/ \
sample_names.txt
# Note: Generating number of primary aligned reads will take the longest time if not generated previously

### Make plots

First rename coverage files entries with full sample names

In [None]:
# Regional
sed -i 's/MT_89/ND22789_TT_Mock/g; s/MT_37/ND01137_GG_Mock/g; s/PT_89/ND22789_GT/g; s/PT_37/ND01137_GT/g; s/CT_89/ND22789_GG/g; s/WT_37/ND01137_GG_OG/g; s/WT_89/ND22789_TT_OG/' regional_cov_all.txt

In [None]:
# GBA1
sed -i 's/MT_89/ND22789_TT_Mock/g; s/MT_37/ND01137_GG_Mock/g; s/PT_89/ND22789_GT/g; s/PT_37/ND01137_GT/g; s/CT_89/ND22789_GG/g; s/WT_37/ND01137_GG_OG/g; s/WT_89/ND22789_TT_OG/' cov_all_whole.txt

In [None]:
# GBAP1
sed -i 's/MT_89/ND22789_TT_Mock/g; s/MT_37/ND01137_GG_Mock/g; s/PT_89/ND22789_GT/g; s/PT_37/ND01137_GT/g; s/CT_89/ND22789_GG/g; s/WT_37/ND01137_GG_OG/g; s/WT_89/ND22789_TT_OG/' cov_all_whole.GBAP1.txt

Run plots in R

In [None]:
Rscript plots.r
# Will output a png of each plot and a pdf with all plots compiled

## 4. Coverage stats

In [3]:
Rscript coverage_stats.r


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h[?25h
Call:
glm(formula = NORMALIZEDDEPTH ~ geno_new, data = only_transcript_intron)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.01942    0.07855   0.247   0.8146  
geno_new     0.26358    0.09294   2.836   0.0364 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.01234099)

    Null deviance: 0.160951  on 6  degrees of freedom
Residual deviance: 0.061705  on 5  degrees of freedom
AIC: -7.254

Number of Fisher Scoring iterations: 2

[?25h[?25h[?25h(Intercept)    geno_new 
 0.81458235  0.03642375 
[?25h[?25h
Call:
glm(formula = NORMALIZEDDEPTH ~ geno_new, data = exon8)

Coefficients:
            Estimate S

: 1

## 5. Transcript quantification with Stringtie2

In [None]:
mkdir $MAIN/stringtie/
cd $MAIN/stringtie/

In [None]:
# Run de novo
cat $MAIN/sample_names_RNA.txt | while read -r first second; do
sbatch --cpus-per-task=5 --mem=20g  --time=1-0 stringtie_denovo.sh \
$MAIN/stringtie/ \
$first \
$MAIN/minimap2/"$first".sorted.bam
done

In [None]:
# Run against a reference, including intron 8 novel transcript call from de novo GG runs
cat $MAIN/sample_names_RNA.txt | while read -r first second; do
sbatch --cpus-per-task=5 --mem=20g  --time=1-0 stringtie_ref.sh \
$MAIN/stringtie/ \
$first \
$MAIN/minimap2/"$first".sorted.bam
done