# Day 3: Epitranscriptomics Analysis
## Detecting m6A RNA Modifications Using Oxford Nanopore Technology

### Introduction to Epitranscriptomics

**Epitranscriptomics** refers to the study of chemical modifications to RNA molecules that don't change the underlying sequence but can dramatically alter RNA function, stability, and translation. Just as epigenetics studies modifications to DNA and histones, epitranscriptomics explores the "chemical decorations" on RNA that add an additional layer of gene regulation.

### N6-methyladenosine (m6A): The Most Abundant mRNA Modification

Among the >150 known RNA modifications(m5C, Ψ pseudouridine, m7G, Inosine), **N6-methyladenosine (m6A)** is the most prevalent internal modification found in eukaryotic mRNAs. This modification:

- Occurs primarily in the sequence motif **DRACH** (D=A/G/U, R=A/G, H=A/C/U)
- Regulates mRNA stability, splicing, translation, and localization
- Plays crucial roles in development, stress response, and disease
- Can be dynamically added and removed by "writer" and "eraser" enzymes

### Experimental Design: Light vs. Dark Conditions

Our dataset compares RNA from **Arabidopsis thaliana** samples grown under two conditions:
- 🌞 **Light condition**: Plants exposed to normal light cycles
- 🌙 **Dark condition**: Plants grown in darkness

This experimental design allows us to investigate how light exposure affects m6A modification patterns, potentially revealing circadian or photomorphogenic regulation of epitranscriptomic marks.

### Computational Pipeline Overview

We processed Oxford Nanopore direct RNA sequencing data through the following pipeline:

#### 1. **Basecalling with Dorado**
- Converted raw electrical signals (FAST5 files) to DNA sequences
- Used the `rna002_70bps_hac@v3` model optimized for direct RNA sequencing
- Generated BAM files with move information for downstream analysis

#### 2. **Data Format Conversion**
- Converted BAM files to FASTQ format for compatibility with downstream tools
- Converted POD5 files to BLOW5 format for efficient signal access

#### 3. **Signal-to-Sequence Alignment**
- Used **nanopolish** to align electrical signals back to reference sequences
- Generated event-level data showing how each nucleotide position corresponds to raw signals

#### 4. **m6A Detection with m6anet**
- **m6anet** is a machine learning tool specifically designed to detect m6A modifications from nanopore data
- Used the pre-trained `arabidopsis_RNA002` model for species-specific detection
- Performed both data preprocessing and probabilistic inference

### What We'll Analyze Today

In this notebook, we will:

1. **📊 Load and explore** the m6anet output tables
2. **🔍 Examine** m6A modification sites and their confidence scores
3. **📈 Compare** modification patterns between light and dark conditions
4. **🧬 Investigate** sequence contexts and motif preferences
5. **📋 Identify** differentially modified transcripts
6. **🎯 Visualize** modification sites on gene models

### Learning Objectives

By the end of this session, you should be able to:
- Understand the biological significance of m6A modifications
- Interpret m6anet output and quality metrics
- Perform comparative analysis of epitranscriptomic data
- Visualize modification patterns across different experimental conditions
- Discuss the potential functional implications of observed modifications

---

## Complete Computational Pipeline Commands

Below are the exact bash commands used to process the raw nanopore data through the entire m6anet pipeline:

### Step 1: Basecalling with Dorado
```bash
# Process light condition samples
dorado basecaller rna002_70bps_hac@v3 /.../RNA/raw_data/rna_total_light/fast5/*.fast5 -o /.../RNA/rna_bam/light_fast5 --emit-moves --models-directory /.../

# Process dark condition samples
dorado basecaller rna002_70bps_hac@v3 /.../RNA/raw_data/rna_total_dark/fast5/*.fast5 -o /.../RNA/rna_bam/dark_fast5/ --emit-moves --models-directory /.../
```

### Step 2: BAM to FASTQ Conversion
```bash
# Convert dark condition BAM to FASTQ
samtools fastq /.../RNA/rna_bam/dark/basecall_dark.bam > /.../RNA/rna_fastq/dark/basecall_dark.fastq

# Convert light condition BAM to FASTQ
samtools fastq /.../RNA/rna_bam/light/basecall_light.bam > /.../RNA/rna_fastq/light/basecall_light.fastq
```

### Step 3: Signal File Format Conversion
```bash
# Install blue-crab for POD5 to BLOW5 conversion
pip install blue-crab

# Convert POD5 files to BLOW5 format
blue-crab p2s -o /.../RNA/raw_data/rna_total_BLOW5/rna_dark.blow5 /.../RNA/raw_data/rna_total_POD5/rna_dark.pod5

blue-crab p2s -o /.../RNA/raw_data/rna_total_BLOW5/rna_light.blow5 /.../RNA/raw_data/rna_total_POD5/rna_light.pod5
```

### Step 4: Nanopolish Indexing
```bash
# Index FASTQ files with corresponding signal files (using FAST5 format)
nanopolish index -d /.../RNA/raw_data/rna_total_dark/fast5 /.../RNA/rna_fastq/dark/basecall_dark.fastq

nanopolish index -d /.../RNA/raw_data/rna_total_light/fast5 /.../RNA/rna_fastq/light/basecall_light.fastq
```

### Step 5: Signal-to-Sequence Alignment (Nanopolish Eventalign)
```bash
# Generate event alignments for dark condition
nanopolish eventalign --reads /.../RNA/rna_fastq/dark/basecall_dark.fastq \
 --bam /.../RNA/rna_aligned_bam/dark/rna_dark.aligned.bam \
 --genome /.../RNA/mapping/reference/mod_refs/AtRTDv2_1_QUASI.LS.fa \
 --scale-events --signal-index \
 --threads 14 > /.../RNA/nanopolish/dark/rna_dark_eventalign.txt

# Generate event alignments for light condition
nanopolish eventalign --reads /.../RNA/rna_fastq/light/basecall_light.fastq \
 --bam /.../RNA/rna_aligned_bam/light/rna_light.aligned.bam \
 --genome /.../RNA/mapping/reference/mod_refs/AtRTDv2_1_QUASI.LS.fa \
 --scale-events --signal-index \
 --threads 14 > /.../RNA/nanopolish/light/rna_light_eventalign.txt
```

### Step 6: m6anet Data Preparation
```bash
# Prepare data for m6anet analysis - dark condition
m6anet dataprep --eventalign /.../RNA/nanopolish/dark/rna_dark_eventalign.txt \
                --out_dir /.../RNA/m6anet/dark/dataprep --n_processes 10

# Prepare data for m6anet analysis - light condition                
m6anet dataprep --eventalign /.../RNA/nanopolish/light/rna_light_eventalign.txt \
                --out_dir /.../RNA/m6anet/light/dataprep --n_processes 10
```

### Step 7: m6anet Inference (Final m6A Detection)
```bash
# Run m6A detection inference - dark condition
m6anet inference --input_dir /.../RNA/m6anet/dark/dataprep \
                 --out_dir /.../RNA/m6anet/dark/result \
                 --pretrained_model arabidopsis_RNA002 --n_processes 10 --num_iterations 1000

# Run m6A detection inference - light condition
m6anet inference --input_dir /.../RNA/m6anet/light/dataprep \
                 --out_dir /.../RNA/m6anet/light/result \
                 --pretrained_model arabidopsis_RNA002 --n_processes 10 --num_iterations 1000
```

---

**⚠️ Note**: This analysis uses real nanopore sequencing data, so processing times and file sizes reflect the computational demands of working with actual genomic datasets.

Let's dive into the data and discover what the epitranscriptome can tell us about light-dependent gene regulation! 🚀

---
## Loading the data (python - pandas)

For the next step on analysing the data, we will use Pandas library, which allow us to load dataframes, filter and parse data with a few lines of code!



In [None]:
# Most python libraries can be easily installed using 'pip install xxxx'
!pip install pandas

In [None]:
import pandas as pd # It's a common practice to import it as 'pd', but this is just an alias for easy usage

In [None]:
# Let's load the dataframes
from google.colab import files
# Small snippet to upload a file to colab.
# Load the .POD5 sample file
uploaded = files.upload()
print(uploaded)
#df_light = pd.read_csv("/.../RNA/m6anet/light/result/")
#df_dark = pd.read_csv("/.../RNA/m6anet/dark/result/")

-----------------

## IGV - Demostrative example

data from: https://hasindu2008.github.io/slow5tools/datasets.html#a-few-more-rna004-direct-rna