# Guideline to prepare files for methylation calling with modkit

For methylation calling we choosed to use reads basecall with Dorado, as well as their respective assemblies construct from Dorado basecalling. As dorado was much more quicker to perform than Guppy and output similar basecalling accuracy and assemblies quality to Guppy.

## I. Dorado trimming

First, we need to trim our adapters from the ubam files  produced by Dorado basecalling, normally it is done by default by Dorado, but as a precausion was run again

Note: ubam file are the official format of Dorado basecalling to stock methylation data (it is based on bam file but unaligned, and contain an MM and ML tag prone for methylation storage).It contain the same information as a fastq, with the identifier of the read, the corresponding sequence, and the quality.

## II. Dorado aligner

Once adapters are removed from the reads, we need to aligned our ubam to our reference genome (the one obtain from HyPo polishing)


## Samtools sort + index

After aligning or ubam to the ref genome, we need to first sort, then index the files

## Below you will find a code that perform trimming, aligning, and sort + index for all samples

In [None]:
## Using for loop for all samples

!/bin/bash

# Find all files in the current directory with the specified extension
files=$(find . -type f -name "*_dorado_modbasecalling.bam")

# Iterate over each file found
for file in $files
do
    # Extract the filename without extension
    filename=$(basename "$file" "_dorado_modbasecalling.bam")

    # Execute the dorado trim command for each file
    dorado trim "$file" > "${filename}_trimmed.bam" -t 50

    # Execute the alignment command for each trimmed BAM file
    dorado aligner /bigvol/omion/HyPo/hypo_"$filename"_Dorado_modbasecalling.fasta ./"$filename"_trimmed.bam --bandwidth "500,20000" -t 50 > "aligned_trimmed_$filename.bam"

    # Sort the aligned BAM file
    samtools sort "aligned_trimmed_$filename.bam" -O BAM -o "aligned_sort_$filename.bam" -@ 50

    # Index the sorted BAM file
    samtools index -@ 50 "aligned_sort_$filename.bam"

done


# Modkit

Now we can process the resulting files with Modkit which will output .tsv or .bed files containing methylation information (probability),reference position in the genome, type of methylation...

01- Pileup
= Allow to obtain the probability of methylation for each genomic position
Here I did not perform any filtering (no methylation threshold) + I created CG motif files (position of each CG in the reference genome) for each isolates and use them for the methylation

02- Bedtools complement

The goals is to have the complement position of the pileup output. Indeed pileup will only give you the position of the genome where methylation is present.
You will need to create genome file before with the first column being the contig name and the second the lengh of the contig

03- Cat to merge the complement and pileup

In [None]:
### Modkit pileup good code (24 mai)

/bigvol/omion/Software/dist/modkit pileup /bigvol/omion/11-Methylation/02-samtools_sort_index/aligned_sort_Gd293.bam pileup_Gd293.bed --combine-strands --cpg --only-tabs -r /bigvol/omion/07-Filtered_Assemblies/filtered_hypo_Gd293_Dorado_modbasecalling.fasta --include-bed /bigvol/omion/11-Methylation/03-modkit-motif_bed/filtered_hypo_Gd293_Dorado_modbasecalling_cg_modifs.bed --no-filtering -t 40


## Calculate Genome-wide 5mCpG and 5hmCpG %

We calculated the mean % of 5mC and 5hmC at CpG sites using the following command lines:

The calculation is based on the number of reads that have a detected methylation mark at a given genomic position.
Therefore, the calculation is for each position:
Number of aligned CpG with 5mC (or 5hmC) / Number of total CpG

In [None]:
## Code to obtain global number of methylated sites (comprise both 5mC and 5hmC as the model is probabilistic)
awk '($4=="m") && ($11>0.0)' ${bedmethyl_cpg} | wc -l

## Code to obtain global methylation level (5mC, 5hmC)
awk '$4=="m" {can+=$13; mod+=$12; oth+=$14; valid+=$10} \
  END{print (can/valid) " CpG canonical\n" (mod/valid) " 5mCpG modified\n" (oth/valid) " 5hmCpG modified"}' ${pileup_bed}

Data were then store in a google sheet file (See Table/Genome_wide_methylation), and a Rscript was made to plot the results

##Rscript

In [None]:
setwd("C:/Users/ocean/Downloads")

library(ggplot2)
library(dplyr)
library(gridExtra)
library(ggtext)

# Read the data
data <- read.delim("Genome_wide_methylation(1).csv", header = TRUE)

# Assign colors based on category in column 2
data$color <- case_when(
  data[[2]] == 1 ~ "#FF5733",
  data[[2]] == 2 ~ "#3377FF",
  data[[2]] == "Outgroup" ~ "#9ACD32",
  TRUE ~ "black"  # Default color for any other categories
)

# Function to order data by color and a specific column in decreasing order
order_data <- function(data, column_index) {
  data %>%
    mutate(color = factor(color, levels = c("#FF5733", "#3377FF", "#9ACD32"))) %>%
    arrange(color, desc(data[[column_index]]))
}

# Order the data for each plot
data1 <- order_data(data, 3)
data2 <- order_data(data, 4)
data3 <- order_data(data, 5)

# Create a factor for the first column based on the new order for each plot
data1[[1]] <- factor(data1[[1]], levels = data1[[1]])
data2[[1]] <- factor(data2[[1]], levels = data2[[1]])
data3[[1]] <- factor(data3[[1]], levels = data3[[1]])

# Create the three plots
plot1 <- ggplot(data1, aes(x = data1[[1]], y = data1[[3]], fill = color)) +
  geom_col() +
  scale_fill_identity() +
  labs(x = colnames(data1)[1], y = "Nb of methylated sites",
       title = "Number of methylated sites per isolates") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

plot2 <- ggplot(data2, aes(x = data2[[1]], y = data2[[4]], fill = color)) +
  geom_col() +
  scale_fill_identity() +
  labs(x = colnames(data2)[1], y = "5mCpG %",
       title = "Genome wide 5mCpG % per isolates") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

plot3 <- ggplot(data3, aes(x = data3[[1]], y = data3[[5]], fill = color)) +
  geom_col() +
  scale_fill_identity() +
  labs(x = colnames(data3)[1], y = "5hmCpG %",
       title = "Genome wide 5hmCpG % per isolates") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_markdown(hjust = 0.5))

# Arrange the plots in a grid
grid.arrange(plot1, plot2, plot3, ncol = 2)


## Calculate contig-wide 5mCpG and 5hmCpG %

Then we choose to also see if some contigs from our assemblies display more or less methylation

##Rscript

In [None]:
setwd("C:/Users/ocean/Desktop")

# Load necessary libraries
library(ggplot2)
library(dplyr)
library(gridExtra)

# Read the CSV file
data <- read.delim("contig_wide_methylation.csv", header = TRUE, sep=",")

# Rename columns for clarity (adjust these names as needed)
colnames(data)[c(2,3,4,5)] <- c("Legend", "name", "category4", "value")

# Function to filter unique contig names and reorder by value
filter_unique_contigs_and_reorder <- function(df) {
  df <- df %>%
    group_by(name) %>%
    filter(row_number() == 1) %>%
    ungroup() %>%
    arrange(desc(value))
  return(df)
}

# Define colors for each category in Legend
colors <- c("1" = "#FF5733", "2" = "#3377FF", "Outgroup" = "#9ACD32")

# Create separate plots for 'h' and 'm', with unique contig names and ordered by value
plot_h <- data %>%
  filter(category4 == "h") %>%
  filter_unique_contigs_and_reorder() %>%
  ggplot(aes(x = reorder(name, -value), y = value, fill = Legend)) +  # Use reorder for ordering
  geom_bar(stat = "identity", position = position_dodge(width = 0.9), color = "black") +
  facet_wrap(~ Legend, ncol = 1, scales = "free_x") +
  scale_x_discrete(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0), limits = NULL, name = "5hmCpG %") +  # Y-axis label for the first plot
  scale_fill_manual(values = colors) +  # Use manual scale for fill colors
  theme_minimal() +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank(),
    panel.spacing = unit(0.1, "lines")
  ) +
  labs(title = expression(paste("Mean 5hmCpG % in ", italic("P. destructans"), " and outgroup contigs")), x = NULL, y = NULL)

plot_m <- data %>%
  filter(category4 == "m") %>%
  filter_unique_contigs_and_reorder() %>%
  ggplot(aes(x = reorder(name, -value), y = value, fill = Legend)) +  # Use reorder for ordering
  geom_bar(stat = "identity", position = position_dodge(width = 0.9), color = "black") +
  facet_wrap(~ Legend, ncol = 1, scales = "free_x") +
  scale_x_discrete(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0), limits = NULL, name = "5mCpG %") +  # Y-axis label for the second plot
  scale_fill_manual(values = colors) +  # Use manual scale for fill colors
  theme_minimal() +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank(),
    panel.spacing = unit(0.1, "lines")
  ) +
  labs(title = expression(paste("Mean 5mCpG % in ", italic("P. destructans"), " and outgroup contigs")), x = NULL, y = NULL)

# Arrange and display plots in two rows
grid.arrange(plot_h, plot_m, nrow = 2)
