vignettes/cell_competition_bulk_data_mt_example_notebook.Rmd

---
title: "Analysing mtDNA cell lines in bulk RNA seq data"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Analysing mtDNA cell lines in bulk RNA seq data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width=7, 
  fig.height=5
)


```

```{r setup}
library(MitoHEAR)
```


In this notebook it is shown the heteroplasmy analysis performed on bulk RNA seq mouse data from [Lima et al, 2021](https://www.nature.com/articles/s42255-021-00422-7).
There are in total 8 samples that belong to the cell line BG (95%) and 8 samples that belong to the cell line HB (24%). 
The two cell lines are built in order to have a different mitochondrial DNA from each other and from the reference genome in some bases.
The cell lines BG and HB are respectively classified as Loser and Winner. 

### Get counts for the four alleles in each base-sample pair


The first step of the library MitoHEAR is to generate a raw counts allele matrix with cells as rows and the four alleles for each base in the fasta file on the columns.
This task is achieved with the function *get_raw_counts_allele*. As input we need to provide the sorted bam files (one for each cell, with full path), the fasta file of the genomic region of interest and the cell names.
The matrices *meta_data_ana_final_big* and *meta_data_ana_final_small*  contain meta data information about the cells (i.e. cell names, condition). 


```{r}

load(system.file("extdata", "meta_data_ana_final_big.Rda", package = "MitoHEAR"))
load(system.file("extdata", "meta_data_ana_final_small.Rda", package = "MitoHEAR"))
cell_names <- as.vector(meta_data_ana_final_big$name_cell)
cell_names_bulk <- cell_names
```

We don't execute the function *get_raw_counts_allele* here and  we directly load his output. 
A command line implementation of the function *get_raw_counts_allele* is also available (see github README file for info).

```{r}

load(system.file("extdata", "output_SNP_ana_mt.Rda", package = "MitoHEAR"))

```


The output of *get_raw_counts_allele* is a list with three elements (see *?get_raw_counts_allele* for more info). The first element is the matrix of counts (n_rows = number of cells, n_cols= 4*number of bases) of the four alleles in each base. The row names are equal to cell_names.


```{r}
matrix_allele_counts <- output_SNP_ana_mt[[1]]
name_position_allele <- output_SNP_ana_mt[[2]]
name_position <- output_SNP_ana_mt[[3]]

```

```{r}
row.names(meta_data_ana_final_big) <- meta_data_ana_final_big$name_cell
meta_data_ana_final_big <- meta_data_ana_final_big[row.names(matrix_allele_counts),]
delete_duplicate <- meta_data_ana_final_big$name_cell[seq(1,48,3)]
meta_data_ana_final_big_filter <- meta_data_ana_final_big[delete_duplicate,]
bulk_sample <- meta_data_ana_final_big_filter$name_cell_original

matrix_allele_counts <- matrix_allele_counts[delete_duplicate,]
row.names(matrix_allele_counts) <- bulk_sample


row.names(meta_data_ana_final_small) <- meta_data_ana_final_small$name_fastq
```


The next step is to obtain a matrix with allele frequencies and a matrix with heteroplasmy values for each pair of cell-base. This is obtained with the function *get_heteroplasmy*. 
This function performs a two step filtering procedure, the first on the cells and the second on the bases. The aim is to keep only the cells that have more than *number_reads* counts in more than *number_positions* bases and to keep only the bases that are covered by more than *number_reads* counts in all the cells (*filtering*=1)  or in at least 50% of cells in each cluster (*filtering*=2).


```{r}

bulk_data_competition <- get_heteroplasmy(matrix_allele_counts[bulk_sample,],name_position_allele,name_position,50,2000,filtering = 1)

```

Among the output of *get_heteroplasmy* there are the matrix with heteroplasmy values and the matrix with allele frequencies, for all the cells and bases that pass the two step filtering procedure. 
The heteroplasmy is computed as *1-max(f)*, where *f* are the frequencies of the four alleles for every cell-base pair.
For more info about the output see *?get_heteroplasmy*.

```{r}
sum_matrix <- bulk_data_competition[[1]]
sum_matrix_qc <- bulk_data_competition[[2]]
heteroplasmy_matrix_bulk <- bulk_data_competition[[3]]
allele_matrix_bulk <- bulk_data_competition[[4]]
cluster_bulk <- as.character(meta_data_ana_final_small[row.names(heteroplasmy_matrix_bulk),]$status)
condition_bulk <- as.character(meta_data_ana_final_small[row.names(heteroplasmy_matrix_bulk),]$condition)
index_bulk <- bulk_data_competition[[5]]
```


```{r}

name_position_allele_qc <- name_position_allele[name_position%in%colnames(sum_matrix_qc)]
name_position_qc <- name_position[name_position%in%colnames(sum_matrix_qc)]

```

It is possible to perform an additional filtering step on the bases keeping only the ones with an heteroplasmy value above *min_heteroplasmy* in more than *min_cells*.

```{r}
relevant_bases <- filter_bases(heteroplasmy_matrix_bulk,0.01,8)
```


### Identification of most different bases according to heteroplasmy between clusters
For detecting the difference in heteroplasmy values between two group of cells (i.e. two clusters), an unpaired two-samples Wilcoxon test is performed. In this case we run the test between the groups *Winner* and *Loser*. As output, for each bases, there is the adjusted p value (FDR).

```{r warning = FALSE}
p_value_wilcox_test <- get_wilcox_test(heteroplasmy_matrix_bulk[,relevant_bases],cluster_bulk,"Loser","Winner" )
```

We sort the bases according to the adjusted p value in order to identify the bases where the heteroplasmy is most different between the two groups.

```{r}
p_value_wilcox_test_sort <- sort(p_value_wilcox_test,decreasing = F)

```

The heteroplasmy and the corresponding allele frequencies for the most relevant bases (according to Wilcoxon test) are shown. The difference in heteroplasmy values is sharp, as expected since the two cell lineages (Winner and Loser) have a different mitochondrial DNA by construction.

```{r}


q <- list()
for ( i in 1:length(p_value_wilcox_test_sort[1:2])) {
p <- plot_heteroplasmy(names(p_value_wilcox_test_sort)[i],heteroplasmy_matrix_bulk,cluster_bulk,index_bulk)+ggplot2::ggtitle(paste(names(p_value_wilcox_test_sort)[i],round(p_value_wilcox_test_sort[i],4),sep = "-"))
q <- list(q,p)
}
q


q <- list()
for ( i in names(p_value_wilcox_test_sort)[1:2]) {
p <- plot_allele_frequency(i,heteroplasmy_matrix_bulk,allele_matrix_bulk,cluster_bulk,name_position_qc,name_position_allele_qc,5,index_bulk)  
q <- list(q,p)
}
q


```

We can check if there is a batch effect in the most relevant positions (i.e. the heteroplasmy levels are constantly higher only in a specific batch). For the top 2 positions defined with the GAM fit there is not a batch effect, since the difference in  heteroplasmy levels are driven by group (Winner and Loser) and not by batch (co cultured-CO or separate culture-Sep).


```{r}
utile <- meta_data_ana_final_small[row.names(heteroplasmy_matrix_bulk),]
batch <- rep(0,length(utile$condition))
batch[utile$status == "Loser"&utile$condition == "CO"] <- "Loser-CO"
batch[utile$status == "Loser"&utile$condition == "Sep"] <- "Loser-Sep"
batch[utile$status == "Winner"&utile$condition == "CO"] <- "Winner-CO"
batch[utile$status == "Winner"&utile$condition == "Sep"] <- "Winner-Sep"


cluster <- utile$status

q <- list()
for ( i in names(p_value_wilcox_test_sort)[1:2]) {
p <- plot_batch(i, heteroplasmy_matrix_bulk, batch, cluster, 6, index_bulk)
q <- list(q, p)
}
q


```

### Unsupervised cluster analysis among cells based on allele frequency values
MitoHEAR offers the possibility to perform an unsupervised hierarchical clustering on the cells based on a distance matrix with the function *clustering_dist_ang*. Given a base, the distance between two cells is the angular distance of the allele frequencies.
Given a base, the variance of the distance values between two cells is also computed.
Top bases with highest variance are selected for down stream analysis.
We can represent the difference between two cells as a vector whose coordinates are the angular distances of the top bases.
The total distance between two cell is the euclidean norm of the vector of difference between the two cells.
The output of *clustering_dist_ang* is a list. The first element is a data frame which contains the old classification (partition available before the cluster analysis based on allele frequencies) and the new classification (partition provided by the cluster analysis based on allele frequencies ). The second element is the distance matrix, on which the hierarchical clustering is done.
The third element is a vector with the top bases according to variance.

It is also possible to run *clustering_dist_ang* in a supervised approach. In this case the bases used for hierarchical clustering are not selected according to variance, but are directly provided with the parameter *relevant_bases*.
The heatmap of the distance matrix with cells sorted according to the new classification is shown below.
The cluster analysis based on allele frequencies information can be a powerful way to perform a lineage tracing analysis, by grouping together cells which are from the same embryo.
See *?clustering_dist_ang* for more info.


```{r}
result_clustering_sc <- clustering_angular_distance(heteroplasmy_matrix_bulk, allele_matrix_bulk, cluster_bulk, length(colnames(heteroplasmy_matrix_bulk)), deepSplit_param = 1, minClusterSize_param = 8, 0.2, min_value = 0.001, index = index_bulk, relevant_bases = NULL)


old_new_classification <- result_clustering_sc[[1]]
dist_matrix_sc <- result_clustering_sc[[2]]
top_dist <- result_clustering_sc[[3]]
common_idx <- result_clustering_sc[[4]]

old_classification <- as.vector(old_new_classification[, 1])
new_classification <- as.vector(old_new_classification[, 2])

plot_distance_matrix(dist_matrix_sc, old_classification)
```


Below we plot the top 2 bases selected for the unsupervised cluster analysis

```{r}
q <- list()
for ( i in 1:length(top_dist[1:2])) {
p <- plot_heteroplasmy(top_dist[i], heteroplasmy_matrix_bulk, cluster_bulk, index_bulk)
q <- list(q, p)
}
q


```


Comparison between the ground truth and the new partition obtained with unsupervised cluster analysis
```{r}
plot_heatmap(new_classification, old_classification, (dist_matrix_sc), cluster_columns = F, cluster_rows = F, "Euclidean distance")

```


The new classification coincides with the old classification. This means that we can perfectly distinguish between the two cell lines only by looking at the heteroplasmy values of the mitochondrial bases. This result was expected since the samples derived from two cell lines, whose mitochondrial DNA is built to be different between each other.

```{r}
utils::sessionInfo()
```