<div style="text-align: center;">
    <img src="../images/logo.png" alt="GML QIMR Logo">
</div>

**This R script/tutorial has been generated by Prakrithi from the *Genomics and Machine Learning Lab*:**
* **Data used below is currently not published or publically available. For more information about the scripts or for the use of this datasets beyond this tutorial please contact uqacause@uq.edu.au**
* **Code and *conda* environments used are available from https://github.com/GMLTestLab/qimr-teaching**

# TABLE OF CONTENTS

1. [Introduction to CNV profiling](#Introduction)
2. [scRNA-seq Data Loading](#Data-Loading)
3. [CopyKAT - Identifying Malignant melanocytes and identifying patient with the tumor](#CopyKAT)
4. [InferCNV](#InferCNV)
5. [Consistency of prediction between inferCNV and CopyKAT](#Evaluation)
6. [Analyzing subclones in spatial data using CopyKAT](#spatial)
7. [Summary](#SUMMARY)


This is just a tutorial and not a hands-on notebook

# CNV Profiling of Single cell and Spatial RNA sequencing data to identify Tumor cells

### What are CNVs?

A Copy number variation (abbreviated as CNV) refers to an instance in which the number of copies of a specific DNA segment   varies among different individuals' genomes. These variations can involve deletions or duplications of segments of the genome and can range from a few kilobases to several megabases in size. 

### How are CNVs related to cancer?

- Oncogene Amplification
- Tumor Suppressor Gene Deletion
- Genomic Instability

<img src="images/CNVs.jpeg" alt="CNVs">

### How can we make use of this DNA profile information for RNA-seq data?

The distinction of malignant from non-malignant cells is a critical step in the follow-up analysis of scRNA-seq tumor datasets. The basic idea to solve such a problem relies on estimating common copy number alterations that characterize transformed cells. The copy number profiles are obtained by considering the gene expression profiles of each cell as a function of the genomic coordinates. The moving average smoothing of the gene expression function is then clustered in malignant and non-malignant cells. The underlying logic for calculating DNA copy number events from RNAseq data is that gene expression levels of many adjacent genes can provide depth information to infer genomic copy number in that region.

## scRNA-Seq Melanoma dataset

An in-house Melanoma dataset consisting of three samples - One with a malignant tumor and the rest with Dysplastic Nevi (unusual mole that has the potential to turn malignant) has been used to demonstrate the application of CNVs in identifying cancerous cells.

### Load necessary libraries and set data directory

In [None]:
library(Seurat)
data_dir <- R.utils::getAbsolutePath('../data/infercnv')

## Load the integrated Melanoma dataset .rds file

Load the seurat object (.rds file converted from anndata object) as Mel_scRNA. <br> Mel_scRNA@meta.data$cell_type has the annotated cell type information

In [None]:
Mel_scRNA<-readRDS(glue::glue("{data_dir}/Mel_3samples_75pcs.rds"))
Mel_scRNA

In [None]:
options(repr.plot.width = 8, repr.plot.height = 6)
Idents(Mel_scRNA)<-Mel_scRNA@meta.data$cell_type
DimPlot(Mel_scRNA)

<img src="images/umap.png">

**Melanocytes** are the cells from which 
Melanoma arises. Not all the melanocytes are cancerous. It is necessary to correctly identify the malignant cells, to resolve clonal substructure within the tumor and to identify the patient sample they come from for accurate diagnosis.

<a id="CopyKAT"></a>
## CopyKAT

**CopyKAT (Copy Number Karyotyping of Aneuploid Tumors)** is a tool to identify CNVs from high dimensional RNA-Seq dataset like scRNA-Seq and can be applied to spatial transcriptomics data as well.

It is a reference-free method i.e. It does not require prior cell type annotations defining normal cell profiles.

**Steps involved**

- **step1**: read and filter data. **Input:** Gene - Barcode count matrix
- **step 2**: annotations of genes and ordering them by gene coordinates
- **step 3**: Freeman–Tukey transformation is performed to stabilize variance, followed by polynomial dynamic linear modeling (DLM) to smooth the outliers in the single-cell UMI counts
- **step 4**:  detect a subset of diploid cells with high confidence to infer the copy number baseline values of the normal 2N cells. To do this, single cells are pooled into several small hierarchical clusters and estimate the variance of each cluster using a Gaussian mixture model (GMM). The cluster with minimal estimated variance is defined as the ‘confident diploid cells’ by following a strict classification criterion.
- **step 5**: segmentation - Relative gene expression values in single cells are used for MCMC segmentation, and segments are merged by KS testing.
- **step 6**: Prediction: Aneuploid tumor and normal cell clusters are classified using normal cell enrichment and GMM distribution tests. The clonal substructure of tumor cells is delineated by clustering, and subclones are used for downstream analysis.


![copyKAT](images/41587_2020_795_Fig1_HTML.png)

Nature Biotechnology (Nat Biotechnol) ISSN 1546-1696 (online) ISSN 1087-0156 (print)

More details at https://github.com/navinlabcode/copykat

Publication: Gao R et al. (2021). Nat Biotechnol. doi:10.1038/s41587-020-00795-2.

### copyKAT installation

Uncomment the lines below to install copyKAT and load the library once installed.

In [None]:
library(copykat)

## Run CopyKAT

In [None]:
exp.rawdata<-read.csv(glue::glue("{data_dir}/MPSs_count_mat.txt",sep="\t"))
rownames(exp.rawdata)<-exp.rawdata$GENE
exp.rawdata<-exp.rawdata[,-1]

In [None]:
copykat.test <- copykat(rawmat=exp.rawdata, id.type="S", ngene.chr=5, win.size=25, KS.cut=0.1, sam.name="test",distance="euclidean", 
                        norm.cell.names="",output.seg="FLASE", plot.genes="TRUE", genome="hg20",n.cores=1) #hg20 built-in copkat is the hg38 genome build coordinatess


### Read CopyKAT predictions

In [None]:
p<-read.csv(glue::glue("{data_dir}/test_copykat_prediction.txt"), sep="\t",header=TRUE)
head(p)

In [None]:
colnames(Mel_scRNA)<-gsub("-1",".1",colnames(Mel_scRNA))
seurat_colnames<-colnames(Mel_scRNA)
rownames(p)<-p$cell.names	
p<-p[seurat_colnames,]
# Assuming infercnv_sub_groupings is your DataFrame and seurat_colnames contains the column names from your Seurat object

# Find the order of the BC values in the seurat_colnames
bc_order <- match(p$cell.names, seurat_colnames)

# Reorder the columns of infercnv_sub_groupings based on the order
p_reordered <- p[order(bc_order), ]

# View the reordered DataFrame
dim(p_reordered)
head(p_reordered)

## Identify Malignant Tumor cells

In [None]:
Mel_scRNA@meta.data$copykat_pred<-p$copykat.pred
#Mel_scRNA@meta.data$copykat_aneuploid <- ifelse(Mel_scRNA@meta.data$copykat_pred == "aneuploid", "Aneuploid", "")
aneuploid_cells <- rownames(Mel_scRNA@meta.data)[Mel_scRNA@meta.data$copykat_pred == "aneuploid"]

DimPlot(Mel_scRNA, group.by = "copykat_pred")
DimPlot(Mel_scRNA, group.by = "copykat_pred", cells.highlight = aneuploid_cells, cols.highlight = "red")

## Identify the sample with malignant tumor

In [None]:
options(repr.plot.width = 12, repr.plot.height = 6)
DimPlot(Mel_scRNA, group.by = "copykat_pred", split.by = "orig.ident")
DimPlot(Mel_scRNA, group.by = "copykat_pred", cells.highlight = aneuploid_cells,cols.highlight = "red", split.by = "orig.ident")

## InferCNV

Another tool many used to analyze CNV profiles of tumor cells and to classify subclones

- InferCNV uses a corrected moving average of gene expression data to determine CNV profiles. 
- Genes are sorted by absolute genomic position. That is, they are first ordered by chromosome, and then by genomic start position within the chromosome. averaging out the expression of genomically adjacent genes removes gene-specific expression variability and yields profiles that reflect chromosomal copy number variations.
- To further refine the CNV profile of tumor cells, InferCNV constructs the CNV profile of a known normal sample, and then for each gene and each cell, the normal sample is subtracted from the tumor sample to determine the final tumor CNV profile.

More Information at https://github.com/broadinstitute/infercnv/wiki

![inferCNV](https://github.com/broadinstitute/infercnv/wiki/images/InferCNV_procedure.png)

### Data requirements

inferCNV requires:

- a raw counts matrix of single-cell RNA-Seq expression
- an annotations file which indicates which cells are tumor vs. normal.
- a gene/chromosome positions file

Detailed documentation: https://www.bioconductor.org/packages/release/bioc/manuals/infercnv/man/infercnv.pdf

### Installation

In [None]:
library(infercnv)

### Preparing metadata file containing cell-type information from previously loaded seurat object

In [None]:
options("Mel_scRNA.assay.version" = "v3")
infercnv_meta <- data.frame(V1 = Mel_scRNA@meta.data$cell_type)
rownames(infercnv_meta)<-rownames(Mel_scRNA@meta.data)
head(infercnv_meta)

In [None]:
unique(Mel_scRNA@meta.data$Level1_res1)

### Create InferCNV object

You can set the reference 'Normal' cell-types as a list using ref_group_names. Or you could set it to NULL. <br>
Either way inferCNV identifies aneuploid cells and infers subclones using predicted CNV profiles

In [None]:
infercnv_obj = CreateInfercnvObject(
  raw_counts_matrix=GetAssayData(Mel_scRNA, slot="counts"),
  annotations_file=infercnv_meta,
  delim="\t",
  gene_order_file=glue::glue("{data_dir}/hg38_gencode_v27.txt"),
  ref_group_names=NULL) #c("KC","Immune","Fibroblast","Endothelial cell"))

### Run InferCNV

In [None]:
out_dir = tempfile()
infercnv_obj_default = infercnv::run(
    infercnv_obj,
    cutoff=1, # cutoff=1 works well for Smart-seq2, and cutoff=0.1 works well for 10x Genomics
    out_dir=glue::glue("{data_dir}/all"),
    cluster_by_groups=TRUE, 
    plot_steps=FALSE,
    denoise=TRUE,
    HMM=FALSE,
    no_prelim_plot=TRUE,
    png_res=60
)


![infercnv_hm_ref](images/infercnv.png)

When you specify reference cell types, they are clustered separately. If not they are still clustered separately but are given cluster numbers

In both cases, the aneuploid cells are subclustered. The diploid cells are excluded in the subcluster-prediction file infercnv_subclusters.observation_groupings.txt. Pick the cells following a prefix 1_ in the HeatMap since this is the largest aneuploid cluster (Melanocytes). The other inclusions could be artifacts or false predictions and might need further investigation. They could also be validated by comparing predictions from CopyKAT and choose the cells with consistent aneuploid predictions across both tools.

![infercnv_subclusters](images/QIMR_infercnv_subclusters.png)

Alternately, you could calculate CNV scores <br> **scores=apply(infercnv_obj@expr.data,2,function(x){ sum(x < 0.95 | x > 1.05)/length(x) })**
 <br> and plot the distribution of a gene. If you see a binormal distribution, you could choose the cut-off based on where the first peak drops to choose aneuploid cells.

Example:
![hist](images/infercnv_hist.png)

In [None]:
infercnv_sub_groupings<-read.csv(glue::glue("{data_dir}/all/infercnv_subclusters.observation_groupings.txt"), sep=" ",header=TRUE)
tail(infercnv_sub_groupings)

infercnv_subclusters.observation_groupings.txt contains the sub-clusters of the Aneuploid cells. 284 cells are predicted aneuploid and are further classified into 12 clusters.

In [None]:
unique(infercnv_sub_groupings$Annotation.Group)
dim(infercnv_sub_groupings)

In [None]:
rownames(infercnv_sub_groupings)<-infercnv_sub_groupings$BC

# Find the order of the BC values in the seurat_colnames
bc_order <- match(rownames(infercnv_sub_groupings), colnames(Mel_scRNA))

# Reorder the columns of infercnv_sub_groupings based on the order
infercnv_sub_groupings_reordered <- infercnv_sub_groupings[order(bc_order), ]

# View the reordered DataFrame
head(infercnv_sub_groupings_reordered)

In [None]:

# Create a new column in Mel_scRNA@meta.data with empty values
Mel_scRNA@meta.data$infercnv_pred <- ""

# Get the row names of both data frames
mel_row_names <- rownames(Mel_scRNA@meta.data)
infercnv_row_names <- rownames(infercnv_sub_groupings)

# Find the matching row names
matching_row_names <- intersect(mel_row_names, infercnv_row_names)

# Update the new column with "Aneuploid" for matching rows
Mel_scRNA@meta.data[matching_row_names, "infercnv_pred"] <- "Aneuploid"

# Check the updated meta.data
head(Mel_scRNA@meta.data)

In [None]:
infercnv_aneuploid_cells <- rownames(Mel_scRNA@meta.data)[Mel_scRNA@meta.data$infercnv_pred == "Aneuploid"]
DimPlot(Mel_scRNA, group.by = "infercnv_pred", cells.highlight = infercnv_aneuploid_cells, cols.highlight = "red")

<a id="Evaluation"></a>

## Consistency of prediction between inferCNV and CopyKAT

In [None]:
# Create a new column in Mel_scRNA@meta.data with empty values
Mel_scRNA@meta.data$copykat_infercnv_consistent <- ""

# Update the new column with "Aneuploid" where both conditions are met
Mel_scRNA@meta.data$copykat_infercnv_consistent[Mel_scRNA@meta.data$infercnv_pred == "Aneuploid" & Mel_scRNA@meta.data$copykat_pred == "aneuploid"] <- "Aneuploid"

# Highlight the cells that are consistent between infercnv and copykat predictions
consistent_cells <- rownames(Mel_scRNA@meta.data)[Mel_scRNA@meta.data$copykat_infercnv_consistent == "Aneuploid"]
DimPlot(Mel_scRNA, group.by = "copykat_infercnv_consistent", cells.highlight = consistent_cells, cols.highlight = "red")


The red cells indicate the cells predicted aneuploid with both copyKAT and inferCNV which indicates a 77.89% consistency. <br> 15% of the Melanoytes are predicted malignant with inferCNV, 10%  with copyKAT,  and 8% with both tools.

<a id="spatial"></a>

## Analyzing subclones in SPATIAL data using CopyKAT

These tools can be applied on spatial data to visualize subclones. The following demonstrates the use of copyKAT on a public 10X Visium Melanoma dataset as the cell type annotations are unknown.

Dataset link https://www.10xgenomics.com/datasets/human-melanoma-if-stained-ffpe-2-standard

In [None]:
library(copykat)

### Extract and use counts from the .h5 file for copyKAT

In [None]:
exp.rawdata <- read.csv(glue::glue("{data_dir}/gene_counts_mat.txt"),sep="\t", header=TRUE) #gene-expression matrix
rownames(exp.rawdata)<-exp.rawdata$GENE
exp.rawdata<-exp.rawdata[,-1]

In [None]:
copykat.test <- copykat(rawmat=exp.rawdata, id.type="S", ngene.chr=5, win.size=25, KS.cut=0.1, sam.name="test",
                        distance="euclidean", norm.cell.names="",output.seg="FLASE", plot.genes="TRUE", 
                        genome="hg20",n.cores=1) #hg20 built-in copkat is the hg38 coords

**Output Heatmap showing clustering of Diploid and aneuploid cells base on inferred CNV profiles**

![HM](images/QIMR_spatial_test_copykat_heatmap.jpeg)

### Extract subclone annotations based on visualizing the output Heatmap

We clearly see the tree splirring into three branches, broadly indicating 4 subclones. So, we cut the tree at '3' and label the subclones to use for further visualization on the tissue.

In [None]:
## extract ploidy 
pred.test <- data.frame(copykat.test$prediction)
pred.test <- pred.test[-which(pred.test$copykat.pred=="not.defined"),]  ##remove undefined cells
CNA.test <- data.frame(copykat.test$CNAmat)

### define sub-clones
tumor.cells <- pred.test$cell.names[which(pred.test$copykat.pred=="aneuploid")]
tumor.mat <- CNA.test[, which(colnames(CNA.test) %in% tumor.cells)]
hcc <- hclust(parallelDist::parDist(t(tumor.mat),threads =4, method = "euclidean"), method = "ward.D2")
hc.umap <- cutree(hcc,3)
# write.table(hc.umap,"3subclones_BCs.csv", sep=",", quote=TRUE)

#### The .h5 file has all information needed to visualize the tissue section and associated data

We need to add the subclone information into a new column of the metadata.

In [None]:
spatial_data<-Load10X_Spatial(glue::glue("{data_dir}/Mel_spatial"),
                    filename = "CytAssist_FFPE_Human_Skin_Melanoma_filtered_feature_bc_matrix.h5",
                    assay = "RNA",
                    slice = "slice1",
                    filter.matrix = TRUE,
                    to.upper = FALSE,
                    image = NULL
)

In [None]:
# normalize data
spatial_data[["percent.mt"]] <- PercentageFeatureSet(spatial_data, pattern = "^MT-")
spatial_data <- NormalizeData(spatial_data, normalization.method = "LogNormalize", scale.factor = 10000)
spatial_data <- FindVariableFeatures(spatial_data, selection.method = "vst", nfeatures = 2000)

# scale and run PCA
spatial_data <- ScaleData(spatial_data, features = rownames(spatial_data))
spatial_data <- RunPCA(spatial_data, features = VariableFeatures(object = spatial_data))

# Check number of PC components (we selected 10 PCs for downstream analysis, based on Elbow plot)
#ElbowPlot(spatial_data)

# cluster and visualize
spatial_data <- FindNeighbors(spatial_data, dims = 1:30)
spatial_data <- FindClusters(spatial_data, resolution = 0.8)
spatial_data <- RunUMAP(spatial_data, dims = 1:30)


### Visualizing the copyKAT predictions to identify tumor region

In [None]:
p<-read.csv(glue::glue("{data_dir}/Mel_spatial/test_copykat_prediction.txt"), sep="\t",header=TRUE)
colnames(spatial_data)<-gsub("-1",".1",colnames(spatial_data))
spatial_data_sub<-subset(spatial_data, cells = p$cell.names)

seurat_colnames<-colnames(spatial_data_sub)
rownames(p)<-p$cell.names	
p<-p[seurat_colnames,]

# Find the order of the BC values in the seurat_colnames
bc_order <- match(p$cell.names, seurat_colnames)

# Reorder the columns of infercnv_sub_groupings based on the order
p_reordered <- p[order(bc_order), ]

spatial_data_sub@meta.data$copykat_pred<-p$copykat.pred
SpatialDimPlot(spatial_data_sub, group.by = "copykat_pred")

### Adding subclone information to the spatial object and visualizing subclones

In [None]:
c<-read.csv(glue::glue("{data_dir}/3subclones_BCs.csv"))
dim(c)
head(c)

In [None]:
spatial_data_subclones<-subset(spatial_data_sub, cells = rownames(c))

seurat_colnames<-colnames(spatial_data_subclones)

# Find the order of the BC values in the seurat_colnames
bc_order <- match(rownames(c), seurat_colnames)

# Reorder the columns of infercnv_sub_groupings based on the order
c_reordered <- c[order(bc_order), ]
spatial_data_subclones@meta.data$copykat_subclones <- c$x
SpatialDimPlot(spatial_data_subclones, group.by = "copykat_subclones")

### Save the processed data for further analysis

In [None]:
#saveRDS(spatial_data,"spatial_data.rds")
#saveRDS(spatial_data_subclones,"spatial_data_subclones.rds")

<a id="SUMMARY"></a>

# SUMMARY

Both of the discussed tools can be applied to scRNA-seq and spatial transcriptomics data

- CopyKAT
    - Cell-type annotation free tool to identify aneuploid (Tumor cells)
    - To infer subclones of tumor
    - Time consuming but easy to interpret
    
- InferCNV
    - Mainly to predict CNV profiles of annotated tumor cells by comparing with the annotated normal cells and to identify subclones.
    - Can also predict tumor cells when cell-type annotations are unknown
    - Quick to run but not as easy as CopyKAT to interpret results
    
Other new tools to explore: 'SCEVAN' (https://www.nature.com/articles/s41467-023-36790-9)
    