# Attack on scAnnotatR on axilla 10k cells dataset

- How to train a scAnnotatR classifier
- How to format the classifier to use it with adverSCarial
- How to run an IGD4C attack

Nguyen, V., Griss, J. scAnnotatR: framework to accurately classify cell types in single-cell RNA-sequencing data. BMC Bioinformatics, 2022. 23(44) https://doi.org/10.1186/s12859-022-04574-5 

In [1]:
library(scAnnotatR)
library(IRdisplay)
library(adverSCarial)

Loading required package: Seurat

Loading required package: SeuratObject

Loading required package: sp

The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
which was just loaded, will retire in October 2023.
Please refer to R-spatial evolution reports for details, especially
https://r-spatial.org/r/2023/05/15/evolution4.html.
It may be desirable to make the sf package available;
package maintainers should consider adding sf to Suggests:.
The sp package is now running under evolution status 2
     (status 2 uses the sf package in place of rgdal)


Attaching package: ‘SeuratObject’


The following objects are masked from ‘package:base’:

    intersect, t


Loading required package: SingleCellExperiment

Loading required package: SummarizedExperiment

Loading required package: MatrixGenerics

Loading required package: matrixStats


Attaching package: ‘MatrixGenerics’


The following objects are masked from ‘package:matrixStats’:

    colAlls, colAnyNAs, colAnys, c

# scAnnotatR classifier training

Load previously splitted train/test pbmc3k dataset
https://cf.10xgenomics.com/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz

annotated from the Seurat tutorial method
https://satijalab.org/seurat/articles/pbmc3k_tutorial

In [2]:
c_basen = c("hgnc_axilla_10k", "hgnc_brain_7k", "hgnc_liver_6k")
basen = c_basen[1]

In [3]:
se_train <- readRDS(paste0("data//v5/data//sc//",basen,"_train.rds"))
se_test <- readRDS(paste0("data//v5/data//sc//",basen,"_test.rds"))

In [4]:
unique(se_train@meta.data[['chr_seurat_cluster']])

In [5]:
Idents(se_train) <- "chr_seurat_cluster"
Idents(se_test) <- "chr_seurat_cluster"

In [6]:
head(se_train@meta.data)

Unnamed: 0_level_0,orig.ident,nCount_RNA,nFeature_RNA,replicate,condition,labels_unif,labels_cl_unif,labels_cl_unif2_broad,compartments,cnv_pass_mal,⋯,cell_type,assay,disease,organism,sex,tissue,self_reported_ethnicity,development_stage,observation_joinid,chr_seurat_cluster
Unnamed: 0_level_1,<fct>,<dbl>,<int>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,⋯,<ord>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<chr>,<ord>
HTAPP-878-SMP-7149-TST-channel1_CGTTCTGCAGTCGCTG-1,HTAPP-878-SMP-7149-TST-channel1,19735,6278,1,TST,Epithelial,Epithelial,Epithelial,Malignant,True,⋯,malignant cell,10x 3' v3,breast cancer,Homo sapiens,female,axilla,European,37-year-old human stage,KYeaGobvIQ,malignant cell
HTAPP-878-SMP-7149-TST-channel1_GCACGGTTCGATACTG-1,HTAPP-878-SMP-7149-TST-channel1,19848,7140,1,TST,Epithelial,Epithelial,Epithelial,Malignant,True,⋯,malignant cell,10x 3' v3,breast cancer,Homo sapiens,female,axilla,European,37-year-old human stage,@dk<quR-I{,malignant cell
HTAPP-878-SMP-7149-TST-channel1_TGTCCACAGTCATAGA-1,HTAPP-878-SMP-7149-TST-channel1,19288,5699,1,TST,Epithelial,Epithelial,Epithelial,Malignant,True,⋯,malignant cell,10x 3' v3,breast cancer,Homo sapiens,female,axilla,European,37-year-old human stage,g%ROJ`3nnu,malignant cell
HTAPP-878-SMP-7149-TST-channel1_TTTAGTCTCCACGTCT-1,HTAPP-878-SMP-7149-TST-channel1,19569,6949,1,TST,Epithelial,Epithelial,Epithelial,Malignant,True,⋯,malignant cell,10x 3' v3,breast cancer,Homo sapiens,female,axilla,European,37-year-old human stage,(HRUf-FqnM,malignant cell
HTAPP-878-SMP-7149-TST-channel1_CCCGGAACAAGTGGAC-1,HTAPP-878-SMP-7149-TST-channel1,19440,6553,1,TST,Epithelial,Epithelial,Epithelial,Malignant,True,⋯,malignant cell,10x 3' v3,breast cancer,Homo sapiens,female,axilla,European,37-year-old human stage,X_LhOX=$9;,malignant cell
HTAPP-878-SMP-7149-TST-channel1_GAGACTTTCATTTCCA-1,HTAPP-878-SMP-7149-TST-channel1,19421,7104,1,TST,Epithelial,Epithelial,Epithelial,Malignant,True,⋯,malignant cell,10x 3' v3,breast cancer,Homo sapiens,female,axilla,European,37-year-old human stage,r0fjWR$I_},malignant cell


## Function to get the most significant genes between a cluster and the other cells

In [7]:
getSignGenesNot <- function(expr, clusters, target, method="wilcox", adjMethod="BH", verbose=FALSE){
        if (verbose) {message("Cluster ",target," vs all the other clusters")}
        pvals <- apply(t(expr), 1, function(x){
            c1 <- x[clusters == target]
            c1 <- c1[!is.na(c1)]
            c2 <- x[clusters != target]
            c2 <- c2[!is.na(c2)]
            if ( length(c1) == 0 || length(c2) == 0){
                return(1)
            }
            if (length(unique(c1))==1){
                c1[1] = c1[1] + 0.00001
            }
            if (length(unique(c2))==1){
                c2[1] = c2[1] + 0.00001
            }
            if (method=="wilcox"){
                return(wilcox.test(c1, c2)$p.value)
            }
            if (method=="ttest"){
                return(t.test(c1, c2)$p.value)
            }
        })
        means <- apply(t(expr), 1, function(x){
                c1 <- x[clusters == target]
                c1 <- c1[!is.na(c1)]
                c2 <- x[clusters != target]
                c2 <- c2[!is.na(c2)]
                if (length(c1) == 0){
                    c1 = c(0)
                }
                if (length(c2) == 0){
                    c2 = c(0)
                }
                return(mean(c1)-mean(c2))
        })  
        dfPvals <- data.frame(gene=colnames(expr), pval=unname(pvals), mean=means)
        rownames(dfPvals) <- dfPvals$gene

        for (clustInt in setdiff(unique(clusters), target)){
            if(verbose){message("Cluster ", target, " vs cluster ", clustInt)}
            newPvals <- pvals <- apply(t(expr), 1, function(x){
                c1 <- x[clusters == target]
                c1 <- c1[!is.na(c1)]
                c2 <- x[clusters == clustInt]
                c2 <- c2[!is.na(c2)]
                if ( length(c1) == 0 || length(c2) == 0){
                    return(1)
                }
                if (length(unique(c1))==1){
                    c1[1] = c1[1] + 0.001
                }
                if (length(unique(c2))==1){
                    c2[1] = c2[1] + 0.001
                }
                if (method=="wilcox"){
                    return(wilcox.test(c1, c2)$p.value)
                }
                if (method=="ttest"){
                    return(t.test(c1, c2)$p.value)
                }    
            })
            newMeans <- pvals <- apply(t(expr), 1, function(x){
                c1 <- x[clusters == target]
                c1 <- c1[!is.na(c1)]
                c2 <- x[clusters == clustInt]
                c2 <- c2[!is.na(c2)]
                if (length(c1) == 0){
                    c1 = c(0)
                }
                if (length(c2) == 0){
                    c2 = c(0)
                }
                return(mean(c1)-mean(c2))
            })  
            if(verbose){message(sum(unname(newPvals) < dfPvals$pval)," pvalues replaced by lower values")}
            dfPvals[unname(newPvals) < dfPvals$pval, "pval"] <- newPvals[unname(newPvals) < dfPvals$pval]
            dfPvals[unname(newPvals) < dfPvals$pval, "mean"] <- newMeans[unname(newPvals) < dfPvals$pval]
    }
    dfPvals$adjPval <- p.adjust(dfPvals$pval, method=adjMethod)
    dfPvals <- dfPvals[order(dfPvals$pval),]
    return(dfPvals)
}

In [8]:
dfScaled <- as.data.frame(se_train@assays$RNA@layers$scale.data)
colnames(dfScaled) <- rownames(se_train@assays$RNA@cells)
rownames(dfScaled) <- rownames(se_train@assays$RNA@features)
dfScaled <- as.data.frame(t(dfScaled))

The scAnnotatR will build one classifier to predict each cell type, so we one-hot encode the cell type by adding a meta.data column for each cell type with binary information: cell type or "unknown".

In [9]:
for (cellType in unique(se_train@meta.data$chr_seurat_cluster)){
    se_train@meta.data[[cellType]] <- unlist(lapply(se_train@meta.data$chr_seurat_cluster, function(x){
        if (x == cellType){
            return(cellType)
        } else {
            return("unknown")
        }
    }))
}

## Train and export one classifier for each cell type

In [10]:
listClassifiers <- list()
for (cellType in unique(se_train@meta.data$chr_seurat_cluster)){
    display(cellType)
    sg <- getSignGenesNot(dfScaled,
                   se_train@meta.data$chr_seurat_cluster,
                   cellType)
    # Selection of the 20 most significants with a minimum mean difference of 0.5 for the SVM
    selectedMarkers <- rownames(sg[abs(sg$mean)>0.5,])[1:20]
    classifier <- train_classifier(train_obj = se_train, cell_type = cellType, 
                                 marker_genes = selectedMarkers,
                                 assay = 'RNA', tag_slot = cellType)
    listClassifiers[[cellType]] <- classifier
    
    save_new_model(new_model = classifier,
                   path_to_models = paste0("repr_data/classifiers/scAnnotatR/trainedClass_", basen),
               include.default = FALSE) 
}
listClassifiers[[cellType]]

# Format the Classifier
To work with adverSCarial the classifier needs to be formated in a certain way.

In [5]:
scAnnotatR_classifier = function(expr, clusters, target){
    if (!"scAnnotatR" %in% loadedNamespaces()){
        library(scAnnotatR)
    }
    if ( !exists("sca_default_models")){
        sca_default_models <<- load_models(paste0("repr_data/classifiers/scAnnotatR/trainedClass_", basen))
        sca_cell_types <<- names(sca_default_models)
    }
    
    seurat.obj <- classify_cells(classify_obj = expr, 
                             assay = 'RNA', slot = 'scale.data',
                             cell_types = sca_cell_types, 
                             path_to_models = paste0("repr_data/classifiers/scAnnotatR/trainedClass_", basen))
    typePredictions <- seurat.obj@meta.data[, stringr::str_replace_all( paste0(sca_cell_types, "_p"), " ", "_")]
    
    pred_table <- table(seurat.obj@meta.data[clusters == target, "most_probable_cell_type"])
    cluster_pred <- names(pred_table[order(pred_table, decreasing = TRUE)])[1]
    odd <- unname(pred_table[cluster_pred]/sum(pred_table))
    colnames(typePredictions) <- unlist(lapply(colnames(typePredictions), function(x){
        unlist(strsplit(x,"_p$"))[1]
    }))
    typePredictions <- as.data.frame(t(typePredictions))
    
    result <- list(
        # Cell type prediction for the cluster
        prediction=stringr::str_replace_all(cluster_pred," ","_"),
        # Score of the predicted cell type
        odd=odd, 
        # Score for each cell type for each cell
        typePredictions=typePredictions,
        # Cell type for each cell
        cellTypes=seurat.obj@meta.data$most_probable_cell_type)
    return(result)
}

# Adversarial attack of the malignant cell cells cluster

In [6]:
so_pbmc_test <- se_test
clusters_so = so_pbmc_test@meta.data$chr_seurat_cluster
names(clusters_so) <- rownames(so_pbmc_test@meta.data)

cell types clusters

In [7]:
unique(clusters_so)

We check if the classifier is working properly, it predicts the NK cells cluster as NK cells

In [8]:
class_results <- scAnnotatR_classifier(so_pbmc_test, clusters_so, 'macrophage')

In [9]:
class_results$prediction

In [10]:
dfScaled <- as.data.frame(so_pbmc_test@assays$RNA@layers$scale.data)
colnames(dfScaled) <- rownames(so_pbmc_test@assays$RNA@cells)
rownames(dfScaled) <- rownames(so_pbmc_test@assays$RNA@features)
dfScaled <- t(dfScaled)

In [11]:
dfScaled[1:5,1:5]

Unnamed: 0,ENSG00000238009,ENSG00000241599,ENSG00000235146,LINC01409,FAM87B
HTAPP-878-SMP-7149-TST-channel1_GATGTTGCAAACGTGG-1,-0.2027578,-0.01920534,-0.04245598,0.7910192,-0.01895955
HTAPP-878-SMP-7149-TST-channel1_CTTTCAAGTAGGTACG-1,-0.2027578,-0.01920534,-0.04245598,0.2370915,-0.01895955
HTAPP-878-SMP-7149-TST-channel1_CACCGTTGTTCTGACA-1,-0.2027578,-0.01920534,-0.04245598,0.7840407,-0.01895955
HTAPP-878-SMP-7149-TST-channel1_GAGTTTGCACAACGTT-1,1.5589332,-0.01920534,-0.04245598,0.8048589,-0.01895955
HTAPP-878-SMP-7149-TST-channel1_GATGACTTCTTTGCTA-1,-0.2027578,-0.01920534,-0.04245598,0.2525219,-0.01895955


We use the adverSCarial getSignGenes function to get the most significant genes between clusters, ensuring that all pairs of clusters are equivalently representated.

In [12]:
start_time <- Sys.time()

In [13]:
sign_genes <- getSignGenes(dfScaled, clusters_so)

In [14]:
Sys.time() - start_time

Time difference of 17.79068 mins

In [15]:
head(sign_genes$results)

Unnamed: 0_level_0,gene,pval
Unnamed: 0_level_1,<chr>,<dbl>
HSPG2,HSPG2,0.0
MSR1,MSR1,6.526758e-284
PRKG1,PRKG1,3.4567790000000003e-180
CASZ1,CASZ1,0.0
ARHGAP15,ARHGAP15,2.87697e-183
MCAM,MCAM,1.122424e-226


We launch the attack with the advCGD function with alpha=2 and epsilon=2 parameters, which lead to a big modification on a few genes. You can try with alpha=1 and epsilon=1 to have small modifications of more genes.

In [16]:
start_time <- Sys.time()

In [17]:
igd4c_results <- advCGD(so_pbmc_test, clusters_so, 'macrophage',
                          scAnnotatR_classifier, alpha=2,
                          epsilon=2, slot="scale.data",
                          genes=sign_genes$results$gene[1:500],
                         verbose=T)

macrophage

53126

5312

392

New cluster target: T_cell

HSPG2 1

Number of original annot macrophage : 370

 mean 0.959160038401541 delt 0

Number of T_cell : 14

 mean 0.737670271885174 delt 0

Number of modified cells 0

MSR1 2

Number of original annot macrophage : 370

 mean 0.959160038401541 delt 0

Number of T_cell : 14

 mean 0.737670271885174 delt 0

Number of modified cells 378

PRKG1 3

Number of original annot macrophage : 361

 mean 0.936429968420805 delt -0.022730069980736

Number of T_cell : 16

 mean 0.762200657720953 delt 0.0245303858357792

Number of modified cells 0

CASZ1 4

Number of original annot macrophage : 361

 mean 0.936429968420805 delt 0

Number of T_cell : 16

 mean 0.762200657720953 delt 0

Number of modified cells 0

ARHGAP15 5

Number of original annot macrophage : 361

 mean 0.936429968420805 delt 0

Number of T_cell : 16

 mean 0.762200657720953 delt 0

Number of modified cells 376

MCAM 6

Number of original annot macrophage : 361

 mean 0.93605772

# Computation time

In [18]:
Sys.time() - start_time

Time difference of 21.19147 mins

We can see the genes that have been modified

In [19]:
igd4c_results$modGenes

And the modified Seurat object

In [20]:
igd4c_results$expr

An object of class Seurat 
25345 features across 5312 samples within 1 assay 
Active assay: RNA (25345 features, 0 variable features)
 3 layers present: counts, data, scale.data

The malignant cell cluster is now classified as CD8 T cells

In [21]:
new_classification <- scAnnotatR_classifier(igd4c_results$expr, clusters_so, 'macrophage')
new_classification$prediction

In [22]:
sessionInfo()

R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Paris
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] adverSCarial_1.3.6          IRdisplay_1.1              
 [3] scAnnotatR_1.8.0            SingleCellExperiment_1.23.0
 [5] SummarizedExperiment_1.31.1 Biobase_2.61.0             
 [7] GenomicR