# Attack on scAnnotatR on liver 10k cells dataset

- How to train a scAnnotatR classifier
- How to format the classifier to use it with adverSCarial
- How to run an IGD4C attack

Nguyen, V., Griss, J. scAnnotatR: framework to accurately classify cell types in single-cell RNA-sequencing data. BMC Bioinformatics, 2022. 23(44) https://doi.org/10.1186/s12859-022-04574-5 

In [1]:
library(scAnnotatR)
library(IRdisplay)
library(adverSCarial)

Loading required package: Seurat

Loading required package: SeuratObject

Loading required package: sp

The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
which was just loaded, will retire in October 2023.
Please refer to R-spatial evolution reports for details, especially
https://r-spatial.org/r/2023/05/15/evolution4.html.
It may be desirable to make the sf package available;
package maintainers should consider adding sf to Suggests:.
The sp package is now running under evolution status 2
     (status 2 uses the sf package in place of rgdal)


Attaching package: ‘SeuratObject’


The following objects are masked from ‘package:base’:

    intersect, t


Loading required package: SingleCellExperiment

Loading required package: SummarizedExperiment

Loading required package: MatrixGenerics

Loading required package: matrixStats


Attaching package: ‘MatrixGenerics’


The following objects are masked from ‘package:matrixStats’:

    colAlls, colAnyNAs, colAnys, c

# scAnnotatR classifier training

Load previously splitted train/test pbmc3k dataset
https://cf.10xgenomics.com/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz

annotated from the Seurat tutorial method
https://satijalab.org/seurat/articles/pbmc3k_tutorial

In [2]:
c_basen = c("hgnc_axilla_10k", "hgnc_kidney_10k", "hgnc_liver_10k")
basen = c_basen[3]

In [3]:
se_train <- readRDS(paste0("data//v5/data//sc//",basen,"_train.rds"))
se_test <- readRDS(paste0("data//v5/data//sc//",basen,"_test.rds"))

In [4]:
unique(se_train@meta.data[['chr_seurat_cluster']])

In [5]:
Idents(se_train) <- "chr_seurat_cluster"
Idents(se_test) <- "chr_seurat_cluster"

In [6]:
head(se_train@meta.data)

Unnamed: 0_level_0,orig.ident,nCount_RNA,nFeature_RNA,replicate,condition,labels_unif,labels_cl_unif,labels_cl_unif2_broad,compartments,cnv_pass_mal,⋯,cell_type,assay,disease,organism,sex,tissue,self_reported_ethnicity,development_stage,observation_joinid,chr_seurat_cluster
Unnamed: 0_level_1,<fct>,<dbl>,<int>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,⋯,<ord>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<chr>,<ord>
HTAPP-944-SMP-7479-TST-channel1_CACAACACATCGTTCC-1,HTAPP-944-SMP-7479-TST-channel1,10284,5095,1,TST,Epithelial,Epithelial,Epithelial_neuro,Malignant,True,⋯,malignant cell,10x 3' v3,breast cancer,Homo sapiens,female,liver,European,46-year-old stage,@60N2&n{|#,malignant cell
HTAPP-944-SMP-7479-TST-channel1_TAACTTCAGCAACTCT-1,HTAPP-944-SMP-7479-TST-channel1,9907,6055,1,TST,Neurons,Mesangial cells,Epithelial_neuro,Malignant,True,⋯,malignant cell,10x 3' v3,breast cancer,Homo sapiens,female,liver,European,46-year-old stage,SK498wkiE8,malignant cell
HTAPP-944-SMP-7479-TST-channel1_AACGAAAGTCTCTCCA-1,HTAPP-944-SMP-7479-TST-channel1,19862,7590,1,TST,Epithelial,Epithelial,Epithelial_neuro,Malignant,True,⋯,malignant cell,10x 3' v3,breast cancer,Homo sapiens,female,liver,European,46-year-old stage,JP$p}wXdg0,malignant cell
HTAPP-944-SMP-7479-TST-channel1_TGGTAGTTCTTACGGA-1,HTAPP-944-SMP-7479-TST-channel1,19169,7672,1,TST,Endothelial,Endothelial,Endothelial,Stromal,False,⋯,endothelial cell,10x 3' v3,breast cancer,Homo sapiens,female,liver,European,46-year-old stage,rhaxk3B9!C,endothelial cell
HTAPP-944-SMP-7479-TST-channel1_TTGTTTGCAGACACCC-1,HTAPP-944-SMP-7479-TST-channel1,19005,6660,1,TST,Epithelial,Epithelial,Epithelial_neuro,Malignant,True,⋯,malignant cell,10x 3' v3,breast cancer,Homo sapiens,female,liver,European,46-year-old stage,v!j2{U6mqJ,malignant cell
HTAPP-944-SMP-7479-TST-channel1_GGCTTTCGTTTCACTT-1,HTAPP-944-SMP-7479-TST-channel1,19107,7349,1,TST,Epithelial,Epithelial,Epithelial_neuro,Malignant,True,⋯,malignant cell,10x 3' v3,breast cancer,Homo sapiens,female,liver,European,46-year-old stage,#?`|gL34x<,malignant cell


## Function to get the most significant genes between a cluster and the other cells

In [7]:
getSignGenesNot <- function(expr, clusters, target, method="wilcox", adjMethod="BH", verbose=FALSE){
        if (verbose) {message("Cluster ",target," vs all the other clusters")}
        pvals <- apply(t(expr), 1, function(x){
            c1 <- x[clusters == target]
            c1 <- c1[!is.na(c1)]
            c2 <- x[clusters != target]
            c2 <- c2[!is.na(c2)]
            if ( length(c1) == 0 || length(c2) == 0){
                return(1)
            }
            if (length(unique(c1))==1){
                c1[1] = c1[1] + 0.00001
            }
            if (length(unique(c2))==1){
                c2[1] = c2[1] + 0.00001
            }
            if (method=="wilcox"){
                return(wilcox.test(c1, c2)$p.value)
            }
            if (method=="ttest"){
                return(t.test(c1, c2)$p.value)
            }
        })
        means <- apply(t(expr), 1, function(x){
                c1 <- x[clusters == target]
                c1 <- c1[!is.na(c1)]
                c2 <- x[clusters != target]
                c2 <- c2[!is.na(c2)]
                if (length(c1) == 0){
                    c1 = c(0)
                }
                if (length(c2) == 0){
                    c2 = c(0)
                }
                return(mean(c1)-mean(c2))
        })  
        dfPvals <- data.frame(gene=colnames(expr), pval=unname(pvals), mean=means)
        rownames(dfPvals) <- dfPvals$gene

        for (clustInt in setdiff(unique(clusters), target)){
            if(verbose){message("Cluster ", target, " vs cluster ", clustInt)}
            newPvals <- pvals <- apply(t(expr), 1, function(x){
                c1 <- x[clusters == target]
                c1 <- c1[!is.na(c1)]
                c2 <- x[clusters == clustInt]
                c2 <- c2[!is.na(c2)]
                if ( length(c1) == 0 || length(c2) == 0){
                    return(1)
                }
                if (length(unique(c1))==1){
                    c1[1] = c1[1] + 0.001
                }
                if (length(unique(c2))==1){
                    c2[1] = c2[1] + 0.001
                }
                if (method=="wilcox"){
                    return(wilcox.test(c1, c2)$p.value)
                }
                if (method=="ttest"){
                    return(t.test(c1, c2)$p.value)
                }    
            })
            newMeans <- pvals <- apply(t(expr), 1, function(x){
                c1 <- x[clusters == target]
                c1 <- c1[!is.na(c1)]
                c2 <- x[clusters == clustInt]
                c2 <- c2[!is.na(c2)]
                if (length(c1) == 0){
                    c1 = c(0)
                }
                if (length(c2) == 0){
                    c2 = c(0)
                }
                return(mean(c1)-mean(c2))
            })  
            if(verbose){message(sum(unname(newPvals) < dfPvals$pval)," pvalues replaced by lower values")}
            dfPvals[unname(newPvals) < dfPvals$pval, "pval"] <- newPvals[unname(newPvals) < dfPvals$pval]
            dfPvals[unname(newPvals) < dfPvals$pval, "mean"] <- newMeans[unname(newPvals) < dfPvals$pval]
    }
    dfPvals$adjPval <- p.adjust(dfPvals$pval, method=adjMethod)
    dfPvals <- dfPvals[order(dfPvals$pval),]
    return(dfPvals)
}

In [8]:
dfScaled <- as.data.frame(se_train@assays$RNA@layers$scale.data)
colnames(dfScaled) <- rownames(se_train@assays$RNA@cells)
rownames(dfScaled) <- rownames(se_train@assays$RNA@features)
dfScaled <- as.data.frame(t(dfScaled))

The scAnnotatR will build one classifier to predict each cell type, so we one-hot encode the cell type by adding a meta.data column for each cell type with binary information: cell type or "unknown".

In [9]:
for (cellType in unique(se_train@meta.data$chr_seurat_cluster)){
    se_train@meta.data[[cellType]] <- unlist(lapply(se_train@meta.data$chr_seurat_cluster, function(x){
        if (x == cellType){
            return(cellType)
        } else {
            return("unknown")
        }
    }))
}

## Train and export one classifier for each cell type

In [10]:
listClassifiers <- list()
for (cellType in unique(se_train@meta.data$chr_seurat_cluster)){
    display(cellType)
    sg <- getSignGenesNot(dfScaled,
                   se_train@meta.data$chr_seurat_cluster,
                   cellType)
    # Selection of the 20 most significants with a minimum mean difference of 0.5 for the SVM
    selectedMarkers <- rownames(sg[abs(sg$mean)>0.5,])[1:20]
    classifier <- train_classifier(train_obj = se_train, cell_type = cellType, 
                                 marker_genes = selectedMarkers,
                                 assay = 'RNA', tag_slot = cellType)
    listClassifiers[[cellType]] <- classifier
    
    save_new_model(new_model = classifier,
                   path_to_models = paste0("repr_data/classifiers/scAnnotatR/trainedClass_", basen),
               include.default = FALSE) 
}
listClassifiers[[cellType]]

Loading required package: ggplot2

Loading required package: lattice

Saving new models to repr_data/classifiers/scAnnotatR/trainedClass_hgnc_liver_10k/new_models.rda...

Finished saving new model



Saving new models to repr_data/classifiers/scAnnotatR/trainedClass_hgnc_liver_10k/new_models.rda...

Finished saving new model



Saving new models to repr_data/classifiers/scAnnotatR/trainedClass_hgnc_liver_10k/new_models.rda...

Finished saving new model



Saving new models to repr_data/classifiers/scAnnotatR/trainedClass_hgnc_liver_10k/new_models.rda...

Finished saving new model



Saving new models to repr_data/classifiers/scAnnotatR/trainedClass_hgnc_liver_10k/new_models.rda...

Finished saving new model



Saving new models to repr_data/classifiers/scAnnotatR/trainedClass_hgnc_liver_10k/new_models.rda...

Finished saving new model



An object of class scAnnotatR for mature NK T cell 
* 20 marker genes applied: RUNX3, CD2, PYHIN1, FCRL6, FCGR3A, CD247, PTPRC, GNLY, ARHGAP15, CX3CR1, CD38, FGFBP2, DTHD1, TXK, FYB1, GZMA, ITK, SCML4, SAMD3, AOAH 
* Predicting probability threshold: 0.5 
* No parent model

# Format the Classifier
To work with adverSCarial the classifier needs to be formated in a certain way.

In [11]:
scAnnotatR_classifier = function(expr, clusters, target){
    if (!"scAnnotatR" %in% loadedNamespaces()){
        library(scAnnotatR)
    }
    if ( !exists("sca_default_models")){
        sca_default_models <<- load_models(paste0("repr_data/classifiers/scAnnotatR/trainedClass_", basen))
        sca_cell_types <<- names(sca_default_models)
    }
    
    seurat.obj <- classify_cells(classify_obj = expr, 
                             assay = 'RNA', slot = 'scale.data',
                             cell_types = sca_cell_types, 
                             path_to_models = paste0("repr_data/classifiers/scAnnotatR/trainedClass_", basen))
    typePredictions <- seurat.obj@meta.data[, stringr::str_replace_all( paste0(sca_cell_types, "_p"), " ", "_")]
    
    pred_table <- table(seurat.obj@meta.data[clusters == target, "most_probable_cell_type"])
    cluster_pred <- names(pred_table[order(pred_table, decreasing = TRUE)])[1]
    odd <- unname(pred_table[cluster_pred]/sum(pred_table))
    colnames(typePredictions) <- unlist(lapply(colnames(typePredictions), function(x){
        unlist(strsplit(x,"_p$"))[1]
    }))
    typePredictions <- as.data.frame(t(typePredictions))
    
    result <- list(
        # Cell type prediction for the cluster
        prediction=stringr::str_replace_all(cluster_pred," ","_"),
        # Score of the predicted cell type
        odd=odd, 
        # Score for each cell type for each cell
        typePredictions=typePredictions,
        # Cell type for each cell
        cellTypes=seurat.obj@meta.data$most_probable_cell_type)
    return(result)
}

# Adversarial attack of the microglial cell cells cluster

In [12]:
so_pbmc_test <- se_test
clusters_so = so_pbmc_test@meta.data$chr_seurat_cluster
names(clusters_so) <- rownames(so_pbmc_test@meta.data)

cell types clusters

In [13]:
unique(clusters_so)

We check if the classifier is working properly, it predicts the blood vessel endothelial cells cluster as blood vessel endothelial cells.

In [14]:
class_results <- scAnnotatR_classifier(so_pbmc_test, clusters_so, 'monocyte')

In [15]:
class_results$prediction

In [16]:
dfScaled <- as.data.frame(so_pbmc_test@assays$RNA@layers$scale.data)
colnames(dfScaled) <- rownames(so_pbmc_test@assays$RNA@cells)
rownames(dfScaled) <- rownames(so_pbmc_test@assays$RNA@features)
dfScaled <- t(dfScaled)

In [17]:
dfScaled[1:5,1:5]

Unnamed: 0,ENSG00000238009,ENSG00000241599,ENSG00000235146,LINC01409,FAM87B
HTAPP-944-SMP-7479-TST-channel1_ACCTACCGTTCCTAAG-1,1.7450717,-0.01717637,-0.04188288,0.2710382,-0.03529206
HTAPP-944-SMP-7479-TST-channel1_CATGCCTAGAGCCCAA-1,-0.1656401,-0.01717637,-0.04188288,0.2823981,-0.03529206
HTAPP-944-SMP-7479-TST-channel1_ATTTCTGTCGCCAATA-1,-0.1656401,-0.01717637,-0.04188288,0.2896639,-0.03529206
HTAPP-944-SMP-7479-TST-channel1_TTGAGTGTCTAGACCA-1,-0.1656401,-0.01717637,-0.04188288,0.2896639,-0.03529206
HTAPP-944-SMP-7479-TST-channel1_AGACACTGTCTCGACG-1,-0.1656401,-0.01717637,-0.04188288,-0.4820482,-0.03529206


We use the adverSCarial getSignGenes function to get the most significant genes between clusters, ensuring that all pairs of clusters are equivalently representated.

In [18]:
start_time <- Sys.time()

In [19]:
sign_genes <- getSignGenes(dfScaled, clusters_so)

In [20]:
Sys.time() - start_time

Time difference of 17.69299 mins

In [21]:
head(sign_genes$results)

Unnamed: 0_level_0,gene,pval
Unnamed: 0_level_1,<chr>,<dbl>
VWA1,VWA1,0.0
AOAH,AOAH,1.2301700000000002e-128
CD247,CD247,6.1708640000000006e-55
AKT3,AKT3,0.0
PTPRC,PTPRC,0.0
LST1,LST1,8.826905e-144


We launch the attack with the advCGD function with alpha=2 and epsilon=2 parameters, which lead to a big modification on a few genes. You can try with alpha=1 and epsilon=1 to have small modifications of more genes.

In [22]:
start_time <- Sys.time()

In [23]:
igd4c_results <- advCGD(so_pbmc_test, clusters_so, 'monocyte',
                          scAnnotatR_classifier, alpha=2,
                          epsilon=2, slot="scale.data",
                          genes=sign_genes$results$gene[1:500],
                         verbose=T)

monocyte

50006

5000

175

New cluster target: macrophage

VWA1 1

Number of original annot monocyte : 91

 mean 0.997460120659876 delt 0

Number of macrophage : 84

 mean 0.998245809077059 delt 0

Number of modified cells 0

AOAH 2

Number of original annot monocyte : 91

 mean 0.997460120659876 delt 0

Number of macrophage : 84

 mean 0.998245809077059 delt 0

Number of modified cells 91



# Computation time

In [24]:
Sys.time() - start_time

Time difference of 54.82041 secs

We can see the genes that have been modified

In [25]:
igd4c_results$modGenes

And the modified Seurat object

In [26]:
igd4c_results$expr

An object of class Seurat 
25571 features across 5000 samples within 1 assay 
Active assay: RNA (25571 features, 0 variable features)
 3 layers present: counts, data, scale.data

The blood vessel endothelial cell cluster is now classified as CD8 T cells

In [27]:
new_classification <- scAnnotatR_classifier(igd4c_results$expr, clusters_so, 'monocyte')
new_classification$prediction

In [28]:
sessionInfo()

R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Paris
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] caret_6.0-94                lattice_0.21-8             
 [3] ggplot2_3.4.2               adverSCarial_1.3.6         
 [5] IRdisplay_1.1               scAnnotatR_1.8.0           
 [7] SingleCe