scripts/imc_generatespillmat_long.Rmd

---
title: "Spillover estimation IMC"
author: 'Vito Zanotelli et al.'
output:
  html_document:
    df_print: paged
    keep_md: true
---

# Aim
This script shows how to estimate spillover from single metal spots on an agarose coated slide.
Each spot should be imaged with a single acquisition. The name of the acquisition should be the metal that is used:
E.g. PanormaA_1_Yb176_23.txt

When run with the example data it reproduces the spillover estimation shown in Fig S5A as well as Fig 4A

# Script
## load all libraries
```{r Libraries, message=FALSE}
library(CATALYST)
library(data.table)
library(ggplot2)
library(flowCore)
library(dplyr)
library(dtplyr)
library(stringr)
library(ggpmisc)

source('spillover_imc_helpers.R')
```


## setup the configuration variables
```{r Setup}
# list of folders that contain each a complete single stain acquisition (e.g. in case that one wants to run and compare multiple single stains from different days)
fols_ss = c('../data/Figure_S5/Spillover_Matrix_2','../data/Figure_S5/Spillover_Matrix_1' )

# output folder
fol_out = '../data/Figure_S5/'
# name prefix for all output
prefix ='paper_version_'
```


## load single stains
### Data loading
```{r}
# load the data
list_img_ss <-lapply(fols_ss, load_ss_fol)
names(list_img_ss) <- fols_ss
```

### Adapt the column names to be recognized metal names by CATALYST 
CATALYST needs to have the metal names in the format (METAL)(MASS)Di
```{r}
list_img_ss = lapply(list_img_ss, function(x) lapply(x, fixnames))
dats_raw = lapply(list_img_ss, imglist2dat)
```


### Extract the single stain masses from the acquisition name
This needs to be changed in case a different naming scheme is used!
```{r Get bc masses}
for (dat in dats_raw){
  dat[, metal:= strsplit(.BY[[1]], '_')[[1]][3],by=file]
  dat[, mass:= as.numeric(str_extract_all(.BY[[1]], "[0-9]+")[[1]]),by=metal]
}
```

## Visualization of the raw data
In the following section the raw data is visualized

### Calculate per-file medians
```{r}
dats_raw_sum = rbindlist(lapply(dats_raw, calc_file_medians),idcol = T)
```


### Visualize per-file medians
Plots the median of the data. It is recommended to have >200 counts for all the channels.
This is also a good plot to check if the metal spots really contain the correct metal!
```{r fig.height=13, fig.width=20}
dats_raw_sum %>%
  ggplot(aes(x=1, y=med, color=.id))+
  facet_wrap(~file+metal, scales = 'free_y')+
  geom_label(aes(label=variable), size=4)
```


###  Optional data bining

If the median per-pixel intensities are to low, it could be worth to sum up some consecuteive pixels to get a better accuracy for the estimation
(here not the case). This is valid because for segmentation based quantitative image analysis usually anyways pixels are aggregated. If the binning is choosen to big, there is however a potential accumulation of background noise.

```{r}
# defines over how many pixels the aggregation should happen
# 1 = no aggregation
npixelbin = 1

dats_agg <- lapply(dats_raw, function(x) aggregate_pixels(x, n=npixelbin))
dats_agg_sum = rbindlist(lapply(dats_agg, calc_file_medians), idcol = T)
```


### Visualize per-file medians after binning
The intensities increase according to the aggregation factor
```{r fig.width=17, fig.height=10}
dats_agg_sum %>%
  ggplot(aes(x=1, y=med, color=.id))+
  facet_wrap(~file+metal, scales = 'free_y')+
  geom_label(aes(label=variable))

```

## CATALYST based compensation


## estimate the spillover
To estimate the spillover, the (aggregated) pixel values are first debarcoded using CATALYST, treating them like single cells. This step acts as a quality filter to remove background/noisy/weak pixels as well as pixels with artefacts (e.g. specles with strong signal in many channels).
If the true metal was correctly encoded in the filename, the 'remove_incorrect_bc' option will check the debarcoding and remove events assigned to the wrong barcode.

Then this identified, strong single stain pixels will be used for the spillover estimation.

```{r Binned}
res = lapply(dats_agg, function(x) re_from_dat(x,
                                               ss_ms=x[!is.na(mass), unique(mass)],
                                               minevents = 40,
                                              correct_bc = T))
sms = lapply(res, function(x) computeSpillmat(x))


```

### save the spillover matrices

```{r}
for (i in seq_along(sms)){
  outname = file.path(fol_out, paste0(prefix, basename(fols_ss[i]),'_sm.csv'))
  write.csv(sms[[i]],file = outname)
}
```


### Visualization of the spillover matrix

```{r}
for (i in seq_along(sms)){
  print(names(dats_agg)[i])
  ss_ms = dats_agg[[i]][!is.na(mass), unique(mass)]
  p = CATALYST::plotSpillmat(ss_ms,sms[[i]])
  print(p)
}
```


### Some quality indicators

Here we calculate e.g. number of debarcoded events/metal, median levels of highest signal and second highest signal

```{r}

for (i in seq_along(res)){
  
  dat = dats_agg[[i]]
  re = res[[i]]
  
  name = names(dats_agg)[i]
  tdat = dat %>%
    mutate(bcid = bc_ids(re)) %>%
    filter(bcid != '0') %>%
    dplyr::select(-c(Start_push, End_push, Pushes_duration,   X , Y  ,Z)) %>%
  melt.data.table(id.vars = c('metal', 'mass','file', 'bcid')) %>%
  do(data.table(.)[, list(med=median(value), n=.N), by=.(variable, metal, mass, bcid,file)]) 
  
  
  # find the highest metal, second highest metal
  sumdat = tdat[ , .(
    highestvariable = variable[med == max(med)],
    highestmed = max(med),
    secondhighestvariable = variable[med == sort(med,partial=length(med)-1)[length(med)-1]],
    secondhighestmed = sort(med,partial=length(med)-1)[length(med)-1],
    n=max(n)
  )  ,by=.( mass, bcid,file)]
  
  print(sumdat)
}

```


```{r Binned uncor}
res_uncor = lapply(dats_agg, function(x) re_from_dat(x,
                                               ss_ms=x[!is.na(mass), unique(mass)],
                                               minevents = 40,
                                              correct_bc = NULL ))
sms_uncor = lapply(res_uncor, function(x) computeSpillmat(x))


```

Assure that the results are exactly the same when enforcing that no debarcoding error happened by using the annotation from the file names:

```{r}
ndig = 8
for (i in seq_along(sms)){
print('all equal?')
print(all(round(sms_uncor[[i]], digits=ndig) == round(sms[[i]],digits = ndig)))
#diffmat = abs(round(sms_uncor[[i]], digits=ndig)-round(sms[[i]],digits = ndig))/round(sms[[i]],digits = ndig)
#match(T,diffmat > 0.01)
}
```
-> The results are exactly equal. Thus just debarcoding - without using any information about where the pixels actually belong - seems to be a vaild option to estimate the spillover.


## The plot below reproduces plots to check the linearity of spillover

### Define a helper function

```{r}
plot_binplot <- function(imgs, fn, x_var, y_var, perc=0.99, nbins=100, fkt=median){
    # This function makes a 'quantile' binning, binning the data in nbins that contain an equal number of events.
    dat = copy(imgs[[fn]])
    print(dat)
    #dat[, bins:= cut(get(x_var), seq(0, quantile(get(x_var),perc),length.out = nbins), right=T, include.lowest = T)]
    dat = subset(dat, get(x_var) < quantile(get(x_var),perc))
    dat[, bins:=  ntile(get(x_var), nbins)]
    x = melt.data.table(dat, id.vars = 'bins')
    x = x[, .(binmean=fkt(value)), by=.(bins, variable)]
    x = dcast.data.table(x[!is.na(bins),], 'bins~variable', value.var='binmean')
    
    ggplot(x, aes(x=get(x_var), y=get(y_var))) +
        geom_smooth(method = 'lm', alpha =0.5)+
        geom_point()+
        xlab(x_var) +
        ylab(y_var)+
        expand_limits(x=0, y=0)+
           stat_poly_eq(formula=as.formula('y~x'), aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")), 
                parse = TRUE) +  
        theme(aspect.ratio=1)
    
}
```
Plot the spillover relationships
```{r}

pltimgs=list_img_ss[[1]]
fn = "Er166_27_Er166_28.txt"
x_var = "Er166Di"
y_var= "Er167Di"
p = plot_binplot(pltimgs, fn, x_var, y_var, perc=0.95, nbins=20, fkt = median)
p= p+ggtitle('Fig4 A upper')
print(p)


fn ="Er166_27_Er166_28.txt"
x_var = "Er166Di"
y_var= "Er168Di"

p = plot_binplot(pltimgs, fn, x_var, y_var, perc=0.9, nbins=25, fkt = median)
p= p+ggtitle('Fig4 A lower')
print(p)

```
-> This reproduces Fig 4A

```{r}
sessionInfo()
```