## TYK2 FlowDMS Offsets

For the most recent full TYK2 FlowDMS, we obtained conflicting summary statistics from different runs. The underlying reason is that, for one run, the chunks were separated before processing and in another they were not. This does not matter to the model itself, which operates per-position and only uses WT counts from the same chunk. However, it _does_ matter for computing the _offset_, which is taken as the `mean(log(count))` within each sample. This quantity was only computed within each sample-chunk when the chunks were pre-separated, but within each sample otherwise.

To see how this leads to the effect we observe in the midpoints, let's consider several models who differ only in the offset:

  - `mean(log(count))`
  - `log(sum(stop_counts))`
  - `log(sum(all_counts))`
  - no offset

As an example, let's grab a position chunk 2 and do these regressions, pull out the WT marginals, and compute the midpoints.

In [3]:
library(data.table)
library(tidyverse)

In [4]:
mapped_counts <- data.table::fread("../../dms/pipeline/OCNT-VAMPLIB-1-assay-run2-all-assigned.mapped-counts.tsv") %>%
    separate(oligo, c("lib", "chunk", "wt_aa", "pos",
                        "mut_aa", "wt_codon", "mut_codon"), "_") %>%
    mutate(condition_conc = as.factor(condition_conc),
        condition = as.factor(paste0(condition, condition_conc)),
        mut_aa = if_else(wt_aa == mut_aa | is.na(mut_aa), "WT", mut_aa),
        mut_aa = relevel(as.factor(mut_aa), ref = "WT")) %>%
    group_by(sample) %>%
    mutate(log_stop_counts_sample = log(sum(count[which(mut_aa %in% c("*", "X", "Stop", "stop"))])),
        mean_log_count_sample = mean(log(count)),
        log_total_count_sample = log(sum(count))) %>%
    ungroup() %>%
    group_by(sample, chunk) %>%
    mutate(log_stop_counts_chunk = log(sum(count[which(mut_aa %in% c("*", "X", "Stop", "stop"))])),
        mean_log_count_chunk = mean(log(count)),
        log_total_count_chunk = log(sum(count)))

write_tsv(mapped_counts,
          "../../dms/pipeline/OCNT-VAMPLIB-1-assay-run2-offsets.mapped-counts.tsv")

“[1m[22mExpected 7 pieces. Missing pieces filled with `NA` in 503411 rows [3, 16, 21,
23, 39, 47, 49, 59, 60, 84, 107, 127, 143, 178, 181, 184, 201, 214, 230, 245,
...].”


In [13]:
mapped_counts %>%
    ungroup() %>%
    select(sample, chunk, mean_log_count_sample, mean_log_count_chunk,
           log_total_count_sample, log_total_count_chunk) %>%
    distinct() %>%
    arrange(chunk)

sample,chunk,mean_log_count_sample,mean_log_count_chunk,log_total_count_sample,log_total_count_chunk
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
A100,1,3.688674,3.769271,17.60177,15.39581
A25,1,3.461891,3.355068,17.53353,14.93132
A50,1,3.839800,3.797552,17.79617,15.44883
A75,1,3.901290,3.938609,17.87131,15.65246
B100,1,3.555128,3.627367,17.41468,15.24655
B25,1,3.241474,3.130687,17.29550,14.71873
B50,1,3.392322,3.328332,17.40494,15.01256
B75,1,3.466787,3.496835,17.46120,15.20297
C100,1,3.483102,3.561759,17.13221,14.92179
C25,1,3.152562,3.132170,16.92379,14.53925
