In [1]:
library(tidyverse)
library(sva)

source("../../utils/plots_eda.R")

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.4     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
Loading required package: mgcv

Loading required package: nlme


Attaching package: ‘nlme’


The following object is masked from ‘package:dplyr’:

    collapse


This is

# Load data

In [2]:
datasets <- c("GSE6008", "GSE14407", "GSE26712",  "GSE40595", "GSE36668", "GSE69428")

In [3]:
all_metadata <- read.table("before/all_design.tsv", header = TRUE, sep = "\t")
all_expression <- read.table("before/all_expr_for_correction.tsv", header = TRUE, sep = "\t") %>%
    column_to_rownames("gene_id")

all_expression <- all_expression[, all_metadata$sample_id]

# remove NAs rows
all_expression <- na.omit(all_expression)
# remove genes with 0 variance
all_expression <- all_expression[apply(all_expression, 1, var) > 0, ]

In [5]:
print("Loading metadata and expression data")
print("Metadata dimensions:")
print(dim(all_metadata))
print("Expression data dimensions:")
print(dim(all_expression))

[1] "Loading metadata and expression data"
[1] "Metadata dimensions:"
[1] 309   3
[1] "Expression data dimensions:"
[1] 13237   309


In [6]:
design <- model.matrix(~all_metadata$Status)

corrected_expr <- sva::ComBat(dat = all_expression, 
                              batch = all_metadata$batch, 
                              mod = design)

corrected_expr <- as.data.frame(corrected_expr)


Found6batches

Adjusting for1covariate(s) or covariate level(s)

Standardizing Data across genes

Fitting L/S model and finding priors

Finding parametric adjustments

Adjusting the Data




In [7]:
all_metadata$Status <- all_metadata$Status
all_metadata$Dataset <- as.character(all_metadata$batch)

In [8]:

# plot the combined corrected data
print("Plotting combined corrected data")
plot_res <- plot_diagnostic(corrected_expr, all_metadata, "Combined Corrected",
                            log_transform = TRUE, with_rowname = TRUE)
layout <- (plot_res[[1]] + plot_res[[2]] ) / 
          (plot_res[[3]] )
ggsave("after/diagnostic_plot_corrected.png", 
            plot = layout, width = 12, height = 12)


[1] "Plotting combined corrected data"
[1] "..plotting.."


“[1m[22m`aes_string()` was deprecated in ggplot2 3.0.0.
[36mℹ[39m Please use tidy evaluation idioms with `aes()`.
[36mℹ[39m See also `vignette("ggplot2-in-packages")` for more information.”
No id variables; using all as measure variables



In [9]:
# write out the corrected expression data
write.table(corrected_expr %>% rownames_to_column("gene_id"), "after/all_corrected_R_expr.tsv", sep = "\t", 
            quote = FALSE, col.names = TRUE, row.names = FALSE)