Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

process idat files with R package gsrc #40

Open
ekarlins opened this issue Mar 27, 2017 · 17 comments
Open

process idat files with R package gsrc #40

ekarlins opened this issue Mar 27, 2017 · 17 comments

Comments

@ekarlins
Copy link
Collaborator

I've started looking into this again since my ultimate goal is to be able to start the pipeline with idat files without the need of using an Illumina GUI. This R package would allow for that. I've figured out how to process an Illumina csv file manifest and output two annotation files that are used in the vignette as "dictionary" and chrPos. These are simply data frames that can be created by reading in the csv files created by my script: "scripts/IllManifest2gsrc.py".
I'll add the GSA example files to the repo.

@ekarlins
Copy link
Collaborator Author

ekarlins commented Mar 27, 2017

Now we should be able to follow the gsrc vignette with the appropriate annotation. Here's how you can load the appropriate dataframes into R for annotation:

dictionary <- read.csv("files/gsrc/GSA_dictionary.csv", colClasses=c("integer", "character"))
chrPos <- read.csv("files/gsrc/GSA_chrPos.csv", colClasses=c("character", "character", "integer"))

Note, it's important to specify "colClasses" for the "chrPos" object. Without "colClasses" , the "chromosome" column gets read in as a factor variable and fails the "is.vector" test in the function "read_intensities"

@ekarlins
Copy link
Collaborator Author

I'm going to continue to work on this issue, but others should feel free to as well. My goal is to create two processes:

  1. creating an object from many samples that can be used for normalization.
  2. creating a way to process one sample, normalizing with the object from 1) above and outputting appropriate data for just the one sample.

@ngiangre
Copy link
Collaborator

@ekarlins can you describe the normalization rationale? I know when I initially coded this we used a test pennCNV input file which I didn't really know what exactly it represented. I'll try to work on this but was wondering if one load a multiple of those input files what are you actually normalizing?

@ekarlins
Copy link
Collaborator Author

@ngiangre, the normalization is needed to calculate theta. Theta is needed to calculate LRR and BAF. For the PennCNV input txt files that I created from the gtc files the normalization to create theta was done by the Illumina software that created the gtc file. So there was already a step that included a lot of samples to normalize (the "a lot of samples" was in the egt file produced by GenomeStudio).

For gsrc if we want to create theta we need to give it a bunch of samples. We need theta before we can get LRR and BAF. So this is figuring out a way to generate theta without using the Illumina software at all.

Does that make sense?

FYI, I ran through the gsrc vignette last night up to generating BAF and LRR on our data. There were a couple things that didn't function as expected, though I found fixes. This may have been because of the small number of input samples. Not sure. Here's the code I used to get the vignette working:

setwd("Scan2CNV")
library(gsrc)
dictionary <- read.csv("files/gsrc/GSA_dictionary.csv", colClasses=c("integer", "character"))
chrPos <- read.csv("files/gsrc/GSA_chrPos.csv", colClasses=c("character", "character", "integer"))
files <- list.files("files", pattern = "idat",full.names = TRUE)
column_names <- sapply(strsplit(files, split = "/"), FUN=function(x) x[length(x)])
head(dictionary)
head(chrPos)
raw_data <- read_intensities(files = files, 
                             dict = dictionary, 
                             cnames = column_names, 
                             pos = chrPos, beads = F,
                             sd = F)
str(raw_data)

boxplot(as.vector(raw_data$raw[, seq(1, length(raw_data$samples), 2)]),
        as.vector(raw_data$raw[, seq(2, length(raw_data$samples), 2)]),
        names = c("Green", "Red"))

norm_dat <- intens_theta(raw_data, norm = "both", scaling = "mean", transf = "log")
str(norm_dat)


head(norm_dat$samples)
norm_dat$samples <- head(norm_dat$samples, 3)
##There's something weird where it's giving a bunch of NA for samples.

norm_dat <- remove_suffix(norm_dat, "_Grn.idat")
head(norm_dat$samples)
hist(norm_dat$intensity, breaks = 1000)
hist(norm_dat$theta, breaks = 1000)

##the function geno_baf_rratio seems to expect a row for each SNP and column for each sample.
##The matrices for theta and intensity seem to be the opposite.
dim(norm_dat$theta) <- c(700078, 3)
dim(norm_dat$intensity) <- c(700078, 3)

norm_dat <- geno_baf_rratio(norm_dat)

str(norm_dat)

hist(norm_dat$baf, breaks = 1000)
tmp <- table(norm_dat$geno, useNA = "ifany")
barplot(tmp, names.arg = c(names(tmp)[1:4], "NA"))
hist(norm_dat$rratio, breaks = 1000)

@ngiangre
Copy link
Collaborator

ngiangre commented Mar 29, 2017 via email

@ekarlins
Copy link
Collaborator Author

ekarlins commented Mar 31, 2017

I'm going through the code of the normalization functions in "gsrc" and trying to figure out how they work to see if there are ways to run these functions (or similar functions) one sample at a time. I know certain steps benefit from having many samples, let's say 1000. So one way would be to create one object that contained 1000 samples. Each time we run one sample also load the object containing 1000 samples. If we are running 150k samples and only have to create the object of 1000 samples once, that can provide some computational benefits. I'm hoping it's possible to strip down this object made from 1000 samples, though. Currently I estimate it would take about 90GB of RAM to load into R.

The first normalization comes from this function:

library(gsrc)
intens_theta <- function(raw, norm = "quantile",  scaling = "mean", transf = "log", pn = 2)

The vignette runs it with:

norm = "both"

You can see the code for this function here: https://github.com/cran/gsrc/blob/master/R/preprocess.R

The normalization steps are in lines 41 to 47 of the link above.
They use this function for norm = "both":

normalize <- function(raw) {
    for (i in seq(2, ncol(raw), 2)) {
      raw[, c(i - 1, i)] <-
        preprocessCore::normalize.quantiles(raw[, c(i - 1, i)])
    }
    limma::normalizeMedianAbsValues(raw)
  }

For the object raw the rows are the samples and the columns are the SNPs.

dim(raw)
##[1]      6 700078

So for preprocessCore::normalize.quantiles it's doing this quantile normalization one SNP at a time, using red and green intensities from all of the samples. So it's a normalization across 1000 samples in our example.

The next normalization function is this:

limma::normalizeMedianAbsValues <- function (x) {
    narrays <- NCOL(x)
    if (narrays == 1) 
        return(x)
    cmed <- log(apply(abs(x), 2, median, na.rm = TRUE))
    cmed <- exp(cmed - mean(cmed))
    t(t(x)/cmed)
}

The 2 in the "apply" function indicates that this is running one column, or one SNP, at a time. This time running red and green separately, though. So again, a normalization across the 1000 samples.

It looks like the rest of the calculations in "intens_theta" would be the same whether running just one sample or many samples (maybe someone can confirm this).

So what I'd like to figure out is if it's possible to do the two normalization steps above without storing the full info from the matrix of 1000 samples.

For "normalizeMedianAbsValues" it seems like we could just calculate "cmed" and store it, which has a median value for each probe. Then we could just run that last line:

t(t(x)/cmed)

To give the normalized value. "x" is the intensities, normalized with quantile normalization.

I spoke with a statistician today who I'm hoping will give some more insight on if it's possible to run quantile normalization using only partial information from the set of 1000 samples. If we just keep the deciles, for example, would it work as well?

@ngiangre
Copy link
Collaborator

What your describing, if we keep only some instead of all information from the 1000 samples, sounds like asking whether we can store a sufficient statistic of the distributions (thinking each SNPs 'emits' a 1000 random variables making as many distributions as there are SNPs) which would represent the data without storing all the extra dimensionality. Great idea to talk to a statistician, they would know the answer best.

We can make a paralleled version of this by using 'mcapply' which can take advantage of multiple cores. Maybe you can try it out on your cluster? Atleast it'll be faster, but it won't avoid the memory issue...

We can delete the object after reducing the matrix so atleast it's not in memory for long. I'll have to think more about this.

@ekarlins
Copy link
Collaborator Author

ekarlins commented Mar 31, 2017 via email

@ngiangre
Copy link
Collaborator

ngiangre commented Mar 31, 2017 via email

@ekarlins
Copy link
Collaborator Author

ekarlins commented Apr 1, 2017

@ngiangre, yes, for Snakemake we'd have to write to files after each step and bring it all together. So I/O would certainly be a cost of doing it this way. The time to submit a lot of jobs could be another cost. But yes, we can make the intermediate files temp files.

Yep, it will involve rewriting the functions. I'm not sure "gsrc" is the best package to use for this. More on that in a separate comment.

@ekarlins
Copy link
Collaborator Author

ekarlins commented Apr 1, 2017

So I've been talking to a statistician about quantile normalization and about low level operations in "gsrc". I think we uncovered an issue with "gsrc" that makes it not suitable for our purposes. It's possible that it's because this package is really only for "Genome Structure Rearrangement Calling in Genomes with High Synteny", which is not what we're looking for. So maybe it's not flexible for any CNV calling.

The quantile normalization step seems to be running one SNP at a time, across all samples, and including both the Red and Green signal. Suppose you have a SNP where all samples have the AA genotype. Your matrix of Red and Green signals, across samples at this SNP may look like this:

raw <- cbind(runif(10,6000,6001),runif(10,100,101))
raw
          [,1]     [,2]
[1,] 6000.348 100.0071
[2,] 6000.868 100.7038
[3,] 6000.544 100.7438
[4,] 6000.777 100.8689
[5,] 6000.987 100.4893
[6,] 6000.501 100.3422
[7,] 6000.239 100.2562
[8,] 6000.615 100.6452
[9,] 6000.906 100.9429
[10,] 6000.315 100.3039

The first column indicates the signal for "A" allele and the second the signal for "B" allele. So all samples are clearly "AA". Now you run the quantile normalization using the code from "gsrc" and you get this:

> raw2=preprocessCore::normalize.quantiles(raw)
> raw2
          [,1]     [,2]
[1,] 3050.326 3050.123
[2,] 3050.806 3050.740
[3,] 3050.517 3050.806
[4,] 3050.740 3050.888
[5,] 3050.965 3050.517
[6,] 3050.422 3050.422
[7,] 3050.123 3050.286
[8,] 3050.630 3050.630
[9,] 3050.888 3050.965
[10,] 3050.286 3050.326

What?! It makes them all look like "AB" genotypes. This doesn't seem right.

@ngiangre
Copy link
Collaborator

ngiangre commented Apr 1, 2017 via email

@ekarlins
Copy link
Collaborator Author

ekarlins commented Apr 1, 2017 via email

@ngiangre
Copy link
Collaborator

ngiangre commented Apr 1, 2017 via email

@ekarlins
Copy link
Collaborator Author

ekarlins commented Apr 1, 2017

I think the main thing right now is to show what function(s) is being used for normalization and the "unit" that it's normalizing.
So for gsrc we would say that it runs preprocessCore::normalize.quantiles one SNP at a time, across all samples, and including both red and green intensities. Then it runs limma::normalizeMedianAbsValues one SNP at a time, across all samples, and separately for red and green.

@ngiangre
Copy link
Collaborator

ngiangre commented Apr 1, 2017 via email

@ekarlins
Copy link
Collaborator Author

ekarlins commented Apr 1, 2017

Started issue #41 for crlmm normalization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants