Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

impute gates that failed the QA check #10

Open
mikejiang opened this issue Jun 27, 2013 · 11 comments
Open

impute gates that failed the QA check #10

mikejiang opened this issue Jun 27, 2013 · 11 comments

Comments

@mikejiang
Copy link
Member

a new API is going to take a list of failed samples as input and return a list of reference samples

refSamples <- .nearestSamples(gs, "MTG_gate", failedSamples)

Once the reference samples are selected, it should be fairly straightforward to do the the gate imputation with the existing APIs ( getGate and setGate )

#extract reference gates
refGates <- sapply(refSamples,function(i)getGate(gs[[i]],"MTG_gate"))

#impute the gates
setGate(gs[failedSamples],"MTG_gate",refGates)
mikejiang pushed a commit that referenced this issue Jun 27, 2013
@raphg
Copy link
Member

raphg commented Jun 27, 2013

Really nice Mike. I can't wait to see some real tests. Perhaps on the
Newell data.

On Wed, Jun 26, 2013 at 5:10 PM, Mike Jiang notifications@github.comwrote:

a new API is going to take a list of failed samples as input and return
a list of reference samples

refSamples <- .nearestSamples(gs, "MTG_gate", failedSamples)

Once the reference samples are selected, it should be fairly
straightforward to do the the gate imputation with the existing APIs (
getGate and setGate )

#extract reference gatesrefGates <- sapply(refSamples,function(i)getGate(gs[[i]],"MTG_gate"))
#impute the gatessetGate(gs[failedSamples],"MTG_gate",refGates)


Reply to this email directly or view it on GitHubhttps://github.com//issues/10
.

@mikejiang
Copy link
Member Author

By using flowCore::expressionFilter

expression1 <- paste0("`",params,"`>0")
ef <- char2ExpressionFilter(expression1)
> ef
expression filter '`<Blue F 525/50-A>`>0' evaluating the expression:
`<Blue F 525/50-A>`>0

I am able to gate out the outlier events (below zero)

tData <- Subset(tData,ef)

And here is the new emd distance, which seems to be more reasonable.
rplot001

@raphg
Copy link
Member

raphg commented Jun 28, 2013

Yes it does look more resonable. I am surprised that outliers make such a
big difference, because the density is so low.
I just did a quick test comparing emd and ks.test, and the latter seems
more robust and it's faster.
Can you try it (without) filtering outliers to see what it gives.

On Fri, Jun 28, 2013 at 9:56 AM, Mike Jiang notifications@github.comwrote:

By using flowCore::expressionFilter

expression1 <- paste0("",params,">0")ef <- char2ExpressionFilter(expression1)> efexpression filter '<Blue F 525/50-A>>0' evaluating the expression:<Blue F 525/50-A>>0

I am able to gate out the outlier events

tData <- Subset(tData,ef)

And here is the new emd distance, which seems more reasonable.
[image: rplot001]https://f.cloud.github.com/assets/1385649/723264/ac65f00c-e013-11e2-9883-0cc2e00a4411.png


Reply to this email directly or view it on GitHubhttps://github.com//issues/10#issuecomment-20200514
.

@mikejiang
Copy link
Member Author

EM with outlier filter

> system.time(refSamples <- .nearestSamples(gs, "MTG_gate", failedSamples , method = "em")))
   user  system elapsed 
 30.002   0.000  29.466

and ks.test without outlier filter is indeed faster and more robust

> system.time(refSamples <- .nearestSamples(gs, "MTG_gate", failedSamples, method = "ks.test"))
   user  system elapsed 
  4.096   0.000   4.313

rplot001

@raphg
Copy link
Member

raphg commented Jun 28, 2013

Great.

On Fri, Jun 28, 2013 at 10:44 AM, Mike Jiang notifications@github.comwrote:

EM with outlier filter

system.time(refSamples <- .nearestSamples(gs, "MTG_gate", failedSamples , method = "em"))) user system elapsed 30.002 0.000 29.466

and ks.test without outlier filter is indeed faster and more robust

system.time(refSamples <- .nearestSamples(gs, "MTG_gate", failedSamples, method = "ks.test")) user system elapsed 4.096 0.000 4.313

[image: rplot001]https://f.cloud.github.com/assets/1385649/723492/8458c868-e019-11e2-9739-732f6923196a.png


Reply to this email directly or view it on GitHubhttps://github.com//issues/10#issuecomment-20203117
.

@gfinak
Copy link
Member

gfinak commented Jun 28, 2013

Nice!

On 06/28/2013 01:44 PM, Mike Jiang wrote:

|EM| with |outlier filter|

system.time(refSamples <- .nearestSamples(gs, "MTG_gate", failedSamples , method = "em")))
user system elapsed
30.002 0.000 29.466

and |ks.test| without |outlier filter| is indeed faster and more robust

system.time(refSamples <- .nearestSamples(gs, "MTG_gate", failedSamples, method = "ks.test"))
user system elapsed
4.096 0.000 4.313

rplot001
https://f.cloud.github.com/assets/1385649/723492/8458c868-e019-11e2-9739-732f6923196a.png


Reply to this email directly or view it on GitHub
#10 (comment).

@ramhiser
Copy link

@mikejiang This is nice.

While I think the K-S approach deserves some credit here, it will give some peculiar results in other cases. Consider for example the following contrived example.

In this example, we are trying to determine if x1 is nearer to x2 or x3 using the K-S criterion. Notice that there is no overlap in the figure of the empirical CDFs, but in the density plot, there is some overlap. Because the K-S statistic simply finds the largest difference in empirical CDFs, samples x2 and x3 are equidistant from x1 because there is no overlap of the samples. Even if there were, it would be relatively straightforward to contrive an example where we would choose x3 as often as x2.

With this in mind, I am in favor of computing some divergence (e.g., Kullback-Leibler) between the estimated densities. I will concoct something soon.

@raphg
Copy link
Member

raphg commented Jun 28, 2013

In this case, 1 is the largest distance possible, and it makes sense given
that there is no overlap between any of the distributions.
So it's doing the right thing, that is all three distributions are as far
as they can be.

This being said, I am in favor of doing more comparison. John, thanks for
looking at it.

On Fri, Jun 28, 2013 at 3:35 PM, John Ramey notifications@github.comwrote:

@mikejiang https://github.com/mikejiang This is nice.

While I think the K-S approach deserves some credit here, it will give
some peculiar results in other cases. Consider for example the following
contrived example https://gist.github.com/ramey/5888640.

In this example, we are trying to determine if x1 is nearer to x2 or x3using the K-S criterion. Notice that there is no overlap in the figure of
the empirical CDFs, but in the density plot, there is some overlap. Because
the K-S statistic simply finds the largest difference in empirical CDFs,
samples x2 and x3 are equidistant from x1 because there is no overlap of
the samples. Even if there were, it would be relatively straightforward to
contrive an example where we would choose x3 as often as x2.

With this in mind, I am in favor of computing some divergence (e.g.,
Kullback-Leibler) between the estimated densities. I will concoct something
soon.


Reply to this email directly or view it on GitHubhttps://github.com//issues/10#issuecomment-20218235
.

@mikejiang
Copy link
Member Author

Add an options passed to specify the samples passes QA and can be served as references.
By default, all samples other than failed are used. But sometime it is helpful to narrow it down to a few of really good samples.

@mikejiang mikejiang reopened this Oct 9, 2015
mikejiang pushed a commit that referenced this issue Oct 9, 2015
mikejiang pushed a commit that referenced this issue Apr 7, 2017
@mikejiang
Copy link
Member Author

mikejiang commented Apr 7, 2017

@gfinak , 2d gate imputing is added through emd2d as you suggested, it is very slow though ,which is one of the reason @raphg asked me to switch to ks.test for 1d gate. (See the previous discussion of this thread).
I've added parallel through mc.cores and hopefully it can finish the job in the reasonable time for you.

Here is the example for matching one bad sample against 8 good samples

> system.time(res <- nearestSamples(gs, node = "CD4", failed = "1349_3_Tcell_A06.fcs", gridsize = c(70, 70), mc.cores = 8))
Finding reference sample for: 1349_3_Tcell_A06.fcs
    user   system  elapsed 
1123.817    1.582  160.287 

We can fiddle with gridsize for bkde2D to find the optimal default settings (trade off between speed and accuracy)

mikejiang pushed a commit that referenced this issue Apr 7, 2017
mikejiang pushed a commit that referenced this issue Apr 7, 2017
@mikejiang
Copy link
Member Author

image

> pData(gs)[, 3:4]
                      Sample Replicate
12828_1_Tcell_A01.fcs  12828         1
12828_2_Tcell_A02.fcs  12828         2
12828_3_Tcell_A03.fcs  12828         3
1349_1_Tcell_A04.fcs    1349         1
1349_2_Tcell_A05.fcs    1349         2
1349_3_Tcell_A06.fcs    1349         3
1369_1_Tcell_A07.fcs    1369         1
1369_2_Tcell_A08.fcs    1369         2
1369_3_Tcell_A09.fcs    1369         3

earth mover on 2d data is pretty expensive, which costs more than 2min with parallel computing in 6 cores.

> system.time(res <- nearestSamples(gs, node = node
+                                   , failed = failed
+                                   , passed = passed
+                                   , gridsize = c(50, 50)
+                                    , method = "em"
+                                   , mc.cores = 6
+                                   )
+             )
 user  system elapsed 
657.522  10.993 156.177
> res
  12828_1_Tcell_A01.fcs    1349_1_Tcell_A04.fcs    1369_1_Tcell_A07.fcs 
"12828_2_Tcell_A02.fcs"  "1349_3_Tcell_A06.fcs"  "1369_2_Tcell_A08.fcs" 

As @raphg suggested, we could use ks.test on each dimension and sum the test statistics, here is the result, which finished the job within seconds and matches the correct replicates!

> system.time(res <- nearestSamples(gs, node = node
+                                   , failed = failed
+                                   , passed = passed
+                                   , method = "ks.test"
+                                   , mc.cores = 6
+                                   )
+             )
   user  system elapsed 
  2.391   1.529   1.179 
> res
  12828_1_Tcell_A01.fcs    1349_1_Tcell_A04.fcs    1369_1_Tcell_A07.fcs 
"12828_3_Tcell_A03.fcs"  "1349_2_Tcell_A05.fcs"  "1369_2_Tcell_A08.fcs" 

mikejiang pushed a commit that referenced this issue Apr 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants