Error in knn(train = pcaND[rownames(eigenvect)[-1 * nrow(eigenvect)]: no missing values are allowed #179

mdozmorov · 2022-10-28T18:03:13Z

I was able to run all steps except the last, Step 4 - Run the ancestry inference on the external study. The error is in this function call:

    resCall <- computeAncestryFromSyntheticFile(gds = gds, 
                                    gdsSample = gdsSample,
                                    listFiles = listFiles,
                                    sample.ana.id = listSamples[i],
                                    spRef = spRef,
                                    study.id.syn = studyDF.syn$study.id,
                                    np = 4L)

I tried to debug the source code - the error is generated by the following:

    listKNNSample <- computeKNNRefSample(listEigenvector = listPCASample, 
                                         listCatPop = listCatPop, spRef = spRef, fieldPopInfAnc = fieldPopInfAnc, 
                                         kList = kList, pcaList = pcaList)

I tried to go deeper, but running into questions like which package the knn function is coming from, etc.

The strange thing is that I was able to successfully run the code on two samples. But then, this error started to occur. I'm running the same samples. I took the original vignette and again adjusted the code for my samples, to minimize the chance I modified something incorrectly. The error persists. Any suggestions?

Error in knn(train = pcaND[rownames(eigenvect)[-1 * nrow(eigenvect)],  : 
  no missing values are allowed

The text was updated successfully, but these errors were encountered:

adeschen · 2022-10-28T20:33:55Z

Hi Mikhail,

The knn function is part of class library.

The computeKNNRefSample is relatively simple. It loops on all K and D values to extract the resulting ancestry using KNN.

I can see a problem with listCatPop and the redundant variable listSuperPop inside the code.

  computeKNNRefSample <- function(listEigenvector,
                        listCatPop=c("EAS", "EUR", "AFR", "AMR", "SAS"),
                        spRef, fieldPopInfAnc="SuperPop",
                        kList=seq(2, 15, 1), pcaList=seq(2, 15, 1)) {

  if(is.null(kList)){
      kList <- seq_len(15)#c(seq_len(14), seq(15,100, by=5))
  }
  if(is.null(pcaList)){
      pcaList <- 2:15
  }
  if(length(listEigenvector$sample.id) != 1) {
      stop("Number of sample in study.annot not equal to 1\n")
  }

  resMat <- data.frame(sample.id=rep(listEigenvector$sample.id,
                                    length(pcaList) * length(kList)),
                        D=rep(0,length(pcaList) * length(kList)),
                        K=rep(0,length(pcaList) * length(kList)),
                    # SuperPop=character(length(pcaList) * length(kList)),
                        stringsAsFactors=FALSE)
   resMat[[fieldPopInfAnc]] <- character(length(pcaList) * length(kList))

   listSuperPop <- c("EAS", "EUR", "AFR", "AMR", "SAS")

  #curPCA <- listPCA.Samples[[sample.id[sample.pos]]]
   eigenvect <- rbind(listEigenvector$eigenvector.ref,
                        listEigenvector$eigenvector)

   # rownames(eigenvect) <- c(sample.ref,
   #                          listEigenvector$sample.id)

  totR <- 1
  for(pcaD in pcaList) {
      for(kV in  seq_len(length(kList))) {
        dCur <- paste0("d", pcaD)
        kCur <- paste0("k", kList[kV])
        resMat[totR,c("D", "K")] <- c(pcaD, kList[kV])

        pcaND <- eigenvect[ ,seq_len(pcaD)]
        y_pred <- knn(train=pcaND[rownames(eigenvect)[-1*nrow(eigenvect)],],
                test=pcaND[rownames(eigenvect)[nrow(eigenvect)],,
                            drop=FALSE],
                cl=factor(spRef[rownames(eigenvect)[-1*nrow(eigenvect)]],
                                levels=listCatPop, labels=listCatPop),
                k=kList[kV],
                prob=FALSE)

        resMat[totR, fieldPopInfAnc] <- listSuperPop[as.integer(y_pred)]

        totR <- totR + 1
      } # end k
  } # end pca Dim

  listKNN <- list(sample.id=listEigenvector$sample.id, matKNN=resMat)

  return(listKNN)
}

mdozmorov · 2022-10-28T23:47:33Z

Thanks, Astrid. I follow the explanation, but even with sourcing your code the same error remains. It is expectedly in the knn function:

y_pred <- knn(train=pcaND[rownames(eigenvect)[-1*nrow(eigenvect)],],
              test=pcaND[rownames(eigenvect)[nrow(eigenvect)],,
                         drop=FALSE],
              cl=factor(spRef[rownames(eigenvect)[-1*nrow(eigenvect)]],
                        levels=listCatPop, labels=listCatPop),
              k=kList[kV],
              prob=FALSE)

All values to the arguments are complete, e.g.,

> nrow(pcaND)
[1] 3250
> nrow(pcaND[complete.cases(pcaND),])
[1] 3250

The knn function is used from the class package, latest 7.2-20 version. Do you have any other suggestions?

adeschen · 2022-10-29T00:00:48Z

Hi Mikhail,

Do you have any NA value in the train or cl parameters (maybe test)? It would cause this message.

mdozmorov · 2022-10-29T00:37:43Z

Both train and test have complete values. They are just subsets of pcaND, which is complete. I also tested different values of pcaD and kV in the whole loop - same error but the values are, again, complete. I'll try debugging knn, but never imagined it would be so difficult.

adeschen · 2022-10-29T00:55:05Z

What about cl parameter ?

mdozmorov · 2022-10-29T00:59:16Z

This is indeed the cause:

> sum(is.na(factor(spRef[rownames(eigenvect)[-1*nrow(eigenvect)]],
...                  levels=listCatPop, labels=listCatPop)))
[1] 780

Looks like these NAs are at the end of 3249-long factor. Trying to understand why..

adeschen · 2022-10-29T01:11:26Z

What spRef[rownames(eigenvect)[-1*nrow(eigenvect)]] looks like? It should be values from listCapPop .

mdozmorov · 2022-10-29T01:24:14Z

It has the expected c("EAS", "EUR", "AFR", "AMR", "SAS"), but also NAs

> table(spRef[rownames(eigenvect)[-1*nrow(eigenvect)]], useNA = "always")

 AFR  AMR  EAS  EUR  SAS <NA> 
 627  344  507  512  479  780

adeschen · 2022-10-29T01:27:02Z

780 corresponds to the number of synthetic samples that are generated in the previous steps. It is like those have not been stored correctly in the Sample GDS or are not retrieved correctly.

mdozmorov · 2022-10-29T01:29:47Z

Will keep debugging the previous steps. Thank you, Astrid.

adeschen · 2022-10-29T01:41:15Z

Hopefully, we will have more time to work on the package after final submission of the paper (probably beginning of next week).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in knn(train = pcaND[rownames(eigenvect)[-1 * nrow(eigenvect)]: no missing values are allowed #179

Error in knn(train = pcaND[rownames(eigenvect)[-1 * nrow(eigenvect)]: no missing values are allowed #179

mdozmorov commented Oct 28, 2022

adeschen commented Oct 28, 2022

mdozmorov commented Oct 28, 2022

adeschen commented Oct 29, 2022

mdozmorov commented Oct 29, 2022

adeschen commented Oct 29, 2022

mdozmorov commented Oct 29, 2022

adeschen commented Oct 29, 2022

mdozmorov commented Oct 29, 2022

adeschen commented Oct 29, 2022

mdozmorov commented Oct 29, 2022

adeschen commented Oct 29, 2022

Error in knn(train = pcaND[rownames(eigenvect)[-1 * nrow(eigenvect)]: no missing values are allowed #179

Error in knn(train = pcaND[rownames(eigenvect)[-1 * nrow(eigenvect)]: no missing values are allowed #179

Comments

mdozmorov commented Oct 28, 2022

adeschen commented Oct 28, 2022

mdozmorov commented Oct 28, 2022

adeschen commented Oct 29, 2022

mdozmorov commented Oct 29, 2022

adeschen commented Oct 29, 2022

mdozmorov commented Oct 29, 2022

adeschen commented Oct 29, 2022

mdozmorov commented Oct 29, 2022

adeschen commented Oct 29, 2022

mdozmorov commented Oct 29, 2022

adeschen commented Oct 29, 2022