Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in knn(train = pcaND[rownames(eigenvect)[-1 * nrow(eigenvect)]: no missing values are allowed #179

Open
mdozmorov opened this issue Oct 28, 2022 · 11 comments

Comments

@mdozmorov
Copy link

I was able to run all steps except the last, Step 4 - Run the ancestry inference on the external study. The error is in this function call:

    resCall <- computeAncestryFromSyntheticFile(gds = gds, 
                                    gdsSample = gdsSample,
                                    listFiles = listFiles,
                                    sample.ana.id = listSamples[i],
                                    spRef = spRef,
                                    study.id.syn = studyDF.syn$study.id,
                                    np = 4L)

I tried to debug the source code - the error is generated by the following:

    listKNNSample <- computeKNNRefSample(listEigenvector = listPCASample, 
                                         listCatPop = listCatPop, spRef = spRef, fieldPopInfAnc = fieldPopInfAnc, 
                                         kList = kList, pcaList = pcaList)

I tried to go deeper, but running into questions like which package the knn function is coming from, etc.

The strange thing is that I was able to successfully run the code on two samples. But then, this error started to occur. I'm running the same samples. I took the original vignette and again adjusted the code for my samples, to minimize the chance I modified something incorrectly. The error persists. Any suggestions?

Error in knn(train = pcaND[rownames(eigenvect)[-1 * nrow(eigenvect)],  : 
  no missing values are allowed
@adeschen
Copy link
Member

Hi Mikhail,

The knn function is part of class library.

The computeKNNRefSample is relatively simple. It loops on all K and D values to extract the resulting ancestry using KNN.

I can see a problem with listCatPop and the redundant variable listSuperPop inside the code.

  computeKNNRefSample <- function(listEigenvector,
                        listCatPop=c("EAS", "EUR", "AFR", "AMR", "SAS"),
                        spRef, fieldPopInfAnc="SuperPop",
                        kList=seq(2, 15, 1), pcaList=seq(2, 15, 1)) {

  if(is.null(kList)){
      kList <- seq_len(15)#c(seq_len(14), seq(15,100, by=5))
  }
  if(is.null(pcaList)){
      pcaList <- 2:15
  }
  if(length(listEigenvector$sample.id) != 1) {
      stop("Number of sample in study.annot not equal to 1\n")
  }

  resMat <- data.frame(sample.id=rep(listEigenvector$sample.id,
                                    length(pcaList) * length(kList)),
                        D=rep(0,length(pcaList) * length(kList)),
                        K=rep(0,length(pcaList) * length(kList)),
                    # SuperPop=character(length(pcaList) * length(kList)),
                        stringsAsFactors=FALSE)
   resMat[[fieldPopInfAnc]] <- character(length(pcaList) * length(kList))

   listSuperPop <- c("EAS", "EUR", "AFR", "AMR", "SAS")

  #curPCA <- listPCA.Samples[[sample.id[sample.pos]]]
   eigenvect <- rbind(listEigenvector$eigenvector.ref,
                        listEigenvector$eigenvector)

   # rownames(eigenvect) <- c(sample.ref,
   #                          listEigenvector$sample.id)

  totR <- 1
  for(pcaD in pcaList) {
      for(kV in  seq_len(length(kList))) {
        dCur <- paste0("d", pcaD)
        kCur <- paste0("k", kList[kV])
        resMat[totR,c("D", "K")] <- c(pcaD, kList[kV])

        pcaND <- eigenvect[ ,seq_len(pcaD)]
        y_pred <- knn(train=pcaND[rownames(eigenvect)[-1*nrow(eigenvect)],],
                test=pcaND[rownames(eigenvect)[nrow(eigenvect)],,
                            drop=FALSE],
                cl=factor(spRef[rownames(eigenvect)[-1*nrow(eigenvect)]],
                                levels=listCatPop, labels=listCatPop),
                k=kList[kV],
                prob=FALSE)

        resMat[totR, fieldPopInfAnc] <- listSuperPop[as.integer(y_pred)]

        totR <- totR + 1
      } # end k
  } # end pca Dim

  listKNN <- list(sample.id=listEigenvector$sample.id, matKNN=resMat)

  return(listKNN)
}

@mdozmorov
Copy link
Author

Thanks, Astrid. I follow the explanation, but even with sourcing your code the same error remains. It is expectedly in the knn function:

y_pred <- knn(train=pcaND[rownames(eigenvect)[-1*nrow(eigenvect)],],
              test=pcaND[rownames(eigenvect)[nrow(eigenvect)],,
                         drop=FALSE],
              cl=factor(spRef[rownames(eigenvect)[-1*nrow(eigenvect)]],
                        levels=listCatPop, labels=listCatPop),
              k=kList[kV],
              prob=FALSE)

All values to the arguments are complete, e.g.,

> nrow(pcaND)
[1] 3250
> nrow(pcaND[complete.cases(pcaND),])
[1] 3250

The knn function is used from the class package, latest 7.2-20 version. Do you have any other suggestions?

@adeschen
Copy link
Member

Hi Mikhail,

Do you have any NA value in the train or cl parameters (maybe test)? It would cause this message.

@mdozmorov
Copy link
Author

Both train and test have complete values. They are just subsets of pcaND, which is complete. I also tested different values of pcaD and kV in the whole loop - same error but the values are, again, complete. I'll try debugging knn, but never imagined it would be so difficult.

@adeschen
Copy link
Member

What about cl parameter ?

@mdozmorov
Copy link
Author

This is indeed the cause:

> sum(is.na(factor(spRef[rownames(eigenvect)[-1*nrow(eigenvect)]],
...                  levels=listCatPop, labels=listCatPop)))
[1] 780

Looks like these NAs are at the end of 3249-long factor. Trying to understand why..

@adeschen
Copy link
Member

What spRef[rownames(eigenvect)[-1*nrow(eigenvect)]] looks like? It should be values from listCapPop .

@mdozmorov
Copy link
Author

It has the expected c("EAS", "EUR", "AFR", "AMR", "SAS"), but also NAs

> table(spRef[rownames(eigenvect)[-1*nrow(eigenvect)]], useNA = "always")

 AFR  AMR  EAS  EUR  SAS <NA> 
 627  344  507  512  479  780 

@adeschen
Copy link
Member

780 corresponds to the number of synthetic samples that are generated in the previous steps. It is like those have not been stored correctly in the Sample GDS or are not retrieved correctly.

@mdozmorov
Copy link
Author

Will keep debugging the previous steps. Thank you, Astrid.

@adeschen
Copy link
Member

Hopefully, we will have more time to work on the package after final submission of the paper (probably beginning of next week).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants