Skip to content

As per PR #38 #39

@GregJohnsonJr

Description

@GregJohnsonJr

A few number of things came up as I was going through the vignette. I've fixed a number of things with this PR, but take a look to see what I've done. Most of my code fixes are hacks on the R end. You may want to correct on the RCpp end. The tests are failing and need to be fixed...

  • Change repo name to "clustur" (all lower case)
  • Vignette has get_distance, but I think the function is actually get_distance_data_frame. Can we make it get_distance_df?
  • get_distance doesn't appear to actually be removing values below the cutoff, at least as written in the vignette. Both get_distance_data_frame(column_distance) and get_distance_data_frame(phylip_distance) have values larger than 0.03 and I see -1 values. They both have self comparisons as well as the upper triangle of the matrix (there are 9604 values = 98 * 98).
  • The output of cluster() had "OTU98" in one data frame and "otu98" in another data frame. I hacked a solution in cluster.R to make everything upper case. I think I'd rather it all be in lower case, but regardless of the case, the two data frames need to be the same
  • The label should be its own field in the list coming out of cluster() and doesn't need to be included in the data frames. Again I added a hack to get this how I'd like to see it. There's probably a more elegant way to do this in the Rcpp code. With my hack the hierarchical methods all return a cutoff of 0.00 rather than 0.03. Looking at the cluster_dfs object it appears the first data frame doesn't have the label, but the second does. I'm grabbing it from the first
  • Leave both data frames ordered by the OTU number, not the count. I commented out that code, but it needs to be removed.
  • In cluster_dfs in cluster() the second data frame has a bins column, this should be sequences
  • get_cutoff() should likely be get_label() to be consistent with how we call things in mothur. I changed this in the code/documentation, but it probably broke stuff
  • I changed the other_metrics field in cluster_dfs to be iteration_metrics.
  • I think cluster_metrics and iteration_metrics is giving garbage results when using a cutoff of 0.03
  • Don't need the "This is a column file. Processing now..." type output from read_dist()
  • I feel like cluster() needs a cutoff field if it isn't given in read_distance(). An example would be if someone reads in the full distance matrix, but wants a specific cutoff. This would likely be needed with average neighbor where someone would read in with a cutoff of 0.20, but cluster to 0.03.
  • There isn't any checking to make sure values are valid. E.g., if I put in method = 'neighbor' it doesn't error out or do anything
  • Finishing lingering issues

A more general question is what we should be calling things. We have label and cutoff, abundance and shared, and sequences/bins/otus. I see my own inconsistency in all this :) Now that I think about it, let's leave it as cutoff rather than label. I'm not sure what to do about sequences/bins/otus. For 16S/mothur we would use sequence/otus. For mums2 we'd want features/omus. A more generic scheme could be features/bins. Let me think more on this...

Metadata

Metadata

Assignees

Labels

duplicateThis issue or pull request already exists

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions