You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A few number of things came up as I was going through the vignette. I've fixed a number of things with this PR, but take a look to see what I've done. Most of my code fixes are hacks on the R end. You may want to correct on the RCpp end. The tests are failing and need to be fixed...
Change repo name to "clustur" (all lower case)
Vignette has get_distance, but I think the function is actually get_distance_data_frame. Can we make it get_distance_df?
get_distance doesn't appear to actually be removing values below the cutoff, at least as written in the vignette. Both get_distance_data_frame(column_distance) and get_distance_data_frame(phylip_distance) have values larger than 0.03 and I see -1 values. They both have self comparisons as well as the upper triangle of the matrix (there are 9604 values = 98 * 98).
The output of cluster() had "OTU98" in one data frame and "otu98" in another data frame. I hacked a solution in cluster.R to make everything upper case. I think I'd rather it all be in lower case, but regardless of the case, the two data frames need to be the same
The label should be its own field in the list coming out of cluster() and doesn't need to be included in the data frames. Again I added a hack to get this how I'd like to see it. There's probably a more elegant way to do this in the Rcpp code. With my hack the hierarchical methods all return a cutoff of 0.00 rather than 0.03. Looking at the cluster_dfs object it appears the first data frame doesn't have the label, but the second does. I'm grabbing it from the first
Leave both data frames ordered by the OTU number, not the count. I commented out that code, but it needs to be removed.
In cluster_dfs in cluster() the second data frame has a bins column, this should be sequences
get_cutoff() should likely be get_label() to be consistent with how we call things in mothur. I changed this in the code/documentation, but it probably broke stuff
I changed the other_metrics field in cluster_dfs to be iteration_metrics.
I think cluster_metrics and iteration_metrics is giving garbage results when using a cutoff of 0.03
Don't need the "This is a column file. Processing now..." type output from read_dist()
I feel like cluster() needs a cutoff field if it isn't given in read_distance(). An example would be if someone reads in the full distance matrix, but wants a specific cutoff. This would likely be needed with average neighbor where someone would read in with a cutoff of 0.20, but cluster to 0.03.
There isn't any checking to make sure values are valid. E.g., if I put in method = 'neighbor' it doesn't error out or do anything
Finishing lingering issues
A more general question is what we should be calling things. We have label and cutoff, abundance and shared, and sequences/bins/otus. I see my own inconsistency in all this :) Now that I think about it, let's leave it as cutoff rather than label. I'm not sure what to do about sequences/bins/otus. For 16S/mothur we would use sequence/otus. For mums2 we'd want features/omus. A more generic scheme could be features/bins. Let me think more on this...
A few number of things came up as I was going through the vignette. I've fixed a number of things with this PR, but take a look to see what I've done. Most of my code fixes are hacks on the R end. You may want to correct on the RCpp end. The tests are failing and need to be fixed...
get_distance, but I think the function is actuallyget_distance_data_frame. Can we make itget_distance_df?get_distancedoesn't appear to actually be removing values below the cutoff, at least as written in the vignette. Bothget_distance_data_frame(column_distance)andget_distance_data_frame(phylip_distance)have values larger than 0.03 and I see -1 values. They both have self comparisons as well as the upper triangle of the matrix (there are 9604 values = 98 * 98).cluster()had "OTU98" in one data frame and "otu98" in another data frame. I hacked a solution incluster.Rto make everything upper case. I think I'd rather it all be in lower case, but regardless of the case, the two data frames need to be the samecluster()and doesn't need to be included in the data frames. Again I added a hack to get this how I'd like to see it. There's probably a more elegant way to do this in the Rcpp code. With my hack the hierarchical methods all return a cutoff of 0.00 rather than 0.03. Looking at thecluster_dfsobject it appears the first data frame doesn't have the label, but the second does. I'm grabbing it from the firstcluster_dfsincluster()the second data frame has abinscolumn, this should besequencesget_cutoff()should likely beget_label()to be consistent with how we call things in mothur. I changed this in the code/documentation, but it probably broke stuffother_metricsfield incluster_dfsto beiteration_metrics.cluster_metricsanditeration_metricsis giving garbage results when using a cutoff of 0.03read_dist()cluster()needs a cutoff field if it isn't given inread_distance(). An example would be if someone reads in the full distance matrix, but wants a specific cutoff. This would likely be needed with average neighbor where someone would read in with a cutoff of 0.20, but cluster to 0.03.method = 'neighbor'it doesn't error out or do anythingA more general question is what we should be calling things. We have label and cutoff, abundance and shared, and sequences/bins/otus. I see my own inconsistency in all this :) Now that I think about it, let's leave it as cutoff rather than label. I'm not sure what to do about sequences/bins/otus. For 16S/mothur we would use sequence/otus. For mums2 we'd want features/omus. A more generic scheme could be features/bins. Let me think more on this...