As per PR #38

A few number of things came up as I was going through the vignette. I've fixed a number of things with this PR, but take a look to see what I've done. Most of my code fixes are hacks on the R end. You may want to correct on the RCpp end. The tests are failing and need to be fixed...

* [x] Change repo name to "clustur" (all lower case)
* [x] Vignette has `get_distance`, but I think the function is actually `get_distance_data_frame`. Can we make it `get_distance_df`?
* [x] `get_distance` doesn't appear to actually be removing values below the cutoff, at least as written in the vignette. Both `get_distance_data_frame(column_distance)` and `get_distance_data_frame(phylip_distance)` have values larger than 0.03 and I see -1 values. They both have self comparisons as well as the upper triangle of the matrix (there are 9604 values = 98 * 98). 
* [x] The output of `cluster()` had "OTU98" in one data frame and "otu98" in another data frame. I hacked a solution in `cluster.R` to make everything upper case. I think I'd rather it all be in lower case, but regardless of the case, the two data frames need to be the same
* [x] The label should be its own field in the list coming out of `cluster()` and doesn't need to be included in the data frames. Again I added a hack to get this how I'd like to see it. There's probably a more elegant way to do this in the Rcpp code. With my hack the hierarchical methods all return a cutoff of 0.00 rather than 0.03. Looking at the `cluster_dfs` object it appears the first data frame doesn't have the label, but the second does. I'm grabbing it from the first
* [x] Leave both data frames ordered by the OTU number, not the count. I commented out that code, but it needs to be removed.
* [x] In `cluster_dfs` in `cluster()` the second data frame has a `bins` column, this should be `sequences`
* [x] `get_cutoff()` should likely be `get_label()` to be consistent with how we call things in mothur. I changed this in the code/documentation, but it probably broke stuff
* [x] I changed the `other_metrics` field in `cluster_dfs` to be `iteration_metrics`. 
* [x] I think `cluster_metrics` and `iteration_metrics` is giving garbage results when using a cutoff of 0.03
* [x] Don't need the "This is a column file. Processing now..." type output from `read_dist()`
* [x] I feel like `cluster()` needs a cutoff field if it isn't given in `read_distance()`. An example would be if someone reads in the full distance matrix, but wants a specific cutoff. This would likely be needed with average neighbor where someone would read in with a cutoff of 0.20, but cluster to 0.03.
* [x] There isn't any checking to make sure values are valid. E.g., if I put in `method = 'neighbor'` it doesn't error out or do anything
* [x] Finishing lingering issues


A more general question is what we should be calling things. We have label and cutoff, abundance and shared, and sequences/bins/otus. I see my own inconsistency in all this :) Now that I think about it, let's leave it as cutoff rather than label. I'm not sure what to do about sequences/bins/otus. For 16S/mothur we would use sequence/otus. For mums2 we'd want features/omus. A more generic scheme could be features/bins. Let me think more on this...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

As per PR #38 #39

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

As per PR #38 #39

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions