Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge values not taking most frequent string #5

Closed
ChrisMuir opened this issue Mar 9, 2018 · 1 comment
Closed

Merge values not taking most frequent string #5

ChrisMuir opened this issue Mar 9, 2018 · 1 comment

Comments

@ChrisMuir
Copy link
Owner

Seeing a bug in which the edit value assigned to a cluster is not the most frequent string in that cluster. Example:

refinr::key_collision_merge(c("cat bike", "bike cat", "bike cat"))
#> "cat bike" "cat bike" "cat bike"

I think the issue is within this line from file key_collision_merge_funcs.cpp:

// Get the string that appears most often in curr_vect.
String most_freq_string = curr_vect[which_max(table(curr_vect))];

I think the solution is to apply .sort() to curr_vect prior to calling table on it. Need to do some more testing.

@ChrisMuir
Copy link
Owner Author

ChrisMuir commented Mar 9, 2018

Solution was to add this function to utils.cpp:

// Given a CharacterVector, return the string that appears most frequently.
// Ties are determined by the string that appears first alphabetically.
// [[Rcpp::export]]
String most_freq_str(CharacterVector x) {
  IntegerVector x_tab = table(x);
  CharacterVector tab_names = x_tab.attr("names");
  return(tab_names[which_max(x_tab)]);
}

The new function gets sourced during both key_collision_merge and n_gram_merge.

Going back to the example:

refinr::key_collision_merge(c("cat bike", "bike cat", "bike cat"))
#> "bike cat" "bike cat" "bike cat"

ChrisMuir added a commit that referenced this issue Mar 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant