Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

detecting anagrams with stringdist #1

Open
markvanderloo opened this issue May 30, 2018 · 3 comments
Open

detecting anagrams with stringdist #1

markvanderloo opened this issue May 30, 2018 · 3 comments

Comments

@markvanderloo
Copy link

Just noticed this little package. Two strings s and t are anagrams when

s != t && stringdist::stringdist(s,t,method='qgram', q=1) == 0
@ChrisMuir
Copy link
Owner

lol never thought I'd get an issue opened on this package....this is cool though, thanks for bring this up!

Yeah so using stringdist::seq_dist, the performance is much better than the base R function I wrote. The "spirit" of the speed tests in the README was to compare the cpp functions to a base R equivalent. But I will probably add the stringdist solution to the README as well, just for completeness.

Here's some metrics (and if I can speed up the stringdist solution more than what I have, please let me know):

library(anagrams)
library(stringdist)
library(microbenchmark)
library(stringi)

# pure R function
r_is_anagram <- function(string, terms) {
  out <- rep(FALSE, length(terms))
  terms_to_insp <- which(nchar(terms) == nchar(string))
  if (length(terms_to_insp) == 0) {
    return(out)
  }
  
  string_spl <- unlist(strsplit(string, "", fixed = TRUE), FALSE, FALSE)
  str_counts <- vapply(string_spl, function(x) sum(string_spl == x), integer(1))
  terms_spl <- strsplit(terms, "", fixed = TRUE)
  
  out[terms_to_insp] <- vapply(terms_spl[terms_to_insp], function(x) {
    anagram <- TRUE
    for (char in string_spl) {
      if (str_counts[char] != sum(x == char)) {
        anagram <- FALSE
        break
      }
    }
    anagram
  }, logical(1), USE.NAMES = FALSE)
  
  out
}


# stringdist function
sd_is_anagram <- function(string, terms) {
  out <- rep(FALSE, length(terms))
  terms_to_insp <- which(nchar(terms) == nchar(string))
  if (length(terms_to_insp) == 0) {
    return(out)
  }
  
  string <- utf8ToInt(string)
  terms[terms_to_insp] <- lapply(terms[terms_to_insp], utf8ToInt)
  out[terms_to_insp] <- seq_dist(string, terms[terms_to_insp], method = "qgram", q = 1) == 0
  
  out
}

# Test to make sure output of all three functions is identical.
identical(
  r_is_anagram("stac", c("cats are great", "tacs", "frogs", "cats", "ts", "stac")), 
  is_anagram("stac", c("cats are great", "tacs", "frogs", "cats", "ts", "stac"))
)
#> [1] TRUE
identical(
  sd_is_anagram("stac", c("cats are great", "tacs", "frogs", "cats", "ts", "stac")), 
  is_anagram("stac", c("cats are great", "tacs", "frogs", "cats", "ts", "stac"))
)
#> [1] TRUE


# Test in which each element is shorter than the input string.
test_vect <- stringi::stri_rand_strings(100000, 3)
microbenchmark(
  rfn = r_is_anagram("cats", test_vect), 
  cpp = is_anagram("cats", test_vect), 
  s_d = sd_is_anagram("cats", test_vect), 
  times = 10
)
#> Unit: milliseconds
#> expr       min        lq      mean    median        uq       max neval
#> rfn 20.496519 20.810660 20.974659 20.975246 21.129934 21.348403    10
#> cpp  8.093992  8.312461  8.530418  8.498306  8.546693  9.199534    10
#> s_d 20.572764 20.924660 21.897508 21.164940 21.470101 28.950472    10


# Test in which each element is the same length as the input string.
test_vect <- stringi::stri_rand_strings(100000, 4)
microbenchmark(
  rfn = r_is_anagram("cats", test_vect), 
  cpp = is_anagram("cats", test_vect), 
  s_d = sd_is_anagram("Cats", test_vect), 
  times = 10
)
#> Unit: milliseconds
#> expr        min         lq       mean     median         uq       max neval
#> rfn 214.011869 218.977646 235.162696 222.838795 231.949439 328.99156    10
#> cpp   9.301804   9.519174   9.785149   9.709601   9.933386  10.66834    10
#> s_d 102.262294 110.689488 114.720318 111.783666 114.861296 133.32195    10


# Test in which each element is an anagram of the input string.
test_vect <- rep("tacs", 100000)
microbenchmark(
  rfn = r_is_anagram("cats", test_vect), 
  cpp = is_anagram("cats", test_vect), 
  s_d = sd_is_anagram("cats", test_vect), 
  times = 10
)
#> Unit: milliseconds
#> expr       min        lq      mean    median        uq      max neval
#> rfn 465.93072 497.54204 502.21724 501.01738 514.15890 532.3813    10
#> cpp  12.55538  12.81344  12.90141  12.93166  13.06087  13.0759    10
#> s_d  95.04035  98.97059 109.39026 100.10986 102.71389 191.7881    10


# Test in which each element is a string with length between two and six chars.
test_vect <- stringi::stri_rand_strings(100000, 2:6)
microbenchmark(
  rfn = r_is_anagram("cats", test_vect), 
  cpp = is_anagram("cats", test_vect), 
  s_d = sd_is_anagram("cats", test_vect), 
  times = 10
)
#> Unit: milliseconds
#> expr       min        lq      mean    median        uq        max neval
#> rfn 72.933533 74.797852 80.432746 78.241309 81.078110 101.496918    10
#> cpp  8.214224  8.299998  8.488043  8.493175  8.583715   8.991696    10
#> s_d 39.914255 40.280815 43.870087 41.912185 47.424685  50.398578    10


# Test in which each element is a long string (nchar == 1000).
test_str <- stringi::stri_rand_strings(1, 1000)
test_vect <- stringi::stri_rand_strings(100000, 1000)
microbenchmark(
  rfn = r_is_anagram(test_str, test_vect), 
  cpp = is_anagram(test_str, test_vect), 
  s_d = sd_is_anagram(test_str, test_vect), 
  times = 10
)
#> Unit: milliseconds
#> expr        min         lq       mean     median        uq        max neval
#> rfn 3856.02951 3876.51687 4055.42689 3959.61362 4078.4752 4802.05662    10
#> cpp   81.71299   83.41126   83.89886   84.12348   84.3973   85.17294    10
#> s_d 1938.79379 1998.42673 2177.81082 2145.90702 2180.8416 2918.15138    10

@markvanderloo
Copy link
Author

markvanderloo commented May 30, 2018

This is fun :)

I did some tests with this one:

qg_is_anagram <- function(term, strings){
  stringdist(term, strings, method="qgram", q=1) == 0
}

I have to leave so I don't have time to post the results but its really insightful to see that pure, dedicated cpp is usually a factor of 10 faster. qgrams comes in second or third. usually.

Stringdist does the conversion from UTF8 to integer for you under the hood, in parallel.

@ChrisMuir
Copy link
Owner

Oh, nice! For some reason I was thinking I had to use one of the stringdist::seq_ functions to get the vectorization I wanted, didn't realize stringdist::stringdist() will recycle args like that. Yeah using your function gives a nice speed-up versus what I was using.

Yeah I initially started working on this just to see how much faster than base R I could get using cpp functions. My cpp code could probably be improved/optimized further, I'm no expert and I didn't spend a ton of time working on this.

And in case you didn't see the link in the README, Romain Francois has an anagram package on GH. His attempts to fetch anagrams of an input str from a dictionary of terms, so not quite the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants