detecting anagrams with stringdist #1

markvanderloo · 2018-05-30T12:40:27Z

Just noticed this little package. Two strings s and t are anagrams when

s != t && stringdist::stringdist(s,t,method='qgram', q=1) == 0

The text was updated successfully, but these errors were encountered:

ChrisMuir · 2018-05-30T14:18:47Z

lol never thought I'd get an issue opened on this package....this is cool though, thanks for bring this up!

Yeah so using stringdist::seq_dist, the performance is much better than the base R function I wrote. The "spirit" of the speed tests in the README was to compare the cpp functions to a base R equivalent. But I will probably add the stringdist solution to the README as well, just for completeness.

Here's some metrics (and if I can speed up the stringdist solution more than what I have, please let me know):

library(anagrams)
library(stringdist)
library(microbenchmark)
library(stringi)

# pure R function
r_is_anagram <- function(string, terms) {
  out <- rep(FALSE, length(terms))
  terms_to_insp <- which(nchar(terms) == nchar(string))
  if (length(terms_to_insp) == 0) {
    return(out)
  }
  
  string_spl <- unlist(strsplit(string, "", fixed = TRUE), FALSE, FALSE)
  str_counts <- vapply(string_spl, function(x) sum(string_spl == x), integer(1))
  terms_spl <- strsplit(terms, "", fixed = TRUE)
  
  out[terms_to_insp] <- vapply(terms_spl[terms_to_insp], function(x) {
    anagram <- TRUE
    for (char in string_spl) {
      if (str_counts[char] != sum(x == char)) {
        anagram <- FALSE
        break
      }
    }
    anagram
  }, logical(1), USE.NAMES = FALSE)
  
  out
}


# stringdist function
sd_is_anagram <- function(string, terms) {
  out <- rep(FALSE, length(terms))
  terms_to_insp <- which(nchar(terms) == nchar(string))
  if (length(terms_to_insp) == 0) {
    return(out)
  }
  
  string <- utf8ToInt(string)
  terms[terms_to_insp] <- lapply(terms[terms_to_insp], utf8ToInt)
  out[terms_to_insp] <- seq_dist(string, terms[terms_to_insp], method = "qgram", q = 1) == 0
  
  out
}

# Test to make sure output of all three functions is identical.
identical(
  r_is_anagram("stac", c("cats are great", "tacs", "frogs", "cats", "ts", "stac")), 
  is_anagram("stac", c("cats are great", "tacs", "frogs", "cats", "ts", "stac"))
)
#> [1] TRUE
identical(
  sd_is_anagram("stac", c("cats are great", "tacs", "frogs", "cats", "ts", "stac")), 
  is_anagram("stac", c("cats are great", "tacs", "frogs", "cats", "ts", "stac"))
)
#> [1] TRUE


# Test in which each element is shorter than the input string.
test_vect <- stringi::stri_rand_strings(100000, 3)
microbenchmark(
  rfn = r_is_anagram("cats", test_vect), 
  cpp = is_anagram("cats", test_vect), 
  s_d = sd_is_anagram("cats", test_vect), 
  times = 10
)
#> Unit: milliseconds
#> expr       min        lq      mean    median        uq       max neval
#> rfn 20.496519 20.810660 20.974659 20.975246 21.129934 21.348403    10
#> cpp  8.093992  8.312461  8.530418  8.498306  8.546693  9.199534    10
#> s_d 20.572764 20.924660 21.897508 21.164940 21.470101 28.950472    10


# Test in which each element is the same length as the input string.
test_vect <- stringi::stri_rand_strings(100000, 4)
microbenchmark(
  rfn = r_is_anagram("cats", test_vect), 
  cpp = is_anagram("cats", test_vect), 
  s_d = sd_is_anagram("Cats", test_vect), 
  times = 10
)
#> Unit: milliseconds
#> expr        min         lq       mean     median         uq       max neval
#> rfn 214.011869 218.977646 235.162696 222.838795 231.949439 328.99156    10
#> cpp   9.301804   9.519174   9.785149   9.709601   9.933386  10.66834    10
#> s_d 102.262294 110.689488 114.720318 111.783666 114.861296 133.32195    10


# Test in which each element is an anagram of the input string.
test_vect <- rep("tacs", 100000)
microbenchmark(
  rfn = r_is_anagram("cats", test_vect), 
  cpp = is_anagram("cats", test_vect), 
  s_d = sd_is_anagram("cats", test_vect), 
  times = 10
)
#> Unit: milliseconds
#> expr       min        lq      mean    median        uq      max neval
#> rfn 465.93072 497.54204 502.21724 501.01738 514.15890 532.3813    10
#> cpp  12.55538  12.81344  12.90141  12.93166  13.06087  13.0759    10
#> s_d  95.04035  98.97059 109.39026 100.10986 102.71389 191.7881    10


# Test in which each element is a string with length between two and six chars.
test_vect <- stringi::stri_rand_strings(100000, 2:6)
microbenchmark(
  rfn = r_is_anagram("cats", test_vect), 
  cpp = is_anagram("cats", test_vect), 
  s_d = sd_is_anagram("cats", test_vect), 
  times = 10
)
#> Unit: milliseconds
#> expr       min        lq      mean    median        uq        max neval
#> rfn 72.933533 74.797852 80.432746 78.241309 81.078110 101.496918    10
#> cpp  8.214224  8.299998  8.488043  8.493175  8.583715   8.991696    10
#> s_d 39.914255 40.280815 43.870087 41.912185 47.424685  50.398578    10


# Test in which each element is a long string (nchar == 1000).
test_str <- stringi::stri_rand_strings(1, 1000)
test_vect <- stringi::stri_rand_strings(100000, 1000)
microbenchmark(
  rfn = r_is_anagram(test_str, test_vect), 
  cpp = is_anagram(test_str, test_vect), 
  s_d = sd_is_anagram(test_str, test_vect), 
  times = 10
)
#> Unit: milliseconds
#> expr        min         lq       mean     median        uq        max neval
#> rfn 3856.02951 3876.51687 4055.42689 3959.61362 4078.4752 4802.05662    10
#> cpp   81.71299   83.41126   83.89886   84.12348   84.3973   85.17294    10
#> s_d 1938.79379 1998.42673 2177.81082 2145.90702 2180.8416 2918.15138    10

markvanderloo · 2018-05-30T15:33:33Z

This is fun :)

I did some tests with this one:

qg_is_anagram <- function(term, strings){
  stringdist(term, strings, method="qgram", q=1) == 0
}

I have to leave so I don't have time to post the results but its really insightful to see that pure, dedicated cpp is usually a factor of 10 faster. qgrams comes in second or third. usually.

Stringdist does the conversion from UTF8 to integer for you under the hood, in parallel.

ChrisMuir · 2018-05-30T19:39:39Z

Oh, nice! For some reason I was thinking I had to use one of the stringdist::seq_ functions to get the vectorization I wanted, didn't realize stringdist::stringdist() will recycle args like that. Yeah using your function gives a nice speed-up versus what I was using.

Yeah I initially started working on this just to see how much faster than base R I could get using cpp functions. My cpp code could probably be improved/optimized further, I'm no expert and I didn't spend a ton of time working on this.

And in case you didn't see the link in the README, Romain Francois has an anagram package on GH. His attempts to fetch anagrams of an input str from a dictionary of terms, so not quite the same.

ChrisMuir added a commit that referenced this issue May 31, 2018

add stringdist func to benchmarks, issue #1

8d3b912

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

detecting anagrams with stringdist #1

detecting anagrams with stringdist #1

markvanderloo commented May 30, 2018

ChrisMuir commented May 30, 2018

markvanderloo commented May 30, 2018 •

edited

Loading

ChrisMuir commented May 30, 2018

detecting anagrams with stringdist #1

detecting anagrams with stringdist #1

Comments

markvanderloo commented May 30, 2018

ChrisMuir commented May 30, 2018

markvanderloo commented May 30, 2018 • edited Loading

ChrisMuir commented May 30, 2018

markvanderloo commented May 30, 2018 •

edited

Loading