Comparing two separate columns #10

thomasdebeus · 2018-05-03T09:08:50Z

Hi Chris,

Great package! I'm a journalist and a lot of my time is spend in merging two data frames on the country, municipality, name or party column. These columns often contain different spellings for the same entity.

Now your package comes in handy, only I haven't figured out yet how to compare two of 'the same' columns and name the strings like I use refinr on a single vector. I'm not that experienced in R so maybe this sounds a little bit vague. Maybe my examples make things a bit clearer.

library(tidyverse)
library(refinr)

# I would like to add the values (and the right name's) of this example df...
df1 <- tribble(
  ~uid, ~name, ~value,
  "A", "Red", 13,
  "A", "violet", 145,
  "B", "Blue", 3,
  "B", "yellow", 56,
  "C", "yellow-purple", 789,
  "C", "green", 17
  )

# ...to the following df
df2 <- tribble(
  ~uid, ~name,
  "A", "red",
  "B", "blu",
  "C", "YellowPurple",
  "C", "green"
  )

# The following code of course produces NA values
df3 <- left_join(df1, df2, by = c("uid", "name"))

# While the following is the desired outcome

# A tibble: 4 x 3
  uid   name             value
  <chr> <chr>          <dbl>
1 A     Red                    13 
2 B     Blue                     3
3 C     yellow-purple  789   
4 C     green                 17

If this is possible, it would safe me so much time!

Thanks in advance.

The text was updated successfully, but these errors were encountered:

ChrisMuir · 2018-05-03T19:43:10Z

Hi @thomasdebeus , glad to hear you've found the package useful!

For your question, yes this is possible to do. Here's an example:

library(tidyverse)
library(refinr)

df1 <- tribble(
  ~uid, ~name, ~value,
  "A", "Red", 13,
  "A", "violet", 145,
  "B", "Blue", 3,
  "B", "yellow", 56,
  "C", "yellow-purple", 789,
  "C", "green", 17
)

df2 <- tribble(
  ~uid, ~name,
  "A", "red",
  "B", "bluee",
  "C", "Yellow Purple",
  "C", "green"
)

# Run the values from both "name" variables through the refinr functions (as a 
# single vector).
names_r <- c(df1$name, df2$name) %>% 
  key_collision_merge(dict = df1$name) %>% 
  n_gram_merge()

# Create vector of labels, then split up "names_r" by those labels, and assign
# the refined values back to their original data frames.
names_labels <- c(rep("df1", nrow(df1)), rep("df2", nrow(df2)))
df1$name <- names_r[names_labels == "df1"]
df2$name <- names_r[names_labels == "df2"]

# Perform join.
left_join(df2, df1, by = c("uid", "name"))
#> # A tibble: 4 x 3
#>   uid   name          value
#>  <chr> <chr>         <dbl>
#> 1 A     Red             13.
#> 2 B     Blue             3.
#> 3 C     yellow-purple  789.
#> 4 C     green           17.

There's probably a cleaner, more elegant way to achieve this using tidyverse functions and the pipe %>% that wouldn't require creating intermediate variables on the way to the solution, but this example should give the basic idea.

I must point out that, for this example, I edited two of the str values in variable name in the input data frames. Here's a quick explanation:

Blue and blu: These are not treated as a cluster and thus won't get merged together. The reason for this is because the approximate string matching used in function n_gram_merge() is only applied to str values that have an identical unigram (unigram values are derived via function get_fingerprint_ngram()). The algorithm only considers a set of strings to be suitable for clustering if their unigram values are identical. Blue and blu return different unigram values, versus the initial strings I used my example (Blue and bluee). Here's a quick demo:

refinr:::get_fingerprint_ngram(c("Blue", "blu"), numgram = 1)
#> "belu" "blu" 
refinr:::get_fingerprint_ngram(c("Blue", "bluee"), numgram = 1)
#> "belu" "belu"

yellow-purple and YellowPurple: These two values would be clustered and merged together by function n_gram_merge(), but NOT by function key_collision_merge(). n_gram_merge() does not feature the dict argument, while key_collision_merge() does (this arg allows the user to dictate merge values for specific clusters). You specified in your example that you wanted the merged values to take the form of the names in df1, however for me just running the names from both df's through n_gram_merge(), the "yellow purple" cluster was taking the form of the string from df2:

df1 <- tribble(
  ~uid, ~name, ~value,
  "A", "Red", 13,
  "A", "violet", 145,
  "B", "Blue", 3,
  "B", "yellow", 56,
  "C", "yellow-purple", 789,
  "C", "green", 17
)

df2 <- tribble(
  ~uid, ~name,
  "A", "red",
  "B", "blu",
  "C", "YellowPurple",
  "C", "green"
)

n_gram_merge(c(df1$name, df2$name))
#> [1] "Red"          "violet"       "Blue"         "yellow"       "YellowPurple" "green"       
#> [7] "Red"          "blu"         "YellowPurple" "green"

Each algorithm has it's quirks and pros/cons, unfortunately they aren't perfect....I've tried to strike a balance between merging true-positives and ignoring false-positives. If you run into an example of specific text combinations that seem to "stump" the functions, I encourage you to open issues for those (something similar to issue #6).

Thanks!

thomasdebeus · 2018-05-07T12:32:24Z

Wow thanks a lot for your thorough reply! Other than the practical code examples, the theory behind this helps me a lot as well. Thanks again!

ChrisMuir · 2018-05-07T13:05:40Z

No problem, happy to help!

thomasdebeus closed this as completed May 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparing two separate columns #10

Comparing two separate columns #10

thomasdebeus commented May 3, 2018 •

edited

Loading

ChrisMuir commented May 3, 2018

thomasdebeus commented May 7, 2018

ChrisMuir commented May 7, 2018

Comparing two separate columns #10

Comparing two separate columns #10

Comments

thomasdebeus commented May 3, 2018 • edited Loading

ChrisMuir commented May 3, 2018

thomasdebeus commented May 7, 2018

ChrisMuir commented May 7, 2018

thomasdebeus commented May 3, 2018 •

edited

Loading