-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comparing two separate columns #10
Comments
Hi @thomasdebeus , glad to hear you've found the package useful! For your question, yes this is possible to do. Here's an example: library(tidyverse)
library(refinr)
df1 <- tribble(
~uid, ~name, ~value,
"A", "Red", 13,
"A", "violet", 145,
"B", "Blue", 3,
"B", "yellow", 56,
"C", "yellow-purple", 789,
"C", "green", 17
)
df2 <- tribble(
~uid, ~name,
"A", "red",
"B", "bluee",
"C", "Yellow Purple",
"C", "green"
)
# Run the values from both "name" variables through the refinr functions (as a
# single vector).
names_r <- c(df1$name, df2$name) %>%
key_collision_merge(dict = df1$name) %>%
n_gram_merge()
# Create vector of labels, then split up "names_r" by those labels, and assign
# the refined values back to their original data frames.
names_labels <- c(rep("df1", nrow(df1)), rep("df2", nrow(df2)))
df1$name <- names_r[names_labels == "df1"]
df2$name <- names_r[names_labels == "df2"]
# Perform join.
left_join(df2, df1, by = c("uid", "name"))
#> # A tibble: 4 x 3
#> uid name value
#> <chr> <chr> <dbl>
#> 1 A Red 13.
#> 2 B Blue 3.
#> 3 C yellow-purple 789.
#> 4 C green 17. There's probably a cleaner, more elegant way to achieve this using tidyverse functions and the pipe I must point out that, for this example, I edited two of the str values in variable
refinr:::get_fingerprint_ngram(c("Blue", "blu"), numgram = 1)
#> "belu" "blu"
refinr:::get_fingerprint_ngram(c("Blue", "bluee"), numgram = 1)
#> "belu" "belu"
df1 <- tribble(
~uid, ~name, ~value,
"A", "Red", 13,
"A", "violet", 145,
"B", "Blue", 3,
"B", "yellow", 56,
"C", "yellow-purple", 789,
"C", "green", 17
)
df2 <- tribble(
~uid, ~name,
"A", "red",
"B", "blu",
"C", "YellowPurple",
"C", "green"
)
n_gram_merge(c(df1$name, df2$name))
#> [1] "Red" "violet" "Blue" "yellow" "YellowPurple" "green"
#> [7] "Red" "blu" "YellowPurple" "green" Each algorithm has it's quirks and pros/cons, unfortunately they aren't perfect....I've tried to strike a balance between merging true-positives and ignoring false-positives. If you run into an example of specific text combinations that seem to "stump" the functions, I encourage you to open issues for those (something similar to issue #6). Thanks! |
Wow thanks a lot for your thorough reply! Other than the practical code examples, the theory behind this helps me a lot as well. Thanks again! |
No problem, happy to help! |
Hi Chris,
Great package! I'm a journalist and a lot of my time is spend in merging two data frames on the country, municipality, name or party column. These columns often contain different spellings for the same entity.
Now your package comes in handy, only I haven't figured out yet how to compare two of 'the same' columns and name the strings like I use
refinr
on a single vector. I'm not that experienced in R so maybe this sounds a little bit vague. Maybe my examples make things a bit clearer.If this is possible, it would safe me so much time!
Thanks in advance.
The text was updated successfully, but these errors were encountered: