Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: assess if fuzzyjoin may simplify/enhance the implementation of match_name #302

Open
maurolepore opened this issue Sep 12, 2020 · 3 comments
Labels
ADO Add issue to ADO feature a feature request or enhancement medium Likely finished in under a week

Comments

@maurolepore
Copy link
Contributor

maurolepore commented Sep 12, 2020

https://cran.r-project.org/web/packages/fuzzyjoin/

AB#10180

@jdhoffa
Copy link
Member

jdhoffa commented Sep 14, 2020

Very cool.

@jdhoffa jdhoffa added feature a feature request or enhancement and removed enhancement labels Apr 14, 2023
@jdhoffa jdhoffa added medium Likely finished in under a week ADO Add issue to ADO labels Feb 6, 2024
@jdhoffa jdhoffa self-assigned this Feb 6, 2024
@cjyetman
Copy link
Member

cjyetman commented Feb 6, 2024

A word of caution, faster is not always better. The first example in the docs for zoomerjoin by my estimation matches 1 correct, 7 incorrect, and the rest of the other 500 rows in each corpus are unmatched. To be fair, it's primarily failing on numbers that it likely does not see much difference in, but as a human they look obviously false.

Also to be fair, this is likely not worse than what is currently being done in this package. But it's likely not much better either, even if it's faster.

library(tidyverse)
library(zoomerjoin)
options(width = 130)

corpus_1 <- dime_data %>% # dime data is packaged with zoomerjoin
  head(500)
names(corpus_1) <- c("a", "field")

corpus_2 <- dime_data %>% # dime data is packaged with zoomerjoin
  tail(500)
names(corpus_2) <- c("b", "field")

jaccard_inner_join(corpus_1, corpus_2,
  by = "field", n_gram_width = 6,
  n_bands = 20, band_width = 6, threshold = .8
)
#> # A tibble: 8 × 4
#>       a field.x                                                      b field.y                                                 
#>   <dbl> <chr>                                                    <dbl> <chr>                                                   
#> 1   302 americans for good government inc                          910 americans for good government                           
#> 2   230 pipefitters local union 524                                998 pipefitters local union 533                             
#> 3   292 bill bradley for u s senate '84                            913 bill bradley for u s senate '90                         
#> 4   378 guarini for congress 1982                                  606 guarini for congress 1984                               
#> 5   378 guarini for congress 1982                                  883 guarini for congress 1986                               
#> 6   238 4th congressional district democratic party                518 16th congressional district democratic party            
#> 7    88 scheuer for congress 1980                                  667 scheuer for congress 1984                               
#> 8   319 7th congressional district democratic party of wisconsin   792 8th congressional district democratic party of wisconsin

@jdhoffa
Copy link
Member

jdhoffa commented Feb 6, 2024

Fair enough.

And code speed isn't really the main blocker with this package, it's "time it takes to manually verify"

Still neat to look into!

@jdhoffa jdhoffa changed the title Explore if fuzzyjoin may simplify/enhance the implementation of match_name() feat: assess if fuzzyjoin may simplify/enhance the implementation of match_name Mar 6, 2024
@jdhoffa jdhoffa added ADO Add issue to ADO and removed ADO Add issue to ADO labels Mar 6, 2024
@jdhoffa jdhoffa removed their assignment Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ADO Add issue to ADO feature a feature request or enhancement medium Likely finished in under a week
Projects
None yet
Development

No branches or pull requests

3 participants