Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize player table scraping in fb_league_stats() #359

Merged
merged 6 commits into from
Jan 18, 2024
Merged

Conversation

tonyelhabr
Copy link
Collaborator

It turns out that, in at least one known case, the player table for league shooting stats is hidden by default on the page. One solution would be to try to identify when this occurs, "click" on the show button, and then parse the table as usual. But this adds a lot of overhead.

A generalizable solution implemented in this PR is to parse out the player table from an HTML comment always loaded with the page. The resulting code is a little more "specific", but I wouldn't deem it "hard-coded" by any means. Further, I think it's ok to make the code very specific in this case since we don't use chromote for any other functions in the package.

Appendix

worldfootballr_html_player_table <- function(session) {
stopifnot(identical(class(session), c("WorldfootballRDynamicPage", "R6")))

## find element "above" commented out table
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that i couldn't figure out how to identify the element with the commented out table directly, so i've opted to do it "indirectly", by identifying the node above it (which always has the same CSS class).

session <- worldfootballr_chromote_session(url)
page <- worldfootballr_html_page(session)
player_table <- worldfootballr_html_player_table(session)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the new solution is a little more specific. this function only returns the HTML for 1 table, while the prior solution returned the HTML for all (3) tables on the page, from which the player table was later plucked. arguably this new solution is better

Comment on lines +48 to +63
player_table_elements <- xml2::xml_children(xml2::xml_children(player_table))
parsed_player_table <- rvest::html_table(player_table_elements)
renamed_player_table <- .rename_fb_cols(parsed_player_table[[1]])
renamed_player_table <- renamed_player_table[renamed_player_table$Rk != "Rk", ]
renamed_player_table <- .add_player_href(
renamed_player_table,
parent_element = player_table_elements,
player_xpath = ".//tbody/tr/td[@data-stat='player']/a"
)
}

suppressMessages(
readr::type_convert(
clean_table,
guess_integer = TRUE,
na = "",
trim_ws = TRUE
suppressMessages(
readr::type_convert(
renamed_player_table,
guess_integer = TRUE,
na = "",
trim_ws = TRUE
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's basically no logic change here. i've just added _player to the variable names

@@ -406,16 +406,16 @@ test_that("fb_league_stats() for players works", {
testthat::skip_on_cran()
testthat::skip_on_ci()
expected_player_shooting_cols <- c("Rk", "Player", "Player_Href", "Nation", "Pos", "Squad", "Age", "Born", "Mins_Per_90", "Gls_Standard", "Sh_Standard", "SoT_Standard", "SoT_percent_Standard", "Sh_per_90_Standard", "SoT_per_90_Standard", "G_per_Sh_Standard", "G_per_SoT_Standard", "Dist_Standard", "FK_Standard", "PK_Standard", "PKatt_Standard", "xG_Expected", "npxG_Expected", "npxG_per_Sh_Expected", "G_minus_xG_Expected", "np:G_minus_xG_Expected", "Matches", "url")
epl_player_shooting_22 <- fb_league_stats(
single_player_shooting_22 <- fb_league_stats(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

realized this variable has epl_ in it but we're scraping for Brazil. i've renamed it to single_ to implicitly reflect its usage for testing just 1 league

Copy link
Owner

@JaseZiv JaseZiv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes you've made look really good.

Thanks so much

@JaseZiv JaseZiv merged commit 703904a into main Jan 18, 2024
2 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fb_league_stats won't return the tables when the stats are hidden
2 participants