Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sector_profile() now accounts for unmatched type, sector or subsector #739

Merged
merged 84 commits into from
May 20, 2024

Conversation

maurolepore
Copy link
Member

@maurolepore maurolepore commented Feb 27, 2024

reprex of the approved behaviour

--

Given a clustered matching one but not a second type of scenario, when the scenarios dataset has the two types, then the second type and its corresponding scenario are still present in grouped_by, and the mismatch is reflected correctly in the value.

The expected behaviour is captured in this GoogleSheet and explained in #739 (comment) (thanks @Tilmon).


TODO

  • Link related issue/PR.
  • Describe the goal of the PR. Avoid details that are clear in the diff.
  • Mark the PR as draft.
  • Include a unit test.
  • Review your own PR in "Files changed".
  • Ensure the PR branch is updated.
  • Ensure the checks pass.
  • Change the status from draft to ready.
  • Polish the PR title and description.
  • Assign a reviewer.

EXCEPTIONS

  • Slide here any item that you intentionally choose to not do.

@maurolepore maurolepore changed the title Handle missing benchmark The sector*() functions now handle missing "benchmark" Feb 27, 2024
@maurolepore
Copy link
Member Author

maurolepore commented Mar 4, 2024

@Tilmon (cc' @AnneSchoenauer)

I'm struggling to see the difference between the case "unmatched products" and the case "missing benchmarks".

For emissions*() I could clearly understand the difference and and create different tests for the cases with "unmatched product" versus "missing benchmark". For example, to test the case "unmatched product" I could add a product in the companies dataset that did not exist in the products dataset. And to test the case "missing benchmark" I could add an NA in the isic_4digit column of the products dataset.

However, for sector*() I can't understand the difference as clearly, and I realize that my tests for both cases essentially use the same type of toy input data: An NA or "unmatched" in either the columns sector, subsector, or type of the companies dataset.

Your #738 (comment) seems to support the idea that these two cases are no different:

I agree that in this case the unmatched value in sector is the important relationship.

The best way to clarify things is through a reproducible example that you are familiar with because it's based on the one you created in this GoogleSheet. At this point I'm so confused that I don't know if you think it shows one or both cases, but I hope it's a good start.

reprex
library(tibble)
devtools::load_all()
#> ℹ Loading tiltIndicator

packageVersion("tiltIndicator")
#> [1] '0.0.0.9211'

withr::local_options(list(tibble.print_max = Inf, width = 500))

companies <- tribble(
  ~companies_id, ~clustered, ~activity_uuid_product_uuid, ~tilt_sector, ~tilt_subsector,       ~type,     ~sector,  ~subsector,
            "a",        "a",                         "a",          "a",             "a",       "ipr",     "total",    "energy",
            "a",        "a",                         "a",          "a",             "a",       "weo",     "total",    "energy",
            "a",        "b",                 "unmatched",  "unmatched",     "unmatched", "unmatched", "unmatched", "unmatched",
            "a",        "c",                 "unmatched",          "c",             "c",       "ipr", "land use",   "land use",
            "a",        "c",                 "unmatched",          "c",             "c",       "weo",         NA,           NA
  )

scenarios <- tribble(
     ~sector,   ~subsector,  ~year, ~reductions, ~type, ~scenario,
     "total",     "energy",   2050,         1.0, "ipr",       "a",
     "total",     "energy",   2050,         0.6, "weo",       "a",
  "land use",   "land use",   2050,         0.3, "ipr",       "a"
)

sector_profile(companies, scenarios) |> unnest_product()
#> # A tibble: 5 × 11
#>   companies_id grouped_by risk_category profile_ranking clustered activity_uuid_product_uuid tilt_sector scenario  year type      tilt_subsector
#>   <chr>        <chr>      <chr>                   <dbl> <chr>     <chr>                      <chr>       <chr>    <dbl> <chr>     <chr>         
#> 1 a            ipr_a_2050 high                      1   a         a                          a           a         2050 ipr       a             
#> 2 a            weo_a_2050 medium                    0.6 a         a                          a           a         2050 weo       a             
#> 3 a            <NA>       <NA>                     NA   b         unmatched                  unmatched   <NA>        NA unmatched unmatched     
#> 4 a            ipr_a_2050 low                       0.3 c         unmatched                  c           a         2050 ipr       c             
#> 5 a            <NA>       <NA>                     NA   c         unmatched                  c           <NA>        NA weo       c

sector_profile(companies, scenarios) |> unnest_company()
#> # A tibble: 8 × 4
#>   companies_id grouped_by risk_category value
#>   <chr>        <chr>      <chr>         <dbl>
#> 1 a            ipr_a_2050 high          0.25 
#> 2 a            ipr_a_2050 medium        0    
#> 3 a            ipr_a_2050 low           0.25 
#> 4 a            ipr_a_2050 <NA>          0.5  
#> 5 a            weo_a_2050 high          0    
#> 6 a            weo_a_2050 medium        0.333
#> 7 a            weo_a_2050 low           0    
#> 8 a            weo_a_2050 <NA>          0.667

It it possible that this is already all you want? If yes, we're done and I can close this PR. If not, what do you expect?

@Tilmon
Copy link
Contributor

Tilmon commented Mar 4, 2024

Hi @maurolepore ,

let me respond to the points you raised below. I first start with a more detailed description of how to think about the two different cases and then explain why I think the reprex is unfortunately wrong.

Description of the "two cases"

I'm struggling to see the difference between the case "unmatched products" and the case "missing benchmarks".

I must admit that I find the difference in the case of sector*() also more complicated, but it's still there. Let's think about it that way: For the sector*() it doesn't matter, whether a clustered is matched to Ecoinvent or not. What matters is:

  1. Whether the clustered has a tilt_sector. If it doesn't have a tilt_sector , the clustered will neither have a sector or subsector for any of the types. Hence, you can think of the tilt_sector as the activity_uuid_product_uuid of sector*(): If the tilt_sector (activity_uuid_product_uuid) is missing, you won't have any sector data (co2_footprint) that can be used to calculate the profile. Because ultimately, the scenario sectors give us the info on the reduction targets. I.e., no scenario sector = no result at all. Hence, tilt_sector == unmatched should be have as the activity_uuid_product_uuid == unmatched in the emission*().
  2. If the tilt_sector & tilt_subsector leads to a sector & subsector for either of the type ipr or weo or both or none. You can think of the sector x subsector x type x year combination as the 6 benchmarks in the emission*() . For each clustered with a tilt_sector, we want to show all benchmarks, i.e. all corresponding combinations of sector x subsector x type x year for the specific tilt_sector x tilt_subsector , even if some are NAs. I.e., for every clustered with a tilt_sector, we should show the benchmarks weo_2030, weo_2050, ipr_2030, ipr_2050, even if some are NA (as in your reprex clustered c has no corresponding weo sector) - similar to an NA in isic_4digit in emission_profile().

Reprex
Your reprex shows BOTH cases.

  • clustered b has no tilt_sector (equivalent of no activity_uuid_product_uuid) and will hence lead to no results for the indicator (as there is no sector, equivalent to no co2_footprint). This is what I describe under 1.
  • clustered c has a tilt_sector (equivalent to matched product with activity_uuid_product_uuid) and hence will lead to results for the sector x subsector x type x year combination, even if there will be some NAs (equivalent to an activity_uuid_product_uuid leading to results for at least some of the 6 benchmarks).

So the reprex example is great to discuss the issue. I see two problems with the reprex you shared, one on product-level and one on company-level:

  • product-level: clustered c should have the grouped_by value "weo_a_2050" instead of "NA". We have a tilt_sector for that clustered and hence should show all benchmarks, even if they are NA.
  • company-level: The results show that for ipr_a_2050, there are 25% of products in high and 25% in low risk category. That can't be right, because we only have 3 products. Instead, we have 1 clustered with high, 1 with low, and one with NA benchmark , i.e. we should have for the benchmark ipr_a_2050 the following values: high 1/3, medium 0, low 1/3, NA 1/3.

I hope this helps. Let me know if it does!

P.S. I realize my comment here was not very helpful, as it's not wrong but leads to more confusion about the two different cases.

Your #738 (comment) seems to support the idea that these two cases are no different: "But I agree that in this case the unmatched value in sector is the important relationship."

I agree that in this case the unmatched value in sector is the important relationship.

@maurolepore maurolepore self-assigned this Mar 5, 2024
@maurolepore maurolepore changed the title The sector*() functions now handle missing "benchmark" The sector*() functions should preserve "missing benchmarks" Mar 5, 2024
@maurolepore maurolepore added the bug Something isn't working label Mar 14, 2024
@Tilmon Tilmon added this to the Database launch milestone Apr 23, 2024
@maurolepore maurolepore force-pushed the 733_sector_missing_benchmark branch from 7e45582 to 6e56a7d Compare May 13, 2024 14:39
@maurolepore
Copy link
Member Author

maurolepore commented May 17, 2024

it's best to pause the sector_upstream_profile() work for now -- @Tilmon in #739 (comment)

Noted (#784).

Thanks!

@maurolepore maurolepore marked this pull request as ready for review May 20, 2024 17:21
@maurolepore maurolepore merged commit 61b4fa9 into main May 20, 2024
14 checks passed
@maurolepore maurolepore deleted the 733_sector_missing_benchmark branch May 20, 2024 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NAs in benchmark-columns should yield NAs in risk_category (except "all")
3 participants