Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: replace id_r2dii with company_id input from ald_demo #375

Open
jdhoffa opened this issue Jul 5, 2021 · 11 comments
Open

feat: replace id_r2dii with company_id input from ald_demo #375

jdhoffa opened this issue Jul 5, 2021 · 11 comments
Labels
ADO Add issue to ADO breaking change ☠️ API change likely to affect existing code feature a feature request or enhancement medium Likely finished in under a week

Comments

@jdhoffa
Copy link
Member

jdhoffa commented Jul 5, 2021

Currently, we uniquely identify matches using an internally generate id, which is the unique group indices for sector and company.

ald_demo now contains the column company_id which uniquely identifies each company. HOWEVER, it likely does not uniquely identify each company + sector combination. We should consider how we can move towards using this company_id column, and move away from the id_r2dii column.

Relates to #355

AB#10179

@jdhoffa
Copy link
Member Author

jdhoffa commented Sep 21, 2021

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(r2dii.data)

select(ald_demo, company_id, name_company) %>% 
  unique() %>% 
  head()
#> # A tibble: 6 × 2
#>   company_id name_company                    
#>   <chr>      <chr>                           
#> 1 1          aba hydropower generation co ltd
#> 2 2          achinsky glinoziemskij kombinat 
#> 3 3          affinity renewables inc.        
#> 4 4          africa oil corp                 
#> 5 5          africon shipping sa             
#> 6 6          agni steels private limited

Created on 2021-09-21 by the reprex package (v2.0.0)

@maurolepore
Copy link
Contributor

maurolepore commented Sep 21, 2021

Currently, we uniquely identify matches using ... the unique group indices for sector and company.

the [new] column company_id ... uniquely identifies each company [but does't consider sector].

So is this the job?:

Produce group indices from company_id + sector
(instead of company + sector).

@maurolepore
Copy link
Contributor

Also, are you saying that now ald_demo is no longer only a demo dataset but instead a "real" dataset we use to match companies in the wild with an id we came up with?

I need to read the comments in the PR that addes company_id to ald_demo -- likely I can answer my own question. But still I want to capture my first reaction to this "mixed" responsability of ald_demo in case I didn't notice the issue the first time.

@jdhoffa
Copy link
Member Author

jdhoffa commented Sep 21, 2021

No, it is definitely still a completely demo dataset.

@jdhoffa
Copy link
Member Author

jdhoffa commented Sep 21, 2021

The only change that was made to ald_demo is that it gained the column company_id (with entries of made-up company ids, still fake). This change just reflects the fact that the REAL data (that is owned and distributed by Asset Resolution) also gained that column (and the real data obviously has real ID values in that column).

@jdhoffa
Copy link
Member Author

jdhoffa commented Sep 21, 2021

The change proposed here is basically to make use of that new column that is now being served with the most recent data.

@jdhoffa
Copy link
Member Author

jdhoffa commented Sep 21, 2021

I've tried to explain this more clearly in code, along with a failing test:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(r2dii.data)
library(r2dii.match)

# match_name currently outputs a column id_2dii. this "ID" is generated 
# arbitrarily by the function
out <- r2dii.data::loanbook_demo %>% 
  match_name(r2dii.data::ald_demo)

out %>% 
  select(id_2dii, level, sector_ald, name_ald)
#> # A tibble: 410 × 4
#>    id_2dii level                 sector_ald name_ald                            
#>    <chr>   <chr>                 <chr>      <chr>                               
#>  1 UP15    ultimate_parent       power      alpine knits india pvt. limited     
#>  2 IP19    intermediate_parent_1 power      ykk usa, inc.                       
#>  3 UP288   ultimate_parent       power      university of iowa                  
#>  4 DL305   direct_loantaker      power      yukon development corp              
#>  5 UP104   ultimate_parent       power      garland power & light               
#>  6 DL304   direct_loantaker      power      yukon development corp              
#>  7 UP83    ultimate_parent       power      earthpower technologies sydney pty …
#>  8 UP163   ultimate_parent       power      kraftwerk mehrum gmbh               
#>  9 UP138   ultimate_parent       power      jai bharat gum & chemicals ltd.     
#> 10 UP32    ultimate_parent       power      bhagwan energy ltd.                 
#> # … with 400 more rows

# ald_demo now has a column `company_id`
r2dii.data::ald_demo %>% 
  select(company_id, name_company) %>% 
  unique() %>% 
  head()
#> # A tibble: 6 × 2
#>   company_id name_company                    
#>   <chr>      <chr>                           
#> 1 1          aba hydropower generation co ltd
#> 2 2          achinsky glinoziemskij kombinat 
#> 3 3          affinity renewables inc.        
#> 4 4          africa oil corp                 
#> 5 5          africon shipping sa             
#> 6 6          agni steels private limited

# let's look at the company "alpine knits ..." which is in both ald_demo and the 
# demo output of `match_name`

# in ald_demo
r2dii.data::ald_demo %>% 
  filter(grepl("alpine", name_company)) %>% 
  select(company_id, name_company) %>% 
  unique() %>% 
  head()
#> # A tibble: 1 × 2
#>   company_id name_company                   
#>   <chr>      <chr>                          
#> 1 16         alpine knits india pvt. limited

# in output of match_name
out %>% 
  filter(grepl("alpine", name_ald)) %>% 
  select(id_2dii, level, sector_ald, name_ald)
#> # A tibble: 1 × 4
#>   id_2dii level           sector_ald name_ald                       
#>   <chr>   <chr>           <chr>      <chr>                          
#> 1 UP15    ultimate_parent power      alpine knits india pvt. limited

# in the match_name output, I would want to see instead a column `company_id` 
# with the value "16" (as above)

# You could, for example, use the following as a failing test
out_alpine <- out %>% 
  filter(grepl("alpine", name_ald))

expect_equal(out_alpine$company_id, 16L) 
#> Error in expect_equal(out_alpine$company_id, 16L): could not find function "expect_equal"
# of course this will fail, since the column company_id does not yet exist

Created on 2021-09-21 by the reprex package (v2.0.0)

@jdhoffa
Copy link
Member Author

jdhoffa commented Sep 21, 2021

Currently, we uniquely identify matches using ... the unique group indices for sector and company.

the [new] column company_id ... uniquely identifies each company [but does't consider sector].

So is this the job?:

Produce group indices from company_id + sector
(instead of company + sector).

I actually no longer think this is an issue. company_id does not need to be unique to the sector on output, because we also output explicitly the column sector_ald. (ie. the user can always explicitly determine uniquely companies by usng the combination of company_id and sector_ald from the output data).

^If the above doesn't make sense, don't worry. But I believe you can disregard my initial concerns about sector.

@jdhoffa
Copy link
Member Author

jdhoffa commented Sep 21, 2021

One last consideration, is that this will change how the argument match_name(..., overwrite = ...) is handled, as explained here: https://2degreesinvesting.github.io/r2dii.match/articles/r2dii-match.html#maybe-overwrite-matches

In particular, overwrite expects a dataframe input, with a column id_2dii. This will also need to be replaced by company_id

@jacobvjk
Copy link
Member

Just want wo highlight that I would very much support the use of company_id soon, as tis is also what we would normally use to match data from different sources in stress testing. The credit risk stress test for banks would be so much more robust if I did not only have to join on company names for which I can't be sure I am doing the right transformation of the format. Big vote from my side

@jdhoffa
Copy link
Member Author

jdhoffa commented Feb 6, 2024

Relates to #212

@jdhoffa jdhoffa added medium Likely finished in under a week feature a feature request or enhancement breaking change ☠️ API change likely to affect existing code ADO Add issue to ADO labels Feb 6, 2024
@jdhoffa jdhoffa self-assigned this Feb 6, 2024
@jdhoffa jdhoffa removed the ADO Add issue to ADO label Mar 6, 2024
@jdhoffa jdhoffa changed the title Replace id_r2dii with company_id input from ald_demo feat: replace id_r2dii with company_id input from ald_demo Mar 6, 2024
@jdhoffa jdhoffa added ADO Add issue to ADO and removed ADO Add issue to ADO labels Mar 6, 2024
@jdhoffa jdhoffa removed their assignment Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ADO Add issue to ADO breaking change ☠️ API change likely to affect existing code feature a feature request or enhancement medium Likely finished in under a week
Projects
None yet
Development

No branches or pull requests

3 participants