This repository has been archived by the owner on Sep 30, 2022. It is now read-only.
forked from RMI-PACTA/r2dii.match
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
New Get started article shows details
- Loading branch information
1 parent
7370674
commit 51e1d25
Showing
3 changed files
with
144 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,3 +6,4 @@ | |
^\.github/workflows$ | ||
^\.travis\.yml$ | ||
^vignettes/articles$ | ||
^vignettes$ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
*.html | ||
*.R |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,141 @@ | ||
--- | ||
title: "r2dii.match" | ||
--- | ||
|
||
```{r, include = FALSE} | ||
knitr::opts_chunk$set( | ||
collapse = TRUE, | ||
comment = "#>" | ||
) | ||
``` | ||
|
||
This example aims to show the entire matching process. As usual, we start by using required packages. For convenience we'll also use the tidyverse. | ||
|
||
```{r} | ||
library(r2dii.match) | ||
library(r2dii.dataraw) | ||
library(tidyverse) | ||
``` | ||
|
||
We'll use some fake datasets from the r2dii.dataraw package, which name ends with `_demo`, for example: | ||
|
||
```{r} | ||
loanbook_demo | ||
``` | ||
|
||
Before matching, both the loanbook and asset level data must be prepared. | ||
To this end, there are several mandatory steps, and several optional steps. | ||
|
||
We can bridge from multiple sector classification codes to 2Dii's sectors: `r sort(unique(bridge_sector(loanbook_demo)$sector))`. | ||
|
||
```{r} | ||
loanbook_demo %>% | ||
bridge_sector() %>% | ||
# Focusing on columns related to sector | ||
select( | ||
sector_classification_system, | ||
sector_classification_input_type, | ||
sector_classification_direct_loantaker, | ||
sector, | ||
borderline | ||
) | ||
``` | ||
|
||
In case the loanbook has non-unique IDs, can generate name+sector specific IDs | ||
(this is especially important if one company is classified in two sectors for two loans). | ||
|
||
```{r} | ||
loanbook_demo %>% | ||
id_by_loantaker_sector() | ||
``` | ||
|
||
Before we run the fuzzy matching algorithm, we simplify the loanbook and ald names using: | ||
|
||
```{r} | ||
some_customer_names <- c("3M Company", "Abbott Laboratories", "AbbVie Inc.") | ||
replace_customer_name(some_customer_names) | ||
# replacements can be defined from scratch using: | ||
custom_replacement <- tibble(from = "AAAA", to = "B") | ||
replace_customer_name("Aa Aaaa", from_to = custom_replacement) | ||
# or appended to the existing list of replacements: | ||
get_replacements() | ||
appended_replacements <- get_replacements() %>% | ||
add_row( | ||
.before = 1, | ||
from = c("AA", "BB"), to = c("alpha", "beta") | ||
) | ||
appended_replacements | ||
# And in combination with `replace_customer_name()` | ||
replace_customer_name(c("AA", "BB", "1"), from_to = appended_replacements) | ||
``` | ||
|
||
The following function takes a loanbook with non-corrupt IDs and outputs a list of all unique name and sector combinations at every level, including the simplified name, to be used in the matching process: | ||
|
||
```{r} | ||
prep_loanbook <- loanbook_demo %>% | ||
id_by_loantaker_sector() %>% | ||
prepare_loanbook_for_matching() | ||
prep_loanbook | ||
``` | ||
|
||
And similarly for the ald: | ||
|
||
```{r} | ||
prep_ald <- r2dii.dataraw::ald_demo %>% | ||
prepare_ald_for_matching() | ||
prep_ald | ||
``` | ||
|
||
For the purpose of manual matching, you can substitute the name and/ or sector of particular loans at the desired level when preparing the loanbook data. To do so, specify the `overwrite` argument in prepare_loanbook_for_matching(). (To substitute only the name, leave sector as `NA` and vice-versa). | ||
|
||
```{r} | ||
overwrite_demo <- r2dii.dataraw::overwrite_demo | ||
overwrite_demo | ||
prep_loanbook <- loanbook_demo %>% | ||
id_by_loantaker_sector() %>% | ||
prepare_loanbook_for_matching(overwrite = overwrite_demo) | ||
prep_loanbook | ||
``` | ||
|
||
`match_all_against_all()` scores the similarity between `simpler_name` values in the prepared loanbook and ald datasets. The `by_sector` argument, flags if names should only be compared against ald names in the same sector. (setting `by_sector = TRUE` reduces the matching runtime on large datasets, and reduces the amount of nonsensical matches). | ||
|
||
```{r} | ||
# Using default `by_sector = TRUE` | ||
matched <- match_all_against_all(prep_loanbook, prep_ald) | ||
matched | ||
``` | ||
|
||
You may use common dplyr functions to recover all columns from the loanbook dataset and to keep only rows at and above some threshold. | ||
|
||
```{r} | ||
threshold <- 0.9 | ||
matched %>% | ||
left_join(prep_loanbook, by = c("simpler_name_x" = "simpler_name")) %>% | ||
filter(score >= threshold) | ||
``` | ||
|
||
This matching data-frame should be saved and manually verified. To do so, try something like: | ||
|
||
```r | ||
readr::write_csv(matched, "path/to/save/matches_to_be_verified.csv") | ||
``` | ||
|
||
and open the .csv in excel/ google sheets/ however you want to edit a spreadsheet. Once open, compare `simpler_name_x` and `simpler_name_y` manually, along with the loanbook sector. If you are happy with the match, set the `score` value to `1` (Note: Only values of exactly `1` will be considered valid, all other potential matches will be considered invalidated.) | ||
|
||
When you are happy with the match validation: | ||
|
||
```r | ||
readr::read_csv("path/to/load/verified_matches.csv") | ||
``` | ||
|
||
**Work in progress, next step of analysis it to join in validated matches in order of priority**. |