Skip to content
This repository has been archived by the owner on Sep 30, 2022. It is now read-only.

Commit

Permalink
New Get started article shows details
Browse files Browse the repository at this point in the history
  • Loading branch information
maurolepore committed Dec 5, 2019
1 parent 7370674 commit 51e1d25
Show file tree
Hide file tree
Showing 3 changed files with 144 additions and 0 deletions.
1 change: 1 addition & 0 deletions .Rbuildignore
Expand Up @@ -6,3 +6,4 @@
^\.github/workflows$
^\.travis\.yml$
^vignettes/articles$
^vignettes$
2 changes: 2 additions & 0 deletions vignettes/.gitignore
@@ -0,0 +1,2 @@
*.html
*.R
141 changes: 141 additions & 0 deletions vignettes/r2dii-match.Rmd
@@ -0,0 +1,141 @@
---
title: "r2dii.match"
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```

This example aims to show the entire matching process. As usual, we start by using required packages. For convenience we'll also use the tidyverse.

```{r}
library(r2dii.match)
library(r2dii.dataraw)
library(tidyverse)
```

We'll use some fake datasets from the r2dii.dataraw package, which name ends with `_demo`, for example:

```{r}
loanbook_demo
```

Before matching, both the loanbook and asset level data must be prepared.
To this end, there are several mandatory steps, and several optional steps.

We can bridge from multiple sector classification codes to 2Dii's sectors: `r sort(unique(bridge_sector(loanbook_demo)$sector))`.

```{r}
loanbook_demo %>%
bridge_sector() %>%
# Focusing on columns related to sector
select(
sector_classification_system,
sector_classification_input_type,
sector_classification_direct_loantaker,
sector,
borderline
)
```

In case the loanbook has non-unique IDs, can generate name+sector specific IDs
(this is especially important if one company is classified in two sectors for two loans).

```{r}
loanbook_demo %>%
id_by_loantaker_sector()
```

Before we run the fuzzy matching algorithm, we simplify the loanbook and ald names using:

```{r}
some_customer_names <- c("3M Company", "Abbott Laboratories", "AbbVie Inc.")
replace_customer_name(some_customer_names)
# replacements can be defined from scratch using:
custom_replacement <- tibble(from = "AAAA", to = "B")
replace_customer_name("Aa Aaaa", from_to = custom_replacement)
# or appended to the existing list of replacements:
get_replacements()
appended_replacements <- get_replacements() %>%
add_row(
.before = 1,
from = c("AA", "BB"), to = c("alpha", "beta")
)
appended_replacements
# And in combination with `replace_customer_name()`
replace_customer_name(c("AA", "BB", "1"), from_to = appended_replacements)
```

The following function takes a loanbook with non-corrupt IDs and outputs a list of all unique name and sector combinations at every level, including the simplified name, to be used in the matching process:

```{r}
prep_loanbook <- loanbook_demo %>%
id_by_loantaker_sector() %>%
prepare_loanbook_for_matching()
prep_loanbook
```

And similarly for the ald:

```{r}
prep_ald <- r2dii.dataraw::ald_demo %>%
prepare_ald_for_matching()
prep_ald
```

For the purpose of manual matching, you can substitute the name and/ or sector of particular loans at the desired level when preparing the loanbook data. To do so, specify the `overwrite` argument in prepare_loanbook_for_matching(). (To substitute only the name, leave sector as `NA` and vice-versa).

```{r}
overwrite_demo <- r2dii.dataraw::overwrite_demo
overwrite_demo
prep_loanbook <- loanbook_demo %>%
id_by_loantaker_sector() %>%
prepare_loanbook_for_matching(overwrite = overwrite_demo)
prep_loanbook
```

`match_all_against_all()` scores the similarity between `simpler_name` values in the prepared loanbook and ald datasets. The `by_sector` argument, flags if names should only be compared against ald names in the same sector. (setting `by_sector = TRUE` reduces the matching runtime on large datasets, and reduces the amount of nonsensical matches).

```{r}
# Using default `by_sector = TRUE`
matched <- match_all_against_all(prep_loanbook, prep_ald)
matched
```

You may use common dplyr functions to recover all columns from the loanbook dataset and to keep only rows at and above some threshold.

```{r}
threshold <- 0.9
matched %>%
left_join(prep_loanbook, by = c("simpler_name_x" = "simpler_name")) %>%
filter(score >= threshold)
```

This matching data-frame should be saved and manually verified. To do so, try something like:

```r
readr::write_csv(matched, "path/to/save/matches_to_be_verified.csv")
```

and open the .csv in excel/ google sheets/ however you want to edit a spreadsheet. Once open, compare `simpler_name_x` and `simpler_name_y` manually, along with the loanbook sector. If you are happy with the match, set the `score` value to `1` (Note: Only values of exactly `1` will be considered valid, all other potential matches will be considered invalidated.)

When you are happy with the match validation:

```r
readr::read_csv("path/to/load/verified_matches.csv")
```

**Work in progress, next step of analysis it to join in validated matches in order of priority**.

0 comments on commit 51e1d25

Please sign in to comment.