New Get started article shows details

2DegreesInvesting · Dec 5, 2019 · 51e1d25 · 51e1d25
1 parent 7370674
commit 51e1d25
Show file tree

Hide file tree

Showing 3 changed files with 144 additions and 0 deletions.
diff --git a/.Rbuildignore b/.Rbuildignore
@@ -6,3 +6,4 @@
 ^\.github/workflows$
 ^\.travis\.yml$
 ^vignettes/articles$
+^vignettes$
diff --git a/vignettes/.gitignore b/vignettes/.gitignore
@@ -0,0 +1,2 @@
+*.html
+*.R
diff --git a/vignettes/r2dii-match.Rmd b/vignettes/r2dii-match.Rmd
@@ -0,0 +1,141 @@
+---
+title: "r2dii.match"
+---
+
+```{r, include = FALSE}
+knitr::opts_chunk$set(
+  collapse = TRUE,
+  comment = "#>"
+)
+```
+
+This example aims to show the entire matching process. As usual, we start by using required packages. For convenience we'll also use the tidyverse.
+
+```{r}
+library(r2dii.match)
+library(r2dii.dataraw)
+library(tidyverse)
+```
+
+We'll use some fake datasets from the r2dii.dataraw package, which name ends with `_demo`, for example:
+
+```{r}
+loanbook_demo
+```
+
+Before matching, both the loanbook and asset level data must be prepared.
+To this end, there are several mandatory steps, and several optional steps. 
+
+We can bridge from multiple sector classification codes to 2Dii's sectors: `r sort(unique(bridge_sector(loanbook_demo)$sector))`.
+
+```{r}
+loanbook_demo %>%
+  bridge_sector() %>%
+  # Focusing on columns related to sector
+  select(
+    sector_classification_system,
+    sector_classification_input_type,
+    sector_classification_direct_loantaker,
+    sector,
+    borderline
+  )
+```
+
+In case the loanbook has non-unique IDs, can generate name+sector specific IDs
+(this is especially important if one company is classified in two sectors for two loans).
+
+```{r}
+loanbook_demo %>%
+  id_by_loantaker_sector()
+```
+
+Before we run the fuzzy matching algorithm, we simplify the loanbook and ald names using:
+
+```{r}
+some_customer_names <- c("3M Company", "Abbott Laboratories", "AbbVie Inc.")
+replace_customer_name(some_customer_names)
+
+# replacements can be defined from scratch using:
+custom_replacement <- tibble(from = "AAAA", to = "B")
+replace_customer_name("Aa Aaaa", from_to = custom_replacement)
+
+# or appended to the existing list of replacements:
+get_replacements()
+
+appended_replacements <- get_replacements() %>%
+  add_row(
+    .before = 1,
+    from = c("AA", "BB"), to = c("alpha", "beta")
+  )
+appended_replacements
+
+# And in combination with `replace_customer_name()`
+replace_customer_name(c("AA", "BB", "1"), from_to = appended_replacements)
+```
+
+The following function takes a loanbook with non-corrupt IDs and outputs a list of all unique name and sector combinations at every level, including the simplified name, to be used in the matching process: 
+
+```{r}
+prep_loanbook <- loanbook_demo %>%
+  id_by_loantaker_sector() %>%
+  prepare_loanbook_for_matching()
+
+prep_loanbook
+```
+
+And similarly for the ald:
+
+```{r}
+prep_ald <- r2dii.dataraw::ald_demo %>%
+  prepare_ald_for_matching()
+
+prep_ald
+```
+
+For the purpose of manual matching, you can substitute the name and/ or sector of particular loans at the desired level when preparing the loanbook data. To do so, specify the `overwrite` argument in prepare_loanbook_for_matching(). (To substitute only the name, leave sector as `NA` and vice-versa). 
+
+```{r}
+overwrite_demo <- r2dii.dataraw::overwrite_demo
+overwrite_demo
+
+prep_loanbook <- loanbook_demo %>%
+  id_by_loantaker_sector() %>%
+  prepare_loanbook_for_matching(overwrite = overwrite_demo)
+
+prep_loanbook
+```
+
+`match_all_against_all()` scores the similarity between `simpler_name` values in the prepared loanbook and ald datasets. The `by_sector` argument, flags if names should only be compared against ald names in the same sector. (setting `by_sector = TRUE` reduces the matching runtime on large datasets, and reduces the amount of nonsensical matches). 
+
+```{r}
+# Using default `by_sector = TRUE`
+matched <- match_all_against_all(prep_loanbook, prep_ald)
+
+matched
+```
+
+You may use common dplyr functions to recover all columns from the loanbook dataset and to keep only rows at and above some threshold.
+
+```{r}
+threshold <- 0.9
+
+matched %>%
+  left_join(prep_loanbook, by = c("simpler_name_x" = "simpler_name")) %>%
+  filter(score >= threshold)
+```
+
+This matching data-frame should be saved and manually verified. To do so, try something like: 
+
+```r
+readr::write_csv(matched, "path/to/save/matches_to_be_verified.csv")
+```
+
+and open the .csv in excel/ google sheets/ however you want to edit a spreadsheet. Once open, compare `simpler_name_x` and `simpler_name_y` manually, along with the loanbook sector. If you are happy with the match, set the `score` value to `1` (Note: Only values of exactly `1` will be considered valid, all other potential matches will be considered invalidated.)
+
+When you are happy with the match validation: 
+
+```r
+readr::read_csv("path/to/load/verified_matches.csv")
+```
+
+**Work in progress, next step of analysis it to join in validated matches in order of priority**.