Skip to content
This repository has been archived by the owner on Sep 30, 2022. It is now read-only.

Commit

Permalink
Merge ac04821 into 0bb2ad3
Browse files Browse the repository at this point in the history
  • Loading branch information
Jackson Hoffart committed Feb 19, 2020
2 parents 0bb2ad3 + ac04821 commit a0ed5c3
Show file tree
Hide file tree
Showing 8 changed files with 226 additions and 378 deletions.
1 change: 1 addition & 0 deletions .Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,4 @@
^\.github/workflows/R-CMD-check\.yaml$
^\.github/workflows/pr-commands\.yaml$
^\.github/workflows/pkgdown\.yaml$
^vignettes/r2dii-match\.Rmd$
5 changes: 4 additions & 1 deletion R/prioritize.R
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,10 @@
#'
#' @template ignores-but-preserves-existing-groups
#'
#' @param data A dataframe, commonly the output of [match_name()].
#' @param data A dataframe like the validated output of [match_name()]. Ensure
#' the output of [match_name()] has been manually validated, and the score of
#' all correct matches has been set to `1`, otherwise matching coverage may be
#' poor.
#' @param priority One of:
#' * `NULL`: defaults to the default level priority as returned by
#' [prioritize_level()].
Expand Down
134 changes: 12 additions & 122 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ knitr::opts_chunk$set(
[![R build status](https://github.com/2DegreesInvesting/r2dii.match/workflows/R-CMD-check/badge.svg)](https://github.com/2DegreesInvesting/r2dii.match/actions)
<!-- badges: end -->

The goal of r2dii.match is to match generic loanbook data with physical asset level data (ald).
The goal of r2dii.match is to match counterparties from a generic loanbook data with physical asset level data (ald).

## Installation

Expand All @@ -42,8 +42,6 @@ devtools::install_github("2DegreesInvesting/r2dii.match")

## Example

We'll use required packages from r2dii, and some convenient packages from the tidyverse.

```{r}
library(r2dii.match)
library(r2dii.dataraw)
Expand All @@ -52,136 +50,28 @@ suppressPackageStartupMessages(
)
```

The process for matching loanbook and ald datasets has multiple steps:

### 1. Create two datasets: [loanbook](https://2degreesinvesting.github.io/r2dii.dataraw/reference/loanbook_description.html) and [asset-level data (ald)](https://2degreesinvesting.github.io/r2dii.dataraw/reference/ald_description.html)

Start by creating datasets like [`loanbook_demo`](https://2degreesinvesting.github.io/r2dii.dataraw/reference/loanbook_demo.html) and [`ald_demo`](https://2degreesinvesting.github.io/r2dii.dataraw/reference/ald_demo.html) (from the [r2dii.dataraw package](https://2degreesinvesting.github.io/r2dii.dataraw)).

```{r}
loanbook_demo
ald_demo
```
Matching is achieved in two main steps:

You may use these datasets as a template:
### 1. Run fuzzy matching

* Write _loanbook\_demo.csv_ and _ald\_demo.csv_ with:
`match_name()` will extract all unique counterparty names from the columns: `direct_loantaker`, `ultimate_parent` or `intermediate_parent*` and run fuzzy matching against all company names in the `ald`:

```r
# Writting to current working directory
loanbook_demo %>%
write_csv(path = "loanbook_demo.csv")
match_result <- match_name(loanbook_demo, ald_demo)
match_result

ald_demo %>%
write_csv(path = "ald_demo.csv")
```

* For each dataset, replace our demo data with your data.
* Save each dataset as, say, _your\_loanbook.csv_ and _your\_ald.csv_.
* Read your datasets back into R with:

```r
# Reading from current working directory
your_loanbook <- read_csv("your_loanbook.csv")
your_ald <- read_csv("your_ald.csv")
```

Here we'll continue to use our `*_demo` datasets, pretending they contain the data of your own.

```{r}
# WARNING: Skip this to avoid overwriting your data with our demo data
your_loanbook <- loanbook_demo
your_ald <- ald_demo
```
### 2. Prioritize validated matches

### 2. Score the goodness of the match between the loanbook and ald datasets

`match_name()` scores the match between names in a loanbook dataset (lbk) and names in an asset-level dataset (ald). The names come from the columns `name_direct_loantaker` and `name_ultimate_parent` of the loanbook dataset, and from the column `name_company` of the a asset-level dataset. The raw names are internally transformed applying best-practices commonly used in name matching algorithms, such as:

* Remove special characters.
* Replace language specific characters.
* Abbreviate certain names to reduce their importance in the matching.
* Spell out numbers to increase their importance.

Then, the similarity is scored between the internally-transformed names from the loanbook versus ald datasets. The scoring algorithm is `stringdist::stringsim()`.

```{r}
match_name(your_loanbook, your_ald)
```

`match_name()` defaults to scoring matches between name strings that belong to the same sector. Using `by_sector = FALSE` removes this limitation -- increasing computation time, and the number of matches with a low score.

```{r}
match_name(your_loanbook, your_ald, by_sector = FALSE) %>%
nrow()
# Compare
match_name(your_loanbook, your_ald, by_sector = TRUE) %>%
nrow()
```

`min_score` allows you to pick rows of a minimum `score` and above.

```{r}
matched <- match_name(your_loanbook, your_ald, min_score = 0.9)
range(matched$score)
```

### 3. Write the output of the previous step into a .csv file

Write the output of the previous step into a .csv file with:

```r
# Writting to current working directory
matched %>%
write_csv("matched.csv")
```

### 4. Compare, edit, and save the data manually

* Open _matched.csv_ with any spreadsheet editor (e.g. MS Excel, Google Sheets).

* Visually compare names from loanbook versus ald datasets, along with the loanbook sector.

* Edit the data manually:
* If you are happy with the match, set the `score` value to `1`.
* Otherwise set or leave the `score` value to anything other than `1`.

* Save the edited file as, say, _matched_edited.csv_.

### 5. Re-read the data from the previous step

Re-read the data from the previous step with:
The user should then manually validate `match_result`, ensuring that the `score` value is only equal to `1` for perfect matches.
Once validated, the `prioritize()` function, will choose only the valid matches, prioritizing (by default) `direct_loantaker` matches over `ultimate_parent` matches:

```r
# Reading from current working directory
matched <- read_csv("matched_edited.csv")
```

### 6. Pick validated matches and prioritize by level

The `matched` dataset may have multiple matches per loan. To get the best match only, use `priorityze()` -- it picks rows where `score` is 1 and `level` per loan is of highest `priority()`.
priotize(match_result)

```{r}
some_interesting_columns <- vars(id_2dii, level, score)
matched %>%
prioritize() %>%
select(!!! some_interesting_columns)
```

The default priority is set internally via `prioritize_levels()`.

```{r}
prioritize_level(matched)
```

You may use a different priority. One way to do that is to pass a function to `priority`. For example, use `rev` to reverse the default priority.

```{r}
matched %>%
prioritize(priority = rev) %>%
select(!!! some_interesting_columns)
```
The result is a dataset, with identical columns to the input loanbook, and added columns bridging all matched loans to their ald counterpart.

For a more detailed walkthrough of the functionality [see the documentation](https://2degreesinvesting.github.io/r2dii.match/articles/r2dii.match.html)

0 comments on commit a0ed5c3

Please sign in to comment.