# `fuzzymatcher` examples

## Basic usage - `link_table`

In the most basic usage, the user provides `fuzzymatcher` with two pandas dataframes, indicating which columns to join on.

The central output of `fuzzymatcher` is the `link_table`.

For each record in the left table, the link table includes one or more possible matching records from the right table.

The user can then inspect the link table and decide which matches to retain, e.g. by choosing a score threshold ( `match_score > chosen_threshold` ) or just choosing the best match ( `match_rank == 1` )

In [1]:
import fuzzymatcher
import pandas as pd

df_left = pd.read_csv("tests/data/left_1.csv")
df_left

Unnamed: 0,id,fname,mname,lname,dob,another_field
0,1,Alistair,Paul,Johnston,20/05/1980,other data
1,2,James,Paul,Smith,15/06/1990,more data
2,3,Alisdair,Paul,Jonson,20/05/1961,another thing
3,4,David,Paul,Williams,01/01/2000,final thing


In [2]:
df_right = pd.read_csv("tests/data/right_1.csv")
df_right

Unnamed: 0,id,name,middlename,surname,date,other
0,1,Alistair,Paul,Johnston,20/05/1980,other data
1,2,James,Paul,Smith,15/06/1990,more data
2,3,Alasdair,Paul,Johnson,20/05/1960,another thing


In [3]:
# Columns to match on from df_left
left_on = ["fname", "mname", "lname",  "dob"]

# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]

# Note that if left_id_col or right_id_col are admitted a unique id will be autogenerated
fuzzymatcher.link_table(df_left, df_right, left_on, right_on, left_id_col = "id", right_id_col = "id")

Unnamed: 0,__id_left,__id_right,match_score,match_rank,fname,name,mname,middlename,lname,surname,dob,date
0,0_left,0_right,0.452393,1,Alistair,Alistair,Paul,Paul,Johnston,Johnston,20/05/1980,20/05/1980
1,1_left,1_right,0.534373,1,James,James,Paul,Paul,Smith,Smith,15/06/1990,15/06/1990
2,2_left,2_right,0.296763,1,Alisdair,Alasdair,Paul,Paul,Jonson,Johnson,20/05/1961,20/05/1960
3,3_left,0_right,0.067482,1,David,Alistair,Paul,Paul,Williams,Johnston,01/01/2000,20/05/1980
4,3_left,1_right,0.067482,2,David,James,Paul,Paul,Williams,Smith,01/01/2000,15/06/1990
5,3_left,2_right,0.067482,3,David,Alasdair,Paul,Paul,Williams,Johnson,01/01/2000,20/05/1960


## Basic usage - `fuzzy_left_join`

A second option is to use `fuzzy_left_join`, which automatically links the two dataframes based on the highest-scoring match.

In [12]:
import fuzzymatcher
import pandas as pd

df_left = pd.read_csv("tests/data/left_1.csv")
df_right = pd.read_csv("tests/data/right_1.csv")
left_on = ["fname", "lname",  "dob"]
right_on = ["name", "surname", "date"]

fuzzymatcher.fuzzy_left_join(df_left, df_right, left_on, right_on)

Unnamed: 0,best_match_score,id_left,fname,mname,lname,dob,another_field,id_right,name,middlename,surname,date,other
0,0.35987,1,Alistair,Paul,Johnston,20/05/1980,other data,1.0,Alistair,Paul,Johnston,20/05/1980,other data
1,0.438719,2,James,Paul,Smith,15/06/1990,more data,2.0,James,Paul,Smith,15/06/1990,more data
2,0.21363,3,Alisdair,Paul,Jonson,20/05/1961,another thing,3.0,Alasdair,Paul,Johnson,20/05/1960,another thing
3,,4,David,Paul,Williams,01/01/2000,final thing,,,,,,
