# recordlinkage
`recordlinkage` is a powerup version of thefuzz.  It allows for not just the  
Levenshtein distance similarity scores between two pd.df columns of strings but  
other methods like ‘jaro’,’jarowinkler’, ‘damerau_levenshtein’, ‘qgram’ or   
‘cosine’.  Handle other datatypes: e.g. compare between date columns, handle   
numeric types with 'step', 'linear', 'exp', 'gauss' or 'squared' methods, and  
find exact matches.

> [Main Table of Contents](../../README.md)

## In This Notebook
- Use Cases
- The Flow

## Use Cases
- Goal: Complex merge between 2+ dirty datasets

Complex Merge Examples | 
--- |
Collapsing a large range of values into a few categories and methods like `df.replace()` won't do | 


## The Flow
1. Clean and normalize to increase record linkage accuracy
	- `recordlinkage.preprocessing` has several methods that may be helpful
2. Use `Indexing` module as prep step to generating pairs between two datasets.  
There are several indexing algos: e.g. 'blocking', 'sorted  
neighborhood indexing'.  
3. Use `Compare` modue to set similarity measurement algorithms between two  
pd.df columns, one from datasetA and one from datasetB
	- Comparison types: Exact, String, Numeric, Date, ...  
4. `.compute()` the `Compare` object to get a pd.df with feature vectors aka  
similarity score. This step generates pairs of records 
5. Use the scores to isolate matches, handle duplicates
6. Combine the datasets

### Example of the flow	
```python
# Create an indexer and object and find possible pairs
indexer = recordlinkage.Index()

# Block pairing on cuisine_type. Blocking is one type of indexing algo.
indexer.block('cuisine_type')

# Generate pairs
pairs = indexer.index(restaurants, restaurants_new)

# Create a comparison object
comp_cl = recordlinkage.Compare()

# Create a comparison object
comp_cl = recordlinkage.Compare()

# Find exact matches on city, cuisine_types 
comp_cl.exact('city', 'city', label='city')
comp_cl.exact('cuisine_type', 'cuisine_type', label = 'cuisine_type')

# Find similar matches of rest_name
comp_cl.string('rest_name', 'rest_name', label='name', threshold = 0.8) 

# Get potential matches and print
potential_matches = comp_cl.compute(pairs, restaurants, restaurants_new)

# Investigate the minimum number of columns you want matching to ensure a proper duplicate find, what do you think should the value of n be? 
# Int this example 3 because I need to have matches in all my columns.
print(potential_matches)

# Isolate potential matches with row sum >=3
matches = potential_matches[potential_matches.sum(axis=1) >= 3]

# Get values of second column index of matches
matching_indices = matches.index.get_level_values(1)

# Subset restaurants_new based on non-duplicate values
non_dup = restaurants_new[~restaurants_new.index.isin(matching_indices)]

# Append non_dup to restaurants
full_restaurants = restaurants.append(non_dup)
print(full_restaurants)
```