In [1]:
import numpy as np
import pandas as pd
import recordlinkage

### Record Linkage:

`Record linkage` is the act of `linking data` from `different sources` regarding the `same entity`. Record linkage does not require `exact` matches between `different pairs` of data, and instead can find `close matches` using `string similarity`. This is why record linkage is effective when there are `no common unique keys` between the `data sources` you can rely upon when linking data sources such as a unique identifier.

In [2]:
restaurants = pd.read_csv("restaurants_L2.csv")
#restaurants.head()

In [4]:
restaurants_new = pd.read_csv("restaurants_L2_dirty.csv")
restaurants_new.head()

Unnamed: 0.1,Unnamed: 0,name,addr,city,phone,type
0,0,kokomo,6333 w. third st.,la,2139330773,american
1,1,feenix,8358 sunset blvd. west,hollywood,2138486677,american
2,2,parkway,510 s. arroyo pkwy .,pasadena,8187951001,californian
3,3,r-23,923 e. third st.,los angeles,2136877178,japanese
4,4,gumbo,6333 w. third st.,la,2139330358,cajun/creole


### Ex 1:

In this exercise, you will perform the first step in `record linkage` and generate possible `pairs` of rows between `restaurants` and `restaurants_new`.

### Steps of Record Linkage:

1.Generate Pairs
  
2.Compare between columns : 

3.Score the comparison :

4.Link the DataFrames :

### Generate Pairs : 

Steps to Generate pairs:

  a. To generate pairs you need to `create` an `indexing` object and find possible `pairs`.
    
  b. Then you need to `Block` your `pairing` by using `indexer's' .block()` method.
    
  c. Finally, `Generate pairs` by `indexing` restaurants and restaurants_new in that order.

In [6]:
# Create an indexer and object and find possible pairs
indexer = recordlinkage.Index()

# Block pairing on cuisine_type
indexer.block("cuisine_type")

# Generate pairs
pairs = indexer.index(restaurants, restaurants_new)

### Ex 2:

When performing `record linkage`, there are different types of `matching` you can perform between different columns of your DataFrames, including `exact matches`, `string similarities`, and more.

Now that your `pairs` have been `generated` and `stored in pairs`, you will find `exact matches` in the `city` and `cuisine_type` columns between each pair, and `similar strings` for each pair in the `rest_name` column. 

### Compare between columns and Scoring the comparison :

Steps to Compare between columns :

1.Instantiate a comparison object using the `recordlinkage.Compare()` function.

2.find `exact` matches 

3.find `similar` strings 

4.Compute the comparison of the pairs by using the `.compute()` method

5.Print out `potential_matches`

In [None]:
# Create a comparison object
comp_cl = recordlinkage.Compare()

# Find exact matches on city, cuisine_types 
comp_cl.exact('city', 'city', label='city')
comp_cl.exact('cuisine_type', 'cuisine_type', label = 'cuisine_type')

# Find similar matches of rest_name
comp_cl.string('rest_name', 'rest_name', label='name', threshold = 0.8) 

## Scoring 
# Get potential matches and print
potential_matches = comp_cl.compute(pairs, restaurants, restaurants_new)
print(potential_matches)
print(potential_matches[potential_matches.sum(axis = 1) >= 3])

### Ex 3: 

Now it's finally time to `link` both DataFrames. You will do so by first extracting all `row indices` of `restaurants_new` that are `matching` across the `columns` mentioned above from `potential_matches`. Then you will `subset` restaurants_new on these indices, then `append` the `non-duplicate` values to `restaurants`.

### Linking DataFrames:

Steps of Linking DataFrames:

1.Isolate instances of potential_matches where the row sum is above or equal to `n` by using the `.sum()` method. Here `n` is the `minimum` number of `columns` you want matching to ensure a `proper duplicate find`, 

`potential_matches[potential_matches.sum(axis = 1) >= n]`

2.Finding `duplicate` rows by using the `.get_level_values()` method.

3.Finding `non-duplicate` rows by subsetting.

4.Finally `append` dataframes with no duplicate values

In [None]:
# Isolate potential matches with row sum >=3
matches = potential_matches[potential_matches.sum(axis = 1) >= 3]
# potential_matches[potential_matches.sum(axis = 1) >= 3]

# Get values of second column index of matches
matching_indices = matches.index.get_level_values(1)

# Subset restaurants_new based on non-duplicate values
non_dup = restaurants_new[~restaurants_new.index.isin(matching_indices)]

# Append non_dup to restaurants
full_restaurants = restaurants.append(non_dup)
print(full_restaurants)