# Linking them together!

In the last lesson, you've finished the bulk of the work on your effort to link `restaurants` and `restaurants_new`. You've generated the different pairs of potentially matching rows, searched for exact matches between the `cuisine_type` and `city` columns, but compared for similar strings in the `rest_name` column. You stored the DataFrame containing the scores in `potential_matches`.

Now it's finally time to link both DataFrames. You will do so by first extracting all row indices of `restaurants_new` that are matching across the columns mentioned above from `potential_matches`. Then you will subset `restaurants_new` on these indices, then append the non-duplicate values to `restaurants`. All DataFrames are in your environment, alongside `pandas` imported as `pd`.

In [1]:
import pandas as pd
import numpy as np
from faker import Faker
import datetime as dt
import missingno as msno
import matplotlib.pyplot as plt
from thefuzz import fuzz
import recordlinkage

fake = Faker()
path=r'Z:/'
file='restaurants_L2.csv'
restaurants = pd.read_csv(path+file,index_col = [0]) 
restaurants = restaurants.rename(columns={'name':'rest_name','addr':'rest_addr','type':'cuisine_type'})
print(restaurants.head(),'\n')

file2='restaurants_L2_dirty.csv'
restaurants_new = pd.read_csv(path+file2,index_col = [0]) #,parse_dates=['birth_date']
restaurants_new = restaurants_new.rename(columns={'name':'rest_name','addr':'rest_addr','type':'cuisine_type'})
print(restaurants_new.head(),'\n')

                   rest_name                  rest_addr         city  \
0  arnie morton's of chicago   435 s. la cienega blv .   los angeles   
1         art's delicatessen       12224 ventura blvd.   studio city   
2                  campanile       624 s. la brea ave.   los angeles   
3                      fenix    8358 sunset blvd. west     hollywood   
4         grill on the alley           9560 dayton way   los angeles   

        phone cuisine_type  
0  3102461501     american  
1  8187621221     american  
2  2139381447     american  
3  2138486677     american  
4  3102760615     american   

  rest_name                 rest_addr         city       phone  cuisine_type
0    kokomo         6333 w. third st.           la  2139330773      american
1    feenix   8358 sunset blvd. west     hollywood  2138486677      american
2   parkway      510 s. arroyo pkwy .     pasadena  8187951001   californian
3      r-23          923 e. third st.  los angeles  2136877178      japanese
4     

In [4]:
# Create an indexer and object and find possible pairs
indexer = recordlinkage.Index()
# Block pairing on cuisine_type
indexer.block(['cuisine_type'])
# Generate pairs
pairs = indexer.index(restaurants, restaurants_new)

comp_cl = recordlinkage.Compare()

# Find exact matches on city, cuisine_types 
comp_cl.exact('city', 'city', label='city')
comp_cl.exact('cuisine_type', 'cuisine_type', label = 'cuisine_type')

# Find similar matches of rest_name
comp_cl.string('rest_name', 'rest_name', label='name', threshold= 0.8) 

potential_matches = comp_cl.compute(pairs, restaurants, restaurants_new)
print(potential_matches)

        city  cuisine_type  name
0   0      0             1   0.0
    1      0             1   0.0
    7      0             1   0.0
    12     0             1   0.0
    13     0             1   0.0
...      ...           ...   ...
40  18     0             1   0.0
281 18     0             1   0.0
288 18     0             1   0.0
302 18     0             1   0.0
308 18     0             1   0.0

[3631 rows x 3 columns]


* Isolate instances of `potential_matches` where the row sum is above or equal to 3 by using the `.sum()` method.
* Extract the second column index from `matches`, which represents row indices of matching record from `restaurants_new` by using the .get_level_values() method.
* Subset `restaurants_new` for rows that are not in `matching_indices`.
* Append `non_dup` to `restaurants`.

In [5]:
# Isolate potential matches with row sum >=3
matches = potential_matches[potential_matches.sum(axis=1) >= 3]

# Get values of second column index of matches
matching_indices = matches.index.get_level_values(1)

# Subset restaurants_new based on non-duplicate values
non_dup = restaurants_new[~restaurants_new.index.isin(matching_indices)]

# Append non_dup to restaurants
full_restaurants = restaurants.append(non_dup)
print(full_restaurants)

AttributeError: 'DataFrame' object has no attribute 'append'

In [7]:
# Isolate potential matches with row sum >=3
matches = potential_matches[potential_matches.sum(axis=1) >= 3]

# Get values of second column index of matches
matching_indices = matches.index.get_level_values(1)

# Subset restaurants_new based on non-duplicate values
non_dup = restaurants_new[~restaurants_new.index.isin(matching_indices)]

# Append non_dup to restaurants

full_restaurants = pd.concat([restaurants,non_dup], ignore_index=True)
print(full_restaurants)

                     rest_name                  rest_addr               city  \
0    arnie morton's of chicago   435 s. la cienega blv .         los angeles   
1           art's delicatessen       12224 ventura blvd.         studio city   
2                    campanile       624 s. la brea ave.         los angeles   
3                        fenix    8358 sunset blvd. west           hollywood   
4           grill on the alley           9560 dayton way         los angeles   
..                         ...                        ...                ...   
391                        don        1136 westwood blvd.           westwood   
392                      feast        1949 westwood blvd.            west la   
393                   mulberry        17040 ventura blvd.             encino   
394                    jiraffe      502 santa monica blvd       santa monica   
395                   martha's  22nd street grill 25 22nd  st. hermosa beach   

          phone cuisine_type  
0    310