In [1]:
import numpy as np
import pandas as pd
import datetime as dt
import missingno as msno
import matplotlib.pyplot as plt

## Comparing strings

#### In this chapter
Chapter 4 - Record linkage


#### Minimum edit distance algorithms

In [None]:
Algorithm                 Operations
Damerau-Levenshtein       insertion, substitution, deletion, transposition
Levenshtein               insertion, substitution, deletion
Hamming                   substitution only
Jaro distance             transposition only
...                       ...
Possible packages:nltk, thefuzz, textdistance ..

#### Simple string comparison

In [2]:
# Lets us compare between two strings
from thefuzz import fuzz
# Compare reeding vs reading
fuzz.WRatio('Reeding', 'Reading')

86

In [3]:
fuzz.WRatio('reading','Reading')

100

#### Partial strings and different orderings

In [4]:
# Partial string comparison
fuzz.WRatio('Houston Rockets', 'Rockets')

90

In [5]:
# Partial string comparison with different order
fuzz.WRatio('Houston Rockets vs Los Angeles Lakers', 'Lakers vs Rockets')

86

#### Comparison with arrays

In [6]:
# Import process
from thefuzz import process
# Define string and array of possible matches
string = "Houston Rockets vs Los Angeles Lakers"
choices = pd.Series(['Rockets vs Lakers', 'Lakers vs Rockets', 'Houson vs Los Angeles', 'Heat vs Bulls'])
process.extract(string, choices, limit = 2)

[('Rockets vs Lakers', 86, 0), ('Lakers vs Rockets', 86, 1)]

#### Collapsing categories with string similarity
Chapter 2

Use .replace() to collapse "eur" into "Europe"

What if there are too many variations?

"EU", "eur", "Europ", "Europa", "Erope", "Evropa"...

String similarity!


#### Collapsing categories with string matching

In [7]:
survey = pd.read_csv('survey.csv')
survey

Unnamed: 0,id,state
0,0,California
1,1,Cali
2,2,Calefornia
3,3,Calefornie
4,4,Californie
5,5,Calfornia
6,6,Calefernia
7,7,New York
8,8,New York City


In [8]:
categories = pd.read_csv('categories.csv')
categories

Unnamed: 0,state
0,California
1,New York


#### Collapsing all of the state

In [9]:
# For each correct category
for state in categories['state']:
    # Find potential matches in states with typoes    
    matches = process.extract(state, survey['state'], limit = survey.shape[0])
    # For each potential match match
for potential_match in matches:
    # If high similarity score
    if potential_match[1] >= 80:
        # Replace typo with correct category     
        survey.loc[survey['state'] == potential_match[0], 'state'] = state

In [10]:
survey['state'].unique()

array(['California', 'Cali', 'Calefornia', 'Calefornie', 'Californie',
       'Calfornia', 'Calefernia', 'New York'], dtype=object)

#### Minimum edit distance
In the video exercise, you saw how minimum edit distance is used to identify how similar two strings are. As a reminder, minimum edit distance is the minimum number of steps needed to reach from String A to String B, with the operations available being:

- Insertion of a new character.
- Deletion of an existing character.
- Substitution of an existing character.
- Transposition of two existing consecutive characters.

What is the minimum edit distance from 'sign' to 'sing', and which operation(s) gets you there?

Answer the question?
- 1 by transposing 'g' with 'n'

- Correct! Transposing the last two letters of 'sign' is the easiest way to get to 'sing' - in the next exercise, you'll use edit distance at scale to remap categories!

### The cutoff point
In this exercise, and throughout this chapter, you'll be working with the restaurants DataFrame which has data on various restaurants. Your ultimate goal is to create a restaurant recommendation engine, but you need to first clean your data.

This version of restaurants has been collected from many sources, where the cuisine_type column is riddled with typos, and should contain only italian, american and asian cuisine types. There are so many unique categories that remapping them manually isn't scalable, and it's best to use string similarity instead.

Before doing so, you want to establish the cutoff point for the similarity score using the thefuzz's process.extract() function by finding the similarity score of the most distant typo of each category.

In [11]:
restaurants = pd.read_csv('restaurants_L2.csv',index_col=0)
restaurants

Unnamed: 0,name,addr,city,phone,type
0,arnie morton's of chicago,435 s. la cienega blv .,los angeles,3102461501,american
1,art's delicatessen,12224 ventura blvd.,studio city,8187621221,american
2,campanile,624 s. la brea ave.,los angeles,2139381447,american
3,fenix,8358 sunset blvd. west,hollywood,2138486677,american
4,grill on the alley,9560 dayton way,los angeles,3102760615,american
...,...,...,...,...,...
331,vivande porta via,2125 fillmore st.,san francisco,4153464430,italian
332,vivande ristorante,670 golden gate ave.,san francisco,4156739245,italian
333,world wrapps,2257 chestnut st.,san francisco,4155639727,american
334,wu kong,101 spear st.,san francisco,4159579300,asian


In [12]:
restaurants = restaurants.rename(columns={'addr':'address','type':'cuisine_type'})

In [13]:
restaurants['phone'] = restaurants['phone'].astype('int16')

In [14]:
restaurants.info()

<class 'pandas.core.frame.DataFrame'>
Index: 336 entries, 0 to 335
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   name          336 non-null    object
 1   address       336 non-null    object
 2   city          336 non-null    object
 3   phone         336 non-null    int16 
 4   cuisine_type  336 non-null    object
dtypes: int16(1), object(4)
memory usage: 13.8+ KB


In [15]:
# Import process from thefuzz
from thefuzz import process

# Store the unique values of cuisine_type in unique_types
unique_types = restaurants['cuisine_type'].unique()

# Calculate similarity of 'asian' to all values of unique_types
print(process.extract('asian', unique_types, limit = len(unique_types)))
print('\n')
# Calculate similarity of 'american' to all values of unique_types
print(process.extract('american', unique_types, limit = len(unique_types)))
print('\n')
# Calculate similarity of 'italian' to all values of unique_types
print(process.extract('italian', unique_types, limit = len(unique_types)))


[('asian', 100), ('italian', 67), ('american', 62), ('mexican', 50), ('steakhouses', 40), ('cajun', 40), ('southwestern', 36), ('southern', 31), ('coffeebar', 26)]


[('american', 100), ('mexican', 80), ('cajun', 68), ('asian', 62), ('italian', 53), ('southwestern', 49), ('southern', 38), ('coffeebar', 24), ('steakhouses', 21)]


[('italian', 100), ('asian', 67), ('american', 53), ('mexican', 43), ('cajun', 33), ('southwestern', 33), ('steakhouses', 33), ('southern', 27), ('coffeebar', 12)]


### Remapping categories II
In the last exercise, you determined that the distance cutoff point for remapping typos of 'american', 'asian', and 'italian' cuisine types stored in the cuisine_type column should be 80.

In this exercise, you're going to put it all together by finding matches with similarity scores equal to or higher than 80 by using fuzywuzzy.process's extract() function, for each correct cuisine type, and replacing these matches with it. Remember, when comparing a string with an array of strings using process.extract(), the output is a list of tuples where each is formatted like:

(closest match, similarity score, index of match)
The restaurants DataFrame is in your environment, and you have access to a categories list containing the correct cuisine types ('italian', 'asian', and 'american').

In [16]:
# Inspect the unique values of the cuisine_type column
print(restaurants['cuisine_type'].unique())

['american' 'asian' 'italian' 'coffeebar' 'mexican' 'southwestern'
 'steakhouses' 'southern' 'cajun']


In [17]:
# Create a list of matches, comparing 'italian' with the cuisine_type column
matches = process.extract('italian', restaurants['cuisine_type'], limit=len(restaurants.cuisine_type))

# Inspect the first 5 matches
print(matches[0:5])

[('italian', 100, 6), ('italian', 100, 10), ('italian', 100, 11), ('italian', 100, 16), ('italian', 100, 19)]


In [18]:
# Iterate through the list of matches to italian
for match in matches:
  # Check whether the similarity score is greater than or equal to 80
  if match[1] >= 80:
    # Select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
    restaurants.loc[restaurants['cuisine_type'] == match[0], 'cuisine_type'] = 'italian'

In [19]:
restaurants['cuisine_type'].value_counts()

cuisine_type
american        137
italian          78
asian            72
coffeebar        25
mexican           9
southern          8
steakhouses       5
southwestern      1
cajun             1
Name: count, dtype: int64

In [20]:
restaurants.info()

<class 'pandas.core.frame.DataFrame'>
Index: 336 entries, 0 to 335
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   name          336 non-null    object
 1   address       336 non-null    object
 2   city          336 non-null    object
 3   phone         336 non-null    int16 
 4   cuisine_type  336 non-null    object
dtypes: int16(1), object(4)
memory usage: 13.8+ KB


In [21]:
categories = ['italian', 'asian', 'american']

In [22]:
# Iterate through categories
for cuisine in categories:  
  # Create a list of matches, comparing cuisine with the cuisine_type column
  matches = process.extract(cuisine, restaurants['cuisine_type'], limit=len(restaurants.cuisine_type))

  # Iterate through the list of matches
  for match in matches:
     # Check whether the similarity score is greater than or equal to 80
    if match[1] >= 80:
      # If it is, select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
      restaurants.loc[restaurants['cuisine_type'] == match[0]] = cuisine

# Inspect the final result
print(restaurants['cuisine_type'].unique())

['american' 'asian' 'italian' 'coffeebar' 'southwestern' 'steakhouses'
 'southern' 'cajun']


  restaurants.loc[restaurants['cuisine_type'] == match[0]] = cuisine


- Tremendous work! All your cuisine types are properly mapped! Now you'll build on string similarity, by jumping into record linkage!

## Generating pairs

In [23]:
import numpy as np
import pandas as pd
import recordlinkage

In [24]:
census_A = pd.read_csv('census_a.csv')
census_B = pd.read_csv('census_b.csv')

In [25]:
census_A

Unnamed: 0,rec_id,given_name,surname,date_of_birth,suburb,state,address_id
0,rec-1070-org,michaela,neumann,19151111,winston hills,cal,stanley street
1,rec-1016-org,courtney,painter,19161214,richlands,txs,pinkerton circuit


In [26]:
census_B

Unnamed: 0,rec_id,given_name,surname,date_of_birth,suburb,state,address_id
0,rec-561-dup-0,elton,,19651013,windermere,ny,light setreet
1,rec-2642-dup-0,mitchell,maxon,19390212,north ryde,cal,edkins street


In [27]:
# Create indexing object
indexer = recordlinkage.Index()

# Generate pairs blocked on state
indexer.block('state')
pairs = indexer.index(census_A, census_B)

#### Comparing the DataFrames

In [28]:
# Generate the pairs
pairs = indexer.index(census_A, census_B)

# Create a Compare object
compare_cl = recordlinkage.Compare()

# Find exact matches for pairs of date_of_birth and state
compare_cl.exact('date_of_birth', 'date_of_birth', label='date_of_birth')
compare_cl.exact('state', 'state', label='state')

# Find similar matches for pairs of surname and address_1 using string similarity
compare_cl.string('surname', 'surname', threshold=0.85, label='surname')
compare_cl.string('address_id', 'address_id', threshold=0.85, label='address_id')

# Find matches
potential_matches = compare_cl.compute(pairs, census_A, census_B)
potential_matches

Unnamed: 0,Unnamed: 1,date_of_birth,state,surname,address_id
0,1,0,1,0.0,0.0


#### Finding the only pairs we want

In [29]:
potential_matches[potential_matches.sum(axis = 1) >= 1]

Unnamed: 0,Unnamed: 1,date_of_birth,state,surname,address_id
0,1,0,1,0.0,0.0


### Pairs of restaurants
In the last lesson, you cleaned the restaurants dataset to make it ready for building a restaurants recommendation engine. You have a new DataFrame named restaurants_new with new restaurants to train your model on, that's been scraped from a new data source.

You've already cleaned the cuisine_type and city columns using the techniques learned throughout the course. However you saw duplicates with typos in restaurants names that require record linkage instead of joins with restaurants.

In this exercise, you will perform the first step in record linkage and generate possible pairs of rows between restaurants and restaurants_new. Both DataFrames, pandas and recordlinkage are in your environment.

In [30]:
restaurants = pd.read_csv('restaurants_L2.csv',index_col=0)
restaurants

Unnamed: 0,name,addr,city,phone,type
0,arnie morton's of chicago,435 s. la cienega blv .,los angeles,3102461501,american
1,art's delicatessen,12224 ventura blvd.,studio city,8187621221,american
2,campanile,624 s. la brea ave.,los angeles,2139381447,american
3,fenix,8358 sunset blvd. west,hollywood,2138486677,american
4,grill on the alley,9560 dayton way,los angeles,3102760615,american
...,...,...,...,...,...
331,vivande porta via,2125 fillmore st.,san francisco,4153464430,italian
332,vivande ristorante,670 golden gate ave.,san francisco,4156739245,italian
333,world wrapps,2257 chestnut st.,san francisco,4155639727,american
334,wu kong,101 spear st.,san francisco,4159579300,asian


In [31]:
restaurants_new = pd.read_csv('restaurants_L2_dirty.csv',index_col=0)
restaurants_new

Unnamed: 0,name,addr,city,phone,type
0,kokomo,6333 w. third st.,la,2139330773,american
1,feenix,8358 sunset blvd. west,hollywood,2138486677,american
2,parkway,510 s. arroyo pkwy .,pasadena,8187951001,californian
3,r-23,923 e. third st.,los angeles,2136877178,japanese
4,gumbo,6333 w. third st.,la,2139330358,cajun/creole
...,...,...,...,...,...
77,feast,1949 westwood blvd.,west la,3104750400,chinese
78,mulberry,17040 ventura blvd.,encino,8189068881,pizza
79,matsuhissa,129 n. la cienega blvd.,beverly hills,3106599639,asian
80,jiraffe,502 santa monica blvd,santa monica,3109176671,californian


In [32]:
# Create an indexer and object and find possible pairs
indexer = recordlinkage.Index()

# Block pairing on cuisine_type
indexer.block('type')

# Generate pairs
pairs = indexer.index(restaurants, restaurants_new)

### Similar restaurants
In the last exercise, you generated pairs between restaurants and restaurants_new in an effort to cleanly merge both DataFrames using record linkage.

When performing record linkage, there are different types of matching you can perform between different columns of your DataFrames, including exact matches, string similarities, and more.

Now that your pairs have been generated and stored in pairs, you will find exact matches in the city and cuisine_type columns between each pair, and similar strings for each pair in the rest_name column. Both DataFrames, pandas and recordlinkage are in your environment.

In [33]:
# Create a comparison object
comp_cl = recordlinkage.Compare()

In [34]:
# Find exact matches on city, cuisine_types 
comp_cl.exact('city', 'city', label='city')
comp_cl.exact('type', 'type', label = 'type')

# Find similar matches of rest_name
comp_cl.string('name', 'name', label='name', threshold = 0.8) 

<Compare>

In [35]:
# Get potential matches and print
potential_matches = comp_cl.compute(pairs, restaurants, restaurants_new)
print(potential_matches)

        city  type  name
0   0      0     1   0.0
    1      0     1   0.0
    7      0     1   0.0
    12     0     1   0.0
    13     0     1   0.0
...      ...   ...   ...
334 79     0     1   0.0
335 26     0     1   0.0
    65     0     1   0.0
    71     0     1   0.0
    79     0     1   0.0

[3631 rows x 3 columns]


## Linking DataFrames

In [None]:
# Import recordlinkage and generate full pairs
import recordlinkage

indexer = recordlinkage.Index()
indexer.block('state')
full_pairs = indexer.index(census_A, census_B)

# Comparison step
compare_cl = recordlinkage.Compare()
compare_cl.exact('date_of_birth', 'date_of_birth', label='date_of_birth')
compare_cl.exact('state', 'state', label='state')
compare_cl.string('surname', 'surname', threshold=0.85, label='surname')
compare_cl.string('address_1', 'address_1', threshold=0.85, label='address_1')

# Generate potential matches
potential_matches = compare_cl.compute(full_pairs, census_A, census_B)

# Isolate matches with matching values for 3 or more columns
matches = potential_matches[potential_matches.sum(axis = 1) >= 3]

# Get index for matching census_B rows only
matches.index
duplicate_rows = matches.index.get_level_values(1)
print(census_B_index)

# Finding new rows in census_B
census_B_new = census_B[~census_B.index.isin(duplicate_rows)]

# Link the DataFrames!
full_census = census_A.append(census_B_new)

### Linking them together!
In the last lesson, you've finished the bulk of the work on your effort to link restaurants and restaurants_new. You've generated the different pairs of potentially matching rows, searched for exact matches between the cuisine_type and city columns, but compared for similar strings in the rest_name column. You stored the DataFrame containing the scores in potential_matches.

Now it's finally time to link both DataFrames. You will do so by first extracting all row indices of restaurants_new that are matching across the columns mentioned above from potential_matches. Then you will subset restaurants_new on these indices, then append the non-duplicate values to restaurants. All DataFrames are in your environment, alongside pandas imported as pd.

In [None]:
# Isolate potential matches with row sum >=3
matches = potential_matches[potential_matches.sum(axis = 1) >= 3]

# Get values of second column index of matches
matching_indices = matches.index.get_level_values(1)

# Subset restaurants_new based on non-duplicate values
non_dup = restaurants_new[~restaurants_new.index.isin(matching_indices)]

# Append non_dup to restaurants
full_restaurants = restaurants.append(non_dup)
print(full_restaurants)

- Awesome work! Linking the DataFrames is arguably the most straightforward step of record linkage. You are now ready to get started on that recommendation engine!