**Table of contents**<a id='toc0_'></a>    
- [Import Statements](#toc1_1_)    
- [String similarity and minimum edit distance](#toc1_2_)    
- [Levenshtein algorithm for comparing two strings](#toc1_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=5
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

### <a id='toc1_1_'></a>[Import Statements](#toc0_)

In [1]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import pyreadr
import datetime as pydt
import missingno as msno
from thefuzz import fuzz, process

### <a id='toc1_2_'></a>[String similarity and minimum edit distance](#toc0_)

Minimum edit distance is a systematic way to identify how close 2 strings are. The minimum edit distance between them is the least possible amount of steps, that could get us from the first word to the second word, with the available operations being, inserting new characters, deleting them, substituting them, and transposing consecutive characters. There's a variety of algorithms based on edit distance that differ on which operations they use, how much weight attributed to each operation, which type of strings they're suited for and more.

### <a id='toc1_3_'></a>[Levenshtein algorithm for comparing two strings](#toc0_)

We'll be comparing strings using Levenshtein distance since it's the most general form of string matching by using the `thefuzz` package. See the usage at: https://github.com/seatgeek/thefuzz.

> `thefuzz.fuzz.WRatio(s1, s2, force_ascii=True, full_process=True)` will Return a measure of the sequences' similarity between 0 and 100, using different process algorithms. The WRatio function is highly robust against partial string comparison with different orderings.

In [2]:
fuzz.WRatio("Houston Rockets vs Los Angeles Lakers", "Lakers vs Rockets")

86

> `thefuzz.process.extract(query, choices, processor=None, scorer=None, limit=5, score_cutoff=0)` will find best matches in a list or dictionary of choices, return a list of tuples containing the match and its score. If a dictionary is used, also returns the key for each match. If `score_cutoff` is set, this function will return only the matches with a similarity score greater than `score_cutoff`.
>
>> `thefuzz.process.extractOne(query, choices, processor=None, scorer=None, score_cutoff=0)` is the same as `extract` but returns the best match only.

In [3]:
string = "Houston Rockets vs Los Angeles Lakers"
choices = pd.Series(["Rockets vs Lakers", "Lakers vs Rockets", "Houston vs Los Angeles", "Heat vs Bulls"])

process.extract(string, choices)

[('Rockets vs Lakers', 86, 0),
 ('Lakers vs Rockets', 86, 1),
 ('Houston vs Los Angeles', 86, 2),
 ('Heat vs Bulls', 86, 3)]

- Let's add a new column to the `restaurants` DataFrame called `cuisine` that will store the cuisine type for each restaurant.

In [4]:
df_restaurants = pyreadr.read_r("../datasets/zagat.rds")[None]

In [5]:
df_restaurants.head()

Unnamed: 0,id,name,addr,city,phone,type,class
0,0,apple pan the,10801 w. pico blvd.,los angeles,310-475-3585,american,534
1,1,asahi ramen,2027 sawtelle blvd.,los angeles,310-479-2231,noodle shops,535
2,2,baja fresh,3345 kimber dr.,los angeles,805-498-4049,mexican,536
3,3,belvedere the,9882 little santa monica blvd.,los angeles,310-788-2306,pacific new wave,537
4,4,benita's frites,1433 third st. promenade,los angeles,310-458-2889,fast food,538


In [6]:
df_restaurants.type.value_counts()

type
american ( new )            25
italian                     19
californian                 18
steakhouses                 17
chinese                     13
american                    12
french ( new )              12
seafood                     12
mexican                     11
japanese                    11
thai                        10
hamburgers                  10
french bistro                9
french ( classic )           9
continental                  9
diners                       8
indian                       7
southern/soul                6
coffee shops                 6
pizza                        5
vietnamese                   4
bbq                          4
pacific new wave             4
mediterranean                4
cafeterias                   4
southwestern                 4
eclectic                     4
delis                        4
hot dogs                     4
asian                        3
vegetarian                   3
cuban                        3
ukr

In [7]:
df_restaurants["cuisine"] = np.nan

In [8]:
cuisine_categories = ["american", "asian", "indian", "chinese", "japanese", "italian", "french", "mexican", "caribbean"]

In [9]:
# process.extract("italian", df_restaurants["type"], limit=len(df_restaurants))

In [10]:
# Iterate through categories
for cuisine in cuisine_categories:  
  # Create a list of matches, comparing cuisine with the cuisine_type column
  matches = process.extract(cuisine, df_restaurants['type'], limit=len(df_restaurants))

  # Iterate through the list of matches
  for match in matches:
     # Check whether the similarity score is greater than or equal to 80
    if match[1] >= 80:
      # If it is, select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
      df_restaurants.loc[df_restaurants['type'] == match[0], "cuisine_type"] = cuisine


  df_restaurants.loc[df_restaurants['type'] == match[0], "cuisine_type"] = cuisine


In [11]:
# Inspect the final result
df_restaurants.cuisine_type.value_counts(dropna=False)

cuisine_type
NaN          171
french        31
american      26
mexican       23
italian       20
chinese       13
japanese      11
indian         9
asian          4
caribbean      2
Name: count, dtype: int64

**`Note:`** This is just a demo of how comparing strings can be useful. There are a lot of NaN values in the final dataset since the "type" of restaurants were very wide and our "cuisine_categories" list doesn't nearly include all of them.