## Researching Fuzzywuzzy

#### info gathered from https://github.com/seatgeek/fuzzywuzzy and https://towardsdatascience.com/natural-language-processing-for-fuzzy-string-matching-with-python-6632b7824c49

##### The Levenshtein distance is a metric to measure how apart are two sequences of words. In other words, it measures the minimum number of edits that you need to do to change a one-word sequence into the other. These edits can be insertions, deletions or substitutions. This metric was named after Vladimir Levenshtein, who originally considered it in 1965.
##### FuzzyWuzzy is a library of Python which is used for string matching. Fuzzy string matching is the process of finding strings that match a given pattern. Basically it uses Levenshtein Distance to calculate the differences between sequences. FuzzyWuzzy has been developed and open-sourced by SeatGeek.
##### Fuzzywuzzy has the functions ratio, partial_ratio, token_sort_ratio, token_set_ratio, and process.

In [30]:
import fuzzywuzzy
from fuzzywuzzy import fuzz
import numpy as np
import pandas as pd

## Example of simple ratio usage:

In [7]:
Str1 = "Sammy Smith"
Str2 = "sammy smith."
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
print(Ratio)

96


##### 96 is the outputted ratio of similarity. Without the period at the end of str2, it would have yielded 100%. Here is another example using partial ratio:

In [11]:
Str1 = "Sammy Smith"
Str2 = "sammy smith."
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
print(Ratio)
print(Partial_Ratio)

96
100


In [12]:
Str1 = "Debra Mae Walkins"
Str2 = "Walkins"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
print(Ratio)
print(Partial_Ratio)

58
100


##### fuzz.partial_ratio() is capable of detecting that both strings are referring to the last name Walkins, so it yields 100% similarity. The way this works is by using an "optimal partial" logic. 
##### If the short string has length k and the longer string has the length m, then the algorithm seeks the score of the best matching length-k substring.

##### What happens when the strings comparison is the same, but they are in a different order?
## Example of token function usage:

In [17]:
Str1 = "Debra Mae Walkins"
Str2 = "Walkins Mae Debra"

Ratio = fuzz.ratio(Str1.lower(),Str2.lower())

Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())

Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)

print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)

41
41
100


In [32]:
Str1 = "Debra Mae Walkins"
Str2 = "Walkins M Debra"

Ratio = fuzz.ratio(Str1.lower(),Str2.lower())

Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())

Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)

print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)

44
47
94


##### The fuzz.token functions are probably better for us than ratio and partial_ratio. They tokenize the strings and preprocess them by turning them to lower case and getting rid of punctuation. 
##### In fuzz.token_sort_ratio(), the string tokens get sorted alphabetically and then joined together. After that, a simple fuzz.ratio() is applied to obtain the similarity percentage. This allows cases such as court cases in this example to be marked as being the same.

##### What if the two strings have widely differing lengths...We should use fuzz.token_set_ratio(). Token_set_ratio ignores duplicated words. Example:

In [29]:
Str1 = "2315 West West Denver St WI wi Milwaukee"
Str2 = "Denver St. wi Milwaukee"

Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)
Token_Set_Ratio = fuzz.token_set_ratio(Str1,Str2)

print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)
print(Token_Set_Ratio)

70
87
71
100


#### Fuzzy wuzzy also has a function called process, which allows you to calculate the string with the highest similarity out of a vector of strings. I don't think that this is important for our use case...right?