# Fuzzy String Matching

The fuzzy string matching algorithm seeks to determine the degree of closeness between two different strings. This is discovered using a distance metric known as the “edit distance.” The edit distance determines how close two strings are by finding the minimum number of “edits” required to transform one string to another. 

For example, the edit distance between “London” and “Londin” is one since replacing the “i” with an “o” leads to an exact match. 

There are different variations of how to calculate edit distance. For instance, there is the Levenshtein distance, Hamming distance, Jaro distance, and more.

## The Levenshtein Distance

It’s a metric named after Vladimir Levenshtein, who originally considered it in 1965 to measure the difference between two sequences of words. We can use it to discover the minimum number of edits that you need to do to change a one-word sequence into the other. 

The Levenshtein distance between two strings a,b (of length |a| and |b| respectively) is given by lev(a,b) where <br> <br>
![image.png](attachment:image.png)

## Relevance

TheFuzz still holds as one of the most advanced open-source libraries for fuzzy string matching in Python. It was first developed by SeatGeek for the purpose of distinguishing whether two ticket listings with similar names were for the same event. 

In accordance with FuzzyWuzzy, TheFuzz uses the Levenshtein edit distance to calculate the degree of closeness between two strings. It also provides features for determining string similarity in various situations, as you will see in this tutorial.

## Examples

### String Matching

In [1]:
from thefuzz import fuzz

string1 = "apple"
string2 = "apples"

# Calculate the similarity ratio between the two strings
similarity_ratio = fuzz.ratio(string1, string2)
print(similarity_ratio)  # Output: 91

# Calculate the similarity using a partial ratio
partial_ratio = fuzz.partial_ratio(string1, string2)
print(partial_ratio)  # Output: 100

# Calculate the similarity using token sort ratio
token_sort_ratio = fuzz.token_sort_ratio(string1, string2)
print(token_sort_ratio)  # Output: 91

# Calculate the similarity using token set ratio
token_set_ratio = fuzz.token_set_ratio(string1, string2)
print(token_set_ratio)  # Output: 100

91
100
91
91




### String Matching in a list

In [2]:
string = "apple"
string_list = ["apples", "banana", "orange", "pineapple"]

# Find the closest match in the list using fuzz ratio
closest_match = max(string_list, key=lambda x: fuzz.ratio(string, x))
print(closest_match)  # Output: "apples"

# Find the closest match in the list using fuzz partial ratio
closest_partial_match = max(string_list, key=lambda x: fuzz.partial_ratio(string, x))
print(closest_partial_match)  # Output: "apples"


apples
apples


### String Matching with threshold

In [3]:
string1 = "apple"
string2 = "applesauce"

# Calculate the similarity ratio and check if it exceeds a threshold
similarity_ratio = fuzz.ratio(string1, string2)
threshold = 80

if similarity_ratio >= threshold:
    print("Strings are similar")
else:
    print("Strings are not similar")

Strings are not similar


### Record Duplication

In [4]:
record1 = {"name": "John Doe", "email": "jdoe@example.com"}
record2 = {"name": "John D.", "email": "johndoe@example.com"}
record3 = {"name": "Jane Smith", "email": "janesmith@example.com"}

# Check the similarity ratio between two records based on name
name_similarity_ratio = fuzz.ratio(record1["name"], record2["name"])
threshold = 90

if name_similarity_ratio >= threshold:
    print("Possible duplicate records:", record1, record2)
else:
    print("Records are not duplicates")

# You can apply the same approach for other fields such as email

Records are not duplicates
