# The Fuzz
- `theFuzz` uses the Levenshtein edit distance to calculate the degree of closeness between two strings. 
    - It also provides features for determining string similarity in various situations
- [Reference](https://www.datacamp.com/tutorial/fuzzy-string-python)

In [None]:
# !conda install thefuzz

In [1]:
from thefuzz import fuzz

## String Matching Methods
|Technique|	Description|	Code Example|
|:------:|:------|:------|
|Simple Ratio|	Calculates similarity considering the order of input strings.	|`fuzz.ratio(name, full_name)`|
|Partial Ratio|	Finds partial similarity by comparing the shortest string with sub-strings.|	`fuzz.partial_ratio(name, full_name)`
|Token Sort Ratio|	Ignores order of words in strings.|	`fuzz.token_sort_ratio(full_name_reordered, full_name)`|
|Token Set Ratio|	Removes common tokens before calculating similarity.|	`fuzz.token_set_ratio(name, full_name)`|

### Simple Ratio 
- `ratio()` calculates the edit distance based on the ordering of both input strings


In [5]:
# Check the similarity score
name = "Kurtis Pykes"
full_name = "Kurtis K D Pykes"

print(f"Similarity score: {fuzz.ratio(name, full_name)}")


Similarity score: 86


### Partial Ratio
- `partial_ratio()` seeks to find how partially similar two strings are.
    - it calculates the similarity by taking the **shortest** string, which in this scenario is stored in the variable `name`, then compares it against the **sub-strings** of the same length in the longer string, which is stored in `full_name`. 

In [7]:
print(f"Similarity score: {fuzz.partial_ratio(name, full_name)}")

Similarity score: 67


- Since order matters in partial ratio, our score dropped to 67 in this instance. 
- Therefore, to get a 100% similarity match, you would have to move the "K D" part to the end of the string

In [8]:
# Order matters with partial ratio
# Check the similarity score
name = "Kurtis Pykes"
full_name = "Kurtis Pykes K D" # move K D to the end 

print(f"Partial ratio similarity score: {fuzz.partial_ratio(name, full_name)}")

# But order will not effect simple ratio if strings do not match
print(f"Simple ratio similarity score: {fuzz.ratio(name, full_name)}")

Partial ratio similarity score: 100
Simple ratio similarity score: 86


### Token Sort Ratio
- Token sort doesn’t care about what order words occur in. It accounts for similar strings that aren’t in order as expressed above

In [9]:
# Check the similarity score
full_name = "Kurtis K D Pykes"
full_name_reordered = "Kurtis Pykes K D"

# Order does not matter for token sort ratio
print(f"Token sort ratio similarity score: {fuzz.token_sort_ratio(full_name_reordered, full_name)}")

# Order matters for partial ratio
print(f"Partial ratio similarity score: {fuzz.partial_ratio(full_name, full_name_reordered)}")

# Order will not effect simple ratio if strings do not match
print(f"Simple ratio similarity score: {fuzz.ratio(name, full_name)}")


Token sort ratio similarity score: 100
Partial ratio similarity score: 75
Simple ratio similarity score: 86


-  If there are words that are dissimilar words in the strings, it will negatively impact the similarity ratio

In [None]:
# Check the similarity score
name = "Kurtis Pykes"
full_name = "Kurtis K D Pykes" # "Kurtis Pykes K D"

print(f"Token sort ratio similarity score: {fuzz.token_sort_ratio(name, full_name)}")

Token sort ratio similarity score: 86


### Token set ratio
- The `token_set_ratio()` method is pretty similar to the token_sort_ratio(), except it takes out common tokens before calculating how similar the strings are: this is extremely helpful when the strings are significantly different in length. 

In [11]:
# Check the similarity score
name = "Kurtis Pykes"
full_name = "Kurtis K D Pykes"

print(f"Token sort ratio similarity score: {fuzz.token_set_ratio(name, full_name)}")

Token sort ratio similarity score: 100


## Process
- The process module enables users to extract text from a collection using fuzzy string matching. Calling the extract() method on the process module returns the strings with a similarity score in a vector. 

In [None]:
from thefuzz import process

collection = ["AFC Barcelona", "Barcelona AFC", "barcelona fc", "afc barcalona"]
print(process.extract("barcelona", collection, scorer=fuzz.ratio, limit=2))


[('barcelona fc', 86), ('AFC Barcelona', 82)]
