# thefuzz

> [Main Table of Contents](../../README.md)

## In This Notebook
- What is thefuzz
- Use Cases
	- Example: How to use the fuzz similarity score to collapse a wide \
range of values into fewer category bins

## What is thefuzz
- `thefuzz` uses Levenshtein Distance to calculate the differences between \
sequences
- similarity score: 1-100 where 100 is exact match

## Use Cases
- Too many values to collapse in a dataframe of type 'category' 

### Example: Use the fuzz similarity score to collapse a wide range of values into fewer category bins

```python
# Import process from thefuzz
import pandas as pd
from thefuzz import process

restaurants = pd.read_csv('./restaurants_L2_dirty.csv', header=0, names=['rest_name', 'rest_addr', 'city', 'phone', 'cuisine_type'])

# Store the unique values of cuisine_type in unique_types
unique_types = restaurants.cuisine_type.unique()
# print(f'unique cuisine types pre-processing\n{unique_types}')

# Investigate by printing the next three lines of code to find the threshold
# We find the threshold is 80
# Calculate similarity of 'asian','american', etc to all values of unique_types
process.extract('asian', unique_types, limit = len(unique_types))
process.extract('american', unique_types, limit=len(unique_types))
process.extract('italian', unique_types, limit=len(unique_types))


# Iterate through categories
categories = ['italian', 'asian', 'american']
for cuisine in categories:  
  # Create a list of matches, comparing cuisine with the cuisine_type column
  matches = process.extract(cuisine, restaurants['cuisine_type'], limit=len(restaurants.cuisine_type))

  # Iterate through the list of matches
  for match in matches:
     # Check whether the similarity score is greater than or equal to 80
    if match[1] >= 80:
      # If it is, select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
      restaurants.loc[restaurants['cuisine_type'] == match[0]] = cuisine
      
# Inspect the final result
print(f'unique cuisine types post-processeing\n{restaurants["cuisine_type"].unique()}')
```
