# FuzzyWuzzy: Fuzzy String Matching in Python, Deep Guide
## ... and a hands-on practice on a real-world dataset
<img src="images/repo.jpg"></img>
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://pixabay.com/users/stephennorris-7555778/?utm_source=link-attribution&utm_medium=referral&utm_campaign=image&utm_content=3052477'>Steve Norris</a>
        on 
        <a href='https://pixabay.com/?utm_source=link-attribution&utm_medium=referral&utm_campaign=image&utm_content=3052477'>Pixabay</a>
    </strong>
</figcaption>

### Introduction <small id='intro'></small>

### Setup <small id='setup'></small>

In [1]:
# Load necessary libraries
import pandas as pd
# fuzzywuzzy to be imported later

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Enable multiple cell outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

### How String Matching Is Performed <small id='comparison'></small>

To understand string matching, let's get you up to speed with Minimum Edit Distance. As humans, we have no trouble at all if two or more strings are similar or not. To create this ability in computers, many algorithms were created and almost all of them depend on Minimum Edit Distance. 

Minimum Edit Distance (MED) is the least possible amount of steps needed to transition from one string to another. MED is calculated using only 4 operations:
- Insertion
- Deletion
- Substitution
- Replacing consecutive characters

Consider these two words: **Program** and **Sonogram**:
<img src='images/1.png'></img>

These two strings have five similar characters in the end. We can ignore them and focus on the beginning of the strings. To get from Program to Sonogram, we need 3 steps:
1. Add letter 'S' to the beginning of 'Program'.
2. Substitute 'P' with 'O'.
3. Substitute 'R' with 'N'.
<img src='images/2.png'></img>
<figcaption style="text-align: center;">
    <strong>
        Minimum Edit Distance of 3
    </strong>
</figcaption>

As I said, there are many algorithms to calculate MED:
- Damerau-Levenshtein
- Levenshtein
- Hamming
- Jaro Distance

Also, there are packages that use these algorithms: `nltk`, `fuzzywuzzy`, `textdistance`, `difflib`, ...

In this article, we will cover `fuzzywuzzy`.

### FuzzWuzzy: Installation <small id='install'></small>

Even though the basic installation can be done easily with `pip`, there are some other options or caveats to `fuzzwuzzy`'s installation:

- Using PIP via PyPI (standard):

```pip install fuzzywuzzy```

The above method installs the default up-to-date version of the package. At first, I installed it using this method. But whenever I imported it, it started giving a warning saying that the package itself is very slow and I should install `python-Levenshtein` package for more speed. If you hate warnings in your Jupyter Notebook like me, here is how you can install extra dependencies:
- Directly install `python-Levenshtein`:

```pip install python-Levenshtein```

or

```pip install fuzzywuzzy[speedup]```

**Warning for Windows users**: if you don't have Microsoft Visual Studio build tools installed, installing `python-Levenshtein` fails. You can download it from [here](https://visualstudio.microsoft.com/downloads/).

### FuzzyWuzzy: The Basics with WRatio <small id='wratio'></small>

To get started with `fuzzywuzzy`, we first import `fuzz` sub-module:

In [2]:
from fuzzywuzzy import fuzz

In this sub-module, there are 5 functions for different methods of comparison between 2 strings. The most flexible and best one for everyday use is `WRatio` (Weighted Ratio) function:

In [3]:
fuzz.WRatio('Python', 'Cython')

83

Here, we are comparing 'Python' to 'Cython'. The output returns a percentage between 0 and 100, 0 being not similar at all and 100 being identical:

In [4]:
fuzz.WRatio('program', 'sonogram')

67

In [5]:
fuzz.WRatio('insert', 'concert')

62

In [6]:
fuzz.WRatio('notebook', 'note')

90

All the functions of `fuzzywuzzy` are case-insensitive:

In [7]:
fuzz.WRatio('Data Science', 'data science')

100

`WRatio` is also very good for partial strings with different orderings:

In [8]:
fuzz.WRatio('data science', 'science')

90

In [9]:
fuzz.WRatio('United States', 'United States of America')

90

In [10]:
fuzz.WRatio('Barcelona, Spain', 'ESP, Barcelona')

82

### FuzzyWuzzy: Comparison of Different Methods <small id='methods'></small>

Apart from `WRatio`, there are 4 other functions to compute string similarity:
- fuzz.ratio
- fuzz.partial_ratio
- fuzz.token_sort_ratio
- fuzz.token_set_ratio

`fuzz.ratio` is perfect for strings with similar lengths and order:

In [11]:
fuzz.ratio('program', 'sonogram')
fuzz.ratio('response', 'respond')
fuzz.ratio('plant', 'grant')
fuzz.ratio('word', 'world')
fuzz.ratio('data science', 'data sience')

67

80

60

89

96

In [12]:
# comparison with WRatio
fuzz.WRatio('program', 'sonogram')
fuzz.WRatio('response', 'respond')
fuzz.WRatio('plant', 'grant')
fuzz.WRatio('word', 'world')
fuzz.ratio('data science', 'data sience')

67

80

60

89

96

For strings with differing lengths, it is better to use `fuzz.patial_ratio':

In [13]:
fuzz.ratio('maths', 'mathematics')
fuzz.partial_ratio('maths', 'mathematics')
fuzz.WRatio('maths', 'mathematics')

62

80

72

In [14]:
fuzz.ratio('barcelona', 'barca')
fuzz.partial_ratio('barcelona', 'barca')
fuzz.WRatio('barcelona', 'barca')

71

80

72

If the strings have the same meaning but their order is different, use `fuzz.token_sort_ratio':

In [15]:
fuzz.ratio('Barcelona vs. Real Madrid', 'Real Madrid vs. Barcelona')
fuzz.partial_ratio('Barcelona vs. Real Madrid', 'Real Madrid vs. Barcelona')
fuzz.WRatio('Barcelona vs. Real Madrid', 'Real Madrid vs. Barcelona')
fuzz.token_sort_ratio('Barcelona vs. Real Madrid', 'Real Madrid vs. Barcelona')

44

46

95

100

For more edge cases, there is `fuzz.token_set_ratio`:

In [16]:
fuzz.ratio('Manchester United vs Manchester City', 'United vs City')
fuzz.partial_ratio('Manchester United vs Manchester City', 'United vs City')
fuzz.WRatio('Manchester United vs Manchester City', 'United vs City')
fuzz.token_set_ratio('Manchester United vs Manchester City', 'City vs United')

56

71

86

100

As you see, these 5 functions are full with caveats. Their comparison is a whole another topic so I am leaving you a link to the [article](https://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) written by the package creators which explains their difference beautifully. 
> I think you already saw that `WRatio` function gives the middle ground for all the functions of `fuzzywuzzy`. For many edge cases and different issues, it is best to use `WRatio` for best results.

### Using `fuzzywuzzy.process` to Extract Best Matches to a String from a List of Options

Now we have some understanding `fuzzywuzzy`'s different functions, we can move on to more complex problems. With real life data, most of the time you have to find the most similar value to your string from a list of options. Consider this example:

In [17]:
string_to_match = 'Mercedez-Benz'
options = ['Ford', 'Mustang', 'mersedez benz', 'MAZDA', 'Mercedez']

We have to find best matches to `Mercedez-Benz` to replace them with the correct spelling of the cars. We can loop over each value but such process could take too long if there are millions of options to choose from. Since this operation is so commonly used, `fuzzywuzzy` provides us with a helpful sub-module:

In [18]:
from fuzzywuzzy import process

With this sub-module, you can extract best matches to your string from a sequence of strings. Let's solve our initial problem:

In [23]:
process.extract(query=string_to_match, choices=options, limit=3)

[('mersedez benz', 92), ('Mercedez', 90), ('Ford', 45)]

The parameters of interest in `process.extract` are `query`, `choices` and `limit`. This function computes the similarity of strings given in `query` from a sequence of options given in `choices` and returns a list of tuples. `limit` controls the number of tuples to return. Each of these tuples contain two elements, first one is the matching string and the second one is the similarity score.

Under the hood, `process.extract` uses default `WRatio` function. However, depending on your case and knowing the differences between the 5 functions you can change the scoring function with `scorer`:

In [24]:
process.extract(query=string_to_match, choices=options, limit=3, scorer=fuzz.ratio)

[('mersedez benz', 92), ('Mercedez', 76), ('Ford', 24)]

If you have many options, it is best to stick with `WRatio` because it is the most flexible.

In the `process` module, there are other similar functions which perform the same operation. `process.extractOne` returns only one output which contains the string with the highest matching score:

In [26]:
process.extractOne(string_to_match, options)

('mersedez benz', 92)

### Text Cleaning With FuzzWuzzy On a Real Dataset <small id='real'></small>