# FuzzyWuzzy: Fuzzy String Matching in Python, Beginner's Guide
## ... and a hands-on practice on a real-world dataset
<img src="images/repo.jpg"></img>
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://pixabay.com/users/stephennorris-7555778/?utm_source=link-attribution&utm_medium=referral&utm_campaign=image&utm_content=3052477'>Steve Norris</a>
        on 
        <a href='https://pixabay.com/?utm_source=link-attribution&utm_medium=referral&utm_campaign=image&utm_content=3052477'>Pixabay</a>
    </strong>
</figcaption>

### Introduction <small id='intro'></small>

If you have dealt with text data before, you know that its issues are the hardest to deal with. There is just no one-size-fits-all solution to text problems and for each dataset, you have to come up with new ways to clean your data. In one of my [previous](https://towardsdatascience.com/master-the-most-hated-task-in-ds-ml-3b9779276d7c?source=your_stories_page-------------------------------------) articles, I talked about the worst-case scenario of such problems:
> For example, consider this worst-case scenario: you are working on a survey data conducted across the USA and there is a state column for the state of each observation in the dataset. There are 50 states in the USA and imagine all the damn variations of state names people can come up with. You are in even bigger problem if data collectors decide to use abbreviations:
CA, ca, Ca, Caliphornia, Californa, Calfornia, calipornia, CAL, CALI, …
Such columns will always be filled with typos, errors, inconsistencies.

The problems related to text often arise because of free-text during data collection. They will be full of typos, inconsistencies, whatever you can name. Of course, the most basic problems can be solved using simple regular expressions or built-in Python functions but for cases like above, which occur very often, you have to arm yourself with more complex tools.

Today's special is `fuzzywuzzy`, a package with a very simple API which helps us to calculate string similarity.

### Overview
1. [Introduction](#intro)
1. [Setup](#setup)
1. [How String Matching Is Performed](#comparison)
1. [Installation](#install)
1. [FuzzyWuzzy: The Basics with WRatio](#wratio)
1. [FuzzyWuzzy: Comparison of Different Methods ](#methods)
1. [Using `fuzzywuzzy.process` to Extract Best Matches to a String from a List of Options](#process)
1. [Text Cleaning With FuzzWuzzy On a Real Dataset](#real)

### Setup <small id='setup'></small>

In [1]:
# Load necessary libraries
import pandas as pd
# fuzzywuzzy to be imported later

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Enable multiple cell outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

### How String Matching Is Performed <small id='comparison'></small>

To understand string matching, let's get you up to speed with Minimum Edit Distance. As humans, we have no trouble at all if two or more strings are similar or not. To create this ability in computers, many algorithms were created and almost all of them depend on Minimum Edit Distance. 

Minimum Edit Distance (MED) is the least possible amount of steps needed to transition from one string to another. MED is calculated using only 4 operations:
- Insertion
- Deletion
- Substitution
- Replacing consecutive characters

Consider these two words: **Program** and **Sonogram**:
<img src='images/1.png'></img>

To get from Program to Sonogram, we need 3 steps:
1. Add letter 'S' to the beginning of 'Program'.
2. Substitute 'P' with 'O'.
3. Substitute 'R' with 'N'.
<img src='images/2.png'></img>
<figcaption style="text-align: center;">
    <strong>
        Minimum Edit Distance of 3
    </strong>
</figcaption>

As I said, there are many algorithms to calculate MED:
- Damerau-Levenshtein
- Levenshtein
- Hamming
- Jaro Distance

Also, there are packages that use these algorithms: `nltk`, `fuzzywuzzy`, `textdistance`, `difflib`, ...

In this article, we will only cover `fuzzywuzzy`.

### FuzzWuzzy: Installation <small id='install'></small>

Even though the basic installation can be done easily with `pip`, there are some other options or caveats to `fuzzwuzzy`'s installation:

- Using PIP via PyPI (standard):

```pip install fuzzywuzzy```

The above method installs the default up-to-date version of the package. At first, I installed it using this method. But whenever I imported it, it started giving a warning saying that the package itself is very slow and I should install `python-Levenshtein` package for more speed. If you hate warnings in your Jupyter Notebook like me, here is how you can install extra dependencies:
- Directly install `python-Levenshtein`:

```pip install python-Levenshtein```

or

```pip install fuzzywuzzy[speedup]```

**Warning for Windows users**: if you don't have Microsoft Visual Studio build tools installed, installing `python-Levenshtein` fails. You can download MVS Build Tools from [here](https://visualstudio.microsoft.com/downloads/).

### FuzzyWuzzy: The Basics with WRatio <small id='wratio'></small>

To get started with `fuzzywuzzy`, we first import `fuzz` sub-module:

In [2]:
from fuzzywuzzy import fuzz

In this sub-module, there are 5 functions for different methods of comparison between 2 strings. The most flexible and best one for everyday use is `WRatio` (Weighted Ratio) function:

In [3]:
fuzz.WRatio('Python', 'Cython')

83

Here, we are comparing 'Python' to 'Cython'. The output returns a percentage between 0 and 100, 0 being not similar at all and 100 being identical:

In [4]:
fuzz.WRatio('program', 'sonogram')

67

In [5]:
fuzz.WRatio('insert', 'concert')

62

In [6]:
fuzz.WRatio('notebook', 'note')

90

All the functions of `fuzzywuzzy` are case-insensitive:

In [7]:
fuzz.WRatio('Data Science', 'data science')

100

`WRatio` is also very good for partial strings with different orderings:

In [8]:
fuzz.WRatio('data science', 'science')

90

In [9]:
fuzz.WRatio('United States', 'United States of America')

90

In [10]:
fuzz.WRatio('Barcelona, Spain', 'ESP, Barcelona')

82

### FuzzyWuzzy: Comparison of Different Methods <small id='methods'></small>

Apart from `WRatio`, there are 4 other functions to compute string similarity:
- fuzz.ratio
- fuzz.partial_ratio
- fuzz.token_sort_ratio
- fuzz.token_set_ratio

`fuzz.ratio` is perfect for strings with similar lengths and order:

In [11]:
fuzz.ratio('program', 'sonogram')
fuzz.ratio('response', 'respond')
fuzz.ratio('plant', 'grant')
fuzz.ratio('word', 'world')
fuzz.ratio('data science', 'data sience')

67

80

60

89

96

In [12]:
# comparison with WRatio
fuzz.WRatio('program', 'sonogram')
fuzz.WRatio('response', 'respond')
fuzz.WRatio('plant', 'grant')
fuzz.WRatio('word', 'world')
fuzz.WRatio('data science', 'data sience')

67

80

60

89

96

For strings with differing lengths, it is better to use `fuzz.patial_ratio':

In [13]:
fuzz.ratio('maths', 'mathematics')
fuzz.partial_ratio('maths', 'mathematics')
fuzz.WRatio('maths', 'mathematics')

62

80

72

In [14]:
fuzz.ratio('barcelona', 'barca')
fuzz.partial_ratio('barcelona', 'barca')
fuzz.WRatio('barcelona', 'barca')

71

80

72

If the strings have the same meaning but their order is different, use `fuzz.token_sort_ratio':

In [15]:
fuzz.ratio('Barcelona vs. Real Madrid', 'Real Madrid vs. Barcelona')
fuzz.partial_ratio('Barcelona vs. Real Madrid', 'Real Madrid vs. Barcelona')
fuzz.WRatio('Barcelona vs. Real Madrid', 'Real Madrid vs. Barcelona')
fuzz.token_sort_ratio('Barcelona vs. Real Madrid', 'Real Madrid vs. Barcelona')

44

46

95

100

For more edge cases, there is `fuzz.token_set_ratio`:

In [16]:
fuzz.ratio('Manchester United vs Manchester City', 'United vs City')
fuzz.partial_ratio('Manchester United vs Manchester City', 'United vs City')
fuzz.WRatio('Manchester United vs Manchester City', 'United vs City')
fuzz.token_set_ratio('Manchester United vs Manchester City', 'City vs United')

56

71

86

100

As you see, these 5 functions are full with caveats. Their comparison is a whole another topic so I am leaving you a link to the [article](https://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) written by the package creators which explains their difference beautifully. 
> I think you already saw that `WRatio` function gives the middle ground for all the functions of `fuzzywuzzy`. For many edge cases and different issues, it is best to use `WRatio` for best results.

### Using `fuzzywuzzy.process` to Extract Best Matches to a String from a List of Options <small id='process'></small>

Now we have some understanding `fuzzywuzzy`'s different functions, we can move on to more complex problems. With real life data, most of the time you have to find the most similar value to your string from a list of options. Consider this example:

In [17]:
string_to_match = 'Mercedez-Benz'
options = ['Ford', 'Mustang', 'mersedez benz', 'MAZDA', 'Mercedez']

We have to find best matches to `Mercedez-Benz` to replace them with the correct spelling of the cars. We can loop over each value but such process could take too long if there are millions of options to choose from. Since this operation is so commonly used, `fuzzywuzzy` provides us with a helpful sub-module:

In [18]:
from fuzzywuzzy import process

With this sub-module, you can extract best matches to your string from a sequence of strings. Let's solve our initial problem:

In [19]:
process.extract(query=string_to_match, choices=options, limit=3)

[('mersedez benz', 92), ('Mercedez', 90), ('Ford', 45)]

The parameters of interest in `process.extract` are `query`, `choices` and `limit`. This function computes the similarity of strings given in `query` from a sequence of options given in `choices` and returns a list of tuples. `limit` controls the number of tuples to return. Each of these tuples contain two elements, first one is the matching string and the second one is the similarity score.

Under the hood, `process.extract` uses default `WRatio` function. However, depending on your case and knowing the differences between the 5 functions you can change the scoring function with `scorer`:

In [20]:
process.extract(query=string_to_match,
                choices=options,
                limit=3,
                scorer=fuzz.ratio)

[('mersedez benz', 92), ('Mercedez', 76), ('Ford', 24)]

If you have many options, it is best to stick with `WRatio` because it is the most flexible.

In the `process` module, there are other functions which perform similar operation. `process.extractOne` returns only one output which contains the string with the highest matching score:

In [21]:
process.extractOne(string_to_match, options)

('mersedez benz', 92)

### Text Cleaning With FuzzWuzzy On a Real Dataset <small id='real'></small>

Now we are ready to tackle a real-world problem. I will load the raw data to practice:

In [22]:
cars = pd.read_csv('data/raw.csv',
                   usecols=['zip', 'year', 'vehicle_make', 'vehicle_model'],
                   na_values=0)
cars.dropna(inplace=True)
cars.drop_duplicates(inplace=True)

In [23]:
cars.sample(5)

Unnamed: 0,zip,year,vehicle_make,vehicle_model
1712,94541,1997.0,MERCEDES-BENZ,C-CLASS
6501,94587,2016.0,TOYOTA,COROLLA
8212,94607,2007.0,VOLKSWAGEN,JETTA
9890,94705,2013.0,TESLA,MODEL S
1380,94539,2007.0,HONDA,CR-V


In [24]:
cars.shape

(8504, 4)

I used this dataset in one of my personal projects and the task was to correct the spelling of each vehicle make and model according to the correct values given in another file:

In [25]:
# Import pickle library to load pickle files
import pickle
# load data file
with open('data/make_model.pkl', 'rb') as file:
    make_model = pickle.load(file)

After loading the pickle file, `make_model` is now a dictionary containing the correct spelling of each car make as keys and the correct spelling of models under each key.

For example, let's see the spellings of makes and models of `Toyota` cars:

In [26]:
make_model['Toyota']

{'4Runner',
 '86',
 'Avalon',
 'Avalon Hybrid',
 'C-HR',
 'Camry',
 'Camry Hybrid',
 'Celica',
 'Corolla',
 'Corolla Hatchback',
 'Corolla Hybrid',
 'Corolla iM',
 'Cressida',
 'Echo',
 'FJ Cruiser',
 'GR Supra',
 'Highlander',
 'Highlander Hybrid',
 'Land Cruiser',
 'MR2',
 'Matrix',
 'Mirai',
 'Paseo',
 'Previa',
 'Prius',
 'Prius Plug-in Hybrid',
 'Prius Prime',
 'Prius c',
 'Prius v',
 'RAV4',
 'RAV4 Hybrid',
 'Regular Cab',
 'Sequoia',
 'Sienna',
 'Solara',
 'Supra',
 'T100 Regular Cab',
 'T100 Xtracab',
 'Tacoma Access Cab',
 'Tacoma Double Cab',
 'Tacoma Regular Cab',
 'Tacoma Xtracab',
 'Tercel',
 'Tundra Access Cab',
 'Tundra CrewMax',
 'Tundra Double Cab',
 'Tundra Regular Cab',
 'Venza',
 'Xtra Cab',
 'Yaris',
 'Yaris Hatchback',
 'Yaris iA'}

Now, let's subset the raw data for `Toyota` cars:

In [27]:
cars[cars['vehicle_make'] == 'TOYOTA']

Unnamed: 0,zip,year,vehicle_make,vehicle_model
4,94612,2014.0,TOYOTA,PRIUS PLUG-IN HYBRID
7,94612,2003.0,TOYOTA,TUNDRA
10,94706,2017.0,TOYOTA,PRIUS
11,94706,2008.0,TOYOTA,PRIUS
12,94706,2012.0,TOYOTA,PRIUS V
...,...,...,...,...
9992,94707,2016.0,TOYOTA,PRIUS
9995,94707,2012.0,TOYOTA,PRIUS PLUG-IN HYBRID
9997,94707,2018.0,TOYOTA,PRIUS PRIME
9998,94707,2004.0,TOYOTA,COROLLA


The dataset contains up to a hundred unique car makes like Audi, Bentley, BMW and each one contains several models which are full of edge cases. We cannot just convert each one to title case or lower case. We also don't know if these contain any spelling errors or inconsistencies and visual search is not an option for such big datasets. There are also some cases where make labels with more than one word divide the name with a `space` while others with a `dash`. If you have this many inconsistencies and there is not a clear pattern, use string matching.

Let's start by cleaning up car make labels. For comparison, here are the make labels in both datasets:

In [28]:
cars['vehicle_make'].unique()

array(['FORD', 'DODGE', 'CHEVROLET', 'SUBARU', 'TOYOTA', 'MERCEDES-BENZ',
       'BMW', 'HONDA', 'HYUNDAI', 'LEXUS', 'SCION', 'NISSAN', 'SAAB',
       'PORSCHE', 'KIA', 'JEEP', 'MITSUBISHI', 'VOLKSWAGEN', 'ACURA',
       'GMC', 'MINI', 'MAZDA', 'CHRYSLER', 'MERCURY', 'CADILLAC',
       'INFINITI', 'VOLVO', 'LINCOLN', 'AUDI', 'TESLA', 'BUICK', 'FIAT',
       'SATURN', 'LAND ROVER', 'FREIGHTLINER', 'PONTIAC', 'JAGUAR',
       'PETERBILT', 'GEO', 'RAM', 'ISUZU', 'PLYMOUTH', 'MASERATI',
       'ASTON MARTIN', 'INTERNATIONAL', 'HINO', 'OLDSMOBILE', 'SUZUKI',
       'UD TRUCKS', 'HUMMER', 'WORKHORSE', 'COUNTRY COACH', 'LAMBORGHINI',
       'MG', 'SMART', 'GENESIS', 'KENWORTH', 'BENTLEY', 'OSHKOSH'],
      dtype=object)

In [29]:
make_model.keys()

dict_keys(['Acura', 'Alfa Romeo', 'Aston Martin', 'Audi', 'Bentley', 'BMW', 'Buick', 'Cadillac', 'Chevrolet', 'Chrysler', 'Dodge', 'Ferrari', 'FIAT', 'Ford', 'Freightliner', 'Genesis', 'GMC', 'Honda', 'Hyundai', 'INFINITI', 'Jaguar', 'Jeep', 'Kia', 'Lamborghini', 'Land Rover', 'Lexus', 'Lincoln', 'Lotus', 'Maserati', 'MAZDA', 'McLaren', 'Mercedes-Benz', 'MINI', 'Mitsubishi', 'Nissan', 'Porsche', 'Ram', 'Rolls-Royce', 'smart', 'Subaru', 'Tesla', 'Toyota', 'Volkswagen', 'Volvo', 'HUMMER', 'Maybach', 'Mercury', 'Pontiac', 'Saab', 'Saturn', 'Scion', 'Suzuki'])

I think the differences are obvious. We will use `process.extract` to match each make with the correct spelling:

In [30]:
# for each correct make:
for make in make_model.keys():
    # find potential matches
    matches = process.extract(make, cars['vehicle_make'], limit=cars.shape[0])
    # for each match
    for match in matches:
        # if high similarity score
        if match[1] >= 90:
            # replace the incorrect spelling with the make
            cars.loc[cars['vehicle_make'] == match[0], 'vehicle_make'] = make

In [31]:
cars['vehicle_make'].unique()

array(['Ford', 'Dodge', 'Chevrolet', 'Subaru', 'Toyota', 'Mercedes-Benz',
       'BMW', 'Honda', 'Hyundai', 'Lexus', 'Scion', 'Nissan', 'Saab',
       'Porsche', 'Kia', 'Jeep', 'Mitsubishi', 'Volkswagen', 'Acura',
       'GMC', 'MINI', 'MAZDA', 'Chrysler', 'Mercury', 'Cadillac',
       'INFINITI', 'Volvo', 'Lincoln', 'Audi', 'Tesla', 'Buick', 'FIAT',
       'Saturn', 'Land Rover', 'Freightliner', 'Pontiac', 'Jaguar',
       'PETERBILT', 'GEO', 'Ram', 'ISUZU', 'PLYMOUTH', 'Maserati',
       'Aston Martin', 'INTERNATIONAL', 'HINO', 'OLDSMOBILE', 'Suzuki',
       'UD TRUCKS', 'HUMMER', 'WORKHORSE', 'COUNTRY COACH', 'Lamborghini',
       'MG', 'smart', 'Genesis', 'KENWORTH', 'Bentley', 'OSHKOSH'],
      dtype=object)

As you see, the make labels which exist in the `make_model` got converted into their correct spelling. Now, it is time for model labels:

In [32]:
# for each make
for make in make_model:
    # if make exists in the main data
    if make in cars['vehicle_make'].unique():
        # for each model
        for model in make_model[make]:
            # subset main data for current make and get its models
            options = cars[cars['vehicle_make'] == make]['vehicle_model']
            # find motential matches
            matches = process.extract(model, options, limit=options.shape[0])
            # for each match
            for match in matches:
                # if high similarity score
                if match[1] >= 90:
                    # replace incorrect spelling with the correct one
                    cars.loc[((cars['vehicle_make'] == make) &
                              (cars['vehicle_model'] == match[0])),
                             'vehicle_model'] = model

In [33]:
cars.sample(15)

Unnamed: 0,zip,year,vehicle_make,vehicle_model
6421,94587,2016.0,Ford,Explorer
7506,94603,1999.0,Chevrolet,Silverado 3500 HD Crew Cab
9282,94619,2016.0,BMW,3 Series
6512,94587,2000.0,Toyota,Corolla
7363,94602,2001.0,Jeep,Cherokee
8735,94610,1991.0,Honda,Accord Hybrid
2248,94544,2013.0,INFINITI,G37 SEDAN
8334,94608,2017.0,Toyota,RAV4 Hybrid
9493,94621,2016.0,Dodge,Caravan Cargo
8825,94611,2014.0,BMW,M2


> The last two code snippets were a little hairy. To fully understand how they are working, you should get some practice on `process.extract`. 

There you go! If you did not know string matching, the task would have been impossible and even Regular Expressions would not have been able to help you. 