# FuzzyWuzzy: Fuzzy String Matching in Python, Deep Guide
## ... and a hands-on practice on a real-world dataset
<img src="images/repo.jpg"></img>
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://pixabay.com/users/stephennorris-7555778/?utm_source=link-attribution&utm_medium=referral&utm_campaign=image&utm_content=3052477'>Steve Norris</a>
        on 
        <a href='https://pixabay.com/?utm_source=link-attribution&utm_medium=referral&utm_campaign=image&utm_content=3052477'>Pixabay</a>
    </strong>
</figcaption>

### Introduction <small id='intro'></small>

### Setup <small id='setup'></small>

In [1]:
# Load necessary libraries
import pandas as pd
# fuzzywuzzy to be imported later

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Enable multiple cell outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

### How String Matching Is Performed <small id='comparison'></small>

To understand string matching, let's get you up to speed with Minimum Edit Distance. As humans, we have no trouble at all if two or more strings are similar or not. To create this ability in computers, many algorithms were created and almost all of them depend on Minimum Edit Distance. 

Minimum Edit Distance (MED) is the least possible amount of steps needed to transition from one string to another. MED is calculated using only 4 operations:
- Insertion
- Deletion
- Substitution
- Replacing consecutive characters

Consider these two words: **Program** and **Sonogram**:
<img src='images/1.png'></img>

These two strings have five similar characters in the end. We can ignore them and focus on the beginning of the strings. To get from Program to Sonogram, we need 3 steps:
1. Add letter 'S' to the beginning of 'Program'.
2. Substitute 'P' with 'O'.
3. Substitute 'R' with 'N'.
<img src='images/2.png'></img>
<figcaption style="text-align: center;">
    <strong>
        Minimum Edit Distance of 3
    </strong>
</figcaption>

As I said, there are many algorithms to calculate MED:
- Damerau-Levenshtein
- Levenshtein
- Hamming
- Jaro Distance

Also, there are packages that use these algorithms: `nltk`, `fuzzywuzzy`, `textdistance`, `difflib`, ...

In this article, we will cover `fuzzywuzzy`.

### FuzzWuzzy: Installation <small id='install'></small>

Even though the basic installation can be done easily with `pip`, there are some other options or caveats to `fuzzwuzzy`'s installation:

- Using PIP via PyPI (standard):

```pip install fuzzywuzzy```

The above method installs the default up-to-date version of the package. At first, I installed it using this method. But whenever I imported it, it started giving a warning saying that the package itself is very slow and I should install `python-Levenshtein` package for more speed. If you hate warnings in your Jupyter Notebook like me, here is how you can install extra dependencies:
- Directly install `python-Levenshtein`:

```pip install python-Levenshtein```

or

```pip install fuzzywuzzy[speedup]```
**Warning for Windows users**: if you don't have Microsoft Visual Studio build tools installed, installing `python-Levenshtein` fails. You can download it from [here](https://visualstudio.microsoft.com/downloads/).

### FuzzyWuzzy: The Basics with WRatio <small id='wratio'></small>

### FuzzyWuzzy: Comparison of Different Methods <small id='methods'></small>

### Text Cleaning With FuzzWuzzy On a Real Dataset <small id='real'></small>