## Fuzzy Search in Python with FuzzyWuzzy

Let's play around on fuzzy text search using Python language and `fuzzywuzzy` package!

This notebook is part of [Fuzzy Search - Buscando texto por aproximação](http://datenworks.com) blog post on [Datenworks Medium Blog](http://medium.com/datenworks)

### Setup

In order to start applying some fuzzy search, let's install `fuzzywuzzy` in our local environment through `pip`:

In [1]:
!sudo pip install fuzzywuzzy[speedup]

You should consider upgrading via the 'pip install --upgrade pip' command.[0m


If all goes well, you may start importing core `fuzzywuzzy` Python modules, as listed below

In [2]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

### Let's roll!

The most basic way to use fuzzywuzzy is through the ratio function, which will determine similarity between two text sentences, using the available algorithm to calculate a "distance" between such sentences. To test, simply apply a function on two sentences as below:

In [3]:
fuzz.ratio("This is my first sentence","This is my first sentence.")

98

According to the `ratio` function, sentences above are 98% similar to each other because of a simple dot. To deal with situations like this there's a function called `partial_ratio`, which compares partial text similarity:

In [4]:
fuzz.partial_ratio("This is my first sentence","This is my first sentence.")

100

There are situations where terms (words or tokens) are the same between sentences, but the order of terms may vary:

In [5]:
fuzz.ratio("São Clemente ganhou o Carnaval", "São Clemente o Carnaval ganhou")

77

Can you realize that both sentences mean the same thing, but given the different order of terms, isn't the result accurate?

Now try the `token_sort_ratio` function to get a more accurate output:

In [6]:
fuzz.token_sort_ratio("São Clemente ganhou o Carnaval", "São Clemente o Carnaval ganhou")

100

What if you have to deal with duplicate terms? The `token_set_ratio` function handles these cases pretty well:

In [7]:
fuzz.ratio("São Clemente ganhou o Carnaval", "São Clemente ganhou ganhou o Carnaval")

90

In [8]:
fuzz.token_set_ratio("São Clemente ganhou o Carnaval", "São Clemente ganhou ganhou o Carnaval")

100

Finally, one of fuzzywuzzy's most powerful tools is in the `process` module, which allows you to approximate a set of options to a particular search term, indicating which options are closest and their respective" distances ". Take the example:

In [9]:
options = ["Futbol Club Barcelona", "Real Madrid Club de Fútbol", "Valencia Club de Fútbol", "Real Sociedad de Fútbol"]

In [10]:
process.extract("real futbol", options, limit=2)

[('Futbol Club Barcelona', 86), ('Real Madrid Club de Fútbol', 86)]

If you want only the best (closest) option, just use the `extractOne` function

In [11]:
process.extractOne("real", options)

('Real Madrid Club de Fútbol', 90)