# <p style="text-align: center;"> Part Six: Fuzzy Matching </p>

![title](https://miro.medium.com/max/1000/1*a_YDKmKItp5JJRehUrJc_w.png)

In [1]:
from IPython.core.display import HTML
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for this IPython notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')

# <p style="text-align: center;"> Table of Contents </p>
- ## 1. [Introduction](#Introduction)
   - ### 1.1 [Abstract](#abstract)
- ## 2 [Fuzzy Matching](#fz)
    - ### 2.1 [Ratio](#r)
    - ### 2.2 [Partial Ratio](#pr)
    - ### 2.3 [Token Sort Ration](#tr)
    - ### 2.4 [Token Set Ratio](#ts)
    - ### 2.5 [Fuzzy Process Extract](#fz1)
- ## 3. [Conclusion](#Conclusion)
- ## 4. [Contribution](#Contribution)
- ## 5. [Citation](#Citation)
- ## 6. [License](#License)

# <p style="text-align: center;"> 1.0 Introduction </p> <a id='Introduction'> </a>

## 1.1 Abstract <a id="abstract"> </a>

As a data scientist, you are forced to retrieve information from various sources by either leveraging publicly available API’s, asking for data, or by simply scraping your own data from a web page. All this information is useful if we are able to combine it and not have any duplicates in the data. But how do we make sure that there are no duplicates?

I know … “duh! you can just use a function that retrieves all the unique information thus removing duplicates”. Well, that’s one way, but our function probably can’t tell that a name like “Barack Obama” is the same as “Barack H. Obama” right? (Assuming we were retrieving names of the most famous people in the world). We can clearly tell that these names are different but they are probably referring to the same person. So, how do we match these names?

This is where Fuzzy String Matching comes in. This post will explain what Fuzzy String Matching is together with its use cases and give examples using Python’s Library Fuzzywuzzy.

![](https://miro.medium.com/max/784/1*VigSVyiXcvoGmNZJh4_gdA.gif)

##  1.2 Importing Libaries <a id="importing_libraries"> </a>

In [2]:
import numpy as np
import pandas as pd

## <p style="text-align: center;"> 2.0 Fuzzy Matching </p> <a id='fz'> </a>

We know we need to match records to identify duplicates and link records for entity resolution. But how exactly do we go about identifying matching records? What properties should we focus on?

#### Deterministic Data Matching
Let’s start with ‘unique identifiers’. These are properties in the records you want to match that are unlikely to change over time, Customer Name for instance. You can assign weights to each property to improve your matching process. Think about it; if you are migrating customer data from one system to another and need to check for duplicates pre- and post-migration, you could, for instance, choose Name as the one unique identifier and phone number as the second. Now it’s just a matter of running a search for matching Customer IDs and phone numbers and you have all potential matches identified. That method is known as ‘deterministic data matching’.

#### Problems with Deterministic Data Matching?

Although effective in theory, the method is rarely used because of its inflexibility: The approach assumes that all entries are free of mistakes and standardized across systems – which is almost never the case in real-world linkage scenarios. 

#### How do you go about determining a match when so many variations exist?

By performing probabilistic data matching, that’s how. More commonly known as fuzzy matching’, this approach permits the user to account for variations like spelling errors, nicknames, punctuation differences, and many more by combining a variety of algorithms.

### Fuzzy matching is a computer-assisted technique to score the similarity of data.

![](https://storage.ning.com/topology/rest/1.0/file/get/2808309149?profile=RESIZE_1024x1024)


### Fuzzy matching is a technique used in computer-assisted translation as a special case of record linkage. It works with matches that may be less than 100% perfect when finding correspondences between segments of a text and entries in a database of previous translations. It usually operates at sentence-level segments, but some translation technology allows matching at a phrasal level. It is used when the translator is working with translation memory.

Here’s a list of the various fuzzy matching techniques that are in use today:

- Levenshtein Distance (or Edit Distance)
- Damerau-Levenshtein Distance
- Jaro-Winkler Distance
- Keyboard Distance
- Kullback-Leibler Distance
- Jaccard Index
- Metaphone 3
- Name Variant
- Syllable Alignment
- Acronym

#### We won't be going much in detail of each technique

In [3]:
df_fuzzy_match=pd.read_csv("Datasets/room_type.csv")

In [4]:
df_fuzzy_match.head(3)

Unnamed: 0,Expedia,Booking.com
0,"Deluxe Room, 1 King Bed",Deluxe King Room
1,"Standard Room, 1 King Bed, Accessible",Standard King Roll-in Shower Accessible
2,"Grand Corner King Room, 1 King Bed",Grand Corner King Room


#### Importing Library for doing fuzzy match

In [5]:
from fuzzywuzzy import fuzz



#### Looks like we do not have that installed . So, Module needs installation and here is a command for that!

In [6]:
!pip3 install fuzzywuzzy

Defaulting to user installation because normal site-packages is not writeable


Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
You should consider upgrading via the 'c:\program files (x86)\microsoft visual studio\shared\python37_64\python.exe -m pip install --upgrade pip' command.


In [7]:
!python -m pip install --trusted-host pypi.org --trusted-host files.pythonhosted.org fuzzywuzzy



##  2.1 RATIO - Compares the entire string similarity <a id="r"> </a>

Ratio function computes the standard Levenshtein distance similarity ratio between two sequences

In [8]:
from fuzzywuzzy import fuzz
Str1 = "Apple Inc."
Str2 = "apple Inc"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
print(Ratio)

95


In [9]:
fuzz.ratio('Deluxe Room, 1 King Bed', 'Deluxe King Room')

62

In [10]:
fuzz.ratio('Traditional Double Room, 2 Double Beds', 'Double Room with Two Double Beds')

69

In [11]:
fuzz.ratio('Room, 2 Double Beds (19th to 25th Floors)', 'Two Double Beds - Location Room (19th to 25th Floors)')

74

## 2.2  PARTIAL RATIO - Compares partial string similarity  <a id="pr">  </a>

It is a powerful function that allows us to deal with more complex situations such as substring matching
If the short string has length k and the longer string has the length m, then the algorithm seeks the score of the best matching length-k substring.

In [12]:
Str1 = "Los Angeles Lakers"
Str2 = "Lakers"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
print(Ratio)
print(Partial_Ratio)

50
100


In [13]:
fuzz.partial_ratio('Deluxe Room, 1 King Bed', 'Deluxe King Room')

69

In [14]:
fuzz.partial_ratio('Traditional Double Room, 2 Double Beds', 'Double Room with Two Double Beds')

83

In [15]:
fuzz.partial_ratio('Room, 2 Double Beds (19th to 25th Floors)', 'Two Double Beds - Location Room (19th to 25th Floors)')

63

## 2.3 TOKEN SORT RATIO - Ignores word order <a id="tr"> </a>

#### What happens when the strings comparison the same, but they are in a different order?

The fuzz.token functions have an important advantage over ratio and partial_ratio. They tokenize the strings and preprocess them by turning them to lower case and getting rid of punctuation. In the case of fuzz.token_sort_ratio(), the string tokens get sorted alphabetically and then joined together. After that, a simple fuzz.ratio() is applied to obtain the similarity percentage.

In [16]:
Str1 = "united states v. nixon"
Str2 = "Nixon v. United States"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)
print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)


59
74
100


In [17]:
fuzz.token_sort_ratio('Deluxe Room, 1 King Bed', 'Deluxe King Room')

84

In [18]:
fuzz.token_sort_ratio('Traditional Double Room, 2 Double Beds', 'Double Room with Two Double Beds')

78

In [19]:
fuzz.token_sort_ratio('Room, 2 Double Beds (19th to 25th Floors)', 'Two Double Beds - Location Room (19th to 25th Floors)')

83

##  2.4 TOKEN SET RATIO - Ignore duplicate words similarly to token sort ratio <a id="ts"> </a>

#### What happens if these two strings are of widely differing lengths? 

That's where fuzz.token_set_ratio() comes in.
Instead of just tokenizing the strings, sorting and then pasting the tokens back together, token_set_ratio performs a set operation that takes out the common tokens (the intersection) and then makes fuzz.ratio() pairwise comparisons between the following new strings:

- s1 = Sorted_tokens_in_intersection
- s2 = Sorted_tokens_in_intersection + sorted_rest_of_str1_tokens
- s3 = Sorted_tokens_in_intersection + sorted_rest_of_str2_tokens

The logic behind these comparisons is that since Sorted_tokens_in_intersection is always the same, the score will tend to go up as these words make up a larger chunk of the original strings or the remaining tokens are closer to each other.


In [20]:
Str1 = "The supreme court case of Nixon vs The United States"
Str2 = "Nixon v. United States"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)
Token_Set_Ratio = fuzz.token_set_ratio(Str1,Str2)
print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)
print(Token_Set_Ratio)

57
77
58
95


In [21]:
fuzz.token_set_ratio('Deluxe Room, 1 King Bed', 'Deluxe King Room')

100

In [22]:
fuzz.token_set_ratio('Traditional Double Room, 2 Double Beds', 'Double Room with Two Double Beds')

78

In [23]:
fuzz.token_set_ratio('Room, 2 Double Beds (19th to 25th Floors)', 'Two Double Beds - Location Room (19th to 25th Floors)')

97

#### As TOKEN SET RATIO is the best for this dataset, let's explore it a bit more.

In [24]:
def get_ratio(row):
    name1 = row['Expedia']
    name2 = row['Booking.com']
    return fuzz.token_set_ratio(name1, name2)

rated = df_fuzzy_match.apply(get_ratio, axis=1)
rated.head(10)

0    100
1     81
2    100
3    100
4    100
5     78
6     72
7    100
8    100
9     97
dtype: int64

#### Which ones got a set ratio greater than 70%?

In [25]:
greater_than_70_percent =df_fuzzy_match[rated > 70]
greater_than_70_percent.count()

Expedia        93
Booking.com    93
dtype: int64

In [26]:
greater_than_70_percent.head(10)

Unnamed: 0,Expedia,Booking.com
0,"Deluxe Room, 1 King Bed",Deluxe King Room
1,"Standard Room, 1 King Bed, Accessible",Standard King Roll-in Shower Accessible
2,"Grand Corner King Room, 1 King Bed",Grand Corner King Room
3,"Suite, 1 King Bed (Parlor)",King Parlor Suite
4,"High-Floor Premium Room, 1 King Bed",High-Floor Premium King Room
5,"Traditional Double Room, 2 Double Beds",Double Room with Two Double Beds
6,"Room, 1 King Bed, Accessible",King Room - Disability Access
7,"Deluxe Room, 1 King Bed",Deluxe King Room
8,Deluxe Room,Deluxe Room (Non Refundable)
9,"Room, 2 Double Beds (19th to 25th Floors)",Two Double Beds - Location Room (19th to 25th ...


In [27]:
len(greater_than_70_percent) / len(df_fuzzy_match)

0.9029126213592233

##### More than 90% of the records have a score greater than 70%.

In [28]:
greater_than_70_percent = df_fuzzy_match[rated > 60]
greater_than_70_percent.count()


Expedia        101
Booking.com    101
dtype: int64

In [29]:
len(greater_than_70_percent) / len(df_fuzzy_match)

0.9805825242718447

##### And more than 98% of the records have a score greater than 60%.

##  2.5 Fuzzy Process Extract <a id="fz1" > </a>

A module called process that allows you to calculate the string with the highest similarity out of a vector of strings


In [30]:
from fuzzywuzzy import process
str2Match = "apple inc"
strOptions = ["Apple Inc.","apple park","apple incorporated","iphone"]
Ratios = process.extract(str2Match,strOptions)
print(Ratios)
# You can also select the string with the highest matching percentage
highest = process.extractOne(str2Match,strOptions)
print(highest)


[('Apple Inc.', 100), ('apple incorporated', 90), ('apple park', 67), ('iphone', 40)]
('Apple Inc.', 100)


# <p style="text-align: center;"> 3.0 Conclusion </p> <a id="Conclusion"> </a>

The world of fuzzy string matching has come a long way. There are a lot more advanced ways that incorporate these concepts into their fuzzy string searches, and there is more room for efficiency

# <p style="text-align: center;"> 4.0 Contribution</p> <a id='Contribution'> </a>

 

This was a fun project in which we explore the idea of Data cleaning and Data Preprocessing. We take inspiration from kaggle learning course and create our own notebook enhancing the same idea and supplementing it with our own contributions from our experiences and past projects.
       
- Code by self : 65%
- Code from external Sources : 35%

# <p style="text-align: center;"> 5.0 Citations <a id="Citation"> </a>

- https://dataladder.com/fuzzy-matching-101/
- https://medium.com/tim-black/fuzzy-string-matching-at-scale-41ae6ac452c2
- https://www.kaggle.com/leandrodoze/fuzzy-string-matching-with-hotel-rooms
- https://medium.com/@julientregoat/an-introduction-to-fuzzy-string-matching-178805cca2ab

# <p style="text-align: center;"> 6.0 License <a id="License"> </a>

Copyright (c) 2020 Manali Sharma, Rushabh Nisher

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.