# Lecture 4: Min-Hashing

## Overview

1. Using PyMinHash to find matches of strings.
2. A more detailed example to show the implementation of Min-Hashing


This example shows how to use PyMinHash to find matches of strings.

First, import Pandas and fix some settings.

In [9]:
%config Completer.use_jedi = False

import pandas as pd

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_row', 500)
pd.set_option('display.max_colwidth', 200)

PyMinHash comes with a toy dataset containing various name and address combinations of Stoxx50 companies.

In [10]:
from pyminhash.datasets import load_data
df = load_data()
df.head()

Unnamed: 0,name
0,adidas ag adi dassler strasse 1 91074 germany
1,adidas ag adi dassler strasse 1 91074 herzogenaurach
2,adidas ag adi dassler strasse 1 91074 herzogenaurach germany
3,airbus se 2333 cs leiden netherlands
4,airbus se 2333 cs netherlands


We're going to match various representations that belong to the same company. For this, we import create a `MinHash` object and tell it to use 10 hash tables. More hash tables means more accurate Jaccard similarity calculation but also requires more time and memory.

In [11]:
from pyminhash.pyminhash import MinHash
myHasher = MinHash(n_hash_tables=10)

The `fit_predict` method needs the dataframe and the name of the column to which minhashing should be applied. The result is a dataframe containing all pairs that have a non-zero Jaccard similarity:

In [12]:
result = myHasher.fit_predict(df, 'name')
result.head()

Unnamed: 0,row_number_1,row_number_2,name_1,name_2,jaccard_sim
0,0,1,adidas ag adi dassler strasse 1 91074 germany,adidas ag adi dassler strasse 1 91074 herzogenaurach,1.0
296,32,33,bayerische motoren werke aktiengesellschaft petuelring 130 80788 munich,bayerische motoren werke aktiengesellschaft petuelring 130 munich germany,1.0
1,0,2,adidas ag adi dassler strasse 1 91074 germany,adidas ag adi dassler strasse 1 91074 herzogenaurach germany,1.0
588,24,25,banco santander s a 28660 madrid,banco santander s a 28660 madrid spain,1.0
593,12,13,anheuser busch inbev sa nv 3000 leuven belgium,anheuser busch inbev sa nv brouwerijplein 1 3000 leuven belgium,1.0


As one can see below, for a Jaccard similarity of 1.0, all words in the shortest string appear in the longest string. For lower Jaccard similarity values, the match is less than perfect. Note that Jaccard similarity has granularity of 1/n_hash_tables, in this example 0.1.

In [13]:
result.groupby('jaccard_sim').head(2)

Unnamed: 0,row_number_1,row_number_2,name_1,name_2,jaccard_sim
0,0,1,adidas ag adi dassler strasse 1 91074 germany,adidas ag adi dassler strasse 1 91074 herzogenaurach,1.0
296,32,33,bayerische motoren werke aktiengesellschaft petuelring 130 80788 munich,bayerische motoren werke aktiengesellschaft petuelring 130 munich germany,1.0
351,55,56,engie sa 1 place samuel de champlain 92400 courbevoie,engie sa 1 place samuel de champlain 92400 france,0.9
78,62,64,fresenius se co kgaa else kroner strasse 1 61352 bad homburg vor der hohe germany,fresenius se co kgaa else kroner strasse 1 bad homburg vor der hohe germany,0.9
581,23,24,banco santander s a 28660,banco santander s a 28660 madrid,0.8
783,83,84,l air liquide s a 75007 paris france,l air liquide s a paris france,0.8
306,92,93,munchener ruckversicherungs gesellschaft aktiengesellschaft koniginstrasse 107,munchener ruckversicherungs gesellschaft aktiengesellschaft koniginstrasse 107 80802 munich,0.7
808,90,91,lvmh moet hennessy louis vuitton societe europeenne 22 avenue montaigne paris,lvmh moet hennessy louis vuitton societe europeenne 75008 france,0.7
159,4,5,airbus se 2333 cs netherlands,airbus se leiden netherlands,0.6
23,39,41,daimler ag 70372 stuttgart germany,daimler ag mercedesstrasse 120 70372 stuttgart germany,0.6
