# Implementing LSH in Python

Tutorial is modified from (https://www.learndatasci.com/tutorials/building-recommendation-engine-locality-sensitive-hashing-lsh-python/) by **Ray McLendon**. In its original tutorial, top-k nearest was reported to the user using MinHash Forest. In this tutorial, however, we are more keen to see how MinHash and LSH work.

We will be using the API MinHash and MinHashLSH which is described in (http://ekzhu.com/datasketch/documentation.html).

MinHash is the minHash we learned.

MinHashLSH is the LSH index. 

# Step 1: Load Python Packages



In [1]:
import numpy as np
import pandas as pd
import re
import time
from datasketch import MinHash, MinHashLSH

ModuleNotFoundError: No module named 'datasketch'

# Step 2: Exploring Your Data

Our goal in this tutorial is to make recommendations on conference papers by using LSH to quickly query all of the known conference papers. As a general rule, you should always examine your data. You need a thorough understanding of the dataset in order to properly pre-process your data and determine the best parameters. We have given some basic guidelines for selecting parameters, and they all require an exploration of your dataset as described above.


For the purposes of this tutorial, we will be working with an easy dataset. Kaggle has the "Neural Information Processing Systems (NIPS) conference papers. You can find them [here](http://www.kaggle.com/benhamner/nips-papers).


An initial data exploration of these papers can be found [here](http://www.kaggle.com/benhamner/exploring-the-nips-papers).


# Step 3: Preprocess your data

For the purposes of this article, we will only use a rough version of unigrams as our shingles. We use the following steps:


1. Remove all punctuation.
1. Lowercase all text.
1. Create unigram shingles (tokens) by separating any white space.

For better results, you may try using a natural language processing library like NLTK or spaCy to produce unigrams and bigrams, remove stop words, and perform lemmatization.

In [None]:
#Preprocess will split a string of text into individual tokens/shingles based on whitespace.
def preprocess(text):
    text = re.sub(r'[^\w\s]','',text)
    tokens = text.lower()
    tokens = tokens.split()
    return tokens

In [None]:
text = 'The devil went down to Georgia'
print('The shingles (tokens) are:', preprocess(text))

# Step 4: Choose your parameters
To start our example, we will use the standard number of permutations of 128. We will also start by just making one recommendation.

In [None]:
#Number of Permutations
permutation = 128

# Step 5: Define the minHash function

We define a function call myMinHash that will put all tokens into a MinHash object


In [None]:
def myMinHash(tokens, perms):
    m = MinHash(num_perm=perms)
    for s in tokens:
        m.update(s.encode('utf8'))
    return m



# Step 6: Evaluate Queries

We will start by loading the CSV containing all the conference papers and creating a new field that combines the title and the abstract into one field, so we can build are shingles using both title and abstract.

Finally, we can query any string of text such as a title or general topic to return a list of recommendations. Note, for our example below, we have actually picked the title of a conference paper. Naturally, we get the exact paper as one of our recommendations. We will follow the steps below:

1. Preprocess your text into shingles.
1. Set the same number of permutations for MinHash.
1. Build the MinHashLSH with all MinHash.
1. Query the forest with your MinHash and return the number of requested recommendations.
1. Provide the titles of each conference paper recommended.

In [None]:
db = pd.read_csv('papers.csv')

#Prepare a field called data that contains the shingles of the title
db['data']=db['title'].apply(preprocess)

#Create a minHash for each row and save it into the column mHash
db['mhash'] = db['data'].apply(myMinHash, args=(permutation,))

#Create a MinHashLSH object called lsh. The params say I have 32 bands of 4 rows
lsh = MinHashLSH(num_perm=permutation, params=[32,4])
db.apply(lambda x: lsh.insert(x['id'], x['mhash']), axis=1)

In [None]:

# Create some titles that are similar to the paper. See if the algorithm will be able to retrive some almost the same papers
titles = {'Self-Organization of Associative Database and Its Applications',
          'Self-Organization of Associative Database and  Applications',
          'Self-Organization of Associative Database ants Applications',
          'Self-Organization of Associative Database d Its Applications',
          'Self-Organization of Associative Datase and Its Applications',
          'Self-Organization of and Its Applications'
        }

for title in titles:
    a = preprocess(title)
    mhash = minHash(a, permutation)

    u = lsh.query(mhash)
    
    idx_array = np.array(u)
    result = db.loc[db['id'].isin(idx_array)]['title']
    
    print('\n\nTitle', title, '\t number of matches:', len(result), '\n---------------------------')
    for i in result:
        print(i)
    