# Lecture 6: Theory of Locality-Sensitive Hashing

## Overview

Locality Sensitive Hashing (LSH) is a Nearest Neighbor Search algorithm that primarily targets large dataset with high dimensions. The existing algorithm takes long processing time when searching for nearest neighbors in high dimensional datasets. This issue often occurs in recommender system applications, data mining applications, etc. Linear search can be useful when dealing with low dimensional datasets. However, it is inefficient when dealing with high dimensional datasets. In order to improve the speed and accuracy, a particular type of hash function was designed to project two similar datasets to one hash value.

There are some challenges existing in the current LSH algorithm. For example, when we apply the LSH algorithm for applications such as finding similarities between different documents, we have to rebuild hash tables every time while adding a document. This process is very costly and inefficient. Our objective is to improve the efficiency and accuracy of LSH.

In this project, we propose a LSH-based algorithm to improve hashing techniques and compression performance. The improved LSH will be applied to the image retrieval application. The improvement process includes feature extraction, result comparison and evaluation, which will be explained in details in the rest of this report.

We will compare the results from our approach with different parameters. The final outcome will be evaluated based on accuracy with multiple different parameters.

# LSH Algorithm Improvement By Applying Bitmap Indexing

In [None]:
import argparse
import sys
from os import listdir
from os.path import isfile, join
from typing import Dict, List, Optional, Tuple
import imagehash
from PIL import Image
import os, os.path
import cv2
from collections import Counter
import scipy as sp
import numpy as np # Import numpy library 
from skimage.feature import hog # Import Hog model to extract features
from sklearn.metrics import confusion_matrix # Import confusion matrix to evaluate the performance
import pandas as pd
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col
from pyspark.conf import SparkConf
from pyspark.ml.feature import BucketedRandomProjectionLSH
from pyspark.sql import SparkSession
from sklearn.model_selection import train_test_split

In [None]:
imgs = []
y = []
file_size = []
k = 0
path = "./data/101_ObjectCategories" # Give the dataset path here

##  Data Preprocessing:
1. Load the images using cv2
2. Image resize
3. Feature extraction: BGR to Gray conversion 
4. Feature extraction: Histogram of Oriented Gradients(HOG)

In [None]:
folder = os.listdir(path) # from the given path get the file names such as accordion, airplanes etc..
for file in folder: # for every file name in the given path go inseide that directory and get the images
    subpath = os.path.join(path,file)  # Join the name of these files to the previous path 
    
    files = os.listdir(subpath) # Take these image names to a list called files
    j = 0
    for i in range(np.size(files)): # now we shall loop through these number of files
        
        im = cv2.imread(subpath+'/'+files[0+j]) # Read the images from this subpath
        
        imgs.append(im) # append all the read images to a list called imgs
        y.append(k) # generate a labe to every file and append it to labels list

        j += 1
        if (j == (np.size(files))):
            file_size.append(j)
   
    k += 1
     
y = np.array(y).tolist()
ix = []
for index, item in enumerate(imgs):
    if (np.size(item) == 1):
        ix.append(index)
        del imgs[index]
        
for index, item in enumerate(y):
    for v in range(np.size(ix)):
        if (index == ix[v]):
            del y[index]
        
y = np.array(y).astype(np.float64) 

# Function to convert an image from color to grayscale
def resize_(image):
    u = cv2.resize(image,(256,256))
    return u

def rgb2gray(rgb):
    gray = cv2.cvtColor(rgb, cv2.COLOR_BGR2GRAY)
    return gray

def fd_hog(image): # 特征描述符就是通过提取图像的有用信息，并且丢弃无关信息来简化图像的表示，HOG特征描述符可以将3通道的彩色图像转换成一定长度的特征向量
    fd = hog(image, orientations=8, pixels_per_cell=(64, 64),block_norm ='L2', cells_per_block=(2, 2))
    return fd

In [None]:
a=[]
import progressbar
with progressbar.ProgressBar(max_value=len(imgs)) as bar:
    i=1
    for img in imgs:
        b=resize_(img)
        c=rgb2gray(b)   
        d=fd_hog(c)
        a.append(d)
        bar.update(i)
        i+=1
df = pd.DataFrame(a)
df['lable'] = y
id_ = np.arange(1,len(df)+1,1)
df['id'] = id_
X = df.values

# init spark
spark = SparkSession.builder \
 .master("local") \
 .appName("Image Retrieval") \
 .config("spark.some.config.option", "some-value") \
 .getOrCreate()

100% (9176 of 9176) |####################| Elapsed Time: 0:02:41 Time:  0:02:41


In [None]:
print("HOG diamension: ")
len(a[0])

HOG diamension: 


288

In [None]:
# print the hog vectors
a[0]

array([0.0150423 , 0.00702026, 0.02142409, 0.00513724, 0.03626991,
       0.00412899, 0.00851918, 0.00341131, 0.3910884 , 0.11769633,
       0.05951159, 0.0338695 , 0.05744279, 0.03126868, 0.07697637,
       0.18629866, 0.13693709, 0.09183842, 0.06311768, 0.04393364,
       0.08685511, 0.05885088, 0.08278085, 0.09256898, 0.45369873,
       0.40324878, 0.33785404, 0.22942793, 0.24095177, 0.15504705,
       0.21395993, 0.24244829, 0.25443706, 0.07657171, 0.03871747,
       0.02203506, 0.03737153, 0.020343  , 0.05007983, 0.1212035 ,
       0.44146027, 0.19044215, 0.11511823, 0.06033758, 0.05521417,
       0.05508732, 0.0997878 , 0.24385815, 0.29517052, 0.26234844,
       0.21980347, 0.14926284, 0.15676011, 0.1008716 , 0.13919956,
       0.15773372, 0.29398333, 0.15013317, 0.13212652, 0.11454095,
       0.11917495, 0.0850342 , 0.14977189, 0.30966194, 0.44515194,
       0.1920347 , 0.11608089, 0.06084214, 0.05567589, 0.05554798,
       0.10062226, 0.24589739, 0.17177813, 0.03998121, 0.01170

## Split the data into training and testing datasets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
Train = map(lambda x: (int(x[-1]),int(x[-2]),Vectors.dense(x[:-2])), X_train)
Train_df = spark.createDataFrame(Train,schema=['id','label',"features"])
Test = map(lambda x: (int(x[-1]),int(x[-2]),Vectors.dense(x[:-2])), X_test)
Test_df = spark.createDataFrame(Test,schema=['id','label',"features"])

## Set Function BucketedRandomProjectionLSH

In [None]:
brp = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes", 
                                      bucketLength=0.002,numHashTables=5)
# LSH模型的使用中最关键的还是要看setBucketLength和setNumHashTables两个参数的设置。因为这两个参数决定了模型的计算效果和性能。
model = brp.fit(Train_df)
transformedModel = model.transform(Train_df)

## Print the hashes
A column represents a hash label. In one colum(one table), vectors will be hash into the same bucket if they have the same hash value.

In [None]:
transformedModel.select("hashes").collect()

[Row(hashes=[DenseVector([-4.0]), DenseVector([43.0]), DenseVector([13.0]), DenseVector([-187.0]), DenseVector([-50.0])]),
 Row(hashes=[DenseVector([35.0]), DenseVector([52.0]), DenseVector([48.0]), DenseVector([-61.0]), DenseVector([55.0])]),
 Row(hashes=[DenseVector([5.0]), DenseVector([81.0]), DenseVector([41.0]), DenseVector([-136.0]), DenseVector([-35.0])]),
 Row(hashes=[DenseVector([4.0]), DenseVector([66.0]), DenseVector([77.0]), DenseVector([-112.0]), DenseVector([15.0])]),
 Row(hashes=[DenseVector([9.0]), DenseVector([-1.0]), DenseVector([-37.0]), DenseVector([-92.0]), DenseVector([-68.0])]),
 Row(hashes=[DenseVector([-68.0]), DenseVector([10.0]), DenseVector([19.0]), DenseVector([-78.0]), DenseVector([-45.0])]),
 Row(hashes=[DenseVector([-126.0]), DenseVector([34.0]), DenseVector([-22.0]), DenseVector([-143.0]), DenseVector([-24.0])]),
 Row(hashes=[DenseVector([-124.0]), DenseVector([-4.0]), DenseVector([-17.0]), DenseVector([-37.0]), DenseVector([-63.0])]),
 Row(hashes=[Dens

## Query the fist image from test dataset to see how well the LSH works

In [None]:
key = Vectors.dense(X_test[0][0:-2])
result = model.approxNearestNeighbors(Train_df, key, 10000, distCol="EuclideanDistance")

In [None]:
resultList = result.select("label").rdd.flatMap(lambda x: x).collect()
targetLabel = Vectors.dense(X_test[0][-2])
print("targetLabel:")
print(targetLabel)
print("The top 10 neighbours:")
resultList[:10]

targetLabel:
[6.0]
The top 10 neighbours:


[85, 91, 36, 70, 65, 91, 91, 35, 95, 97]

## Check the size of returned neighbours
we have total 6000 images in training dataset. 
if the size of resultList is large, it means that we probably hashed more images into the same bucket than we expected. Function BucketedRandomProjectionLSH paramters:  bucketLength, numHashTables need to be changed.

In [None]:
len(resultList)

169

## To run all the testdata with LSH and evaluate the result

In [None]:
def getBestPerformance(numOfTest, bucketLength,numHashTables,numOfNeighbor):
 
#     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
#     Train = map(lambda x: (int(x[-1]),int(x[-2]),Vectors.dense(x[:-2])), X_train)
#     Train_df = spark.createDataFrame(Train,schema=['id','label',"features"])
#     Test = map(lambda x: (int(x[-1]),int(x[-2]),Vectors.dense(x[:-2])), X_test)
#     Test_df = spark.createDataFrame(Test,schema=['id','label',"features"])

    brp = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes", 
                                      bucketLength=bucketLength,numHashTables=numHashTables)
    model = brp.fit(Train_df)
    transformedDF = model.transform(Train_df)
  
    nnToMajorityAccMap = {}
    nnToWeightedAccMap = {}
    with progressbar.ProgressBar(max_value = numOfTest) as bar:
        for i in range(0, numOfTest):
            Catg = X_test[i][-2]
            key = Vectors.dense(X_test[i][0:-2])
            # Choose the Last one of numOfNeighbor, the biggest one 
            result = model.approxNearestNeighbors(transformedDF, key, numOfNeighbor[-1], distCol="EuclideanDistance")
            # Convert pySpark framework colunm to python list
            labelList = result.select("label").rdd.flatMap(lambda x: x).collect()
            # Build a distance dataframe for the distance between query and result
            result1 = result.select('label', 'EuclideanDistance').collect()
            df_weighted = pd.DataFrame()
            df_weighted['Label'] = [int(row['label']) for row in result1]
            df_weighted['EuclideanDistance'] = [float(row['EuclideanDistance']) for row in result1]
            df_weighted['Weight'] = 1 / df_weighted['EuclideanDistance'] # query 和 result的Ed越大，权重越低
            # slice LabelList into different length subLists 
            nnList = []
            for numberNN in numOfNeighbor:
                slicedList = labelList[0:numberNN]
                nnList.append(slicedList)
                
            for index in range(0, len(nnList)):
                majority_vote = Counter(nnList[index]).most_common(1)[0][0]
                if  Catg == majority_vote:
                    key = "M" + str(numOfNeighbor[index])
                    if key in nnToMajorityAccMap:
                        nnToMajorityAccMap[key] = nnToMajorityAccMap.get(key) + 1
                    else:
                        nnToMajorityAccMap[key] = 1
                        
            for n in numOfNeighbor:
                weighted_result = df_weighted.head(n).groupby('Label')['Weight'].sum()
                df_weighted_result = pd.DataFrame(weighted_result)
                df_weighted_result = df_weighted_result.reset_index()
                df_weighted_result.sort_values(by='Weight',inplace=True,ascending=False)
                weighted_vote = df_weighted_result.iloc[0]['Label'].astype('int')
                if  Catg == weighted_vote:
                    key = "W" + str(n)
                    if key in nnToWeightedAccMap:
                        nnToWeightedAccMap[key] = nnToWeightedAccMap.get(key) + 1
                    else:
                        nnToWeightedAccMap[key] = 1
                        
            bar.update(i)
        
        # calculate accuracy base on Majority Vote
        for key in nnToMajorityAccMap:
            nnToMajorityAccMap[key] = nnToMajorityAccMap.get(key) / numOfTest
        # calculate accuracy base on Weighted Vote
        for key in nnToWeightedAccMap:
            nnToWeightedAccMap[key] = nnToWeightedAccMap.get(key) / numOfTest    
    return nnToMajorityAccMap, nnToWeightedAccMap

In [None]:
#set Param 
bucketLengthList = np.arange(0.008, 0.010,0.002)
numHashTablesList = np.arange(18, 20, 1)
numHashTablesList[0] = 3
#make sure the last element of numOfNeighborList is the largest
numOfNeighborList = [3];
numOfTest = 1000
print("Checking bucketLength Param:")
print(bucketLengthList)
print("Checking numHashTablesList Param:")
print(numHashTablesList)

Checking bucketLength Param:
[0.008]
Checking numHashTablesList Param:
[ 3 19]


In [None]:
%%time
bucketLengthList_para=[]
numHashTablesList_para=[]
resultMList = []
resultWList = []
for i in bucketLengthList:
    for j in numHashTablesList:
            resultM, resultW = getBestPerformance(numOfTest ,i, j, numOfNeighborList)
            print( "bucketLen:" + str(i) + "  #Hashtable:" + str(j) +  "  Majority Vote Acc: " + str(resultM))
            print( "bucketLen:" + str(i) + "  #Hashtable:" + str(j) +  "  Weighted Vote Acc: " + str(resultW))
            bucketLengthList_para.append(i)
            numHashTablesList_para.append(j)
            resultMList.append(resultM)
            resultWList.append(resultW)

100% (1000 of 1000) |####################| Elapsed Time: 0:32:35 Time:  0:32:35


bucketLen:0.008  #Hashtable:3  Majority Vote Acc: {'M3': 0.374}
bucketLen:0.008  #Hashtable:3  Weighted Vote Acc: {'W3': 0.375}


100% (1000 of 1000) |####################| Elapsed Time: 0:40:57 Time:  0:40:57


bucketLen:0.008  #Hashtable:19  Majority Vote Acc: {'M3': 0.47}
bucketLen:0.008  #Hashtable:19  Weighted Vote Acc: {'W3': 0.473}
CPU times: user 58.9 s, sys: 14.8 s, total: 1min 13s
Wall time: 1h 13min 41s


In [None]:
df_result = pd.DataFrame()
df_result['BucketLength'] = bucketLengthList_para
df_result['NumHashTables'] = numHashTablesList_para
# conver List of map to panda dataframework
df_result_map =  pd.DataFrame(resultMList)
df_result_map2 = pd.DataFrame(resultWList)
df_result = pd.concat([df_result, df_result_map], axis=1, join='inner')
df_result = pd.concat([df_result, df_result_map2], axis=1, join='inner')
#save the result to a csv file
df_result.to_csv('./result3.csv') 

In [None]:
df_result

Unnamed: 0,BucketLength,NumHashTables,M3,W3
0,0.002,3,0.33,0.33
1,0.002,11,0.41,0.415
2,0.002,19,0.415,0.415
3,0.004,3,0.37,0.37
4,0.004,11,0.47,0.47
5,0.004,19,0.47,0.47
6,0.006,3,0.405,0.405
7,0.006,11,0.48,0.485
8,0.006,19,0.485,0.485
9,0.008,3,0.385,0.385
