Hereby is a description of the functions which are found in each designated file The functions are intended to be loaded to replicate the paper.
I describe the files in order of which they occur when running the code. To implement it, one needs to load the functions in the "functions" folder, and then run the bootstraps in the "Bootstraps" folder. Within the bootstraps folder is another folder in which all information is assembled.
Next, I explain what the functions in the functions folder do:
The very first step is manipulating the JSON file to make it workable in R. This is done in the file "Extract_Model_Words". A fundamental part of LSH is to come up with binary vector representations. The papers we have studied make use of the title. I use the the feature list and the title to create the model words. Notably, I do not use all features that are available in the dataset. I first analyse which features occur in at least 48% of all the products. This means that I use 30 features from the dataset out of the 351 total features. After identifying the most important features, the values need to be identified. This is done in object feat_value_pair. I further clean these by removing any syntax (?":.,!) and irrelevant words, such as, "featuresMap.", etc. After obtaining the feat_value_pair, I turn to the titles. The title are cleaned in a similar fashion. Using the features and the titles, the model words are created.
In this file, the test and the train data is created. The train-test split is 63-27, respectively.
This file contains the procedure of compressing the binary vectors to signature vectors using min-hashing. The figure below shows an example of a small binary matrix. The model words are a few examples of some of the model words used in this paper.
In this file the following procedure is conducted. LSH is applied on the signature matrix from function "Binary_Signature_Matrix". Signature matrix is divided into bands and each band contains r rows with the following property: n = r * b where n is the length of the signature matrix, b is the number of bands, r is the number of rows. "t", also represents the relation between false positives and false negatives in the following fashion: t = (1/b)^(1/r)
After creating the dissimilarity matrix, it is used as iput in the agglomerative complete-linkage clustering algorithm. The LSH algorithm is meant to reduce the amount of time that the algorithm. Takes to process the input matrix. The "hclust" function is used and height at which to cut the tree is chosen to be 0.8.
The test and train dataset have been set. Due to some difficulty in the implementation of complete-linkage hierarchical clustering the bootstraps need to be set individually, which is often unwanted, however, unfortunately unavoidable. The bootstraps for all the different values of t are evaluated separately and can be found in the folder "bootstraps".