Skip to content

EnesKilicaslan/Enhanced-k-Nearest-Neighbors

Repository files navigation

Implementation of Enhanced k-Nearest-Neighbors

What is Enhanced K-Nearest Neighbour Algorithm:

Enhanced K-Nearest Neighbour algorithm is developed to classify multi label documents by ADReM group at University of Antwerpen.

Algorithm:

The Algorithm contains three main steps:

  1. Find the k nearest neighbours according to BM25 similarity
  2. Use the k nearest neighbours weighted vote on candidate classes
  3. Decide the final predictions via the thresholding strategy. Alt text

About Project:

You can use this project either to find nearest neighbors according to BM25 similarity or to predict the label(s) of test documents.

Implemented with two different ways:

  1. Vector: fast on small datasets
  2. Inverted Index: efficient on large sparse datasets

The input data might be pre-processed or needs to pre-process, but you need to specify it by setting command line argument ** 'preprocess' ** to true.

How To Compile And Run:

Because the project is implemented in standard c++,
a UNIX like Operating System (e.g Ubuntu, MacOSx, ..) and a c++ compiler (e.g g++) are needed.

you can download the code by the following command

git clone https://github.com/EnesKilicaslan/Enhanced-k-Nearest-Neighbors.git
cd Enhanced-k-Nearest-Neighbors

Open a terminal in the directory where you downloaded the source code, then copy and paste the following command to the terminal.

g++ *.cpp -o exec

that command will create an executable named exec, now you can use the executable wit appropriate command line arguments

firstly,
--help command is available, and you can use it like following:

./exec --help

The output would list the command line arguments you can use and explains their meaning, like in the below picture: Alt text

Secondly,
k , train , test arguments are mandatory arguments. So you need to set them correctly in order to use the executable.

  • k: is the decimal number to specify number of documents
  • train: is the path to a file or directory that contains training input
  • test: is the path to a file or directory that contains test input

Example usage:

./exec k=4 train="path/to/training/file" test="path/to/test/file"

Lastly,
if you don't set plabel to true the application generates k number of nearest neighbours according to BM25 similarity, otherwise it will make a class ( label ) prediction by using weighted voting and thresholding.
As specified earlier, you can use two different type of data structure; one is vector and the other one is inverted index. In order to use vector (matrix) as a data structure, set vector to true. If you don't set it, as a default the application will use inverted index.
If you would like to see the data structure content in the file, set save to true. This will save inverted index or vector (depending on what you are using to represent training documents) to a file. if you use vector, then file name would be "train_vectors.txt"; otherwise "train_invertedindex.txt" If your data needs preprocessing the set preprocess to true. But train and test arguments must point to a directory each of which have to include .key and txt files for each document, like the dataset here.

Some Running Examples:

I am assuming that you compiled the code without any error and have executable named exec

  • Please download the data from here. This is actually NLM_500 dataset but I separated documents for test. Because the data needs preprocessing we need to use preprocess argument. I want to use sparse vector implementation and it would be great if I see that sparse vector in the file. I also want it to predict labels instead of nearest neighbors and I am using k=10. So the following command will do that :)

    ./exec k=10 train=NLM_500/train test=NLM_500/test preprocess=true vector=true save=true plabel=true
  • This time I want to use inverted index with the same data, but I want it to retrieve the nearest neighbors according to BM25 similarity. I still need preprocessing but I don't want to save the inverted index So the following command is exactly what I am looking for:

    ./exec k=10 train=NLM_500/train test=NLM_500/test preprocess=true
  • Surely, you can use a dataset which is already preprocessed like the data here. This data is Wikipedia-500K which is really big for this kind of algorithms, so we must use inverted index implementation. Again I want to save the inverted indexes for train data sets to a file and as a result I expect prediction of labels. This time I am supplying path to files not directories for train and test arguments because of no need to preprocess. Lastly, this takes a couple of hours

    ./exec k=20 train=WikiLSHTC/wikiLSHTC_train.txt test=WikiLSHTC/wikiLSHTC_test.txt plabel=true save=true

PS: I am assuming that the NLM_500 and WikiLSHTC folders are in the same directory with the executable

ToDo's:

  • read necessary materials to understand the concept

  • pre-processing code

  • implement BM25 similarity to compare two documents

  • implement k-NN

  • implement weighted vote scheme

  • implement thresholding strategy

  • add --help command

  • write documentation

  • use inverted index to make k-NN faster

  • running the implementation on the datasets and comparing results

  • changing the implementation to multi-threaded

  • writing report

About

k nearest neighbors algorithm for multi label documents

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published