GitHub - MrAlexSee/Fingerprints: A proof-of-concept program for lightweight fingerprints for fast approximate keyword matching using bitwise operations.

Basic information

A proof-of-concept program for lightweight fingerprints for fast approximate keyword matching. Authors: Aleksander Cisłak, Szymon Grabowski.

Published as Lightweight Fingerprints for Fast Approximate Keyword Matching Using Bitwise Operations (Cisłak and Grabowski), Computing and Informatics, 02/2019, pp. 367-389. Available under the following link: http://www.cai.sk/ojs/index.php/cai/article/view/2019_2_367.

Compilation

Add Boost library to the path for compilation by setting BOOST_DIR in the makefile. Requires Boost program options module to be compiled for static linking. Requires support for the C++14 standard.

Type make for optimized compile. Comment out OPTFLAGS in the makefile in order to disable optimization.

Tested with gcc 64-bit 7.2.0 and Boost 1.67.0 (the latter is not performance-critical, used only for parameter and data parsing and formatting) on Ubuntu 17.10 Linux version 4.13.0-36 64-bit.

A binary (compiled executable) for Linux is available in the release (file name fingerprints).

Usage

Basic usage: ./fingerprints [options] <input dictionary file> <input pattern file>

Input dictionary file (positional parameter 1 or named parameter -i or --in-dict-file) should contain the list of words, separated with newline characters. Input pattern file (positional parameter 2 or named parameter -I or --in-pattern-file) should contain the list of patterns, separated with newline characters. Attached as part of this package are scripts test_real.sh for testing on real-world data and test_synth.sh for testing on synthetic data.

End-to-end tests are located in the end_to_end_tests folder and they can be run using the run_tests.sh script in that folder.
Unit tests are located in the unit_tests folder and they can be run by issuing the make run command in that folder.
The scripts directory contains some helpful Python 2 tools.

Command-line parameter description

Short name	Long name	Parameter description
	`--calc-rejection`	calculate percentages of rejected words instead of measuring time
`-d`	`--dump`	dump input files and params info with elapsed time and throughput to output file (useful for testing)
	`--dump-construction`	dump fingerprint construction time
`-D`	`--distance arg`	distance metric: ham (Hamming), lev (Levenshtein) (default = ham)
`-f`	`--fingerprint-type arg`	fingerprint type: none, occ (occurrence), occhalved (occurrence halved), count, pos (position) (default = occ)
`-h`	`--help`	display help message
`-i`	`--in-dict-file arg`	input dictionary file path (positional arg 1)
`-I`	`--in-pattern-file arg`	input pattern file path (positional arg 2)
	`--iter arg`	number of iterations per pattern lookup (default = 1)
`-k`	`--approx arg`	perform approximate search (Hamming or Levenshtein) for k errors
`-l`	`--letters-type arg`	letters type: common, mixed, rare (default = common)
`-o`	`--out-file arg`	output file path (default = res.txt)
`-p`	`--pattern-count arg`	maximum number of patterns read from top of the pattern file (non-positive values are ignored)
	`--pattern-size arg`	if set, only patterns of this size (letter count) will be read from the pattern file (non-positive values are ignored)
`-s`	`--separator arg`	input data (dictionary and patterns) separator (default = newline)
`-v`	`--version`	display version info
`-w`	`--word-count arg`	maximum number of words read from top of the dictionary file (non-positive values are ignored)

Data files description

All files listed below are located in the data folder.

File name	Description
`dict_english_200.txt`	words extracted from the English 200 collection from the Pizza&Chili index, 815,935 words
`dict_iamerican_insane.txt`	English dictionary, 350,518 words
`queries_english_200_8.txt`	queries of length 8 with errors extracted from the corresponding dictionary, 10,000 words
`queries_iamerican_insane_8.txt`	queries of length 8 with errors extracted from the corresponding dictionary, 10,000 words
`queries_misspellings.txt`	a list of common English misspellings, 4,278 words
`queries_urls_69.txt`	queries of length 69 with errors extracted from a dictionary of real-world URL data, 10,000 words

References:

Pizza&Chili: http://pizzachili.dcc.uchile.cl/texts/nlang/english.200MB.gz
iamerican-insane: Ubuntu package
English misspellings: https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines
URL dictionary: http://data.law.di.unimi.it/webdata/in-2004/in-2004.urls.gz

Name		Name	Last commit message	Last commit date
Latest commit History 242 Commits
data		data
end_to_end_tests		end_to_end_tests
related		related
scripts		scripts
unit_tests		unit_tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
fingerprints.cpp		fingerprints.cpp
fingerprints.hpp		fingerprints.hpp
helpers.hpp		helpers.hpp
main.cpp		main.cpp
makefile		makefile
params.hpp		params.hpp
test_real.sh		test_real.sh
test_synth.sh		test_synth.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Basic information

Compilation

Usage

Command-line parameter description

Data files description

About

Releases 3

Packages

Languages

License

MrAlexSee/Fingerprints

Folders and files

Latest commit

History

Repository files navigation

Basic information

Compilation

Usage

Command-line parameter description

Data files description

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages