WordParser

WordParser can build a vocabulary of words by scanning text files. It can also track word relationships (e.g. what words are most likely to come before or after a given word), and using those word relationships it can compute the semantic similarly between words. WordParser is used to generate some of the word files used by Wordsmith.

Note: WordParser expects UTF8 or ASCII text files and will skip words containing characters with a character code greater than 255. This allows words to contain some unicode characters, so the output word files are saved in UTF8 format.

USAGE

Generate, save, and load a vocabulary file (one word per line):

#include "WordParser.h"

int main()
{
	WordParser::MinOccurr = 3;
	WordParser::MaxWordLen = 32;

	phmap::parallel_flat_hash_map<std::string, uint64_t> words;

	WordParser::GenEnglishWordList(words, "C:/text_files/");
	WordParser::SaveWordList(words, "C:/english_words.txt");

	phmap::parallel_flat_hash_set<std::string> engWords;
	WordParser::LoadWordList(engWords, "C:/english_words.txt");
}

Generate and save a "word web" file (records list of words used before and after other words):

#include "WordParser.h"

int main()
{
	WordParser::MinOccurr = 4;
	WordParser::MaxWordLen = 28;

	phmap::parallel_flat_hash_map<std::string, WordMaps> words;

	WordParser::GenEnglishWordWeb(words, dataDir);
	WordParser::SaveWordWeb(words, "C:/english_wordweb.txt");
}

Once you have a word web it can be used to compute word similarities:

#include "WordParser.h"

int main()
{
	phmap::parallel_flat_hash_map<std::string, std::pair<uint32_t,WordLinks>> wordMap;
	std::vector<std::pair<std::string, uint32_t>> wordVec;
	WordParser::LoadWordWeb(wordMap, wordVec, "C:/english_wordweb.txt");

	phmap::parallel_node_hash_map<std::string, WP_PQ> simWords;
	WordParser::GenSimWords(simWords, wordVec, wordMap);
	WordParser::SaveSimWords(simWords, "C:/english_wordsims.txt");
}

Dependencies

parallel-hashmap

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
include		include
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
WordParser.h		WordParser.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WordParser

USAGE

Dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WordParser

USAGE

Dependencies

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages