Skip to content

JacobBruce/WordParser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WordParser

WordParser can build a vocabulary of words by scanning text files. It can also track word relationships (e.g. what words are most likely to come before or after a given word), and using those word relationships it can compute the semantic similarly between words. WordParser is used to generate some of the word files used by Wordsmith.

Note: WordParser expects UTF8 or ASCII text files and will skip words containing characters with a character code greater than 255. This allows words to contain some unicode characters, so the output word files are saved in UTF8 format.

USAGE

Generate, save, and load a vocabulary file (one word per line):

#include "WordParser.h"

int main()
{
	WordParser::MinOccurr = 3;
	WordParser::MaxWordLen = 32;

	phmap::parallel_flat_hash_map<std::string, uint64_t> words;

	WordParser::GenEnglishWordList(words, "C:/text_files/");
	WordParser::SaveWordList(words, "C:/english_words.txt");

	phmap::parallel_flat_hash_set<std::string> engWords;
	WordParser::LoadWordList(engWords, "C:/english_words.txt");
}

Generate and save a "word web" file (records list of words used before and after other words):

#include "WordParser.h"

int main()
{
	WordParser::MinOccurr = 4;
	WordParser::MaxWordLen = 28;

	phmap::parallel_flat_hash_map<std::string, WordMaps> words;

	WordParser::GenEnglishWordWeb(words, dataDir);
	WordParser::SaveWordWeb(words, "C:/english_wordweb.txt");
}

Once you have a word web it can be used to compute word similarities:

#include "WordParser.h"

int main()
{
	phmap::parallel_flat_hash_map<std::string, std::pair<uint32_t,WordLinks>> wordMap;
	std::vector<std::pair<std::string, uint32_t>> wordVec;
	WordParser::LoadWordWeb(wordMap, wordVec, "C:/english_wordweb.txt");

	phmap::parallel_node_hash_map<std::string, WP_PQ> simWords;
	WordParser::GenSimWords(simWords, wordVec, wordMap);
	WordParser::SaveSimWords(simWords, "C:/english_wordsims.txt");
}

Dependencies

About

Fast C++ library for extracting words from text files

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages