WordParser can build a vocabulary of words by scanning text files. It can also track word relationships (e.g. what words are most likely to come before or after a given word), and using those word relationships it can compute the semantic similarly between words. WordParser is used to generate some of the word files used by Wordsmith.
Note: WordParser expects UTF8 or ASCII text files and will skip words containing characters with a character code greater than 255. This allows words to contain some unicode characters, so the output word files are saved in UTF8 format.
Generate, save, and load a vocabulary file (one word per line):
#include "WordParser.h"
int main()
{
WordParser::MinOccurr = 3;
WordParser::MaxWordLen = 32;
phmap::parallel_flat_hash_map<std::string, uint64_t> words;
WordParser::GenEnglishWordList(words, "C:/text_files/");
WordParser::SaveWordList(words, "C:/english_words.txt");
phmap::parallel_flat_hash_set<std::string> engWords;
WordParser::LoadWordList(engWords, "C:/english_words.txt");
}Generate and save a "word web" file (records list of words used before and after other words):
#include "WordParser.h"
int main()
{
WordParser::MinOccurr = 4;
WordParser::MaxWordLen = 28;
phmap::parallel_flat_hash_map<std::string, WordMaps> words;
WordParser::GenEnglishWordWeb(words, dataDir);
WordParser::SaveWordWeb(words, "C:/english_wordweb.txt");
}Once you have a word web it can be used to compute word similarities:
#include "WordParser.h"
int main()
{
phmap::parallel_flat_hash_map<std::string, std::pair<uint32_t,WordLinks>> wordMap;
std::vector<std::pair<std::string, uint32_t>> wordVec;
WordParser::LoadWordWeb(wordMap, wordVec, "C:/english_wordweb.txt");
phmap::parallel_node_hash_map<std::string, WP_PQ> simWords;
WordParser::GenSimWords(simWords, wordVec, wordMap);
WordParser::SaveSimWords(simWords, "C:/english_wordsims.txt");
}