Word2vec is a research and exploration pipeline designed to analyze biomedical grants, publication abstracts, and other natural language corpora. While this repository is primarily a research platform, it is used internally within the Office of Portfolio Analysis at the National Institutes of Health.
When installing from a new environment the following may be useful:
sudo apt install libssl1.0* # For Ubuntu 18.04 pyenv install 2.7.11 pyenv local 2.7.11 pip install pip -U pip install setuptools -U pip install -r requirements.txt
Pipeline parameters and options for word2vec are run through the configuration file, the defaults are accessible for guiding new projects.
Bracketed sections within the config file outline each step of the word2vec pipeline; for instance, the parameters that affect word2vec embedding are found in the embed section.
Within each step, output data is stored in the
Each step of the pipeline, and their corresponding functions, are listed in the table below:
|import_data||Imports documents and concatenates text fields|
|phrase||Assigns single definitions to abbreviated words or phrases|
|parse||Removes non-contextual language|
|embed||Assigns numerical weights to the words|
|score||Assigns numerical weights to the documents|
|metacluster||Separates the data into clusters based on the embedding|
|analyze||Provides statistical data for each cluster|
|predict||Predicts input features from the document vectors|
import_data does not perform any processing; its purpose is assigning each document a unique reference ID
_ref id and concatenating specified fields.
Text processing requires csv documents containing labeled headers for each section be imported into the pipeline and given a unique reference ID.
[import_data] input_data_directories = datasets, merge_columns = title, abstract, "specific aims" output_data_directory = data_import
To properly save the imported document, create a new data folder that can be recognized by the
input_data_directories section, currently the field is set to recognize folders entitled
As the word2vec pipeline is limited to processing one field for each document, the
import_data step requires different fields be concatenated into one; for instance, the step:
merge_columns = title, abstract, "specific aims"
would create a new text column combining each document's title, abstract, and specific aims into a single text field that can then be parsed.
"specific aims" needs to be quoted because it is two words, and case matters ("abstract" is not the same as "Abstract").
The merged column text can be found in the
import_data output folder.
Abbreviated terms and phrases within the dataset can be replaced by single definitions using the
The resulting file displays abbreviated terms and phrases as well as their prevalence within the dataset; this information is stored in the
output_data_directory folder in the file
[phrase] output_data_directory = data_document_scores/ f_abbreviations = abbreviations.csv
Concatenated document fields within the pipeline can be parsed for word2vec embedding.
Stripping the text of stop words, punctuation, errors, and content lacking semantic information can be performed using the NLPre library.
The NLPre library is a (pre)-processing library capable of smoothing data inconsistencies.
Parsed documents are automatically sent to the
[parse] output_table = parsed output_data_directory = data_parsed pipeline = dedash, titlecaps, replace_acronyms, separated_parenthesis, replace_from_dictionary, token_replacement, decaps_text, pos_tokenizer [[replace_from_dictionary]] prefix = 'MeSH_' [[replace_acronyms]] prefix = 'PHRASE_' [[separated_parenthesis]] # Only keep long parenthetical content min_keep_length = 10 [[pos_tokenizer]] POS_blacklist = connector, cardinal, pronoun, symbol, punctuation, modal_verb, adverb, verb, w_word, adjective
The embed step of the pipeline scans the pre-processed text and creates word vectors by assigning numerical weights according to their distributed representation. This is the eponymous word2vec step.
[embed] input_data_directory = data_parsed output_data_directory = data_embeddings embedding_commands = w2v_embedding, [[w2v_embedding]] f_db = w2v.gensim skip_gram = 0 hierarchical_softmax = 1 epoch_n = 30 window = 5 negative = 0 sample = 1e-5 size = 300 min_count = 10
Modifications can be made to this step to tailor it for individual analyses.
Common adjustments include changes to the
window setting refers to the size of the frame used to scan the text,
size represents the number of vectors generated, and
min_count is the number of times a word must appear before it is recognized as a term by the algorithm.
The output gensim data is then stored in the
data_embeddings output folder under the filename
The stored data can be accessed using the gensim library.
The learned vectors can be utilized for other machine learning tasks such as unsupervised clustering or predictions; therefore, this process requires enough document information for accurate training.
You can read more about word2vec embedding here.
Using the score step, word vectors are generated for each document's embedded text to compare similarity across the entire dataset.
count_commands subsection determines the weights assigned to each word within a document.
At least one method must be listed under
score_commands, the most common is
A full description of each score command can be found in the table below.
These scoring measures create 300 dimensional vectors for each document, which represents their position in word2vec space.
Scored data is stored in the
Due to size restrictions, output of this document scoring is stored in a HDF5 file.
Each of the scoring functions assume a bag-of-words model; they each add up the contribution of every word and renormalize the vector to have unit length. As an example, assume your document only has two words "cat" which appears twice and "dog" which appears only once. Let their word vectors be v1, v2 and their IDF scores from
count_commands be f1 and f2.
||Adds the word vectors||2*v1 + v2|
||Adds the word vectors only once||v1 + v2|
||Adds the word vectors weighted by IDF||2*v1*f1 + v2*f2|
||Adds the word vectors weighted by IDF only once||v1*f1 + v2*f2|
||Same as simple IDF, but removes the first principal component per doc|
Principal component analysis (PCA) dimensionality reduction can be applied to these 300-dimensional vectors to identify which are the most influential, the default dimension to reduce to is 25.
The default number is specified by
Document scores are determined based gensim word2vec model created by the embed step.
To speed up the scoring process, word2vec embedding models from previous runs can be reused to score other documents.
To use a set of approximate "stop-words", adjust the values for
For each word downsampled, a Gaussian is expanded around the center word (ci) and all words (cj) are downsampled by a factor of exp(-alpha*(ci.cj)), where alpha is the weight.
Words are never upsampled, as the value above is clipped at unity. A warning will be issued if a downsampled word is not in the embedding.
[score] output_data_directory = data_document_scores f_db = document_scores.h5 compute_reduced_representation = True count_commands = term_document_frequency, term_frequency, score_commands = score_unique_IDF, score_simple, [[downsample_weights]] # Downsampling weights, adjust as needed understand = 0.50 scientific = 0.25 [[reduced_representation]] n_components = 25 [[term_frequency]] f_db = TF.csv [[term_document_frequency]] f_db = TDF.csv
Document score outputs can be used to create interpretive clustering algorithms.
Document similarity, based on the embedding outputs, can be analyzed by cluster size and proximity.
Document vectors are pulled from only one scoring method, specified under
Since document vectors are not distributed according to the assumptions under k-means, spectral clustering is preferred.
However, spectral clustering is too computationally expensive to run on large datasets,
so we perform "metaclustering" using random sampling of subsets of the data.
The parameters of the metacluster step can be adjusted depending on the analysis.
Each subcluster has size
subcluster_m, the total number of subclusters generated is
and the percentage of clusters discarded due to dissimilarity is
For example, if
subcluster_kn = 32 and
subcluster_pcut = .8 documents will only be assigned to 32 * .8 = 25 total clusters.
subcluster_repeats variable determines how many times the clustering algorithm will be performed.
[metacluster] score_method = unique_IDF subcluster_m = 1000 subcluster_kn = 15 subcluster_pcut = 0.80 subcluster_repeats = 1 output_data_directory = data_clustering f_centroids = meta_cluster_centroids.h5
This step of the pipeline has multiple options:
analyze metacluster and
The analyze metacluster step returns additional document and cluster information.
Under this command, the labels for each document are assigned to the cluster.
The labels assigned to each document should capture broad themes of semantic content.
Cluster and document statistics can be used for comparing average document similarity as well as inter-document similarity.
The output of this command is determined by the variable
Document analysis data for corresponding clusters are stored in the
Cluster statistics, including document similarity, can be acquired in the
cluster_desc.csv file in the output data folder.
These statistics are informative, but must be verified by human interpretation.
This information is a measure of document semantic similarity given the model's training and the similarity of the portfolio-data quality issues, therefore, will impact the outcome of this algorithm.
The average distance of each document within a cluster from the centroid can is reported under the column
compute_dispersion is True, the output contains a column labeled
intra_document_dispersion that measures the average document similarity.
dispersion_order attempts to re-arrange each cluster in an order to reflect inter-document similarity.
The analyze LIME step attempts to differentiate words between all pairs of close metaclusters.
LIME is often informative, but be aware that this may take a awhile to compute.
Results are stored in
Metaclusters are considered "close" if the cosine similarity between their centroids is greater than
[postprocessing] compute_dispersion = True output_data_directory = results master_columns = PMID, title [[LIME_explainer]] metacluster_cosine_minsim = 0.6 score_method = unique_IDF n_lime_samples = 25 # Make this higher for more accuracy n_lime_features = 50 n_estimators = 50
The predict step tries to learn a model to accurately predict the categories for the columns under
The data is fit against the document vectors found in
score step using a random forest with
To robustly test the accuracy of the model, it is repeated using the number in
use_reduced is True, the data are fit using the PCA reduced vectors, otherwise the full document vectors are used.
use_SMOTE is True, over- and under-samples the minority and majority classes so that the training data is evenly balanced using the SMOTE algorithm.
A meta-estimator is used if
use_meta is True, combining all the scoring methods under
The final output stored under
extra_columns from the original dataset are copied over for convenience.
[predict] categorical_columns = journal, n_estimators = 200 cross_validation_folds = 12 use_SMOTE = False use_reduced = True use_meta = True meta_methods = unique_IDF, output_data_directory = data_predict extra_columns = journal, title,
This project is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication.