Aztec-TagGeneration

Running Steps:

Use modified SegPhrase code to generate parsed file with high frequency phrases. All the parsed results are stored in Data/parsed in this repo. To Generate the parsed results, run with: ./parse.sh
Combine all the parsed files to a single file, feed file to Doc2Vec.py.It will generate doc2vec models with specified word embedding dimension. Run with python Doc2Vec.py.
doc2Vec_visual.py produces a single file contains all the significant phrases for each of the parsed file using a trained Doc2Vec model with specified wording embedding dimension.

Data

data/PaperSet: all the 4953 PubMed paper
ProcessedNew: all the processed 4953 PubMed paper, all lowercased with space removed between capital letters in the original paper
parsed_0.8: output of segphrase parsing with parameter 0.8, all frequent phrases are connected with ""
parsed: output of segphrase parsing with default parameter 0.6, all frequent phrases are connected with ""

Source Code

Doc2Vec.py: Read in processed and combined 4953 PubMed paper file, build and train a doc2vec model with dimension specified in the parameters

dumpFiles.py: Extract all 4953 paper from a single file Output2.txt. These files have better format and less syntax errors. The extracted paper are further processed to train the doc2vec model. All the processed files are stored in /ProcessedNew

insertSpace.py: Insert space between capital letters in the original papers then changed all these paper to lowercase for SegPhrase training purpose.

doc2vec_visual.py: dump the results of most significant words trained on the doc2vec model to results.txt. Usage: python doc2vec_visual.py [model dimension] [file range]

Modified SegPhrase

SegPhrase is git cloned from the original author. The source code are modified for the purpose of the current task.

src/online_query/segphrase_parser.cpp: the original code uses [] to indicate frequent phrases. Changed to "_" to connect phrase as a single word for the doc2vec training purpose.

These modified code and results are uploaded in the google drive.

Models

Doc2Vec trained models with word embedding dimension 50, 100 and 400 are uploaded in the google drive.

##Results data/results.txt: contains paper with corresponding significant word trained on the doc2vec model with specified word embedding dimension

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
ProcessedNew		ProcessedNew
data		data
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Aztec-TagGeneration

Running Steps:

Data

Source Code

Modified SegPhrase

Models

About

Releases

Packages

Languages

BD2K-Aztec/Aztec-TagGeneration

Folders and files

Latest commit

History

Repository files navigation

Aztec-TagGeneration

Running Steps:

Data

Source Code

Modified SegPhrase

Models

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages