Node2Vec implementation based on paper by A. Grover and J. Leskovec. Paper link. The dataset used is BlogCatalog and can be found here
This project generates walks based on the algorithm mentioned in the paper. Learning vectors from these walks will be implemented in next update.
Can first calculate all probabilities for AliasSampling in a file, but not used since it might take days to generate them all.
Current implementation generates the tables on a as needed basis and routinely cleans them up to avoid slowdown due to a large hashmap and also possibly running out of memory. Can resume previous runs if program has to quit in the middle.
To run a complete experiment run following commands:
-
Compile
Main.Java
. Also make sure the dataset is stored in a directory and has appropriate paths set inconfig.txt
. For using default configuration create a directory calleddata
in same directory as the project and addnodes.csv
,groups.csv
andgroups-edges.csv
files obtained from BlogCatalog. -
Run
java Main.Class <config_file_location>
. The main class takes just one command line parameter, the location of the configuration file. Look atconfig.txt
in main repository for more info. Make sure correct output directories for writing the walks exist along with proper permissions. -
Go to the directory containing the
embeddingLearner.py
. Create a directory calledruns
or with any name but it should match withlog_dir
intrain_config.txt
-
Run
python embeddingLearner.py --config-file=<config_file> --input-file=<walks_file>
.config_file
is the location of configuration file for training the embedder. Look at train_config.txt in the main repository for more info. Thewalks_file
is the file to which the Java program wrote the output of its walks. The output of the runs will be stored inlog_dir\{timestamp}
. Use this location in next step. -
Run
python embeddingLearner.py --load-embed-from=<directory of previous run> --vocabulary-size=<num_of_nodes_in_graph>
. The directory of run should be as mentioned in previous step. Vocabulary size is the number of nodes in the entire graph. This will create a file 'embeddings.txt' in the same folder provided inload-embed-from
option. -
Run
python onevAll.py --embedding-file=<embeddings.txt location> --meta-file=<embedding_config> --config-file=<classifier_config>
.embedding-file
takes the location of "embeddings.txt" created in previous step.meta-file
takes the value of config file to train the embeddings which will be used, set it totrain_config.txt
found in the same directory as the run for that embedding.config-file
is configuration for training the classifier. It is in same file as the one used for embedding.