Learning Based Proximity Matrix Factorization for Node Embedding

Updates

We have updated our paper on arXiv.
We observed that for Torch.FloatTensor A and B, the results computed by using command Torch.mm(A,B) are unstable on different machines. We have changed all FloatTensor to DoubleTensor, and the training process are conducted on a Linux machine with an NVIDIA RTX A6000 48G GPU.
There were some implementation issues in our evaluation code for node classification on directed graphs and we have fixed them.
We observed some significant improvement on Orkut dataset when normalization is applied. Thus, we take normalization for the embedding vectors on all datasets.

Tested Environment

First install Anaconda. Then open new.yaml file, and change the prefix in the last line (line 190).

prefix: /[please set the path]/anaconda3/envs

Example:

prefix: /home/user_A/anaconda3/envs

Use the following command to create a new conda environment.

conda env create -f new.yaml
conda activate new

Enviroment:

Layout

data
lp_data
embds
alpha
frPCA
label
evaluation

Input

Place the prepared edgelist file [graphname].txt in data/. Note that the first row of data is the number of nodes.

Example

wiki.txt

Compilations

bash compile.sh

Algorithm

Learning Process

The trained alpha files are saved in alpha/ folder. linkpred.py and classification.py are designed for small graphs, linkpred_sample.py and classification_sample.py are used for large graphs (n > 10k). We also provide some suggestions to train alphas on a new graph in Evaluation on new dataset section.

Parameters

nepoch: number of training epochs, default = 50
lr: learning rate
nhop: maximum steps for random walks, default = 15
sample: number of negative samples per node, default = 10
data: dateset name
dist: initialization distribution for alphas, p for Possion and g for geometric, default = 'p'
param: initialization parameter, e.g., 1 for Possion and 0.5 for geometric distribution, default = 5
beta: hyperparameter in loss function

gamma: hyperparameter in loss function

task	dataset	dist	param	beta	gamma	lr	delta for push	svd for push
link prediction	Wikipedia	p	7	0.01	1	0.01	1e-5	JacobiSVD
link prediction	Wikivote	p	1	0.5	1	0.5	1e-6	frPCA
link prediction	BlogCatalog	g	0.5	0.01	1	0.1	1e-7	frPCA
link prediction	Slashdot	p	5	0.1	1	0.001	1e-5	frPCA
link prediction	Tweibo	g	0.5	0.1	1	0.01	1e-5	frPCA
link prediction	Orkut	p	1	1	2	0.01	1e-4	frPCA
node classification	Wikipedia	p	1	3	1	0.5	1e-4	JacobiSVD
node classification	BlogCatalog	p	5	3	1	0.5	1e-5	frPCA
node classification	Tweibo	p	1	1	3	0.5	1e-4	frPCA
node classification	Orkut	p	1	1	2	0.01	1e-4	frPCA

Examples

Wikivote, link prediction, note that we need to split the data before training:

./gendata_d -graph wikivote -test_ratio 0.3
python linkpred.py --data wikivote --lr 0.5 --dist p --param 1 --beta 0.5 --gamma 1

BlogCatalog, classification:

python classification.py --data BlogCatalog --dist p --param 5 --beta 3 --gamma 1 --lr 0.5

Generalized Push

We provide two versions of Randomized SVD to generate embeddings, i.e., frPCA (filename: lemane_frpca_u.cpp, lemane_frpca_d.cpp) and JacobiSVD (filename: lemane_svd_u.cpp, lemane_svd_d.cpp), 'u' for undirected graphs, 'd' for directed graphs. See example_link.sh and example_class.sh for more details.

Parameters

graph: name of input graph
graph_path: graph file path, default = 'data/'
emb_path: embedding file path, default = 'embds/'
alpha_path: trained alpha file path, default = 'alpha/'
task: downstream task, link or class, default = 'class'
d: embedding dimension, default = 128
delta: push error, default = 1e-5
num_thread: number of threads used for push, default = 24

Examples

Wikipedia, link prediction:

./lemane_svd_d -graph wiki -graph_path lp_data/train_graph/ -task link -delta 0.00001

BlogCatalog, classification:

./lemane_frpca_u -graph BlogCatalog

Experiments

Use the following commands to reproduce the results. example_link.sh for link prediction and example_class for node classification.

bash example_link.sh
bash example_class.sh

Link Prediction

First, split the graph into training/testing set and generate negative samples. Training set will be saved into lp_data/train_graph/ folder, positive testing set will be saved into lp_data/test/positive/ folder, and negative testing set will be saved into lp_data/test/negative/ folder.

Parameters

graph: name of target graph
testing ratio: testing ratio, default = 0.3
method: embedding method name, default = 'lemane_frpca_link'

Examples

Splitting graph. For link prediction task, you should split the graph before the training process.

./gendata_u -graph BlogCatalog

./gendata_d -graph wiki

Then get embeddings of the training set.

./lemane_frpca_u -graph BlogCatalog -graph_path lp_data/ -task link -delta 0.0000001

The code to calculate link prediction precision:

./linkpre_u -graph BlogCatalog -method lemane_frpca_link

Node Classification

Generate the embeddings of full graph.

Parameters

graph: name of target graph
method: embedding method name, default = 'lemane_frpca_class'

Example

./lemane_svd_u -graph BlogCatalog

Train a classifier(one vs all logistic regression) using the embeddings of full graph and the provided labels. The performance is evaluated in terms of Micro-F1 metric.

python labelclassification.py --graph BlogCatalog --method lemane_frpca_class

Training and evaluation on new dataset

Since the training process is complex and easily to fall into a local minimum, if you have a new dataset, three initialized distributions, i.e. Poisson distribution with t = 1 and t = 5, geometric distribution with a = 0.5, are suggested for training alphas and evaluation.

If you have any questions on hyperparameter settings and training process, feel free to contact us zxy96cuhk@gmail.com and we are willing to provide suggestions or help to train our model on new datasets.

Training Process

We suggest using grid search to set hyperparameters beta and gamma from {0.01, 0.1, 0.5, 1, 2, 3}, learning rate from {0.001, 0.005, 0.01, 0.05, 0.1, 0.5} for each initialized distribution. Run at least 5 times and select the one with the minimum loss under each hyperparameter setting, i.e., you will get several groups of alphas for one task (link prediction or node classification) after the training process, each group is corresponding to a certain hyperparameter setting.

Examples

Suppose the number of nodes in [graphname] is more than 10k, and [graphname] is undirected.

Suppose we use Possion distribution with t = 1 as initialized distribution, set learning rate = 0.001, beta = 1, and gamma = 0.01 for link prediction, first split the data, then train alphas on the training graph:

./gendata_u -graph [graphname] -test_ratio 0.3
python linkpred_sample.py --data [graphname] --lr 0.001 --dist p --param 1 --beta 1 --gamma 0.01

Suppose we use geometric distribution with a = 0.5 as initialized distribution, set learning rate = 0.5, beta = 3 and gamma = 1 for node classification:

python classification_sample.py --data [graphname] --lr 0.5 --dist g --param 0.5 --beta 3 --gamma 1

Generalized Push

We suggest using frPCA for social networks. For other types, using frPCA for graphs with more than 10k nodes and using JacobiSVD for small graphs with less than 10k nodes.

Examples

Suppose the number of nodes in a social network graph [graphname] is more than 10k, and [graphname] is undirected.

Generate embedding, and evaluate lemane for link prediction:

./lemane_frpca_u -graph [graphname] -graph_path lp_data/train_graph/ -task link
./linkpred_u -graph [graphname] -method lemane_frpca_link

Generate embedding, and evaluate lemane for node classification:

./lemane_frpca_u -graph [graphname]
python labelclassification.py --graph [graphname] --method lemane_frpca_class

For each task, generate embeddings with different alphas trained under different settings and report the best result.

Citation

@inproceedings{lemane,
author = {Xingyi, Zhang and Kun, Xie and Sibo Wang and Zengfeng, Huang},
title = {Learning Based Proximity Matrix Factorization for Node Embedding},
year = {2021},
booktitle = {KDD},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Learning Based Proximity Matrix Factorization for Node Embedding

Updates

Tested Environment

Layout

Input

Compilations

Algorithm

Learning Process

Generalized Push

Experiments

Link Prediction

Node Classification

Training and evaluation on new dataset

Training Process

Generalized Push

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
alpha		alpha
data		data
embds/wiki		embds/wiki
evaluation		evaluation
frPCA		frPCA
label		label
lp_data		lp_data
README.md		README.md
SVD.h		SVD.h
classification.py		classification.py
classification_sample.py		classification_sample.py
compile.sh		compile.sh
example_class.sh		example_class.sh
example_linkpred.sh		example_linkpred.sh
graph_util.py		graph_util.py
lemane_svd_d.cpp		lemane_svd_d.cpp
lemane_svd_u.cpp		lemane_svd_u.cpp
linkpred.py		linkpred.py
linkpred_sample.py		linkpred_sample.py
new.yaml		new.yaml
svd.py		svd.py

Josiah96Zhang/Lemane

Folders and files

Latest commit

History

Repository files navigation

Learning Based Proximity Matrix Factorization for Node Embedding

Updates

Tested Environment

Layout

Input

Compilations

Algorithm

Learning Process

Generalized Push

Experiments

Link Prediction

Node Classification

Training and evaluation on new dataset

Training Process

Generalized Push

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages