Skip to content

Learning Based Proximity Matrix Factorization for Node Embedding

Notifications You must be signed in to change notification settings

Josiah96Zhang/Lemane

Repository files navigation

Learning Based Proximity Matrix Factorization for Node Embedding

Updates

  • We have updated our paper on arXiv.
  • We observed that for Torch.FloatTensor A and B, the results computed by using command Torch.mm(A,B) are unstable on different machines. We have changed all FloatTensor to DoubleTensor, and the training process are conducted on a Linux machine with an NVIDIA RTX A6000 48G GPU.
  • There were some implementation issues in our evaluation code for node classification on directed graphs and we have fixed them.
  • We observed some significant improvement on Orkut dataset when normalization is applied. Thus, we take normalization for the embedding vectors on all datasets.

Tested Environment

First install Anaconda. Then open new.yaml file, and change the prefix in the last line (line 190).

prefix: /[please set the path]/anaconda3/envs

Example:

prefix: /home/user_A/anaconda3/envs

Use the following command to create a new conda environment.

conda env create -f new.yaml
conda activate new

Enviroment:

Layout

  • data
  • lp_data
  • embds
  • alpha
  • frPCA
  • label
  • evaluation

Input

Place the prepared edgelist file [graphname].txt in data/. Note that the first row of data is the number of nodes.

Example

wiki.txt

4777
0 0
0 1
0 2
0 3
0 4
...

Compilations

bash compile.sh

Algorithm

Learning Process

The trained alpha files are saved in alpha/ folder. linkpred.py and classification.py are designed for small graphs, linkpred_sample.py and classification_sample.py are used for large graphs (n > 10k). We also provide some suggestions to train alphas on a new graph in Evaluation on new dataset section.

Parameters

  • nepoch: number of training epochs, default = 50

  • lr: learning rate

  • nhop: maximum steps for random walks, default = 15

  • sample: number of negative samples per node, default = 10

  • data: dateset name

  • dist: initialization distribution for alphas, p for Possion and g for geometric, default = 'p'

  • param: initialization parameter, e.g., 1 for Possion and 0.5 for geometric distribution, default = 5

  • beta: hyperparameter in loss function

  • gamma: hyperparameter in loss function

    task dataset dist param beta gamma lr delta for push svd for push
    link prediction Wikipedia p 7 0.01 1 0.01 1e-5 JacobiSVD
    link prediction Wikivote p 1 0.5 1 0.5 1e-6 frPCA
    link prediction BlogCatalog g 0.5 0.01 1 0.1 1e-7 frPCA
    link prediction Slashdot p 5 0.1 1 0.001 1e-5 frPCA
    link prediction Tweibo g 0.5 0.1 1 0.01 1e-5 frPCA
    link prediction Orkut p 1 1 2 0.01 1e-4 frPCA
    node classification Wikipedia p 1 3 1 0.5 1e-4 JacobiSVD
    node classification BlogCatalog p 5 3 1 0.5 1e-5 frPCA
    node classification Tweibo p 1 1 3 0.5 1e-4 frPCA
    node classification Orkut p 1 1 2 0.01 1e-4 frPCA

Examples

Wikivote, link prediction, note that we need to split the data before training:

./gendata_d -graph wikivote -test_ratio 0.3
python linkpred.py --data wikivote --lr 0.5 --dist p --param 1 --beta 0.5 --gamma 1

BlogCatalog, classification:

python classification.py --data BlogCatalog --dist p --param 5 --beta 3 --gamma 1 --lr 0.5

Generalized Push

We provide two versions of Randomized SVD to generate embeddings, i.e., frPCA (filename: lemane_frpca_u.cpp, lemane_frpca_d.cpp) and JacobiSVD (filename: lemane_svd_u.cpp, lemane_svd_d.cpp), 'u' for undirected graphs, 'd' for directed graphs. See example_link.sh and example_class.sh for more details.

Parameters

  • graph: name of input graph
  • graph_path: graph file path, default = 'data/'
  • emb_path: embedding file path, default = 'embds/'
  • alpha_path: trained alpha file path, default = 'alpha/'
  • task: downstream task, link or class, default = 'class'
  • d: embedding dimension, default = 128
  • delta: push error, default = 1e-5
  • num_thread: number of threads used for push, default = 24

Examples

Wikipedia, link prediction:

./lemane_svd_d -graph wiki -graph_path lp_data/train_graph/ -task link -delta 0.00001

BlogCatalog, classification:

./lemane_frpca_u -graph BlogCatalog

Experiments

Use the following commands to reproduce the results. example_link.sh for link prediction and example_class for node classification.

bash example_link.sh
bash example_class.sh

Link Prediction

First, split the graph into training/testing set and generate negative samples. Training set will be saved into lp_data/train_graph/ folder, positive testing set will be saved into lp_data/test/positive/ folder, and negative testing set will be saved into lp_data/test/negative/ folder.

Parameters

  • graph: name of target graph
  • testing ratio: testing ratio, default = 0.3
  • method: embedding method name, default = 'lemane_frpca_link'

Examples

Splitting graph. For link prediction task, you should split the graph before the training process.

./gendata_u -graph BlogCatalog
./gendata_d -graph wiki

Then get embeddings of the training set.

./lemane_frpca_u -graph BlogCatalog -graph_path lp_data/ -task link -delta 0.0000001

The code to calculate link prediction precision:

./linkpre_u -graph BlogCatalog -method lemane_frpca_link

Node Classification

Generate the embeddings of full graph.

Parameters

  • graph: name of target graph
  • method: embedding method name, default = 'lemane_frpca_class'

Example

./lemane_svd_u -graph BlogCatalog 

Train a classifier(one vs all logistic regression) using the embeddings of full graph and the provided labels. The performance is evaluated in terms of Micro-F1 metric.

python labelclassification.py --graph BlogCatalog --method lemane_frpca_class

Training and evaluation on new dataset

Since the training process is complex and easily to fall into a local minimum, if you have a new dataset, three initialized distributions, i.e. Poisson distribution with t = 1 and t = 5, geometric distribution with a = 0.5, are suggested for training alphas and evaluation.

If you have any questions on hyperparameter settings and training process, feel free to contact us zxy96cuhk@gmail.com and we are willing to provide suggestions or help to train our model on new datasets.

Training Process

We suggest using grid search to set hyperparameters beta and gamma from {0.01, 0.1, 0.5, 1, 2, 3}, learning rate from {0.001, 0.005, 0.01, 0.05, 0.1, 0.5} for each initialized distribution. Run at least 5 times and select the one with the minimum loss under each hyperparameter setting, i.e., you will get several groups of alphas for one task (link prediction or node classification) after the training process, each group is corresponding to a certain hyperparameter setting.

Examples

Suppose the number of nodes in [graphname] is more than 10k, and [graphname] is undirected.

Suppose we use Possion distribution with t = 1 as initialized distribution, set learning rate = 0.001, beta = 1, and gamma = 0.01 for link prediction, first split the data, then train alphas on the training graph:

./gendata_u -graph [graphname] -test_ratio 0.3
python linkpred_sample.py --data [graphname] --lr 0.001 --dist p --param 1 --beta 1 --gamma 0.01

Suppose we use geometric distribution with a = 0.5 as initialized distribution, set learning rate = 0.5, beta = 3 and gamma = 1 for node classification:

python classification_sample.py --data [graphname] --lr 0.5 --dist g --param 0.5 --beta 3 --gamma 1

Generalized Push

We suggest using frPCA for social networks. For other types, using frPCA for graphs with more than 10k nodes and using JacobiSVD for small graphs with less than 10k nodes.

Examples

Suppose the number of nodes in a social network graph [graphname] is more than 10k, and [graphname] is undirected.

Generate embedding, and evaluate lemane for link prediction:

./lemane_frpca_u -graph [graphname] -graph_path lp_data/train_graph/ -task link
./linkpred_u -graph [graphname] -method lemane_frpca_link

Generate embedding, and evaluate lemane for node classification:

./lemane_frpca_u -graph [graphname]
python labelclassification.py --graph [graphname] --method lemane_frpca_class

For each task, generate embeddings with different alphas trained under different settings and report the best result.

Citation

@inproceedings{lemane,
author = {Xingyi, Zhang and Kun, Xie and Sibo Wang and Zengfeng, Huang},
title = {Learning Based Proximity Matrix Factorization for Node Embedding},
year = {2021},
booktitle = {KDD},
}

About

Learning Based Proximity Matrix Factorization for Node Embedding

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages