# Example: facebook

## Prepare the Dataset Instance

Let's begin with a small social network dataset: facebook.
The data is included in the XGCN repository: ``data/raw_facebook``. 
You can also download it from SNAP: http://snap.stanford.edu/data/facebook_combined.txt.gz .

We recommend to arrange the data with a clear directory structure. 
To get started, you may manually setup an ``XGCN_data`` (or other names you like) directory as follows: 
(It's recommended to put your ``XGCN_data`` somewhere else than in this repository.)

```
XGCN_data
└── dataset
    └── raw_facebook
        └── facebook_combined.txt
```

From now on, we'll use this directory to hold all the different datasets 
and models outputs. 
We refer to its path as ``all_data_root`` in our python code and shell scripts. 


In [1]:
import XGCN
from XGCN.data import io, csr
from XGCN.utils.utils import ensure_dir, set_random_seed
import os.path as osp

In [3]:
# set your own all_data_root:
all_data_root = '/home/sxr/code/XGCN_and_data/XGCN_data'

### Load & save the raw graph

In [5]:
dataset = 'facebook'
raw_data_root = osp.join(all_data_root, 'dataset/raw_' + dataset)
file_raw_graph = osp.join(raw_data_root, 'facebook_combined.txt')
E_src, E_dst = io.load_txt_edges(file_raw_graph)
print(E_src)
print(E_dst)

[   0    0    0 ... 4027 4027 4031]
[   1    2    3 ... 4032 4038 4038]


In [6]:
info, indptr, indices = csr.from_edges_to_csr_with_info(
    E_src, E_dst, graph_type='homo'
)
print(info)

# from_edges_to_csr ...
# remove_repeated_edges ...
## 0 edges are removed
{'graph_type': 'homo', 'num_nodes': 4039, 'num_edges': 88234}


Loading large graphs from text files can be time-consuming (though the facebook graph here is a small one), we can cache the graph using ``io.save_pickle``: 

In [7]:
raw_csr_root = osp.join(raw_data_root, 'csr')
ensure_dir(raw_csr_root)  # mkdir if not exists

io.save_yaml(osp.join(raw_csr_root, 'info.yaml'), info)
io.save_pickle(osp.join(raw_csr_root, 'indptr.pkl'), indptr)
io.save_pickle(osp.join(raw_csr_root, 'indices.pkl'), indices)

### Split validation/test set

Assume that we don't have existing evaluation set 
and want to split some edges for model evaluation.

In [8]:
set_random_seed(1999)

num_sample = 10_000       # number of edges to split
min_src_out_degree = 3    # guarantee the minimum out-degree of a source node
min_dst_in_degree = 1     # guarantee the minimum in-degree of a destination node

indptr, indices, pos_edges = XGCN.data.split.split_edges(
    indptr, indices,
    num_sample, min_src_out_degree, min_dst_in_degree
)
info['num_edges'] = len(indices)
print(info)

sampling edges 9999/10000 (99.99%)
num sampled edges: 10000
{'graph_type': 'homo', 'num_nodes': 4039, 'num_edges': 78234}


Now we have all the positive edges: ``pos_edges``, let's divide them for 
validation set and test set, and we’ll use the "whole-graph-multi-pos" evaluation method:

In [9]:
num_validation = 2000
val_edges = pos_edges[:num_validation]
test_edges = pos_edges[num_validation:]

val_set = XGCN.data.split.from_edges_to_adj_eval_set(val_edges)
test_set = XGCN.data.split.from_edges_to_adj_eval_set(test_edges)

### Save the Dataset Instance

Now we have already generated a complete dataset instance, let's save it:

In [10]:
data_root = osp.join(all_data_root, 'dataset/instance_' + dataset)
ensure_dir(data_root)

io.save_yaml(osp.join(data_root, 'info.yaml'), info)
io.save_pickle(osp.join(data_root, 'indptr.pkl'), indptr)
io.save_pickle(osp.join(data_root, 'indices.pkl'), indices)
io.save_pickle(osp.join(data_root, 'pos_edges.pkl'), pos_edges)
io.save_pickle(osp.join(data_root, 'val_set.pkl'), val_set)
io.save_pickle(osp.join(data_root, 'test_set.pkl'), test_set)

Here we also save the ``pos_edges``, so you can use it to make evaluation sets for 
"one-pos-k-neg" or "whole-graph-one-pos" method by concatenating some randomly 
sampled negative nodes. 

If you have done the above steps successfully, your data directory will be like follows: 

```
XGCN_data
└── dataset
    ├── raw_facebook
    |   ├── facebook_combined.txt
    |   └── csr
    |       ├── indices.pkl
    |       ├── indptr.pkl
    |       └── info.yaml
    └── instance_facebook
        ├── indices.pkl
        ├── indptr.pkl
        ├── info.yaml
        ├── pos_edges.pkl
        ├── test_set.pkl
        └── val_set.pkl
```


## Run GraphSAGE

In [11]:
config_file = '/home/sxr/code/XGCN_and_data/XGCN_library/config/GraphSAGE-config.yaml'
config = io.load_yaml(config_file)
config

{'data_root': '',
 'results_root': '',
 'epochs': 200,
 'val_freq': 1,
 'key_score_metric': 'r100',
 'convergence_threshold': 20,
 'Dataset_type': 'BlockDataset',
 'num_workers': 0,
 'num_gcn_layers': 2,
 'train_num_layer_sample': '[10, 20]',
 'NodeListDataset_type': 'LinkDataset',
 'pos_sampler': 'ObservedEdges_Sampler',
 'neg_sampler': 'RandomNeg_Sampler',
 'num_neg': 1,
 'BatchSampleIndicesGenerator_type': 'SampleIndicesWithReplacement',
 'train_batch_size': 1024,
 'epoch_sample_ratio': 0.1,
 'val_method': '',
 'val_batch_size': 256,
 'file_val_set': '',
 'test_method': '',
 'test_batch_size': 256,
 'file_test_set': '',
 'model': 'GraphSAGE',
 'seed': 1999,
 'graph_device': 'cuda:0',
 'emb_table_device': 'cuda:0',
 'gnn_device': 'cuda:0',
 'out_emb_table_device': 'cuda:0',
 'forward_mode': 'sample',
 'infer_num_layer_sample': '[10, 20]',
 'from_pretrained': 0,
 'file_pretrained_emb': '',
 'freeze_emb': 0,
 'use_sparse': 0,
 'emb_dim': 64,
 'emb_init_std': 0.1,
 'emb_lr': 0.005,
 'gn

In [16]:
config['data_root'] = data_root
results_root = osp.join(all_data_root, 'model_output', dataset, 'GraphSAGE')
ensure_dir(results_root)
config['results_root'] = results_root

config['val_method'] = 'MultiPosWholeGraph_Evaluator'
config['file_val_set'] = osp.join(data_root, 'val_set.pkl')
config['test_method'] = 'MultiPosWholeGraph_Evaluator'
config['file_test_set'] = osp.join(data_root, 'test_set.pkl')

config['Dataset_type'] = 'NodeListDataset'
config['num_gcn_layers'] = 2
config['train_num_layer_sample'] = "[]"

config['forward_mode'] = 'full_graph'
config['infer_num_layer_sample'] = "[]"

config['epochs'] = 20  # for demonstration

In [17]:
io.save_yaml(osp.join(results_root, 'config.yaml'), config)

In [18]:
seed = config['seed'] if 'seed' in config else 1999
set_random_seed(seed)

In [19]:
data = {}  # containing some global data objects

model = XGCN.build_Model(config, data)

train_dl = XGCN.build_DataLoader(config, data)

val_method = XGCN.build_val_Evaluator(config, data, model)
test_method = XGCN.build_test_Evaluator(config, data, model)

trainer = XGCN.build_Trainer(config, data, model, train_dl,
                                val_method, test_method)

In [20]:
trainer.train_and_test()

  d[key] = value
val: 100%|██████████| 8/8 [00:03<00:00,  2.44it/s]


val: {'r20': 0.013000000000000001, 'r50': 0.0315, 'r100': 0.059000000000000004, 'r300': 0.1585, 'n20': 0.0035900108441710476, 'n50': 0.007235271885991097, 'n100': 0.011683212518692018, 'n300': 0.024925463270395992}
>> new best score - r100 : 0.059000000000000004
epoch 1


train: 100%|██████████| 7/7 [00:00<00:00, 84.14it/s]
val: 100%|██████████| 8/8 [00:00<00:00, 12.55it/s]


val: {'r20': 0.16349999999999998, 'r50': 0.3325, 'r100': 0.5145, 'r300': 0.8135000000000001, 'n20': 0.057526278354227536, 'n50': 0.09067361905425787, 'n100': 0.1201671567633748, 'n300': 0.16068945047631858}
>> new best score - r100 : 0.5145
epoch 2


train: 100%|██████████| 7/7 [00:00<00:00, 81.30it/s]
val: 100%|██████████| 8/8 [00:00<00:00, 12.29it/s]


val: {'r20': 0.211, 'r50': 0.4335, 'r100': 0.634, 'r300': 0.8775, 'n20': 0.07587603640556335, 'n50': 0.11962771809101105, 'n100': 0.15214757758378983, 'n300': 0.18534035896882417}
>> new best score - r100 : 0.634
epoch 3


train: 100%|██████████| 7/7 [00:00<00:00, 126.24it/s]
val: 100%|██████████| 8/8 [00:00<00:00, 11.00it/s]


val: {'r20': 0.232, 'r50': 0.477, 'r100': 0.6775, 'r300': 0.916, 'n20': 0.08253836344927551, 'n50': 0.1308656261265278, 'n100': 0.1633737542703748, 'n300': 0.19593326587602497}
>> new best score - r100 : 0.6775
epoch 4


train: 100%|██████████| 7/7 [00:00<00:00, 57.54it/s]
val: 100%|██████████| 8/8 [00:00<00:00, 13.27it/s]


val: {'r20': 0.2215, 'r50': 0.436, 'r100': 0.6415, 'r300': 0.9065000000000001, 'n20': 0.08090560722351074, 'n50': 0.1232845614105463, 'n100': 0.156492164760828, 'n300': 0.1926167269051075}
>> distance_between_best_epoch: 1 threshold: 20
epoch 5


train: 100%|██████████| 7/7 [00:00<00:00, 142.35it/s]
val: 100%|██████████| 8/8 [00:00<00:00, 11.98it/s]


val: {'r20': 0.243, 'r50': 0.4495, 'r100': 0.6485, 'r300': 0.912, 'n20': 0.08498106802254915, 'n50': 0.12590246862918136, 'n100': 0.15823239166289568, 'n300': 0.19429039805009962}
>> distance_between_best_epoch: 2 threshold: 20
epoch 6


train: 100%|██████████| 7/7 [00:00<00:00, 59.64it/s]
val: 100%|██████████| 8/8 [00:00<00:00, 12.08it/s]


val: {'r20': 0.256, 'r50': 0.46499999999999997, 'r100': 0.668, 'r300': 0.904, 'n20': 0.09241263209283351, 'n50': 0.13377996470779183, 'n100': 0.1666352772563696, 'n300': 0.19891404315456748}
>> distance_between_best_epoch: 3 threshold: 20
epoch 7


train: 100%|██████████| 7/7 [00:00<00:00, 123.35it/s]
val: 100%|██████████| 8/8 [00:00<00:00, 11.87it/s]


val: {'r20': 0.2555, 'r50': 0.48450000000000004, 'r100': 0.6930000000000001, 'r300': 0.9315, 'n20': 0.09128188960254192, 'n50': 0.13651855089515447, 'n100': 0.1703202478811145, 'n300': 0.2028881103955209}
>> new best score - r100 : 0.6930000000000001
epoch 8


train: 100%|██████████| 7/7 [00:00<00:00, 118.75it/s]
val: 100%|██████████| 8/8 [00:00<00:00, 10.23it/s]


val: {'r20': 0.252, 'r50': 0.497, 'r100': 0.698, 'r300': 0.93, 'n20': 0.08928789292275906, 'n50': 0.13788158389925959, 'n100': 0.17039455630630254, 'n300': 0.20219868839904667}
>> new best score - r100 : 0.698
epoch 9


train: 100%|██████████| 7/7 [00:00<00:00, 139.16it/s]
val: 100%|██████████| 8/8 [00:00<00:00, 11.96it/s]


val: {'r20': 0.275, 'r50': 0.4945, 'r100': 0.706, 'r300': 0.925, 'n20': 0.10096251881867647, 'n50': 0.1442385149821639, 'n100': 0.1787556702271104, 'n300': 0.20875201932340862}
>> new best score - r100 : 0.706
epoch 10


train: 100%|██████████| 7/7 [00:00<00:00, 141.92it/s]
val: 100%|██████████| 8/8 [00:00<00:00, 11.91it/s]


val: {'r20': 0.2735, 'r50': 0.4915, 'r100': 0.698, 'r300': 0.93, 'n20': 0.10016111679375171, 'n50': 0.1432922118604183, 'n100': 0.17669726830720903, 'n300': 0.20847842105105518}
>> distance_between_best_epoch: 1 threshold: 20
epoch 11


train: 100%|██████████| 7/7 [00:00<00:00, 132.05it/s]
val: 100%|██████████| 8/8 [00:00<00:00, 11.93it/s]


val: {'r20': 0.277, 'r50': 0.487, 'r100': 0.695, 'r300': 0.929, 'n20': 0.10383798963576556, 'n50': 0.1452993296533823, 'n100': 0.17909001976251604, 'n300': 0.21108173427730798}
>> distance_between_best_epoch: 2 threshold: 20
epoch 12


train: 100%|██████████| 7/7 [00:00<00:00, 148.14it/s]
val: 100%|██████████| 8/8 [00:00<00:00, 12.22it/s]


val: {'r20': 0.29950000000000004, 'r50': 0.526, 'r100': 0.734, 'r300': 0.9299999999999999, 'n20': 0.10883612235635519, 'n50': 0.15344094064831731, 'n100': 0.18705170691013334, 'n300': 0.21396440940350295}
>> new best score - r100 : 0.734
epoch 13


train: 100%|██████████| 7/7 [00:00<00:00, 126.78it/s]
val: 100%|██████████| 8/8 [00:00<00:00, 12.03it/s]


val: {'r20': 0.274, 'r50': 0.4875, 'r100': 0.6885, 'r300': 0.9249999999999999, 'n20': 0.09739882365614175, 'n50': 0.13953693313896656, 'n100': 0.17207467442005872, 'n300': 0.20449494194611909}
>> distance_between_best_epoch: 1 threshold: 20
epoch 14


train: 100%|██████████| 7/7 [00:00<00:00, 33.34it/s]
val: 100%|██████████| 8/8 [00:00<00:00, 11.63it/s]


val: {'r20': 0.252, 'r50': 0.495, 'r100': 0.6890000000000001, 'r300': 0.9245000000000001, 'n20': 0.09332643342763186, 'n50': 0.14137147800624372, 'n100': 0.1729602329060435, 'n300': 0.205282624270767}
>> distance_between_best_epoch: 2 threshold: 20
epoch 15


train: 100%|██████████| 7/7 [00:00<00:00, 130.93it/s]
val: 100%|██████████| 8/8 [00:00<00:00, 12.26it/s]


val: {'r20': 0.258, 'r50': 0.4865, 'r100': 0.6864999999999999, 'r300': 0.9245, 'n20': 0.09679963165521621, 'n50': 0.14209914100915194, 'n100': 0.17449942503124477, 'n300': 0.20712741608172658}
>> distance_between_best_epoch: 3 threshold: 20
epoch 16


train: 100%|██████████| 7/7 [00:00<00:00, 30.85it/s]
val: 100%|██████████| 8/8 [00:00<00:00, 12.04it/s]


val: {'r20': 0.249, 'r50': 0.46799999999999997, 'r100': 0.6775, 'r300': 0.9295, 'n20': 0.0911653943657875, 'n50': 0.13437426361441612, 'n100': 0.16825335029512645, 'n300': 0.20266326193511486}
>> distance_between_best_epoch: 4 threshold: 20
epoch 17


train: 100%|██████████| 7/7 [00:00<00:00, 139.31it/s]
val: 100%|██████████| 8/8 [00:00<00:00, 12.13it/s]


val: {'r20': 0.2535, 'r50': 0.482, 'r100': 0.712, 'r300': 0.948, 'n20': 0.09300272885710001, 'n50': 0.13803596653789282, 'n100': 0.17543301368504763, 'n300': 0.20784425151348113}
>> distance_between_best_epoch: 5 threshold: 20
epoch 18


train: 100%|██████████| 7/7 [00:00<00:00, 60.89it/s]
val: 100%|██████████| 8/8 [00:00<00:00, 12.12it/s]


val: {'r20': 0.2955, 'r50': 0.5095000000000001, 'r100': 0.7235, 'r300': 0.9404999999999999, 'n20': 0.10859645496308803, 'n50': 0.15065764221549033, 'n100': 0.1854883262515068, 'n300': 0.21523110443353652}
>> distance_between_best_epoch: 6 threshold: 20
epoch 19


train: 100%|██████████| 7/7 [00:00<00:00, 129.00it/s]
val: 100%|██████████| 8/8 [00:00<00:00, 13.05it/s]


val: {'r20': 0.2735, 'r50': 0.505, 'r100': 0.7084999999999999, 'r300': 0.9349999999999999, 'n20': 0.09955779647827148, 'n50': 0.14570224852114916, 'n100': 0.17868383438885213, 'n300': 0.20972430708259343}
>> distance_between_best_epoch: 7 threshold: 20
epoch 20


train: 100%|██████████| 7/7 [00:00<00:00, 86.56it/s]
test: 100%|██████████| 11/11 [00:01<00:00, 10.91it/s]

test: {'r20': 0.28091589798480376, 'r50': 0.5174031882777184, 'r100': 0.7274887186773148, 'r300': 0.9430090295189377, 'n20': 0.14019608510895787, 'n50': 0.2046973804199135, 'n100': 0.25180337509420503, 'n300': 0.2920429244948151, 'formatted': 'r20:0.2809 || r50:0.5174 || r100:0.7275 || r300:0.9430 || n20:0.1402 || n50:0.2047 || n100:0.2518 || n300:0.2920 || '}



