# Example: facebook

## Prepare the Dataset Instance

Let's begin with a small social network dataset: facebook.
The data is included in the XGCN repository: ``data/raw_facebook``. 
You can also download it from SNAP: http://snap.stanford.edu/data/facebook_combined.txt.gz .

We recommend to arrange the data with a clear directory structure. 
To get started, you may manually setup an ``XGCN_data`` (or other names you like) directory as follows: 
(It's recommended to put your ``XGCN_data`` somewhere else than in this repository.)

```
XGCN_data
└── dataset
    └── raw_facebook
        └── facebook_combined.txt
```

From now on, we'll use this directory to hold all the different datasets 
and models outputs. 
We refer to its path as ``all_data_root`` in our python code and shell scripts. 


In [1]:
import XGCN
from XGCN.data import io, csr
from XGCN.utils.utils import ensure_dir, set_random_seed

import os.path as osp

In [2]:
# set your own all_data_root:
all_data_root = '/home/wuyao/songxiran/code/XGCN_coda_and_data/XGCN_data'

### Load & save the raw graph

In [3]:
dataset = 'facebook'
raw_data_root = osp.join(all_data_root, 'dataset/raw_' + dataset)
file_raw_graph = osp.join(raw_data_root, 'facebook_combined.txt')
E_src, E_dst = io.load_txt_edges(file_raw_graph)
print(E_src)
print(E_dst)

[   0    0    0 ... 4027 4027 4031]
[   1    2    3 ... 4032 4038 4038]


In [4]:
info, indptr, indices = csr.from_edges_to_csr_with_info(
    E_src, E_dst, graph_type='homo'
)
print(info)

# from_edges_to_csr ...
# remove_repeated_edges ...
## 0 edges are removed
{'graph_type': 'homo', 'num_nodes': 4039, 'num_edges': 88234}


Loading large graphs from text files can be time-consuming (though the facebook graph here is a small one), we can cache the graph using ``io.save_pickle``: 

In [5]:
raw_csr_root = osp.join(raw_data_root, 'csr')
ensure_dir(raw_csr_root)  # mkdir if not exists

io.save_yaml(osp.join(raw_csr_root, 'info.yaml'), info)
io.save_pickle(osp.join(raw_csr_root, 'indptr.pkl'), indptr)
io.save_pickle(osp.join(raw_csr_root, 'indices.pkl'), indices)

### Split validation/test set

Assume that we don't have existing evaluation set 
and want to split some edges for model evaluation.

In [6]:
set_random_seed(1999)

num_sample = 10_000       # number of edges to split
min_src_out_degree = 3    # guarantee the minimum out-degree of a source node
min_dst_in_degree = 1     # guarantee the minimum in-degree of a destination node

indptr, indices, pos_edges = XGCN.data.split.split_edges(
    indptr, indices,
    num_sample, min_src_out_degree, min_dst_in_degree
)
info['num_edges'] = len(indices)
print(info)

sampling edges 9999/10000 (99.99%)
num sampled edges: 10000
{'graph_type': 'homo', 'num_nodes': 4039, 'num_edges': 78234}


Now we have all the positive edges: ``pos_edges``, let's divide them for 
validation set and test set, and we’ll use the "whole-graph-multi-pos" evaluation method:

In [7]:
num_validation = 2000
val_edges = pos_edges[:num_validation]
test_edges = pos_edges[num_validation:]

val_set = XGCN.data.split.from_edges_to_adj_eval_set(val_edges)
test_set = XGCN.data.split.from_edges_to_adj_eval_set(test_edges)

### Save the Dataset Instance

Now we have already generated a complete dataset instance, let's save it:

In [8]:
data_root = osp.join(all_data_root, 'dataset/instance_' + dataset)
ensure_dir(data_root)

io.save_yaml(osp.join(data_root, 'info.yaml'), info)
io.save_pickle(osp.join(data_root, 'indptr.pkl'), indptr)
io.save_pickle(osp.join(data_root, 'indices.pkl'), indices)
io.save_pickle(osp.join(data_root, 'pos_edges.pkl'), pos_edges)
io.save_pickle(osp.join(data_root, 'val_set.pkl'), val_set)
io.save_pickle(osp.join(data_root, 'test_set.pkl'), test_set)

Here we also save the ``pos_edges``, so you can use it to make evaluation sets for 
"one-pos-k-neg" or "whole-graph-one-pos" method by concatenating some randomly 
sampled negative nodes. 

If you have done the above steps successfully, your data directory will be like follows: 

```
XGCN_data
└── dataset
    ├── raw_facebook
    |   ├── facebook_combined.txt
    |   └── csr
    |       ├── indices.pkl
    |       ├── indptr.pkl
    |       └── info.yaml
    └── instance_facebook
        ├── indices.pkl
        ├── indptr.pkl
        ├── info.yaml
        ├── pos_edges.pkl
        ├── test_set.pkl
        └── val_set.pkl
```


## Run GraphSAGE

In [9]:
config_file = '/home/wuyao/songxiran/code/XGCN_coda_and_data/XGCN_library/config/GraphSAGE-config.yaml'
config = io.load_yaml(config_file)
config

{'data_root': '',
 'results_root': '',
 'epochs': 200,
 'val_freq': 1,
 'key_score_metric': 'r100',
 'convergence_threshold': 20,
 'Dataset_type': 'BlockDataset',
 'num_workers': 0,
 'num_gcn_layers': 2,
 'train_num_layer_sample': '[10, 20]',
 'NodeListDataset_type': 'LinkDataset',
 'pos_sampler': 'ObservedEdges_Sampler',
 'neg_sampler': 'RandomNeg_Sampler',
 'num_neg': 1,
 'BatchSampleIndicesGenerator_type': 'SampleIndicesWithReplacement',
 'train_batch_size': 1024,
 'epoch_sample_ratio': 0.1,
 'val_evaluator': '',
 'val_batch_size': 256,
 'file_val_set': '',
 'test_evaluator': '',
 'test_batch_size': 256,
 'file_test_set': '',
 'model': 'GraphSAGE',
 'seed': 1999,
 'graph_device': 'cuda:0',
 'emb_table_device': 'cuda:0',
 'gnn_device': 'cuda:0',
 'out_emb_table_device': 'cuda:0',
 'forward_mode': 'sample',
 'infer_num_layer_sample': '[10, 20]',
 'from_pretrained': 0,
 'file_pretrained_emb': '',
 'freeze_emb': 0,
 'use_sparse': 0,
 'emb_dim': 64,
 'emb_init_std': 0.1,
 'emb_lr': 0.005

In [10]:
config['data_root'] = data_root
results_root = osp.join(all_data_root, 'model_output', dataset, 'GraphSAGE')
ensure_dir(results_root)
config['results_root'] = results_root

config['val_evaluator'] = 'WholeGraph_MultiPos_Evaluator'
config['file_val_set'] = osp.join(data_root, 'val_set.pkl')
config['test_evaluator'] = 'WholeGraph_MultiPos_Evaluator'
config['file_test_set'] = osp.join(data_root, 'test_set.pkl')

config['Dataset_type'] = 'NodeListDataset'
config['num_gcn_layers'] = 2
config['train_num_layer_sample'] = "[]"

config['forward_mode'] = 'full_graph'
config['infer_num_layer_sample'] = "[]"

config['epochs'] = 5  # for demonstration

In [11]:
io.save_yaml(osp.join(results_root, 'config.yaml'), config)

In [12]:
seed = config['seed'] if 'seed' in config else 1999
set_random_seed(seed)

In [13]:
data = {}  # containing some global data objects

model = XGCN.build_Model(config, data)

train_dl = XGCN.build_DataLoader(config, data)

val_evaluator = XGCN.build_val_Evaluator(config, data, model)
test_evaluator = XGCN.build_test_Evaluator(config, data, model)

trainer = XGCN.build_Trainer(config, data, model, train_dl,
                                val_evaluator, test_evaluator)

In [14]:
trainer.train_and_test()

  d[key] = value
val: 100%|██████████| 8/8 [00:05<00:00,  1.37it/s]


val: {'r20': 0.008, 'r50': 0.029000000000000005, 'r100': 0.063, 'r300': 0.172, 'n20': 0.0022217263951897622, 'n50': 0.006249060936272145, 'n100': 0.011755600214004517, 'n300': 0.026065643906593324}
>> new best score - r100 : 0.063
train epoch 1


100%|██████████| 7/7 [00:00<00:00, 23.91it/s]
val: 100%|██████████| 8/8 [00:01<00:00,  5.42it/s]


val: {'r20': 0.182, 'r50': 0.3625, 'r100': 0.5435000000000001, 'r300': 0.8225, 'n20': 0.061383630290627475, 'n50': 0.09682544178515673, 'n100': 0.12624080485105516, 'n300': 0.16416783578693867}
>> new best score - r100 : 0.5435000000000001
train epoch 2


100%|██████████| 7/7 [00:00<00:00, 35.16it/s]
val: 100%|██████████| 8/8 [00:01<00:00,  6.08it/s]


val: {'r20': 0.201, 'r50': 0.3955, 'r100': 0.6055, 'r300': 0.8665, 'n20': 0.07034942269325256, 'n50': 0.10879766258597375, 'n100': 0.14273371067643165, 'n300': 0.17818819119036197}
>> new best score - r100 : 0.6055
train epoch 3


100%|██████████| 7/7 [00:00<00:00, 26.37it/s]
val: 100%|██████████| 8/8 [00:01<00:00,  5.48it/s]


val: {'r20': 0.23049999999999998, 'r50': 0.43400000000000005, 'r100': 0.6325000000000001, 'r300': 0.895, 'n20': 0.08142697764933109, 'n50': 0.12165832801163198, 'n100': 0.1537973643690348, 'n300': 0.1896073479913175}
>> new best score - r100 : 0.6325000000000001
train epoch 4


100%|██████████| 7/7 [00:00<00:00, 27.83it/s]
val: 100%|██████████| 8/8 [00:01<00:00,  5.42it/s]


val: {'r20': 0.2625, 'r50': 0.47, 'r100': 0.6759999999999999, 'r300': 0.9075, 'n20': 0.09483492197841406, 'n50': 0.13604466869682075, 'n100': 0.16931749348342418, 'n300': 0.20104179226234553}
>> new best score - r100 : 0.6759999999999999
train epoch 5


100%|██████████| 7/7 [00:00<00:00, 33.79it/s]
test: 100%|██████████| 11/11 [00:01<00:00,  5.70it/s]

test: {'r20': 0.2606913787674581, 'r50': 0.48381444972196996, 'r100': 0.678367787569825, 'r300': 0.921091709989611, 'n20': 0.12727378382057442, 'n50': 0.18800182983608477, 'n100': 0.23156760687158523, 'n300': 0.276839482918626, 'formatted': 'r20:0.2607 || r50:0.4838 || r100:0.6784 || r300:0.9211 || n20:0.1273 || n50:0.1880 || n100:0.2316 || n300:0.2768 || '}



