# Merit Tutorial
#### This tutorial illustrates the use of Merit algorithm [Multi-Scale Contrastive Siamese Networks for Self-Supervised Graph Representation Learning](https://arxiv.org/abs/2105.05682), an unsupervised and semisupervised graph-level representation learning method,  which maximizes the mutual information between the graph-level representation and the representations of substructures of different scales.
#### The tutorial is organized as folows:
#### 1. [Preprocessing Data and Loading Configuration](InfoGraph.ipynb#L48)
#### 2. [Training the model](InfoGraph.ipynb#L100)
#### 3. [Evaluating the model](InfoGraph.ipynb#L206)

## 1. Preprocessing Data and Loading Configuration 
#### First, we load the configuration from yml file and the dataset. 
#### For easy usage, we conduct experiments to search for the best parameter across three datasets and find the proper value of parameters such that the performance of implemented InfoGraph is similar to the value reported in the paper.

In [3]:
import torch

print(torch.__version__)

1.12.1


In [7]:
from torch_geometric.loader import DataLoader
from src.methods.merit import Merit, GCN
from src.trainer import SimpleTrainer
from src.evaluation import LogisticRegression
import torch_geometric.transforms as T
from src.transforms import NormalizeFeatures, GCNNorm, Edge2Adj, Compose
from src.datasets import Planetoid, Amazon, WikiCS,Coauthor
from src.utils.create_data import create_masks
from src.evaluation import LogisticRegression
import torch 
import yaml
from src.utils.add_adj import add_adj_t
from sklearn.impute import SimpleImputer
import os
# from src.config import load_yaml
import torch
import numpy as np

In [8]:
config = yaml.safe_load(open("./configuration/merit.yml", 'r', encoding='utf-8').read())
print(config)
torch.manual_seed(0)
# np.random.seed(config.torch_seed)
# device = torch.device("cuda:{}".format(config.gpu_idx) if torch.cuda.is_available() and config.use_cuda else "cpu")

# -------------------- Data --------------------
pre_transforms = Compose([NormalizeFeatures(ord=1), Edge2Adj(norm=GCNNorm(add_self_loops=1))])
data_name = config['dataset']

current_folder = os.path.abspath('')
# path = os.path.join(current_folder, config.dataset.root, config.dataset.name)

if data_name=="photo": #91.4101
    dataset = Amazon(root="pyg_data", name="photo", pre_transform=pre_transforms) 
elif data_name=="coauthor": # 92.0973
    dataset = Coauthor(root="pyg_data", name='cs', transform=pre_transforms)
elif data_name=="wikics": #82.0109
    dataset = WikiCS(root="pyg_data", transform=T.NormalizeFeatures())
    dataset = add_adj_t(dataset)
    nan_mask = torch.isnan(dataset[0].x)
    imputer = SimpleImputer()
    dataset[0].x = torch.tensor(imputer.fit_transform(dataset[0].x))

# dataset = Amazon(root="pyg_data", name="photo", pre_transform=pre_transforms)
data_loader = DataLoader(dataset)
data = dataset.data

{'dataset': 'wikics', 'torch_seed': 0, 'drop_edge': 0.4, 'drop_feat1': 0.4, 'drop_feat2': 0.4, 'projection_size': 512, 'prediction_size': 512, 'prediction_hidden_size': 4096, 'projection_hidden_size': 4096, 'beta': 0.6, 'momentum': 0.8, 'alpha': 0.05, 'sample_size': 2000}


Downloading https://github.com/pmernyei/wiki-cs-dataset/raw/master/dataset/data.json
Processing...
Done!


## 2. Training the Model
#### In the second step, we first initialize the parameters of Merit. The backbone of the encoder is Graph Convolutional Network (GCN). 
#### You may replace the encoder with the user-defined encoder. Keep in mind that the encoder consists of class initialization, forward function, and get_embs() function.

In [None]:
# ------------------- Method -----------------
encoder = GCN(in_ft=data.x.shape[1], out_ft=512, projection_hidden_size=config["projection_hidden_size"],
                  projection_size=config["projection_size"])
method = Merit(encoder=encoder, data = data, config=config,device="cuda:0",is_sparse=True)

#### We train the model by calling the trainer.train() function.

In [10]:
trainer = SimpleTrainer(method=method, data_loader=data_loader, device="cuda:0")
trainer.train()

  warn('spsolve is more efficient when sparse b '


Epoch 0: loss: 7.9260, time: 891.0707s
Epoch 1: loss: 7.7113, time: 889.0046s
Epoch 2: loss: 7.6641, time: 880.7199s
Epoch 3: loss: 7.5954, time: 853.4735s


## 3. Evaluating the performance of InfoGraph
#### In the last step, we evaluate the performance of Merit. We first get the embedding by calling method.get_embs() function and then use logistic regression to evaluate its performance. 
#### The more choice of classifiers can be found in [classifier.py](https://github.com/IDEA-ISAIL/ssl/edit/molecure/src/evaluation/classifier.py), including SVM, RandomForest, etc. 
#### Besides, other evaluation methods in an unsupervised setting could be found in [cluster.py](https://github.com/IDEA-ISAIL/ssl/edit/molecure/src/evaluation/cluster.py) or [sim_search.py](https://github.com/IDEA-ISAIL/ssl/edit/molecure/src/evaluation/sim_search.py), including K-means method or similarity search.

In [None]:
# ------------------ Evaluator -------------------
data_pyg = dataset.data.to(method.device)
embs = method.get_embs(data_pyg, data_pyg.adj_t).detach()

lg = LogisticRegression(lr=0.01, weight_decay=0, max_iter=2000, n_run=50, device="cuda")
create_masks(data=data_pyg.cpu())
lg(embs=embs, dataset=data_pyg)