# Sugrl Tutorial
#### This tutorial illustrates the use of Sugrl algorithm [Simple Unsupervised Graph Representation Learning](https://ojs.aaai.org/index.php/AAAI/article/view/20748), a novel unsupervised graph representation learning method that employs a novel multiplet loss to leverage the complementary nature of structural and neighbor information, thereby enhancing inter-class variation. Additionally, it incorporates an upper bound loss to maintain a finite distance between positive and anchor embeddings, effectively reducing intra-class variation. This approach sidesteps the need for data augmentation and discriminators, enabling the production of efficient, low-dimensional embeddings.
#### The tutorial is organized as folows:
#### 1. [Preprocessing Data and Loading Configuration](InfoGraph.ipynb#L48)
#### 2. [Training the model](InfoGraph.ipynb#L100)
#### 3. [Evaluating the model](InfoGraph.ipynb#L206)

## 1. Preprocessing Data and Loading Configuration 
#### First, we load the configuration from yml file and the dataset. 
#### For easy usage, we conduct experiments to search for the best parameter across three datasets and find the proper value of parameters such that the performance of implemented InfoGraph is similar to the value reported in the paper.

In [3]:
import torch
print(torch.__version__)

1.12.1


In [4]:
import torch 
from src.augment import RandomMask, RandomDropEdge, RandomDropNode, AugmentSubgraph, AugmentorList
from src.methods import SugrlMLP, SugrlGCN
from src.methods import SUGRL
from src.trainer import SimpleTrainer
from torch_geometric.loader import DataLoader
import torch_geometric.transforms as T
from src.transforms import NormalizeFeatures, GCNNorm, Edge2Adj, Compose
from src.datasets import Planetoid, Amazon, WikiCS,Coauthor
from src.utils.create_data import create_masks
from src.evaluation import LogisticRegression
import yaml
from src.utils.add_adj import add_adj_t
from sklearn.impute import SimpleImputer
import os

In [5]:
# load the configuration file
# config = yaml.safe_load(open('./configuration/sugrl_wikics.yml', 'r', encoding='utf-8').read())
config = yaml.safe_load(open("./configuration/sugrl_amazon.yml", 'r', encoding='utf-8').read())
# config = yaml.safe_load(open("./configuration/sugrl_coauthor.yml", 'r', encoding='utf-8').read())
# config = yaml.safe_load(open("./configuration/sugrl_cora.yml", 'r', encoding='utf-8').read())
print(config)
data_name = config['dataset']
torch.manual_seed(0)
# np.random.seed(config.torch_seed)
# device = torch.device("cuda:{}".format(config.gpu_idx) if torch.cuda.is_available() and config.use_cuda else "cpu")

# -------------------- Data --------------------
pre_transforms = Compose([NormalizeFeatures(ord=1), Edge2Adj(norm=GCNNorm(add_self_loops=1))])
data_name = config['dataset']

current_folder = os.path.abspath('')
# path = os.path.join(current_folder, config.dataset.root, config.dataset.name)

if data_name=="cora":
    dataset = Planetoid(root="pyg_data", name="cora", pre_transform=pre_transforms)
if data_name=="photo": #92.9267
    dataset = Amazon(root="pyg_data", name="photo", pre_transform=pre_transforms) 
elif data_name=="coauthor": # 92.0973
    dataset = Coauthor(root="pyg_data", name='cs', transform=pre_transforms)
elif data_name=="wikics": #82.0109
    dataset = WikiCS(root="pyg_data", transform=T.NormalizeFeatures())
    dataset = add_adj_t(dataset)
    nan_mask = torch.isnan(dataset[0].x)
    imputer = SimpleImputer()
    dataset[0].x = torch.tensor(imputer.fit_transform(dataset[0].x))


# dataset = Amazon(root="pyg_data", name="photo", pre_transform=pre_transforms)
data_loader = DataLoader(dataset)
data = dataset.data

{'dataset': 'photo', 'black_list': [4, 5, 6], 'lr': 0.01, 'out_heads': 1, 'task_type': 'Node_Transductive', 'val_interval': 1, 'num_hidden_features': 8, 'epochs': 1000, 'to_undirected_at_neg': True, 'w_loss1': 100, 'w_loss2': 100, 'w_loss3': 1, 'margin1': 0.9, 'margin2': 0.9, 'dim': 128, 'cfg': [512, 128], 'NewATop': 0, 'dropout': 0.1, 'NN': 1, 'num1': 200, 'wd': 0.0, 'weight_decay': 0.0001}


## 2. Training the Model
#### In the second step, we first initialize the parameters of Sugrl. The backbones of the encoder are MLP and GCN. 
#### You may replace the encoder with the user-defined encoder. Keep in mind that the encoder consists of class initialization, forward function, and get_embs() function.

In [6]:
# ------------------- Method -----------------
encoder_1 = SugrlMLP(in_channels=data.x.shape[1])
encoder_2 = SugrlGCN(in_channels=data.x.shape[1])
method = SUGRL(encoder=[encoder_1,encoder_2],data = data, config=config,device="cuda:0")

  i = torch.LongTensor([self.data.edge_index[0].numpy(), self.data.edge_index[1].numpy()])


#### We train the model by calling the trainer.train() function.

In [7]:
trainer = SimpleTrainer(method=method, data_loader=data_loader, device="cuda:0")
trainer.train()

  index = torch.range(0, len(lable) - 1)[(lable == j).squeeze()]


Epoch 0: loss: 199.5476, time: 1.0656s
Epoch 1: loss: 196.8908, time: 0.3950s
Epoch 2: loss: 194.3576, time: 0.3982s
Epoch 3: loss: 191.9660, time: 0.4052s
Epoch 4: loss: 189.7411, time: 0.3905s
Epoch 5: loss: 187.6877, time: 0.3903s
Epoch 6: loss: 185.7750, time: 0.3863s
Epoch 7: loss: 184.0428, time: 0.3959s
Epoch 8: loss: 182.5275, time: 0.3971s
Epoch 9: loss: 181.3188, time: 0.3848s
Epoch 10: loss: 180.4411, time: 0.2732s
Epoch 11: loss: 179.9785, time: 0.2572s
Epoch 12: loss: 179.8774, time: 0.2544s
Epoch 13: loss: 180.0378, time: 0.2523s
Epoch 14: loss: 180.2350, time: 0.2543s
Epoch 15: loss: 180.3821, time: 0.2486s
Epoch 16: loss: 180.4006, time: 0.2488s
Epoch 17: loss: 180.3510, time: 0.2612s
Epoch 18: loss: 180.1465, time: 0.2593s
Epoch 19: loss: 179.8565, time: 0.2777s
Epoch 20: loss: 179.4822, time: 0.2602s
Epoch 21: loss: 179.0675, time: 0.2536s
Epoch 22: loss: 178.6110, time: 0.3458s
Epoch 23: loss: 178.1984, time: 0.2663s
Epoch 24: loss: 177.8410, time: 0.2584s
Epoch 25: 

## 3. Evaluating the performance of Sugrl
#### In the last step, we evaluate the performance of Sugrl. We first get the embedding by calling method.get_embs() function and then use logistic regression to evaluate its performance. 
#### The more choice of classifiers can be found in [classifier.py](https://github.com/IDEA-ISAIL/ssl/edit/molecure/src/evaluation/classifier.py), including SVM, RandomForest, etc. 
#### Besides, other evaluation methods in an unsupervised setting could be found in [cluster.py](https://github.com/IDEA-ISAIL/ssl/edit/molecure/src/evaluation/cluster.py) or [sim_search.py](https://github.com/IDEA-ISAIL/ssl/edit/molecure/src/evaluation/sim_search.py), including K-means method or similarity search.

In [8]:
# ------------------ Evaluator -------------------
data_pyg = dataset.data.to(method.device)
embs = method.get_embs(data_pyg.x, data_pyg.adj_t).detach()
lg = LogisticRegression(lr=0.001, weight_decay=0, max_iter=3000, n_run=50, device="cuda")
create_masks(data=data_pyg.cpu())
lg(embs=embs, dataset=data_pyg)

Evaluate node classification results
** Val: 83.8699 (1.5726) | Test: 84.0484 (0.7522) **
