Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugs in minibatch trainning #131

Closed
suxnju opened this issue Aug 16, 2022 · 2 comments
Closed

bugs in minibatch trainning #131

suxnju opened this issue Aug 16, 2022 · 2 comments

Comments

@suxnju
Copy link

suxnju commented Aug 16, 2022

馃悰 Bug

To Reproduce

error occurred in _mini_train_step function in trainerflow/node_classification.py when use mini_batch_flag in node_classification task and SimpleHGN model

import argparse
from openhgnn.experiment import Experiment

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--model', '-m', default='SimpleHGN', type=str, help='name of models')
    parser.add_argument('--task', '-t', default='node_classification', type=str, help='name of task')
    # link_prediction / node_classification
    parser.add_argument('--dataset', '-d', default='imdb4MAGNN', type=str, help='name of datasets')
    parser.add_argument('--gpu', '-g', default='0', type=int, help='-1 means cpu')
    parser.add_argument('--use_best_config', action='store_true', help='will load utils.best_config')
    parser.add_argument('--load_from_pretrained', action='store_true', help='load model from the checkpoint')
    args = parser.parse_args()

    experiment = Experiment(model=args.model, dataset=args.dataset, task=args.task, gpu=args.gpu,
                            use_best_config=args.use_best_config, load_from_pretrained=args.load_from_pretrained, mini_batch_flag = True, batch_size=64)
    experiment.run()

Expected behavior

Minibatch training on a large heterograph

Environment

  • torch==1.12.1
  • dgl-cu113==0.9.0 # for CUDA support
  • openhgnn==0.3.0
  • Linux
  • Python 3.8.13

Additional context

  • the default minibatch sampler is MultiLayerFullNeighborSampler
  • the blocks is a list (line 164) and the expected input in the forward function of the model (e.g. SimpleHGN) is a hg(line 159)
for i, (input_nodes, seeds, blocks) in enumerate(loader_tqdm):
    blocks = [blk.to(self.device) for blk in blocks]
    ...
    logits = self.model(blocks, emb)[self.category]
def forward(self, hg, h_dict):
    with hg.local_scope():
        hg.ndata['h'] = h_dict
@dddg617
Copy link
Collaborator

dddg617 commented Aug 17, 2022

Currently, many models do not support mini-batch training, we are now trying to fix this. You may refer to RGCN.py to support mini-batch. However, models like SimpleHGN, HGT, and HetSANN may have more trouble as these models need dgl.to_homogeneous. As far as I know, this API has bugs when using mini-batch and we are reporting this to DGL Team.

@suxnju
Copy link
Author

suxnju commented Sep 2, 2022

Currently, many models do not support mini-batch training, we are now trying to fix this. You may refer to RGCN.py to support mini-batch. However, models like SimpleHGN, HGT, and HetSANN may have more trouble as these models need dgl.to_homogeneous. As far as I know, this API has bugs when using mini-batch and we are reporting this to DGL Team.

Thank you for your reply. Finally, I solved the minibatch training problem in the context of my scenario, in short, I use dgl.dataloading.GraphDataLoader method as my dataset has many small graphs.

But I find a very strange little problem, the process does not shut down properly.

To Produce

from openhgnn.config import Config

config = Config(file_path="./model/config.ini", model="SimpleHGN", dataset="imdb4MAGNN", task="node_classification",gpu=2)

print(config)

The console output is

[Config Info]   Model: SimpleHGN,       Task: node_classification,      Dataset: imdb4MAGNN

But the program does not shut down. File ./model/config.ini is copied from openhgnn/config.ini.

I still use the following code, It can end normally.

import configparser
import numpy as np
import torch as th
import sys

print(111)

@suxnju suxnju closed this as completed Sep 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants