Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

在aistudio上,用Graph4Rec跑自己的数据异常 #435

Closed
zouhan6806504 opened this issue Jul 15, 2022 · 4 comments
Closed

在aistudio上,用Graph4Rec跑自己的数据异常 #435

zouhan6806504 opened this issue Jul 15, 2022 · 4 comments

Comments

@zouhan6806504
Copy link

按照数据格式生成了文件,起了两个ip,环境是aistudio 32gb显存,pip install pgl
配置文件如下

# configuration for multi-metapath2vec

task_name: metapath2vec.0712

# ------------------------Data Configuration--------------------------------------------#
etype2files: "item2other:/home/aistudio/data/edges/item_other.txt,other2item:/home/aistudio/data/edges/other_item.txt,other2other:/home/aistudio/data/edges/other_other.txt"
ntype2files: "item:/home/aistudio/data/nodes/item_other_types.txt,other:/home/aistudio/data/nodes/item_other_types.txt"
symmetry: False
shard_num: 1000
# [ntype, name, feat_type, length]
nfeat_info: null
slots: []

meta_path: "item2other-other2item;item2other-other2other-other2item;other2item-item2other-other2item;other2item-item2other-other2other-other2item"


walk_len: 24
win_size: 3
neg_num: 10
walk_times: 10


# -----------------Model HyperParams Configuration--------------------------------------#
dataset_type: WalkBasedDataset
collatefn: CollateFn
model_type: WalkBasedModel
warm_start_from: null
num_nodes: 13806619
embed_size: 128
hidden_size: 128

# ----------------------Training Configuration------------------------------------------#
epochs: 100
num_workers: 1
lr: 0.001
lazy_mode: True
batch_node_size: 20
batch_pair_size: 100
pair_stream_shuffle_size: 10000
log_dir: /home/aistudio/logs_custom
output_dir: /home/aistudio/outputs_custom
save_dir: /home/aistudio/ckpt_custom
files2saved: ["*.yaml", "*.py", "*.sh", "./models", "./datasets", "./utils"]
log_steps: 100

# -------------Distributed CPU Training Configuration-----------------------------------#

# if you want to save model per epoch, then save_steps will be set by below equation
# save_steps = num_nodes * walk_len * win_size * walk_times / batch_pair_size / worker_num
# but the equation is not very precise since the neighbors of each node is not the same.
save_steps: 100000

启动命令
!python PGL-main/apps/Graph4Rec/env_run/src/train.py --config /home/aistudio/metapath2vec.yaml --ip /home/aistudio/ip_list.txt

报错信息如下

backup ./metapath2vec.yaml to /home/aistudio/logs_custom/metapath2vec.0712
[INFO] 2022-07-15 23:20:03,813 [    train.py:  134]:	=========================================================================
[INFO] 2022-07-15 23:20:03,813 [    train.py:  137]:	task_name: metapath2vec.0712
[INFO] 2022-07-15 23:20:03,813 [    train.py:  137]:	etype2files: item2other:/home/aistudio/data/edges/item_other.txt,other2item:/home/aistudio/data/edges/other_item.txt,other2other:/home/aistudio/data/edges/other_other.txt
[INFO] 2022-07-15 23:20:03,813 [    train.py:  137]:	ntype2files: item:/home/aistudio/data/nodes/item_other_types.txt,other:/home/aistudio/data/nodes/item_other_types.txt
[INFO] 2022-07-15 23:20:03,813 [    train.py:  137]:	symmetry: False
[INFO] 2022-07-15 23:20:03,813 [    train.py:  137]:	shard_num: 1000
[INFO] 2022-07-15 23:20:03,813 [    train.py:  137]:	nfeat_info: None
[INFO] 2022-07-15 23:20:03,813 [    train.py:  137]:	slots: []
[INFO] 2022-07-15 23:20:03,813 [    train.py:  137]:	meta_path: item2other-other2item;item2other-other2other-other2item;other2item-item2other-other2item;other2item-item2other-other2other-other2item
[INFO] 2022-07-15 23:20:03,813 [    train.py:  137]:	walk_len: 24
[INFO] 2022-07-15 23:20:03,813 [    train.py:  137]:	win_size: 3
[INFO] 2022-07-15 23:20:03,814 [    train.py:  137]:	neg_num: 10
[INFO] 2022-07-15 23:20:03,814 [    train.py:  137]:	walk_times: 10
[INFO] 2022-07-15 23:20:03,814 [    train.py:  137]:	dataset_type: WalkBasedDataset
[INFO] 2022-07-15 23:20:03,814 [    train.py:  137]:	collatefn: CollateFn
[INFO] 2022-07-15 23:20:03,814 [    train.py:  137]:	model_type: WalkBasedModel
[INFO] 2022-07-15 23:20:03,814 [    train.py:  137]:	warm_start_from: None
[INFO] 2022-07-15 23:20:03,814 [    train.py:  137]:	num_nodes: 13806619
[INFO] 2022-07-15 23:20:03,814 [    train.py:  137]:	embed_size: 128
[INFO] 2022-07-15 23:20:03,814 [    train.py:  137]:	hidden_size: 128
[INFO] 2022-07-15 23:20:03,814 [    train.py:  137]:	epochs: 100
[INFO] 2022-07-15 23:20:03,814 [    train.py:  137]:	num_workers: 1
[INFO] 2022-07-15 23:20:03,814 [    train.py:  137]:	lr: 0.001
[INFO] 2022-07-15 23:20:03,814 [    train.py:  137]:	lazy_mode: True
[INFO] 2022-07-15 23:20:03,814 [    train.py:  137]:	batch_node_size: 20
[INFO] 2022-07-15 23:20:03,814 [    train.py:  137]:	batch_pair_size: 100
[INFO] 2022-07-15 23:20:03,814 [    train.py:  137]:	pair_stream_shuffle_size: 10000
[INFO] 2022-07-15 23:20:03,814 [    train.py:  137]:	log_dir: /home/aistudio/logs_custom/metapath2vec.0712
[INFO] 2022-07-15 23:20:03,814 [    train.py:  137]:	output_dir: /home/aistudio/outputs_custom/metapath2vec.0712
[INFO] 2022-07-15 23:20:03,814 [    train.py:  137]:	save_dir: /home/aistudio/ckpt_custom/metapath2vec.0712
[INFO] 2022-07-15 23:20:03,814 [    train.py:  137]:	files2saved: ['*.yaml', '*.py', '*.sh', './models', './datasets', './utils']
[INFO] 2022-07-15 23:20:03,814 [    train.py:  137]:	log_steps: 100
[INFO] 2022-07-15 23:20:03,814 [    train.py:  137]:	save_steps: 100000
[INFO] 2022-07-15 23:20:03,814 [    train.py:  139]:	=========================================================================
W0715 23:20:03.816249  2101 gpu_context.cc:278] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 10.1
W0715 23:20:03.820690  2101 gpu_context.cc:306] device: 0, cuDNN Version: 7.6.
[INFO] 2022-07-15 23:20:07,897 [    train.py:  107]:	starting training...
[INFO] 2022-07-15 23:20:07,899 [  dataset.py:   83]:	gpu train data generator
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/pgl/distributed/helper.py:60: UserWarning: node_batch_stream_shuffle_size attribute is not existed, return None
  warnings.warn("%s attribute is not existed, return None" % attr)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/pgl/distributed/dist_graph.py:180: UserWarning: node_batch_stream_shuffle_size is not specified, default value is 20000
  warnings.warn("node_batch_stream_shuffle_size is not specified, "
I0715 23:20:07.900080  2140 graph_py_service.cc:102] start to build server
I0715 23:20:07.900151  2140 graph_py_service.cc:112] build server done
/home/aistudio/PGL-main/apps/Graph4Rec/env_run/src/utils/config.py:83: UserWarning: sample_num_list attribute is not existed, return None
  warnings.warn("%s attribute is not existed, return None" % attr)
[INFO] 2022-07-15 23:20:07,911 [ego_graph.py:  198]:	sample_num_list is None
/home/aistudio/PGL-main/apps/Graph4Rec/env_run/src/utils/config.py:83: UserWarning: sage_mode attribute is not existed, return None
  warnings.warn("%s attribute is not existed, return None" % attr)
Traceback (most recent call last):
  File "PGL-main/apps/Graph4Rec/env_run/src/train.py", line 142, in <module>
    main(config, args.ip)
  File "PGL-main/apps/Graph4Rec/env_run/src/train.py", line 108, in main
    train(config, model, train_loader, optim)
  File "PGL-main/apps/Graph4Rec/env_run/src/train.py", line 68, in train
    optim.step()
  File "<decorator-gen-252>", line 2, in step
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/base.py", line 299, in __impl__
    return func(*args, **kwargs)
  File "<decorator-gen-250>", line 2, in step
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/wrapped_decorator.py", line 25, in __impl__
    return wrapped_func(*args, **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/framework.py", line 434, in __impl__
    return func(*args, **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/optimizer/adam.py", line 451, in step
    loss=None, startup_program=None, params_grads=params_grads)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/optimizer/optimizer.py", line 963, in _apply_optimize
    optimize_ops = self._create_optimization_pass(params_grads)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/optimizer/optimizer.py", line 767, in _create_optimization_pass
    param_and_grad)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/optimizer/adam.py", line 351, in _append_optimize_op
    'beta2', _beta2, 'multi_precision', find_master)
OSError: (External) CUDA error(700), an illegal memory access was encountered. 
  [Hint: 'cudaErrorIllegalAddress'. The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched. ] (at /paddle/paddle/phi/backends/gpu/gpu_context.cc:624)
  [operator < adam > error]
@Liwb5
Copy link
Collaborator

Liwb5 commented Jul 21, 2022

环境安装正确吗?
pip install paddlepaddle==2.2.2 -U
pip install pgl==2.1.5 -U

@zouhan6806504
Copy link
Author

升级了环境报了另一个异常
!pip install --upgrade --user pip
!pip install paddlepaddle==2.2.2 -U
!pip install pgl==2.1.5 -U

启动多个server
python -m pgl.distributed.launch --ip_config /home/aistudio/ip_list.txt --conf /home/aistudio/metapath2vec.yaml --shard_num 1000 --server_id 0
异常结果

Traceback (most recent call last):
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/runpy.py", line 183, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/runpy.py", line 109, in _get_module_details
    __import__(pkg_name)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/pgl/__init__.py", line 20, in <module>
    from pgl import graph
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/pgl/graph.py", line 20, in <module>
    import paddle
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/__init__.py", line 47, in <module>
    import paddle.distribution  # noqa: F401
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/distribution/__init__.py", line 15, in <module>
    from paddle.distribution import transform
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/distribution/transform.py", line 24, in <module>
    from paddle.distribution import (constraint, distribution,
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/distribution/distribution.py", line 33, in <module>
    from paddle.fluid.framework import _non_static_mode, in_dygraph_mode
ImportError: cannot import name '_non_static_mode' from 'paddle.fluid.framework' (/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/framework.py)

@Liwb5
Copy link
Collaborator

Liwb5 commented Jul 26, 2022

是不是你的aistudio项目,一开始设置的时候,paddlepaddle的版本就选择错了。 比如你新建一个项目,项目会让你选择paddlepaddle的版本,你直接选择2.2.2,这样就不用自己再pip install paddlepaddle==2.2.2 了。

@zouhan6806504
Copy link
Author

是不是你的aistudio项目,一开始设置的时候,paddlepaddle的版本就选择错了。 比如你新建一个项目,项目会让你选择paddlepaddle的版本,你直接选择2.2.2,这样就不用自己再pip install paddlepaddle==2.2.2 了。

这么选的话就能运行了,感谢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants