-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
which config file is used? #53
Comments
@zhanghaoo |
Besides, i think this parameter,NUM_WORKERS ,should be modified according to the number of gpu.In other words , NUM_WORKERS should be 1 when the number of gpu is 1. What made me think of that? what can i do!!!!!! |
You need not change the num_workers, I try 1 Tesla V32gpu 4Tesla V32 gpu and 8 Tesla V32 gpu training and keep the num_workers=4. Just wait. The first time to get the annotation to the cache may take some times. BTW 4GPU V32 can get the mAP as author posted in paper, |
Hello, I'm sorry the server is running other programs in the past few days. I cannot demonstrate the problem
I saw it that night after you responded to me
Now the program is running train_net.py. As shown in the attached picture.
Questions are as follows:
1. It's really time-consuming, but what I don't understand is why the GPU utilization is often 0 after printing "start training"? What is the program doing?
2. I added the red arrow print command. Why doesn't the program print this line? This is to directly execute the green arrow command and print "start training"?
Part of the configuration is as follows:
1.NIVIDIA GeForce GTX 1080Ti
2.Cuda:10.1.243
3.Cudnn: 7.6.5
Thank you very much for your answers!
…---Original---
From: "LauncH"<notifications@github.com>
Date: Mon, Aug 31, 2020 19:20 PM
To: "Scalsol/mega.pytorch"<mega.pytorch@noreply.github.com>;
Cc: "Mention"<mention@noreply.github.com>;"zhanghaoo"<296495427@qq.com>;
Subject: Re: [Scalsol/mega.pytorch] which config file is used? (#53)
You need not change the num_workers, I try 1 Tesla V32gpu 4Tesla V32 gpu and 8 Tesla V32 gpu training and keep the num_workers=4. Just wait. The first time to get the annotation to the cache may take some times.
BTW 4GPU V32 can get the mAP as author posted in paper,
8GPU V32 the mAP may be a little lower.(0.002 base learning rate, 6w iterations)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
for question1, check your log.txt and find how many GPUs you have used. If u have used all of them, the utilization of a GPU may be more than 80%. The task is time-consuming, I use 4 Tesla V32 GPU to train for nearly a day and test the whole validation datasets for nearly 2 hours, finally get the result as the author posted in his paper. No doubt that you may take much more time because you use NVIDIA 1080Ti. |
The training set holds 109815 pictures-53621 for DET and 56194 for VID. Maybe you can just train the VID dataset and get lower accuracy(around 76.8%mAP). The validation dataset holds 176126 pictures. The whole iteration for 4 GPU MEGA traning is 120K,and batch size is 4. |
不好意思我可能没有表述清楚,我是通过邮件发送的,附件在邮件里,我在issue重新说明一下。 1.GPU使用情况
3、关于训练 (PS:谢谢您这么耐心解答小白问题,非常谢谢,祝您开心!!!) |
数据集当中的图片数量及索引txt文件包含的需要训练的图片个数我都有更改,训练只用了: 只用了上述30张图,现在只是想训练、测试两个模块可以顺利跑通,用您的方法跑一下我自己的数据集。可是跑不通。 |
我不是作者。我就只复现了原作者的结果。你应该没开始训练,开始训练了会打印输出迭代次数和Loss,每2500次存一个checkpoint。你试试把gpu现在占用的不要用的进程全部kill关了,再重新执行程序。 |
强。 唉不懂,按您的提示我分别使用了: 都不会打印出来,真的很烦,有点无从下手的感觉。 我认为我前边的步骤执行的都是正确的,这是我看到的最好最清晰不过的开源代码,可是不知道为什么我自己实现起来有好多问题。 目前为止: 可是就是不训练。 您有空的时候再回我吧我自己在看一看,我觉得MEGA很适合现在的工作,不想放弃它。 |
您好,请问batchsize在哪里改的,我改了之后还是需要很大内存,不清楚是不是改错了地方。 |
你可以在对应的config file里面加,或者在base_RCNN_xgpu里面改,根据你的gpu的个数找到对应的config
file,参数名是imgs_per_batch。应该就是这个
…On Tue, Dec 1, 2020 at 9:22 PM liwenjielongren ***@***.***> wrote:
您好,请问batchsize在哪里改的,我改了之后还是需要很大内存,不清楚是不是改错了地方。
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#53 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFBCJD4RHKTDE4U62HAGQYDSSWQF3ANCNFSM4QP7KSRQ>
.
|
好的,谢谢 |
一起学习,多交流(^^)
…On Wed, Dec 2, 2020 at 08:10 liwenjielongren ***@***.***> wrote:
好的,谢谢
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#53 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFBCJD6C5MS5OQNZPAU7L3TSSY4DVANCNFSM4QP7KSRQ>
.
|
我又回来了,这阵子忙完了要继续做这个了。 兄弟们你们都跑成功了吗? |
跑成功啦,就是acc不太行,你是哪里报错啦
…On Wed, Jan 27, 2021 at 04:04 zhanghaoo ***@***.***> wrote:
我又回来了,这阵子忙完了要继续做这个了。
@ZhijunHou <https://github.com/ZhijunHou> @liwenjielongren
<https://github.com/liwenjielongren>
兄弟们你们都跑成功了吗?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#53 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFBCJDYMOIJK7J3QX4QY5WLS37JJVANCNFSM4QP7KSRQ>
.
|
很奇怪 我这个提示start training之后就卡住不动了 我觉得还是gpu读取出了问题 我在make_data_loader时,num_gpus = get_world_size()读不出来num_gpus,但是也不报错 铁子我邮箱zh_pure@sina.com,请教你一些问题,需要指导一波 突然有点事了我得先回宿舍了,期待我军的联系! |
解决了。 跑通了。 自己的数据集也跑了。 剩下的就是分析为什么不太好的原因了。 |
跑了一下可以跑通,自己训练或者直接加载作者训练好的模型也可以复现作者的结果。rdn和mega都可以。在自己的数据集上跑能够好一点,但涨点不多。Imagenet vid太简单了。 |
就是按照Install.md配置的。ubuntu16.04, cuda9.2, pytorch1.3.0+cu92, torchvision 0.4.1+cu92, python3.7, 4卡或者8卡Tesla V100都行。maskbenchmark这个框架facebook不更新了。用高版本的pytorch1.4以上可能会有问题。 |
acc太低会不会可能是backbone的预训练模型离线加载没加载对。./mega_core/config/paths_catalog.py有显示各个预训练模型的路径。R101.pkl是detectron 1 msra版本的预训练模型,是c4的不是fpn的。如果是在线加载的应该不是这个问题 |
@launchauto 兄弟 可以分享一波网络架构吗?就是网络结构图。 |
@ZhijunHou 可不可以联系我邮箱讨论一下这个MEGA的问题bro |
啥问题啊,老铁🤣。
…On Thu, Apr 15, 2021 at 8:16 AM zhanghaoo ***@***.***> wrote:
@ZhijunHou <https://github.com/ZhijunHou> 可不可以联系我邮箱讨论一下这个MEGA的问题bro
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#53 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFBCJD7DTCBQWND4MNUEF7DTI3KKPANCNFSM4QP7KSRQ>
.
|
@ZhijunHou 好兄弟不知道你有没有具体看MEGA引用的Relation model这部分。 |
@ZhijunHou 还有,bro,比如说测试集有20张,但假设聚合的帧数按公式为Tm*Nl+Tg=40,这多出来的20,是如何处理的?? |
@launchauto T_T T_T |
好兄弟们 ,bro! |
@joe660 ??? 这......好家伙你问我啥了啊我从来都没看到过啊 |
怎么感觉这个东西最近又火起来了?知乎也有人问我问题…
…On Mon, Apr 19, 2021 at 10:19 PM joe660 ***@***.***> wrote:
ni你可真是个小天才。
不过我估计没人加......
qq群号:728816033
二维码如下:
[image: MEGA]
<https://user-images.githubusercontent.com/33448536/115327370-67f80680-a1c1-11eb-8ab4-2aad0dc1cb36.png>
让上面那几个大佬进群 哈哈 你们之前还在上面讨论问题呢?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#53 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFBCJD4IIYDDOUB3YTOLTYTTJTQEHANCNFSM4QP7KSRQ>
.
|
你是浙大的吗? |
python demo/demo.py mega configs/MEGA/vid_R_101_C4_MEGA_1x.yaml configs/MEGA/MEGA_R_101.pth --video --visualize-path datasets/vid/1.mp4 --output-folder visualization/1_MEGA 请问有老哥遇到这个问题了吗 @launchauto @ZhijunHou @asmallcat @zhanghaoo @liwenjielongren |
请问下自己做数据集,文件摆放是怎么样的,记录帧数的txt文件又是怎么生成的?我下载了VID数据集发现里面还有对应的视频,这里需要么 |
I konw that these config files are uesd as follows on the condition that gpu is 1:
BASE_RCNN_1gpu.yaml and vid_R_101_C4_MEGA_1x.yaml.
But where can i modify the parameter , NUM_WORKERS of DATALOADER?
It is useless when i change that parameter in defaults.py.However, i can't any yamls which contain that parameter.
Who can tell me?
Thank u very much!
The text was updated successfully, but these errors were encountered: