Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unstable training #1

Closed
alwynmathew opened this issue Jun 22, 2018 · 13 comments
Closed

Unstable training #1

alwynmathew opened this issue Jun 22, 2018 · 13 comments

Comments

@alwynmathew
Copy link

alwynmathew commented Jun 22, 2018

Results with lr=1e-3

Epoch 1:
epoch001_disp_left
Epoch 2:
epoch003_disp_left
Epoch 5:
epoch005_disp_left

Disparities are degrading as I train more.

@NikolasEnt
Copy link
Member

Hello, @alwynmathew!
What data and hyperparameters do you use for training?
I'd like to reproduce the issue.

@alwynmathew
Copy link
Author

I used exact same data and hyperparameters you have used in the demo notebook.

@NikolasEnt
Copy link
Member

NikolasEnt commented Jun 22, 2018

Ok, I'll examine it. There were some issues with the parameters.

Meanwhile you can play with our pretrained model - it was trained for 75 epochs with lr=1e-2; batch=20 (You may try to learn with the parameters). Or a better one - it was trained for extra 35 epochs with lr=1e-4 (with the first model as pretrain).

@alwynmathew
Copy link
Author

I have also ported original monodepth to pytorch here. I faced the exact same issue that the disparities get degraded after few epochs. What are the hyperparameters you used to get better results?

@NikolasEnt
Copy link
Member

lr=1e-2; batch=20 is good enough (see above) for our implementation

@alwynmathew
Copy link
Author

alwynmathew commented Jun 22, 2018

@NikolasEnt batch_size= 20 is too big.
Epoch 10 with batch_size = 8 and lr=1e-2
epoch010_disp_left
Epoch 16
epoch016_disp_left
Its still unstable.

@voeykovroman
Copy link
Member

Hello, @alwynmathew!

Once again. Are you sure you downloaded correct dataset as described here?
Because there are 38237 images and you attached your results for the first 10 epochs after approximately an hour after @NikolasEnt answered you about lr. And you used smaller batch size what means that you will need more time for training than we do while we needed something about 45 minutes for 1 epoch using single GTX1080 Ti. It looks like something wrong with your data or equipment.
During this week we will try to reproduce our training using exactly this repo without any changes downloaded on new machine and publish disparities we get after 10 epochs in this thread.

@alwynmathew
Copy link
Author

Hi @Sparkling-Brick, according to the notebook provided in the repo, the data loader is just loading from one of the kitti dataset subfolder 'data_dir':'../../2011_09_26/' .

@voeykovroman
Copy link
Member

Yes. Thanks for noting it, the path was changed to check whether notebook working or not before publishing. However if you noticed that path in the notebook is just for one subfolder you should be able to change it to load the whole dataset.
Moreover if you read our README you would notice the structure of data and path variable are described here.

@alwynmathew
Copy link
Author

alwynmathew commented Jun 26, 2018

But @Sparkling-Brick do you think just added more data will solve the problem? The original implementation used batch size as small as 8, why do you recommend higher batch size?

@NikolasEnt
Copy link
Member

Hi, @alwynmathew, we retrain our model from scratch with the following parameters:

'model':'resnet18_md',
'learning_rate':1e-2,
'batch_size':8,
'adjust_lr':True,
'do_augmentation':True,
'augment_parameters':[0.8, 1.2, 0.5, 2.0, 0.8, 1.2],

Here it is the result. Obviously, it should be trained further, however, it is stable with lr=1e-2 and batch size 8.
demo
Full Kitti dataset from the original repo was utilized for training.

@alwynmathew
Copy link
Author

alwynmathew commented Jun 29, 2018

Thank you @NikolasEnt for the effort of training the model from scratch.

I do get perfect disparity map for selected images but it doesn't seem to applies to all of the kitti test images even after training for 17 epoch with same parameters on full kitti dataset.

'model':'resnet18_md',
'learning_rate':1e-2,
'batch_size':8,
'adjust_lr':True,
'do_augmentation':True,
'augment_parameters':[0.8, 1.2, 0.5, 2.0, 0.8, 1.2]

Is it just me or do you face the same problem?

Reconstructed images and corresponding disparities during my training:
drawing

@NikolasEnt
Copy link
Member

Hi, @alwynmathew. It looks like the second original raw image has some issues. They may be due to video->image transformation process or image encoding in the dataset. Personally, I didn't observe such examples, however, I didn't exam the whole dataset.
Ideally, such images should be excluded from train/val subsets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants