-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Abnormal Training Phenomena and Bad Performance #22
Comments
Thank you for your appreciation. In my experience, the training loss is quite high, too. I would double check if the model is using the backbone pretrained on ImageNet, namely, does it print out something like "Encoder is pretrained from..." at the beginning of the training? Any additional information may be helpful in understanding where the problem lies. |
Thanks for your reply, I have investigated the code as per your suggestion. Since I already have the pre-trained model locally, I modified the code that originally used the URL to load the model, and instead loaded it using the local file path. # Before Modification
if pretrained:
print(f"\t-> Encoder is pretrained from: {pretrained}")
pretrained_state = load_state_dict_from_url(pretrained, map_location="cpu")[
"model"
]
info = self.load_state_dict(deepcopy(pretrained_state), strict=False)
print("Loading pretrained info:", info)
# After Modification
if pretrained:
from urllib.parse import urlparse
def is_url(path):
# Check pretrained is URL or path
result = urlparse(path)
return all([result.scheme, result.netloc])
print(f"\t-> Encoder is pretrained from: {pretrained}")
if is_url(pretrained):
pretrained_state = load_state_dict_from_url(pretrained, map_location="cpu")[
"model"
]
info = self.load_state_dict(deepcopy(pretrained_state), strict=False)
print("Loading pretrained info:", info)
else:
pretrained_state = torch.load(pretrained, map_location="cpu")["model"] Therefore, when I run the training program, four prompt messages (from four processes) will be printed: Secondly, I addressed the alignment issue you mentioned between the training set and the test set. I used the Eigen splits on KITTI for both the training set and the test set. However, I couldn't find any factors in the program that could cause a mismatch between them. I noticed that loading the training set and the test set uses the same code module ( Additionally, I performed tests on the training set and the test set using the weights you provided and the weights I trained myself, respectively (using
Both models perform similarly on the training set (or my trained model even performs slightly better). However, there is a significant difference in performance between the two models on the test set. This suggests the presence of overfitting. However, during the training process, there was no occurrence of the evaluation metric initially improving and then deteriorating later. I look forward to hearing your further suggestions. Thank you once again for your reply. Best wishes to you! |
You could try using the provided checkpoint and test it on your data/code and see if the results match the ones provided. |
Yes, I did exactly that. The table I provided describes this work. So, the fact that the checkpoint you provided performs well on both my training and test sets suggests that there might not be an problem with my dataset. |
Honestly, I do not know, you are not seeing any overfitting, but it does not generalize either since the training metrics are good, but not the validation ones. Moreover, KITTI validation and training are pretty similar, so I wonder why such drop. Either the training set is different wrt the one I used (I used the "new" Eigen split, namely the one after 2019) or the configs (i.e., augmentations, training schedule/lr, etc...) have something different. |
Thank you for your assistance! This situation is indeed perplexing. I believe we can rule out differences in the dataset and configuration since I used the I would like to make some attempts based on your work. Therefore, I will continue to try and debug the issue. Once again, thank you for your help, and I wish you a pleasant day! |
Dear Luigi Piccinelli,
I hope this message finds you well. I wanted to express my sincere appreciation for your exceptional article. Inspired by your work, I attempted to train your project on the KITTI Eigen partitioning dataset.
However, during my training process, I encountered several abnormal phenomena that I would like to bring to your attention:
Here is a screenshot depicting the issue:
To accommodate the equipment I am using (a single machine with four RTX 3090s and no SLURM), I modified the distributed training setup from SLURM to standard DDP (DistributedDataParallel).
Additionally, I made some modifications in the dataloader directory to align with the directory structure of my existing KITTI dataset. I believe these changes should not be the cause of the undesirable results, as the code correctly outputs messages such as "Loaded 23158 images. Totally 0 invalid pairs are filtered" and "Loaded 652 images. Totally 45 invalid pairs are filtered."
Furthermore, in order to track the training process using TensorBoard, I incorporated some code in the training section to generate and save log information.
Apart from these adjustments, I have not made any additional modifications to the code. Specifically, the config file remains the same as the one you provided.
I would greatly appreciate your valuable insights and guidance regarding these issues. If there are any specific details or additional information I can provide to assist in troubleshooting, please let me know. Thank you once again for your remarkable contribution to the field.Best regards
The text was updated successfully, but these errors were encountered: