Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does the best model and results reproduced on the iu-xray dataset appear in the 3rd epoch? #14

Open
Leepoet opened this issue Aug 28, 2023 · 7 comments

Comments

@Leepoet
Copy link

Leepoet commented Aug 28, 2023

Hi, I have successfully reproduced your work and got the exact same results as described in your paper. But I found a phenomenon that when experimenting on the iu-xray dataset, the best model and results appeared in the 3rd epoch. Does this phenomenon indicate that the validity of the method proposed in the paper needs to be re-discussed? Can you explain to me whether this phenomenon is reasonable?
Generally speaking, using checkpoints obtained on the previous epochs to generate corresponding reports has poor diversity. I have tried using the best retrained model and the best model you provided to generate the corresponding report, and finally found that this is the case.
Below is an excerpt of the experiment log I got with the best results to demonstrate that I successfully reproduced the results.

07/24/2023 16:11:39 - INFO - modules.trainer - [3/30] Start to evaluate in the validation set.
07/24/2023 16:12:32 - INFO - modules.trainer - [3/30] Start to evaluate in the test set.
07/24/2023 16:13:57 - INFO - modules.trainer - epoch : 3
07/24/2023 16:13:57 - INFO - modules.trainer - ce_loss : 2.3583324741023457
07/24/2023 16:13:57 - INFO - modules.trainer - img_con : 0.010452255175282905
07/24/2023 16:13:57 - INFO - modules.trainer - txt_con : 0.02370573818510355
07/24/2023 16:13:57 - INFO - modules.trainer - img_bce_loss : 0.6931472420692444
07/24/2023 16:13:57 - INFO - modules.trainer - txt_bce_loss : 0.6931472420692444
07/24/2023 16:13:57 - INFO - modules.trainer - val_BLEU_1 : 0.4875411346726625
07/24/2023 16:13:57 - INFO - modules.trainer - val_BLEU_2 : 0.32324968962851985
07/24/2023 16:13:57 - INFO - modules.trainer - val_BLEU_3 : 0.2303989906968061
07/24/2023 16:13:57 - INFO - modules.trainer - val_BLEU_4 : 0.16892974375553144
07/24/2023 16:13:57 - INFO - modules.trainer - val_METEOR : 0.19912841341017073
07/24/2023 16:13:57 - INFO - modules.trainer - val_ROUGE_L : 0.3893886781595059
07/24/2023 16:13:57 - INFO - modules.trainer - test_BLEU_1 : 0.5247745358089907
07/24/2023 16:13:57 - INFO - modules.trainer - test_BLEU_2 : 0.35656897214407807
07/24/2023 16:13:57 - INFO - modules.trainer - test_BLEU_3 : 0.2620523629665125
07/24/2023 16:13:57 - INFO - modules.trainer - test_BLEU_4 : 0.19875032988045743
07/24/2023 16:13:57 - INFO - modules.trainer - test_METEOR : 0.21969653608856185
07/24/2023 16:13:57 - INFO - modules.trainer - test_ROUGE_L : 0.4113942119889325
07/24/2023 16:14:09 - INFO - modules.trainer - Saving checkpoint: /data/XProNet/results_RETRAIN_withReportGen/iu_xray/current_checkpoint.pth ...
07/24/2023 16:14:30 - INFO - modules.trainer - Saving current best: model_best.pth ...

@Leepoet Leepoet changed the title Why does the best model reproduced on the iu-xray dataset appear in the 3rd epoch? Why does the best model and results reproduced on the iu-xray dataset appear in the 3rd epoch? Aug 28, 2023
@Markin-Wang
Copy link
Owner

Markin-Wang commented Aug 28, 2023

Hi, thanks for your interest in our work. Note that for radiology report generation, the precision is more important than the diversity of the reports. For the validity of our method, our work follows the method of the most notable work in this area: we utilized six widely-used evaluation metrics to gauge the performance of our model. We also observed the same phenomenon during our experiments on different models, e.g., R2Gen (see this issue) , R2GenCMN, on IU-Xray dataset. The possible reason could be the IU-Xray has both the frontal and lateral views, hence it's difficult for the visual extractor to capure the diffrence between different samples, hence the model is likely to generate the similar reports. Besides, IU-Xray is a small dataset, hence the diversity itself in the report is smaller than the MIMIC-CXR dataset. Hope this can help you figure out the problem.

@Leepoet
Copy link
Author

Leepoet commented Aug 28, 2023

Hi, thanks for your reply.
I have tried to reproduce the work of R2GenCMN, and finally found that its best model and best results are generated around the 25th epoch, which is within the acceptable range in my opinion.
As I said earlier, the best model obtained in the first few epochs is usually not of reference value. In my recent repro experiments, I got pretty good results in the first epoch, but in the end found that only one kind of report was generated. Does this mean that the model was not well trained in the first few epochs so that the diversity of the generated reports is poor but the six evaluation indicators of the results are good? Further, if this is the case, please forgive my bold doubts, then the validity of the method you propose may lack some convincing.

@Markin-Wang
Copy link
Owner

Markin-Wang commented Aug 28, 2023

Hi, thanks for your reply. I have tried to reproduce the work of R2GenCMN, and finally found that its best model and best results are generated around the 25th epoch, which is within the acceptable range in my opinion. As I said earlier, the best model obtained in the first few epochs is usually not of reference value. In my recent repro experiments, I got pretty good results in the first epoch, but in the end found that only one kind of report was generated. Does this mean that the model was not well trained in the first few epochs so that the diversity of the generated reports is poor but the six evaluation indicators of the results are good? Further, if this is the case, please forgive my bold doubts, then the validity of the method you propose may lack some convincing.

Hi, I guess the epoch the best performance occured is also influenced by the hyper-parameters such as learning rate and the working environment in addition to the method itself. As we mentioned earlier, our work follows the method of the most notable work in this area such as R2Gen and R2GenCMN: we utilized six widely-used evaluation metrics to gauge the performance of our model. In addition, from my perspective, the problem is that the NLP evaluation metrics may not reflect the true performance of the model, which is a common problem in text generation tasks. This is why we normally focus more on the larger dataset such as MIMIC-CXR to mitigate this problem. Moreover, the higher diversity is not always with the higher precision. SME involvement is required to truly gauge this.

@Leepoet
Copy link
Author

Leepoet commented Aug 29, 2023

Maybe you are right, there is a possibility that something in the iu-xray dataset itself is causing this to happen. In response to your answer, I can agree with most of your points. By the way, I also appreciate your patient reply and wonderful work contribution. Thank you.

@Markin-Wang
Copy link
Owner

Maybe you are right, there is a possibility that something in the iu-xray dataset itself is causing this to happen. In response to your answer, I can agree with most of your points. By the way, I also appreciate your patient reply and wonderful work contribution. Thank you.

Never mind, and thank you for your interest to our work and the conrete discussion. Please feel free to contact again if you have any other questions.

@Xqq2620xx
Copy link

Maybe you are right, there is a possibility that something in the iu-xray dataset itself is causing this to happen. In response to your answer, I can agree with most of your points. By the way, I also appreciate your patient reply and wonderful work contribution. Thank you.

Hello! I've encountered the same problem as you! I also achieved very good results in the 1st epoch, but the generated sentences are all repetitive. I would like to share my thoughts and discuss them with you:

I have tried R2Gen, R2GenCMN, and XProNet, and their results on IU-XRay were very unstable. (You mentioned that R2GenCMN had the highest value at the 25th epoch, but I also experienced cases where the highest value occurred in the first five epochs). I have also modified my own model and encountered situations where the 1st epoch had very high results.

Currently, everyone(the previous papars) is taking the best validation result, and the evaluation metrics do not include a measure of diversity. I think there might not be a good solution to this problem at the moment. Taking the average of results from all epochs or just using the results from the final epoch doesn't seem appropriate either.

In addition, I found that using LSTM as the decoder, compared to Transformer, can result in better diversity, but I don't understand the specific reasons behind it.

However, on MIMIC-CXR, the above situation was largely alleviated, and the results were relatively stable. At least in the experiments I conducted, I did not encounter cases where the first five epochs had very high results. Perhaps we can explore more on MIMIC-CXR.

I think that we need better and more reasonable metrics to evaluate the ability of radiology report generation models😂~

@ThatNight
Copy link

ThatNight commented Apr 26, 2024

@Leepoet Hello Leepoet , I have repeated the experiment many times and it is difficult to get the results of iu-xray dataset. Can you share the parameters of utils.py on the iu-xray dataset, or the random seed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants