Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduction of Doc2EDAG #37

Closed
CarlanLark opened this issue Jun 30, 2022 · 9 comments
Closed

Reproduction of Doc2EDAG #37

CarlanLark opened this issue Jun 30, 2022 · 9 comments
Labels
discussion Discussion on DocEE and SentEE help wanted Extra attention is needed

Comments

@CarlanLark
Copy link

CarlanLark commented Jun 30, 2022

** Idea sharing **
While sharing what you want to do, make sure to protect your ideas.

** Problems **
If you have any questions about event extraction, make sure you have read the latest papers or searched on the Internet.

** Others **
Other things you may want to share or discuss.
Hello, Spico! I'm very glad to talk with you about event extraction. Does the order of event type (o2o, o2m, m2m) in training data important for model performance? I find that the reproduction of Doc2EDAG in your paper is (P=86.2, R=70.8, F=79.0, overall scores), but my reproduction is only (P=79.7, R=73.2, F=76.3, overall scores). I just git clone code from the Github repo in Doc2EDAG paper and run the code without modified data preprocessing.

@CarlanLark CarlanLark added the discussion Discussion on DocEE and SentEE label Jun 30, 2022
@Spico197
Copy link
Owner

Spico197 commented Jul 1, 2022

Hi there. I think your reproduced results may be macro averaged, while F=79.0 reported in my paper is micro averaged.

@CarlanLark
Copy link
Author

Thanks for your reply. Unfortunately, the results above are actually micro scores, and here are my reproduction results:
{
"MacroPrecision": 0.7605239514335949,
"MacroRecall": 0.6683277432764718,
"MacroF1": 0.7092646497139452,
"MicroPrecision": 0.797208663819402,
"MicroRecall": 0.7323874583990191,
"MicroF1": 0.7634245649911446,
"TP": 20906,
"FP": 5318,
"FN": 7639
}

We choose the results on the test set by the best model on the dev set and the epoch id is 96.

@CarlanLark
Copy link
Author

By the way, I find that the main diff between your reproduction results and ours is from the precision score. I wonder whether different python packages will lead the model to different "styles", such as (high precision, low recall) or (low precision, high recall)

@CarlanLark
Copy link
Author

What's more, in README.md of this repo you said that
# generate data with doc type (o2o, o2m, m2m) for better evaluation
So do you keep the document order of train.json or do you use a new document order? I mean if documents in the training dataset have the order (first o2o, then o2m, then m2m), the model will be first trained on o2o documents. This order may lead the model to prefer less generation (higher precision, lower recall) rather than more generation(lower precision, higher recall), which reason for the difference between your reproduction and my reproduction.

@Spico197
Copy link
Owner

Spico197 commented Jul 1, 2022

What's more, in README.md of this repo you said that # generate data with doc type (o2o, o2m, m2m) for better evaluation So do you keep the document order of train.json or do you use a new document order? I mean if documents in the training dataset have the order (first o2o, then o2m, then m2m), the model will be first trained on o2o documents. This order may lead the model to prefer less generation (higher precision, lower recall) rather than more generation(lower precision, higher recall), which reason for the difference between your reproduction and my reproduction.

doc_type is only used in evaluation. We keep the same data loading strategy with the original Doc2EDAG code.
The order isn't really a problem in evaluation. doc_type is used to make fine-grained evaluation results on each document type.

@Spico197
Copy link
Owner

Spico197 commented Jul 1, 2022

Your reproduced results are interesting since the original Doc2EDAG paper says Doc2EDAG reaches a MACRO-averaged F1 score of 76.3. Your MICRO-averaged score is only 76.3, which means the macro averaged score is far below the reported score.
Do you change the exec script that Doc2EDAG provides ?

@CarlanLark
Copy link
Author

CarlanLark commented Jul 2, 2022

I made the reproduction with the following steps:

  1. git clone https://github.com/dolphin-zs/Doc2EDAG.git
  2. unzip Data.zip
  3. modify the code in train_multi.sh
    python -m torch.distributed.launch --nproc_per_node ${NUM_GPUS} run_dee_task.py $*
    to
    python3 -m torch.distributed.launch --nproc_per_node ${NUM_GPUS} run_dee_task.py $*
  4. ./train_multi.sh 8 --task_name [TASK_NAME]
    I didn't change any parameters in the Doc2EDAG code and then got the results above.

Now I think maybe the version of the packages causes the differences in our reproductions

@Spico197
Copy link
Owner

Spico197 commented Jul 2, 2022

Sorry, I don't know... Maybe you could open an issue in the original repo.

@Spico197 Spico197 added the help wanted Extra attention is needed label Jul 2, 2022
@CarlanLark
Copy link
Author

I will change my python packages version and then do the reproduction again. Looking forward to the final results.

@Spico197 Spico197 closed this as completed Jul 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Discussion on DocEE and SentEE help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants