# Reproducing results

This notebook will walk you through reproducing the main results from [our paper](https://aclanthology.org/2022.bionlp-1.2/).

## 🔧 Install the prerequisites

In [None]:
# The colab environment comes with py3.7, but several dependencies require py>=3.8 (like NumPy).
# This can be removed if Colab ever updates python to >=3.8 in its environment.
# For the solution, see: https://stackoverflow.com/q/60775160/6578628
# For the issue tracking Colab's python update, see: https://github.com/googlecolab/colabtools/issues/1880
!wget -O mini.sh https://repo.anaconda.com/miniconda/Miniconda3-py38_4.8.2-Linux-x86_64.sh
!chmod +x mini.sh
!bash ./mini.sh -b -f -p /usr/local
!pip install ipykernel

In [None]:
!pip install git+https://github.com/JohnGiorgi/seq2rel.git
!pip install git+https://github.com/JohnGiorgi/seq2rel-ds.git

In [16]:
import os

DATA_DIR = "datasets"
OUTPUT_DIR = "output"

## CDR

We will use the CDR corpus to explain the details, and the rest of the datasets will be given without comment.

### End-to-end

To evaluate the end-to-end models, all you need to do is provide the `model_name`, which can be any of the pretrained models found [here](https://github.com/JohnGiorgi/seq2rel/releases/tag/pretrained-models)

In [None]:
model_name = "cdr"

In [23]:
# Set the directories to save the datasets and model results.
preprocessed_data_dir = os.path.join(DATA_DIR, model_name)
output_dir = os.path.join(OUTPUT_DIR, model_name)

# AllenNLP doesn't create this directory for us, so create it here.
!mkdir -p "$output_dir"

# Set the url of the pretrained model
pretrained_model_url = f"https://github.com/JohnGiorgi/seq2rel/releases/download/pretrained-models/{model_name}.tar.gz"

Then, we can use [seq2rel-ds](https://github.com/JohnGiorgi/seq2rel-ds) to download and preprocess the dataset

In [25]:
!seq2rel-ds cdr main "$preprocessed_data_dir" --combine-train-valid

[1m
[2K[38;5;2m✔ Downloaded the corpus.[0m
[38;5;4mℹ Training and validation sets will be combined into one train set.[0m
[2K[38;5;2m✔ Preprocessed the data.[0m
[38;5;2m✔ Preprocessed data saved to /content/datasets/cdr.[0m


Lastly, we can evaluate this model using the [`allennlp evaluate`](https://docs.allennlp.org/main/api/commands/evaluate/) command

In [32]:
!allennlp evaluate "$pretrained_model_url" "$preprocessed_data_dir/test.tsv" \
    --output-file "$output_dir/test_metrics.jsonl" \
    --cuda-device 0 \
    --predictions-output-file "$output_dir/test_predictions.jsonl" \
    --include-package "seq2rel"

2022-04-15 20:51:41,976 - INFO - allennlp.common.plugins - Plugin allennlp_models available
2022-04-15 20:51:43,635 - INFO - cached_path - cache of https://github.com/JohnGiorgi/seq2rel/releases/download/pretrained-models/cdr.tar.gz is up-to-date
2022-04-15 20:51:43,635 - INFO - allennlp.models.archival - loading archive file https://github.com/JohnGiorgi/seq2rel/releases/download/pretrained-models/cdr.tar.gz from cache at /root/.allennlp/cache/da436b73452adc6becdb387839c37b12a8fbf93f16990fd6a63accdc56cc39c1.b998c18b6f8de86129a44212ca3cf410f9b51dafcb920e3e7211b62c89d602ff
2022-04-15 20:51:43,636 - INFO - allennlp.models.archival - extracting archive file /root/.allennlp/cache/da436b73452adc6becdb387839c37b12a8fbf93f16990fd6a63accdc56cc39c1.b998c18b6f8de86129a44212ca3cf410f9b51dafcb920e3e7211b62c89d602ff to temp dir /tmp/tmpit3od2c5
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the 

### Entity hinting

Evaluating the models using entity hinting works similarly, just prepend `_hints` to `model_name`

In [None]:
model_name = "cdr_hints"

In [None]:
preprocessed_data_dir = os.path.join(DATA_DIR, model_name)
output_dir = os.path.join(OUTPUT_DIR, model_name)
!mkdir -p "$output_dir"
pretrained_model_url = f"https://github.com/JohnGiorgi/seq2rel/releases/download/pretrained-models/{model_name}.tar.gz"

and add the argument `--entity-hinting "gold"` to the call to `seq2rel-ds`

In [None]:
!seq2rel-ds cdr main "$preprocessed_data_dir" --combine-train-valid --entity-hinting "gold"

The call to `allennlp evaluate` is unchanged

In [34]:
# Takes ~5min.
!allennlp evaluate "$pretrained_model_url" "$preprocessed_data_dir/test.tsv" \
    --output-file "$output_dir/test_metrics.jsonl" \
    --cuda-device 0 \
    --predictions-output-file "$output_dir/test_predictions.jsonl" \
    --include-package "seq2rel"

[1m
[2K[38;5;2m✔ Downloaded the corpus.[0m
[38;5;4mℹ Entity hints will be inserted into the source text using the gold
annotations.[0m
[38;5;4mℹ Training and validation sets will be combined into one train set.[0m
[2K[38;5;2m✔ Preprocessed the data.[0m
[38;5;2m✔ Preprocessed data saved to /content/datasets/cdr_hints.[0m
2022-04-15 21:01:35,039 - INFO - allennlp.common.plugins - Plugin allennlp_models available
2022-04-15 21:01:37,202 - INFO - cached_path - cache of https://github.com/JohnGiorgi/seq2rel/releases/download/pretrained-models/cdr_hints.tar.gz is up-to-date
2022-04-15 21:01:37,202 - INFO - allennlp.models.archival - loading archive file https://github.com/JohnGiorgi/seq2rel/releases/download/pretrained-models/cdr_hints.tar.gz from cache at /root/.allennlp/cache/5d845bebc5887213bab7c90a311e51d6dff9a03fb60648a6498d58be8397166c.82548b1687f75978154d471c6ead95e2dd4d865a01baaba9fa7873d62232ffbe
2022-04-15 21:01:37,203 - INFO - allennlp.models.archival - extracting arc

## GDA

### End-to-end

In [37]:
model_name = "gda"

preprocessed_data_dir = os.path.join(DATA_DIR, model_name)
output_dir = os.path.join(OUTPUT_DIR, model_name)
!mkdir -p "$output_dir"
pretrained_model_url = f"https://github.com/JohnGiorgi/seq2rel/releases/download/pretrained-models/{model_name}.tar.gz"

!seq2rel-ds gda main "$preprocessed_data_dir"

# Takes ~10min.
!allennlp evaluate "$pretrained_model_url" "$preprocessed_data_dir/test.tsv" \
    --output-file "$output_dir/test_metrics.jsonl" \
    --cuda-device 0 \
    --predictions-output-file "$output_dir/test_predictions.jsonl" \
    --include-package "seq2rel"

[1m
[2K[38;5;2m✔ Downloaded the corpus.[0m
[2K[38;5;2m✔ Preprocessed the training data.[0m
[2K[38;5;2m✔ Preprocessed the test data.[0m
[38;5;4mℹ Holding out 20.00% of the training data as a validation set.[0m
[38;5;2m✔ Preprocessed data saved to /content/datasets/gda.[0m
2022-04-15 21:15:53,401 - INFO - allennlp.common.plugins - Plugin allennlp_models available
2022-04-15 21:15:55,568 - INFO - cached_path - cache of https://github.com/JohnGiorgi/seq2rel/releases/download/pretrained-models/gda.tar.gz is up-to-date
2022-04-15 21:15:55,568 - INFO - allennlp.models.archival - loading archive file https://github.com/JohnGiorgi/seq2rel/releases/download/pretrained-models/gda.tar.gz from cache at /root/.allennlp/cache/473102e8cdb77dbc7cc8b70355bce7f765767b987cebe6b64028772bfd438f59.2c835eb34375fc9c9206dcfe8fa5ad0c17af767af48e3a2d332667b8d785b59b
2022-04-15 21:15:55,569 - INFO - allennlp.models.archival - extracting archive file /root/.allennlp/cache/473102e8cdb77dbc7cc8b70355bce

### Entity hinting

In [40]:
model_name = "gda_hints"

preprocessed_data_dir = os.path.join(DATA_DIR, model_name)
output_dir = os.path.join(OUTPUT_DIR, model_name)

!mkdir -p "$output_dir"
pretrained_model_url = f"https://github.com/JohnGiorgi/seq2rel/releases/download/pretrained-models/{model_name}.tar.gz"

!seq2rel-ds gda main "$preprocessed_data_dir" --entity-hinting "gold"

# Takes ~10min.
!allennlp evaluate "$pretrained_model_url" "$preprocessed_data_dir/test.tsv" \
    --output-file "$output_dir/test_metrics.jsonl" \
    --cuda-device 0 \
    --predictions-output-file "$output_dir/test_predictions.jsonl" \
    --include-package "seq2rel"

[1m
[2K[38;5;2m✔ Downloaded the corpus.[0m
[38;5;4mℹ Entity hints will be inserted into the source text using the gold
annotations.[0m
[2K[38;5;2m✔ Preprocessed the training data.[0m
[2K[38;5;2m✔ Preprocessed the test data.[0m
[38;5;4mℹ Holding out 20.00% of the training data as a validation set.[0m
[38;5;2m✔ Preprocessed data saved to /content/datasets/gda_hints.[0m
2022-04-15 21:29:32,675 - INFO - allennlp.common.plugins - Plugin allennlp_models available
2022-04-15 21:29:34,629 - INFO - cached_path - https://github.com/JohnGiorgi/seq2rel/releases/download/pretrained-models/gda_hints.tar.gz not found in cache, downloading to /root/.allennlp/cache/85c523a0511e0717bf7736a8ca65fcf60a0b17a75370d2d03a4f2f4510a547ad.c3821cfed4a85d8d0ae199653b4c856e7506ab4786d45c182e0c3f47110f19ba
downloading: 100%|##########| 420M/420M [00:47<00:00, 9.28MiB/s]
2022-04-15 21:30:22,220 - INFO - allennlp.models.archival - loading archive file https://github.com/JohnGiorgi/seq2rel/releases/down

## DGM

The DGM corpus must be downloaded from [here](https://hanover.azurewebsites.net/downloads/naacl2019.aspx). The following expects that this corpus exists at `DATA_DIR/naacl2019`.

### End-to-end

In [39]:
model_name = "dgm"

# This expects that you have downloaded the DGM corpus and that it lives at DATA_DIR/naacl2019
data_dir = os.path.join(DATA_DIR, "naacl2019")

preprocessed_data_dir = os.path.join(DATA_DIR, model_name)
output_dir = os.path.join(OUTPUT_DIR, model_name)
!mkdir -p "$output_dir"
pretrained_model_url = f"https://github.com/JohnGiorgi/seq2rel/releases/download/pretrained-models/{model_name}.tar.gz"

!seq2rel-ds dgm main "$data_dir" "$preprocessed_data_dir"

!allennlp evaluate "$pretrained_model_url" "$preprocessed_data_dir/test.tsv" \
    --output-file "$output_dir/test_metrics.jsonl" \
    --cuda-device 0 \
    --predictions-output-file "$output_dir/test_predictions.jsonl" \
    --include-package "seq2rel"

### Entity hinting

In [None]:
model_name = "dgm_hints"

# This expects that you have downloaded the DGM corpus and that it lives at DATA_DIR/naacl2019
data_dir = os.path.join(DATA_DIR, "naacl2019")

preprocessed_data_dir = os.path.join(DATA_DIR, model_name)
output_dir = os.path.join(OUTPUT_DIR, model_name)
!mkdir -p "$output_dir"
pretrained_model_url = f"https://github.com/JohnGiorgi/seq2rel/releases/download/pretrained-models/{model_name}.tar.gz"

!seq2rel-ds dgm main "$data_dir" "$preprocessed_data_dir" --entity-hinting "gold"

!allennlp evaluate "$pretrained_model_url" "$preprocessed_data_dir/test.tsv" \
    --output-file "$output_dir/test_metrics.jsonl" \
    --cuda-device 0 \
    --predictions-output-file "$output_dir/test_predictions.jsonl" \
    --include-package "seq2rel"

## DocRED

DocRED is only evaluated in the end-to-end setting.

In [42]:
model_name = "docred"

preprocessed_data_dir = os.path.join(DATA_DIR, model_name)
output_dir = os.path.join(OUTPUT_DIR, model_name)
!mkdir -p "$output_dir"
pretrained_model_url = f"https://github.com/JohnGiorgi/seq2rel/releases/download/pretrained-models/{model_name}.tar.gz"

!seq2rel-ds docred main "$preprocessed_data_dir"

# Takes ~30min.
!allennlp evaluate "$pretrained_model_url" "$preprocessed_data_dir/test.tsv" \
    --output-file "$output_dir/test_metrics.jsonl" \
    --cuda-device 0 \
    --predictions-output-file "$output_dir/test_predictions.jsonl" \
    --include-package "seq2rel"

[1m
[2K[38;5;2m✔ Downloaded the corpus.[0m
[2K[38;5;2m✔ Preprocessed the data.[0m
[38;5;2m✔ Preprocessed data saved to /content/datasets/docred.[0m
2022-04-15 21:48:17,659 - INFO - allennlp.common.plugins - Plugin allennlp_models available
2022-04-15 21:48:19,999 - INFO - cached_path - cache of https://github.com/JohnGiorgi/seq2rel/releases/download/pretrained-models/docred.tar.gz is up-to-date
2022-04-15 21:48:19,999 - INFO - allennlp.models.archival - loading archive file https://github.com/JohnGiorgi/seq2rel/releases/download/pretrained-models/docred.tar.gz from cache at /root/.allennlp/cache/fc93f4b028785d1b77ece2ed95d5a278dbd8f48bb77b9be3c0e0b0713a694fe5.36dd151a95000c7b53adee599c6002eb74b5393a52d4f9044eb78c8c82c186b5
2022-04-15 21:48:20,000 - INFO - allennlp.models.archival - extracting archive file /root/.allennlp/cache/fc93f4b028785d1b77ece2ed95d5a278dbd8f48bb77b9be3c0e0b0713a694fe5.36dd151a95000c7b53adee599c6002eb74b5393a52d4f9044eb78c8c82c186b5 to temp dir /tmp/tmpvi


## ♻️ Conclusion

That's it! In this notebook, we covered how to reproduce the main results from our paper. Please see [our paper](https://aclanthology.org/2022.bionlp-1.2/) and [repo](https://github.com/JohnGiorgi/seq2rel) for more details, and don't hesitate to open an issue if you have any trouble!

