# Reproduction Notebook for
# "Temporal Alignment: Evaluation for Temporal Relation Extraction"

## Setup

You will need two Python environment in order to perform complete reproduction of all experiments in our paper.

1. Python 3.8+ - the environment you will be using to run this Jupyter notebook
2. Python 2.7 - an additional Python environment we will use to run some commands to utilize old code

We will be doing as much work as possible in Python 3.8+ (within this notebook) and execute some command in a Python 2.7 env from time to time.

### Python 3.8+ Env

This notebook is tested on Python `3.8.16`. However it should work with any Python version `3.8` or above.

In [1]:
# check Python
!python --version

Python 3.8.16


Install dependencies

The only dependency we need is `tieval`. The notebook is tested with `tieval==0.1.1`, so that is what we will be installing.

Although installation should be simple as

```
pip install tieval==0.1.1
```

but we have found better success with the following command.

In [2]:
# !pip install torch==1.11.0+cpu --extra-index-url https://download.pytorch.org/whl/cpu
# !pip install --default-timeout=300 tieval==0.1.1 allennlp==2.9.3

Make our evaluation library importable

In [3]:
# !pip install -r requirements.txt -r dev-requirements.txt

In [4]:
import sys


sys.path.append("lib")

###  Python 2.7

This environment is used for running UzZaman's Temporal Awareness implementation.

#### Setting up a clean Python 2.7 env with docker.

Pull the `python:2.7.18` docker image
```
docker pull python:2.7.18
```

Run the docker image is as a detached container named "temporal-py27" with this repository mounted.  
Be sure to replace `/home/user/dev/repo` with an absolute path to this repo on your machine.
```
docker run -d --name temporal-py27 -v /home/user/dev/repo:/workspace -u 1000:1000 python:2.7.18 tail -f /dev/null
```

To shell into the "temporal-py27" container, use the following command
```
docker exec -it temporal-py27 bash

```

## Experiments on TempEval3

### Prep data

NOTE: since tieval store the tlinks in a `set`, iterating over across multiple runs (Python restarts) leads to inconsistent annotation order. This leads to different outcome for some evaluation metrics.

Sort annotation entries for consistent evaluation. Since order of original dataset is lost, we will sort in order of their position in the text.

This would not affect greedy algorithms removing closure violations since annotations in TE3 does not have any closure violations.

In [5]:
from tieval import datasets

# `te3_ordered` will be sorted
te3_ordered = datasets.read("tempeval_3")
# `te3` kept as is
te3 = datasets.read("tempeval_3")

100%|██████████| 275/275 [00:00<00:00, 624.48it/s]
100%|██████████| 275/275 [00:00<00:00, 571.04it/s]


In [6]:
def tlink_pos(tlink):
    "get position of a tlink"
    source_offset = tlink.source.offsets
    target_offset = tlink.target.offsets
    if source_offset is None:
        assert tlink.source.id == "t0"
        source_offset = (0, 0)
    if target_offset is None:
        assert tlink.target.id == "t0"
        target_offset = (0, 0)
        
    if source_offset < target_offset:
        return source_offset, target_offset
    else:
        return target_offset, source_offset

te3_test_tlinks = {}

for entry in te3_ordered.test:
    tlinks = sorted(entry.tlinks, key=tlink_pos)
    te3_test_tlinks[entry.name] = tlinks
    entry.tlinks = tlinks

### Inference with CogComp2

Load model and perform inference on dataset

#### Load model

In [7]:
# import nltk

# nltk.download('punkt')
# nltk.download('wordnet')

In [8]:
# from tieval.models.classification.temporal_relation import CogCompTime2

# model = CogCompTime2()
# model

#### Inference & Save

We perform three inference passes since `CogCompTime2` is not deterministic. The body of the paper reports on the 1st run. Runs 2nd and 3rd are reported in the appendix.

Readers looking to reproduce the exact evaluation should load the inference output. We have inline values you should get for the 1st inference in the notebook. For values in the 2nd and 3rd inference, please refer to the paper.

In [9]:
# import pickle

In [10]:
# prediction = model.predict(te3_ordered.test)
# with open("output/inference-1.pkl", "wb") as file:
#     pickle.dump(prediction, file)

Two additional runs for appendix section.

In [11]:
# prediction = model.predict(te3.test)
# with open("output/inference-2.pkl", "wb") as file:
#     pickle.dump(prediction, file)

In [12]:
# prediction = model.predict(te3.test)
# with open("output/inference-3.pkl", "wb") as file:
#     pickle.dump(prediction, file)

#### Load Inference

Loading inference is prefered since CogCompTime2 is not deterministic.

In [13]:
infer_pass = 1

In [14]:
# infer_pass = 2

In [15]:
# infer_pass = 3

In [16]:
infer_path = f"output/inference-{infer_pass}.pkl"

In [17]:
import pickle

with open(infer_path, "rb") as file:
    prediction = pickle.load(file)

### UzZaman - Temporal Awareness

#### Prep

Prepare files as input for [UzZaman - Temporal Awareness evaluator](https://github.com/naushadzaman/tempeval3_toolkit)

In [18]:
import os
from pathlib import Path

ref_path = Path(f"output/tempeval3_toolkit/ref-{infer_pass}")
pred_path = Path(f"output/tempeval3_toolkit/pred-{infer_pass}")

os.makedirs(ref_path, exist_ok=True)
os.makedirs(pred_path, exist_ok=True)

Tooling for rendering TLinks to TimeML format

Relations not in `tlink_rel_types` will be removed

In [19]:
from tieval import entities

from temporal_extract.rep.raw_timeml import tlink_rel_types


print("tlink_rel_types:", tlink_rel_types)

def render_timeml_tlinks(tlinks):
    """Render tieval tlinks into UzZaman
    Temporal Awareness TimeML TLINKS compatible file"""
    
    buffer = ""
    for tlink in tlinks:
        if isinstance(tlink.source, entities.Event):
            src_type = "eventInstanceID"
            src_id = tlink.source.eiid
        else:
            src_type = "timeID"
            src_id = tlink.source.id
        if isinstance(tlink.target, entities.Event):
            tgt_type = "relatedToEventInstance"
            tgt_id = tlink.target.eiid
        else:
            tgt_type = "relatedToTime"
            tgt_id = tlink.target.id
            
        if tlink.relation.interval not in tlink_rel_types:
            # if relation not in `tlink_rel_types` and report
            print("removed relation", tlink.relation.interval)
            continue
            
        buffer += f'<TLINK lid="{tlink.id}" {src_type}="{src_id}" relType="{tlink.relation.interval}" {tgt_type}="{tgt_id}"/>\n'
    return buffer

tlink_rel_types: {'BEFORE', 'ENDS', 'AFTER', 'DURING', 'IAFTER', 'IBEFORE', 'BEGINS', 'INCLUDES', 'SIMULTANEOUS', 'BEGUN_BY', 'ENDED_BY', 'DURING_INV', 'IDENTITY', 'IS_INCLUDED'}


In [20]:
for name, tlinks in te3_test_tlinks.items():
    with open(ref_path / f"{name}.tml", "w") as file:
        file.write(render_timeml_tlinks(tlinks))

In [21]:
for name, tlinks in prediction.items():
    with open(pred_path / f"{name}.tml", "w") as file:
        file.write(render_timeml_tlinks(tlinks))

removed relation VAGUE
removed relation VAGUE
removed relation VAGUE
removed relation VAGUE
removed relation VAGUE
removed relation VAGUE
removed relation VAGUE
removed relation VAGUE
removed relation VAGUE
removed relation VAGUE
removed relation VAGUE
removed relation VAGUE
removed relation VAGUE
removed relation VAGUE
removed relation VAGUE
removed relation VAGUE
removed relation VAGUE
removed relation VAGUE
removed relation VAGUE
removed relation VAGUE
removed relation VAGUE


#### Evaluation

Git clone https://github.com/naushadzaman/tempeval3_toolkit

Use the following code to generate commands to evaluate using UzZaman Temporal Awareness implementation in a Python 2.7 env.

In [22]:
path_to_tempeval3_toolkit_repo = "repos/tempeval3_toolkit"
path_to_this_repo = "temporal_alignment"

In [23]:
ta_script_path = f"{path_to_tempeval3_toolkit_repo}/evaluation-relations/temporal_evaluation.py"
output_prefix = f"output/tempeval3_toolkit/eval-ta-{infer_pass}"
print(f"""python {ta_script_path} \\
    {path_to_this_repo}/{ref_path} \\
    {path_to_this_repo}/{pred_path} \\
    1 > {path_to_this_repo}/{output_prefix}.txt

python {ta_script_path} \\
    {path_to_this_repo}/{ref_path} \\
    {path_to_this_repo}/{pred_path} \\
    1 acl11 > {path_to_this_repo}/{output_prefix}-acl11.txt
    
python {ta_script_path} \\
    {path_to_this_repo}/{ref_path} \\
    {path_to_this_repo}/{pred_path} \\
    1 implicit_in_recall > {path_to_this_repo}/{output_prefix}-implicit_in_recall.txt

""")

python repos/tempeval3_toolkit/evaluation-relations/temporal_evaluation.py \
    temporal_alignment/output/tempeval3_toolkit/ref-1 \
    temporal_alignment/output/tempeval3_toolkit/pred-1 \
    1 > temporal_alignment/output/tempeval3_toolkit/eval-ta-1.txt

python repos/tempeval3_toolkit/evaluation-relations/temporal_evaluation.py \
    temporal_alignment/output/tempeval3_toolkit/ref-1 \
    temporal_alignment/output/tempeval3_toolkit/pred-1 \
    1 acl11 > temporal_alignment/output/tempeval3_toolkit/eval-ta-1-acl11.txt
    
python repos/tempeval3_toolkit/evaluation-relations/temporal_evaluation.py \
    temporal_alignment/output/tempeval3_toolkit/ref-1 \
    temporal_alignment/output/tempeval3_toolkit/pred-1 \
    1 implicit_in_recall > temporal_alignment/output/tempeval3_toolkit/eval-ta-1-implicit_in_recall.txt




In [24]:
def display_scores(path):
    with open(path) as file:
        lines = file.readlines()
    # print tail of evaluation file
    print("".join(lines[-4:-1]))
    
    # format scores as Latex Table
    f1, p, r = lines[-3].split("\t")[2:5]
    f1 = float(f1) / 100
    p = float(p) / 100
    r = float(r) / 100
    return f"{p:.4f} & {r:.4f} & {f1:.4f}"

In [25]:
display_scores(f"output/tempeval3_toolkit/eval-ta-{infer_pass}.txt")

Temporal Score	F1	P	R
		39.8088	40.2537	39.3736	
Overall Temporal Awareness Score (F1 score): 39.8088



'0.4025 & 0.3937 & 0.3981'

should be
```
Temporal Score	F1	P	R
		39.8088	40.2537	39.3736	
Overall Temporal Awareness Score (F1 score): 39.8088

'0.4025 & 0.3937 & 0.3981'
```

In [26]:
display_scores(f"output/tempeval3_toolkit/eval-ta-{infer_pass}-acl11.txt")

Temporal Score	F1	P	R
		40.6634	41.1894	40.1507	
Overall Temporal Awareness Score (F1 score): 40.6634



'0.4119 & 0.4015 & 0.4066'

should be
```
Temporal Score	F1	P	R
		40.6634	41.1894	40.1507	
Overall Temporal Awareness Score (F1 score): 40.6634

'0.4119 & 0.4015 & 0.4066'
```

In [27]:
display_scores(f"output/tempeval3_toolkit/eval-ta-{infer_pass}-implicit_in_recall.txt")

Temporal Score	F1	P	R
		39.8094	40.2537	39.3747	
Overall Temporal Awareness Score (F1 score): 39.8094



'0.4025 & 0.3937 & 0.3981'

should be
```
Temporal Score	F1	P	R
		39.8094	40.2537	39.3747	
Overall Temporal Awareness Score (F1 score): 39.8094

'0.4025 & 0.3937 & 0.3981'
```

### TiEval - Temporal Awareness

Note: we are not using `evaluate.temporal_awareness` since it would produce scores on a per document basis.
However, we want to evaluate on the whole dataset.

We will use `te3` since `tieval.evaluate.temporal_recall` and `tieval.evaluate.temporal_precision` operate on `set` instead of `list`.

NOTE: since tieval Temporal Awareness implementation is also affect ordering, the numerical results will not be consistent across restarts.

In [28]:
from tieval import evaluate


ref_total = 0
ref_correct = 0
pred_total = 0
pred_correct = 0

for doc in te3.test:

    ref = doc.tlinks
    pred = set(prediction[doc.name])

    cor, total = evaluate.temporal_recall(pred, ref)
    ref_total += total
    ref_correct += cor
    cor, total = evaluate.temporal_precision(pred, ref)
    pred_total += total
    pred_correct += cor
    
recall = ref_correct / ref_total
precision = pred_correct / pred_total

temporal_awareness = 2 * recall * precision / (recall + precision) if (recall + precision) else 0.0

In [29]:
f"{precision:.4f} & {recall:.4f} & {temporal_awareness:.4f}"

'0.3983 & 0.3940 & 0.3961'

should be **approximately**

`'0.3983 & 0.3940 & 0.3961'`

### Ours

Tooling for handling TiEval data and annotations

In [30]:
from typing import List, Tuple, Union

from tieval import entities


def get_id(ins: Union[entities.Event, entities.Timex]) -> str:
    "get ID from a TiEval entity"
    if isinstance(ins, entities.Event):
        return ins.eiid
    elif isinstance(ins, entities.Timex):
        return ins.id
    else:
        raise NotImplementedError

def tieval_rels_2_te_rels(rels) -> List[Tuple[str, str, str]]:
    """Convert TiEval relations format into our temporal entitie relations format
    which is an array of Tuple[source_id, target_id, relation]
    """
    return [
        (get_id(rel.source), get_id(rel.target), rel.relation.interval)
        for rel in rels
        # remove "VAGUE" relation
        if rel.relation.interval != "VAGUE"
    ]

In [31]:
from temporal_extract.rep.timeml_graph import TmlGraph
from temporal_extract.scorer.base import f_measure

#### No prediction post-processing

Prediction outputs are evaluated as-is, without conflict removal.

##### Temporal Awareness (Our implementation)

In [32]:
from temporal_extract.scorer.temporal_awareness import TemporalAwarenessScorer


for i in range(5):
    ta_scorer = TemporalAwarenessScorer(suppress_warning=i!=0)

    doc_eval = {}

    for idx, doc in enumerate(te3_ordered.test):
        # iterate over each document

        ref = doc.tlinks
        pred = prediction[doc.name]

        # convert reference relations
        ref_te_rels = tieval_rels_2_te_rels(ref)
        # convert predicted relations
        pred_te_rels = tieval_rels_2_te_rels(pred)

        # evaluate the document
        doc_eval[doc.name] = ta_scorer.evaluate_relations(ref_te_rels, pred_te_rels)

    # summarize evaluation of the dataset (all documents)
    res = ta_scorer.summarize()

    precision = res["precision"]
    recall = res["recall"]
    temporal_awareness = res["fscore"]
    print(f"{precision:.4f} & {recall:.4f} & {temporal_awareness:.4f}")


    For accurate evaluation of temporal relations between system-prediction and ref-standard please use TemporalPointAlignmentScorer or TemporalEntityAlignmentScorer instead.

    The TemporalAwarenessScorer computes "Temporal Awarness score" accoring to the paper titled "UzZaman, Naushad. Interpreting the temporal aspects of language. University of Rochester, 2012." and the reference implementation from https://github.com/naushadzaman/tempeval3_toolkit.

    This reimplementation also inherit all of the quirks of `tempeval3_toolkit` not discussed in the paper, while NOT inheriting the issues present in the original closure graph implementation.

    quirks include
     * greedy removal of closure violation
     * matching relations with removed violation causing relations

    This leads to this implementation of TemporalAwareness favoring outputs with closure violation, which is not ideal.
    
0.3979 & 0.3895 & 0.3937
0.3979 & 0.3895 & 0.3937
0.3979 & 0.3895 & 0.3937
0.3979 & 0.389

should be
```
'0.3979 & 0.3895 & 0.3937'
```

##### Temporal Entity Alignment

We are running the evaluation metric 5 times with random order annotation ordering to demonstrate robustness.

In [33]:
import random

from temporal_extract.scorer.temporal_alignment import TemporalEntityAlignmentScorer


print("Temporal Entity Alignment")

for i in range(5):
    tea_scorer = TemporalEntityAlignmentScorer()

    # store evaluation of each document (used for futher analysis)
    doc_eval = {}

    for idx, doc in enumerate(te3_ordered.test):
        # iterate over each document

        # shuffle ref and pred to demonstrate our evaluation robustness
        ref = doc.tlinks
        pred = prediction[doc.name]

        # convert reference relations
        ref_te_rels = tieval_rels_2_te_rels(ref)
        # convert predicted relations
        pred_te_rels = tieval_rels_2_te_rels(pred)
        
        # shuffle ref and pred to demonstrate our evaluation robustness
        # random.shuffle(ref_te_rels)
        # random.shuffle(pred_te_rels)

        # evaluate the document
        doc_eval[doc.name] = tea_scorer.evaluate_relations(ref_te_rels, pred_te_rels)

    # summarize evaluation of the dataset (all documents)
    res = tea_scorer.summarize()

    precision = res["precision"]
    recall = res["recall"]
    temporal_awareness = res["fscore"]
    
    print("Run:", i)
    print(f"{precision:.4f} & {recall:.4f} & {temporal_awareness:.4f}")
    print(f"doc affected by violations: {tea_scorer.model_doc_violation_count}")
    print()
    
print(f"total docs: {len(te3.test)}")

Temporal Entity Alignment
Run: 0
0.3651 & 0.3321 & 0.3478
doc affected by violations: 4

Run: 1
0.3651 & 0.3321 & 0.3478
doc affected by violations: 4

Run: 2
0.3651 & 0.3321 & 0.3478
doc affected by violations: 4

Run: 3
0.3651 & 0.3321 & 0.3478
doc affected by violations: 4

Run: 4
0.3651 & 0.3321 & 0.3478
doc affected by violations: 4

total docs: 20


every run should be **approximately**

```
0.3651 & 0.3321 & 0.3478
0.3651 & 0.3309 & 0.3471
doc affected by violations: 4

total docs: 20
```

##### Temporal Point Alignment

In [34]:
import random

from temporal_extract.scorer.temporal_alignment import TemporalPointAlignmentScorer


print("Temporal Point Alignment")

for i in range(5):
    tpa_scorer = TemporalPointAlignmentScorer()

    # store evaluation of each document (used for futher analysis)
    doc_eval = {}

    for idx, doc in enumerate(te3_ordered.test):
        # iterate over each document

        # shuffle ref and pred to demonstrate our evaluation robustness
        ref = doc.tlinks
        pred = prediction[doc.name]

        # convert reference relations
        ref_te_rels = tieval_rels_2_te_rels(ref)
        # convert predicted relations
        pred_te_rels = tieval_rels_2_te_rels(pred)
        
        # shuffle ref and pred to demonstrate our evaluation robustness
        random.shuffle(ref_te_rels)
        random.shuffle(pred_te_rels)

        # evaluate the document
        doc_eval[doc.name] = tpa_scorer.evaluate_relations(ref_te_rels, pred_te_rels)

    # summarize evaluation of the dataset (all documents)
    res = tpa_scorer.summarize()

    precision = res["precision"]
    recall = res["recall"]
    temporal_awareness = res["fscore"]
    
    print("Run:", i)
    print(f"{precision:.4f} & {recall:.4f} & {temporal_awareness:.4f}")
    print(f"potins affected by violations: {tpa_scorer.pt_affected}")
    print(f"total points: {tpa_scorer.pt_total}")
    print()

Temporal Point Alignment
Run: 0
0.3584 & 0.3834 & 0.3705
potins affected by violations: 26
total points: 823

Run: 1
0.3584 & 0.3834 & 0.3705
potins affected by violations: 26
total points: 823

Run: 2
0.3584 & 0.3834 & 0.3705
potins affected by violations: 26
total points: 823

Run: 3
0.3584 & 0.3834 & 0.3705
potins affected by violations: 26
total points: 823

Run: 4
0.3584 & 0.3834 & 0.3705
potins affected by violations: 26
total points: 823



every run should be

```
0.3584 & 0.3834 & 0.3705
potins affected by violations: 26
total points: 823
```

#### Greedily Violation Removal

In [35]:
core_ent_prediction = {}
ent_closure_violation_count = 0
ent_count = 0
core_ent_count = 0

for idx, doc in enumerate(te3.test):
    # filters out closure violation greedily
    
    # convert tieval to our format
    pred_te_rels = tieval_rels_2_te_rels(prediction[doc.name])
    
    # create prediciton graph for doc
    ent_count += len(pred_te_rels)
    pred_graph = TmlGraph()
    # safely add relation to graph (in-order) without incurring closure violation
    _, _, closure_violation = pred_graph.safe_add_relations(pred_te_rels)
    # collect violation count
    if len(closure_violation) > 0:
        print("closure violation", len(closure_violation), "in", doc.name)
    ent_closure_violation_count += len(closure_violation)
    
    # create a new set of entities-relation which encompass all relation data of the violation-free graph
    core_ent = pred_graph.compute_core_entity_relations()
    core_ent_prediction[doc.name] = core_ent
    core_ent_count += len(core_ent)
    

print(f"\nTotal closure violation causing entity-relations: {ent_closure_violation_count}")
print(f"Total entity-relations: {ent_count}")
print(f"Total core entity-relations: {core_ent_count}")

closure violation 1 in WSJ_20130321_1145
closure violation 1 in WSJ_20130318_731
closure violation 1 in nyt_20130321_women_senate
closure violation 2 in WSJ_20130322_804

Total closure violation causing entity-relations: 5
Total entity-relations: 908
Total core entity-relations: 808


##### Temporal Awareness (Our implementation)

In [43]:
from temporal_extract.scorer.temporal_awareness import TemporalAwarenessScorer

for i in range(5):
    ta_scorer = TemporalAwarenessScorer(suppress_warning=i!=0)

    doc_eval = {}

    for idx, doc in enumerate(te3_ordered.test):
        # iterate over each document

        # convert reference relations
        ref_te_rels = tieval_rels_2_te_rels(doc.tlinks)

        pred_te_rels = core_ent_prediction[doc.name]

        # evaluate the document
        doc_eval[doc.name] = ta_scorer.evaluate_relations(ref_te_rels, pred_te_rels)

    # summarize evaluation of the dataset (all documents)
    res = ta_scorer.summarize()

    precision = res["precision"]
    recall = res["recall"]
    temporal_awareness = res["fscore"]
    print(f"{precision:.4f} & {recall:.4f} & {temporal_awareness:.4f}")


    For accurate evaluation of temporal relations between system-prediction and ref-standard please use TemporalPointAlignmentScorer or TemporalEntityAlignmentScorer instead.

    The TemporalAwarenessScorer computes "Temporal Awarness score" accoring to the paper titled "UzZaman, Naushad. Interpreting the temporal aspects of language. University of Rochester, 2012." and the reference implementation from https://github.com/naushadzaman/tempeval3_toolkit.

    This reimplementation also inherit all of the quirks of `tempeval3_toolkit` not discussed in the paper, while NOT inheriting the issues present in the original closure graph implementation.

    quirks include
     * greedy removal of closure violation
     * matching relations with removed violation causing relations

    This leads to this implementation of TemporalAwareness favoring outputs with closure violation, which is not ideal.
    
0.3923 & 0.3895 & 0.3909
0.3923 & 0.3895 & 0.3909
0.3923 & 0.3895 & 0.3909
0.3923 & 0.389

should be
```
'0.3923 & 0.3895 & 0.3909'
```

##### Temporal Entity Alignment

We are running the evaluation metric 5 times with random order annotation ordering to demonstrate robustness.

In [37]:
import random

from temporal_extract.scorer.temporal_alignment import TemporalEntityAlignmentScorer


print("Temporal Entity Alignment")

for i in range(5):
    tea_scorer = TemporalEntityAlignmentScorer()

    # store evaluation of each document (used for futher analysis)
    doc_eval = {}

    for idx, doc in enumerate(te3_ordered.test):
        # iterate over each document

        # convert reference relations
        ref_te_rels = tieval_rels_2_te_rels(doc.tlinks)

        pred_te_rels = list(core_ent_prediction[doc.name])
        
        # shuffle ref and pred to demonstrate our evaluation robustness
        # random.shuffle(ref_te_rels)
        # random.shuffle(pred_te_rels)

        # evaluate the document
        doc_eval[doc.name] = tea_scorer.evaluate_relations(ref_te_rels, pred_te_rels)

    # summarize evaluation of the dataset (all documents)
    res = tea_scorer.summarize()

    precision = res["precision"]
    recall = res["recall"]
    temporal_awareness = res["fscore"]
    
    print("Run:", i)
    print(f"{precision:.4f} & {recall:.4f} & {temporal_awareness:.4f}")
    print(f"doc affected by violations: {tea_scorer.model_doc_violation_count}")
    print()
    
print(f"total docs: {len(te3.test)}")

Temporal Entity Alignment
Run: 0
0.3923 & 0.3260 & 0.3561
doc affected by violations: 0

Run: 1
0.3923 & 0.3260 & 0.3561
doc affected by violations: 0

Run: 2
0.3923 & 0.3260 & 0.3561
doc affected by violations: 0

Run: 3
0.3923 & 0.3260 & 0.3561
doc affected by violations: 0

Run: 4
0.3923 & 0.3260 & 0.3561
doc affected by violations: 0

total docs: 20


every run should be **approximately**

```
0.3923 & 0.3260 & 0.3561
0.3923 & 0.3248 & 0.3554
doc affected by violations: 0

total docs: 20
```

##### Temporal Point Alignment

In [38]:
from temporal_extract.scorer.temporal_alignment import TemporalPointAlignmentScorer


print("Temporal Point Alignment")

for i in range(5):
    tpa_scorer = TemporalPointAlignmentScorer()

    # store evaluation of each document (used for futher analysis)
    doc_eval = {}

    for idx, doc in enumerate(te3_ordered.test):
        # iterate over each document

        # convert reference relations
        ref_te_rels = tieval_rels_2_te_rels(doc.tlinks)
        
        pred_te_rels = list(core_ent_prediction[doc.name])
        
        # shuffle ref and pred to demonstrate our evaluation robustness
        random.shuffle(ref_te_rels)
        random.shuffle(pred_te_rels)

        # evaluate the document
        doc_eval[doc.name] = tpa_scorer.evaluate_relations(ref_te_rels, pred_te_rels)

    # summarize evaluation of the dataset (all documents)
    res = tpa_scorer.summarize()

    precision = res["precision"]
    recall = res["recall"]
    temporal_awareness = res["fscore"]
    
    print("Run:", i)
    print(f"{precision:.4f} & {recall:.4f} & {temporal_awareness:.4f}")
    print(f"potins affected by violations: {tpa_scorer.pt_affected}")
    print(f"total points: {tpa_scorer.pt_total}")
    print()

Temporal Point Alignment
Run: 0
0.3923 & 0.4197 & 0.4055
potins affected by violations: 0
total points: 808

Run: 1
0.3923 & 0.4197 & 0.4055
potins affected by violations: 0
total points: 808

Run: 2
0.3923 & 0.4197 & 0.4055
potins affected by violations: 0
total points: 808

Run: 3
0.3923 & 0.4197 & 0.4055
potins affected by violations: 0
total points: 808

Run: 4
0.3923 & 0.4197 & 0.4055
potins affected by violations: 0
total points: 808



should be

```
0.3923 & 0.4197 & 0.4055
potins affected by violations: 0
total points: 808
```

## Temporal Awareness Bug

This section clovers how to reproduce the bug we found in [UzZaman's Temporal Awareness evaluator](https://github.com/naushadzaman/tempeval3_toolkit).

<img src="bug/closure-bug.png" alt="diagram of closure bug" style="width:50%;"/>

We have prepaired three sets of TimeML annotation which contain relations shown in the image above.

1. `bug/ref-1`
    * the relations r0-r5 are written in order of the relation name (r0, r1, r2, r3, r4, r5)
2. `bug/ref-2`
    * the relations r0-r5 are written in a different order (r0, r1, r2, r3, **r5, r4**)
3. `bug/pred`
    * only the relation r6 are written
    

Ideally, a closure graph should be able to infer that 

> "e2 BEFORE e7" (r6)

since

> "e2 BEFORE e3" (r5) and "e3 BEFORE e7" (r3)

However, [UzZaman's Temporal Awareness evaluator](https://github.com/naushadzaman/tempeval3_toolkit) is unable to infer "r6", as seen in the results bellow.

When evaluating `bug/ref-1` against `bug/pred`, we get precision is 0.0, which means that no relation in `bug/pred` (i.e., r6) does not match any of the relations in `bug/ref-1`. This is incorrect because `bug/ref-1` does contain r5 and r3.

This bug is hard to detect, see how changing the relations order (i.e., `bug/ref-2`) affects the results.

In [39]:
ta_script_path = f"{path_to_tempeval3_toolkit_repo}/evaluation-relations/temporal_evaluation.py"
output_prefix = f"output/tempeval3_toolkit/eval-ta-{infer_pass}"
print(f"""python {ta_script_path} \\
    {path_to_this_repo}/bug/ref-1 \\
    {path_to_this_repo}/bug/pred \\
    1 > {path_to_this_repo}/bug/ref-1-pred-eval.txt

python {ta_script_path} \\
    {path_to_this_repo}/bug/ref-2 \\
    {path_to_this_repo}/bug/pred \\
    1 > {path_to_this_repo}/bug/ref-2-pred-eval.txt
""")

python repos/tempeval3_toolkit/evaluation-relations/temporal_evaluation.py \
    temporal_alignment/bug/ref-1 \
    temporal_alignment/bug/pred \
    1 > temporal_alignment/bug/ref-1-pred-eval.txt

python repos/tempeval3_toolkit/evaluation-relations/temporal_evaluation.py \
    temporal_alignment/bug/ref-2 \
    temporal_alignment/bug/pred \
    1 > temporal_alignment/bug/ref-2-pred-eval.txt



In [40]:
display_scores(f"bug/ref-1-pred-eval.txt")

Temporal Score	F1	P	R
		0.0	0.0	0.0	
Overall Temporal Awareness Score (F1 score): 0.0



'0.0000 & 0.0000 & 0.0000'

In [41]:
display_scores(f"bug/ref-2-pred-eval.txt")

Temporal Score	F1	P	R
		0.0	100.0	0.0	
Overall Temporal Awareness Score (F1 score): 0.0



'1.0000 & 0.0000 & 0.0000'