# Fossil Data Extraction Baselines

This notebook sets up, runs and evaluates the baseline models for the fossil data extraction task.

The data and baseline approaches are as follows:

| **Entity Name**            | **Baseline Approach**                                              |
|:---:|:---|
| Geographic Location - GEOG | Regular Expressions (Goring et. al 2021)                                      |
| Site Name - SITE           | spaCy Pre-Trained NER model identifying location entities |
| Taxa - TAXA                | In-text search for existing taxa already in Neotoma                |
| Age - AGE                  | Regular Expressions (Goring et. al 2021)                                      |
| Altitude - ALTI            | Regular Expressions ("above sea level", "a.s.l.")                  |
| Email Address(es) - EMAIL  | Regular Expressions                                                |


In [None]:
import os, sys

import re
import pandas as pd
import json
import numpy as np
import plotly.express as px

# ensure that the parent directory is on the path for relative imports
sys.path.append(os.path.join(os.path.abspath(''), ".."))

from src.entity_extraction.baseline_entity_extraction import (
    extract_geographic_coordinates,
    extract_site_names,
    extract_taxa,
    extract_age,
    extract_altitude,
    extract_email,
    baseline_extract_all
)

from src.entity_extraction.entity_extraction_evaluation import (
    get_token_labels,
    plot_token_classification_report,
    calculate_entity_classification_metrics,
    visualize_mislabelled_entities
)

%load_ext autoreload
%autoreload 2

## Geographic Location - GEOG

The coordinates of a site are often reported in the literature in a variety of formats. Few varying examples are:
- 402646302N 
- 0795855903W
- 40:26:46.302N 
- 079:58:55.903W
- 40°26′46″N
- 40d 26′ 46″ N
- 40.446195N 
- -79.982195
- 40° 26.7717
- N40:26:46.302
- N40°26′46″
- N40d 26′ 46″
- N40.446195
- 52°05.75′N
- 10°50'E

The baseline solution based off of Goring et. al 2021, which uses regular expressions to identify coordinates. We combine mulitple regex patterns presented in the past work to come up with a pattern which is robust to different formats of the coordinates and reduces the number of false positive string matches compared to the previous work:

1. Pattern for geographic coordinates - ```[-]?[NESW\d]+\s?[NESWd.:°o◦'`"″]\s?[NESW]?\d{1,7}\s?[NESWd.:°o◦′'`"″]?\s?\d{1,6}[[NESWd.:°o◦′'`"″]?\s?\d{0,3}[NESW]?```

In [None]:
 test_sentences = [
    "40:26:46.302N",
    "079:58:55.903W",
    "40°26′46″N",
    "40d 26′ 46″ N",
    "N40:26:46.302",
    "N40°26′46″",
    "N40d 26′ 46″",
    "52°05.75′ N",
    "10°50'E",
    ]

expected_results = [
    [{'start': 0, 'end': 13, 'labels': ['GEOG'], 'text': '40:26:46.302N'}],
    [{'start': 0, 'end': 14, 'labels': ['GEOG'], 'text': '079:58:55.903W'}],
    [{'start': 0, 'end': 10, 'labels': ['GEOG'], 'text': '40°26′46″N'}],
    [{'start': 0, 'end': 13, 'labels': ['GEOG'], 'text': '40d 26′ 46″ N'}],
    [{'start': 0, 'end': 13, 'labels': ['GEOG'], 'text': 'N40:26:46.302'}],
    [{'start': 0, 'end': 10, 'labels': ['GEOG'], 'text': 'N40°26′46″'}],
    [{'start': 0, 'end': 12, 'labels': ['GEOG'], 'text': 'N40d 26′ 46″'}],
    [{'start': 0, 'end': 11, 'labels': ['GEOG'], 'text': '52°05.75′ N'}],
    [{'start': 0, 'end': 7, 'labels': ['GEOG'], 'text': "10°50'E"}]
]

In [None]:
for i, sentence in enumerate(test_sentences):

    extracted_geographic_coordinates = extract_geographic_coordinates(sentence)

    print(f"Testing sentence: {sentence}")
    print(f"Got: {extracted_geographic_coordinates}\n")
    assert extracted_geographic_coordinates == expected_results[i]

## Site Name - SITE

We can use pretrained ner models to extract location names from literature. For the baseline approach, we do not try to differentiate between `site names` and `region names` for the site and consider all the entities to be of type `SITE`.

In [None]:
 test_sentences = [
    "Its relevance to northwestern Europe in the Late Quaternary Period ( H. NICHOLS -)231 Chronology of Postglacial pollen profiles in the Pacific Northwest ( U.S.A. )",
    "The scenery around Garibaldi lake is pristine",
    "This movie was shot in the old towns of Europe",
    "Philosophical Transactions of and tbe pollen record in the British Isles, In : Birks HH, Birks HJb, Kaland PE, Moe D, eds.",
    "Holocene fluctuations of cold climate in the Swiss Alps ( H. ZOLLER -)"
]

expected_results = [
    [{'start': 30, 'end': 36, 'labels': ['SITE'], 'text': 'Europe'}, {'start': 131, 'end': 152, 'labels': ['SITE'], 'text': 'the Pacific Northwest'}],
    [{'start': 19, 'end': 33, 'labels': ['SITE'], 'text': 'Garibaldi lake'}],
    [{'start': 40, 'end': 46, 'labels': ['SITE'], 'text': 'Europe'}],
    [{'start': 55, 'end': 72, 'labels': ['SITE'], 'text': 'the British Isles'}],
    [{'start': 41, 'end': 55, 'labels': ['SITE'], 'text': 'the Swiss Alps'}]
]

In [None]:
for i, sentence in enumerate(test_sentences):

    extracted_site_names = extract_site_names(sentence)
    print(f"Testing sentence: {sentence}")
    print(f"Got: {extracted_site_names}\n")
    assert extracted_site_names == expected_results[i]

## Taxa - TAXA

The baseline approach to extract taxas from full text journal articles is to perform string matching. Using an exhaustive list of all the taxas is present on the neotoma database, we search for the exact taxon names in literature. 

In [None]:
 test_sentences = [
    "Percentage calculation is based on the terrestrial pollen sum from which Betula was excluded KM/1 KM/2 KM/3 NM/1 NM/2 NM/3 NM/4 NM/5 NM/6 NM/7 NM/8",
    "The palaeoecology of an Early Neolithic waterlogged site in northwestern England ( F. OLovmLo -)A pollen-analytical study of cores from the Outer Silver Pit", #False positive
    "Description Salix 0.57 1.76 0.73 13.3 1.67 8.78 1.50 2.88 Solanum dulcamara 0 0 0.73 0 0 1.58 0 0 Lysimachia vulgaris 0 0 4.90 0 0.84 0.53 0 0 Mentha-type 00 0 1.04 0 0 00 Lemna 00 0 7.44 0 1.58 0 0",
    "The first major impacts upon the vegetation record become eident from about 3610 BP with sharp reductions in arboreal taxa, the appearance of cerealtype pollen in L.A.BI, and marked increases in Calluna, Foaceae and Cyperaceae.",
    "The overlying Sphagnum peat is devoid of clastic elements for a short period during which sediment inorganic content declines.",
    "Abstract ) ( A. T. CROSS, G. G. THOMPSON and J. B. ZAITZEFF ) 3 - 1 1 Gymnospermae, general The gymnospermous affinity of Eucommiidites ERDTMAN, 1948"
]

expected_results = [
    [{'start': 73, 'end': 79, 'labels': ['TAXA'], 'text': 'Betula'}],
    [{'start': 146, 'end': 152, 'labels': ['TAXA'], 'text': 'Silver'}], # False positive
    [
        {'start': 12, 'end': 17, 'labels': ['TAXA'], 'text': 'Salix'}, 
        {'start': 58, 'end': 75, 'labels': ['TAXA'], 'text': 'Solanum dulcamara'},
        {'start': 98, 'end': 117, 'labels': ['TAXA'], 'text': 'Lysimachia vulgaris'}, 
        {'start': 143, 'end': 154, 'labels': ['TAXA'], 'text': 'Mentha-type'},
        {'start': 143, 'end': 149, 'labels': ['TAXA'], 'text': 'Mentha'}, 
        {'start': 172, 'end': 177, 'labels': ['TAXA'], 'text': 'Lemna'}],
    [
        {'start': 195, 'end': 202, 'labels': ['TAXA'], 'text': 'Calluna'}, 
        {'start': 216, 'end': 226, 'labels': ['TAXA'], 'text': 'Cyperaceae'}],
    [{'start': 14, 'end': 22, 'labels': ['TAXA'], 'text': 'Sphagnum'}],
    [{'start': 70, 'end': 81, 'labels': ['TAXA'], 'text': 'Gymnosperma'}]
]

In [None]:
for i, sentence in enumerate(test_sentences):

    extracted_taxas = extract_taxa(sentence)

    print(f"Testing sentence: {sentence}")
    print(f"Got: {extracted_taxas}\n")
    assert extracted_taxas == expected_results[i]

## Age - AGE

The age of samples is often reported in the literature in a variety of formats.  The most common formats are:
- years BP - before present
- kyr BP - 1000’s of years BP
- ka BP - kilo annum BP
- a BP - annum BP
- Ma BP - million years BP
- YBP - years BP

In Neotoma there are three age columns, we have ageold, agetype and ageyoung.

- agetype: Age type or units. Includes the following:
  - Calendar years AD/BC
  - Calendar years BP
  - Calibrated radiocarbon years BP
  - Radiocarbon years BP
  - Varve years BP

The baseline solution based off of Goring et. al 2021 uses regular expressions to:
1. Identify the age entity in the sentence - `" BP "`
2. Determine if it is a range of dates - `"(\\d+(?:[.]\\d+)*) ((?:- {1,2})|(?:to)) (\\d+(?:[.]\\d+)*) ([a-zA-Z]+,BP"`
3. Extract the age entity from the sentence - `"(\\d+(?:[.]\\d+)*),((?:- {1,2})|(?:to)),(\\d+(?:[.]\\d+)*),([a-zA-Z]+,BP),"`

In [None]:
test_sentences = [
    "1234 BP",
    "1234 Ma BP",
    "1234 to 1235 BP",
    "1234 - 1235 BP",
    "1234 -- 1235 BP",
    "1234 BP and 456 to 789 BP",
    "1234 BP and 456 to 789 Ma BP",
    "1234 ka BP",
    "1234 a BP",
    "1234 Ma BP",
    "1234 kyr BP",
    "1234 cal yr BP",
    "1234 YBP",
    "1234 14C BP",
]

expected_results = [
    [{'start': 0, 'end': 7, 'labels': ['AGE'], 'text': '1234 BP'}],
    [{'start': 0, 'end': 10, 'labels': ['AGE'], 'text': '1234 Ma BP'}],
    [{'start': 0, 'end': 15, 'labels': ['AGE'], 'text': '1234 to 1235 BP'}],
    [{'start': 0, 'end': 14, 'labels': ['AGE'], 'text': '1234 - 1235 BP'}],
    [{'start': 0, 'end': 15, 'labels': ['AGE'], 'text': '1234 -- 1235 BP'}],
    [
        {'start': 0, 'end': 7, 'labels': ['AGE'], 'text': '1234 BP'}, 
        {'start': 12, 'end': 25, 'labels': ['AGE'], 'text': '456 to 789 BP'}
    ],
    [
        {'start': 0, 'end': 7, 'labels': ['AGE'], 'text': '1234 BP'}, 
        {'start': 12, 'end': 28, 'labels': ['AGE'], 'text': '456 to 789 Ma BP'}
    ],
    [{'start': 0, 'end': 10, 'labels': ['AGE'], 'text': '1234 ka BP'}],
    [{'start': 0, 'end': 9, 'labels': ['AGE'], 'text': '1234 a BP'}],
    [{'start': 0, 'end': 10, 'labels': ['AGE'], 'text': '1234 Ma BP'}],
    [{'start': 0, 'end': 11, 'labels': ['AGE'], 'text': '1234 kyr BP'}],
    [{'start': 0, 'end': 14, 'labels': ['AGE'], 'text': '1234 cal yr BP'}],
    [{'start': 0, 'end': 8, 'labels': ['AGE'], 'text': '1234 YBP'}],
    [{'start': 0, 'end': 11, 'labels': ['AGE'], 'text': '1234 14C BP'}],
]

In [None]:
# test that all the test sentences are extracted correctly
for i, sentence in enumerate(test_sentences):

    extracted_ages = extract_age(sentence)

    print(f"Testing sentence: {sentence}")
    print(f"Got: {extracted_ages}\n")
    assert extracted_ages == expected_results[i]

## Altitude - ALTI

To identify altitude descriptions the primary indicators are:
- "above sea level"
- "a.s.l."
- a single m as the last character after numbers or as a standalone word

In [None]:
test_sentences = [
    "120m above sea level",
    "120m a.s.l.",
    "120 m above sea level",
    "120 m a.s.l.",
    "120m asl",
    "120 m asl",
    "The site was 120m above sea level",
    "The site was 120m a.s.l.",
    "The site was 120 m above sea level",
    "The site was 120 m a.s.l.",
    "First site was 120m asl and the second was 300 m asl",
]

expected_results = [
    [{'start': 0, 'end': 20, 'labels': ['ALTI'], 'text': '120m above sea level'}],
    [{'start': 0, 'end': 11, 'labels': ['ALTI'], 'text': '120m a.s.l.'}],
    [{'start': 0, 'end': 21, 'labels': ['ALTI'], 'text': '120 m above sea level'}],
    [{'start': 0, 'end': 12, 'labels': ['ALTI'], 'text': '120 m a.s.l.'}],
    [{'start': 0, 'end': 8, 'labels': ['ALTI'], 'text': '120m asl'}],
    [{'start': 0, 'end': 9, 'labels': ['ALTI'], 'text': '120 m asl'}],
    [{'start': 13, 'end': 33, 'labels': ['ALTI'], 'text': '120m above sea level'}],
    [{'start': 13, 'end': 24, 'labels': ['ALTI'], 'text': '120m a.s.l.'}],
    [{'start': 13, 'end': 34, 'labels': ['ALTI'], 'text': '120 m above sea level'}],
    [{'start': 13, 'end': 25, 'labels': ['ALTI'], 'text': '120 m a.s.l.'}],
    [
        {'start': 15, 'end': 23, 'labels': ['ALTI'], 'text': '120m asl'},
        {'start': 43, 'end': 52, 'labels': ['ALTI'], 'text': '300 m asl'}
    ]
]

In [None]:
# test that all the test sentences are extracted correctly
for i, sentence in enumerate(test_sentences):

    extracted_altitude = extract_altitude(sentence)

    print(f"Testing sentence: {sentence}")
    print(f"Found: {extracted_altitude}\n")
    assert extracted_altitude == expected_results[i]

## Email Addresses - EMAIL

There are existing regex patterns developed to identify emails. The one used below was sourced from this StackoverFlow thread: 
- https://stackoverflow.com/questions/201323/how-can-i-validate-an-email-address-using-a-regular-expression

In [None]:
test_sentences = [
    "ty.elgin.andrews@gmail.com",
    "john.smith@aol.com",
    "ty.andrews@student.ubc.ca",
    # from GGD 54b4324ae138239d8684a37b segment 0
    "E-mail addresses : carina.hoorn@milne.cc (C. Hoorn -) mauro.cremaschi@libero.it"
]

expected_results = [
    [{'start': 0, 'end': 26, 'labels': ['EMAIL'], 'text': 'ty.elgin.andrews@gmail.com'}],
    [{'start': 0, 'end': 18, 'labels': ['EMAIL'], 'text': 'john.smith@aol.com'}],
    [{'start': 0, 'end': 25, 'labels': ['EMAIL'], 'text': 'ty.andrews@student.ubc.ca'}],
    [
        {'start': 19, 'end': 40, 'labels': ['EMAIL'], 'text': 'carina.hoorn@milne.cc'},
        {'start': 54, 'end': 79, 'labels': ['EMAIL'], 'text': 'mauro.cremaschi@libero.it'}
    ]
]

In [None]:
for i, sentence in enumerate(test_sentences):

    extracted_emails = extract_email(sentence)

    print(f"Testing sentence: {sentence}")
    print(f"Found: {extracted_emails}\n")
    assert extracted_emails == expected_results[i]

# Evaluation of Baseline Methods

Multiple methods to evaluate the NER task are evaluated for a given report:
1. Does the number instances of each category align with what is present in Neotoma?
   1. For each article need to pull out the number of instances of each category for the given DOI/GDD id
2. Does the number of instances of each category align with what has been labelled?
   1. Using the team labelled reports, count the number of instances of each category for each report.
3. Does the extracted locations align with the labelled locations?
   1. This can be done with the Python package `seqeval` which is used to evaluate the performance of a sequence labeling tasks. ([`seqeval` Github](https://github.com/chakki-works/seqeval))

## Loading Raw & Labelled Data

In [None]:
from collections import defaultdict

labelled_file_path = os.path.join(os.getcwd(), os.pardir, "data", "labelled", "labelling-labelled")

## Comparison to Neotoma Data

To compare the extraction rate, for each article pull the reported taxa, site names, etc. and compare the extracted values to the reported values.

In [None]:
# TODO

## Comparison of Entity Counts to Labelled Data

In [None]:
labelled_files = os.listdir(labelled_file_path)

# count the number of occurences of each label
annotated_label_counts = defaultdict(int)
baseline_label_counts = defaultdict(int)

for file in labelled_files:
    
    with open(os.path.join(labelled_file_path, file), "r") as f:
        task = json.load(f)

    raw_text = task['task']['data']['text']
    
    # get the baseline annotations
    baseline_result = baseline_extract_all(raw_text)
    for baseline in baseline_result:
        baseline_label_counts[baseline['labels'][0]] += 1

    annotation_result = task['result']
    
    for annotation in annotation_result:
        annotated_label_counts[annotation['value']['labels'][0]] += 1

In [None]:
# plot the percentage of entities extracted by baseline vs annotated
annotated_labels = list(annotated_label_counts.keys())
annotated_counts = list(annotated_label_counts.values())

baseline_counts = [baseline_label_counts[label] for label in annotated_labels]

annotated_counts = np.array(annotated_counts)
baseline_counts = np.array(baseline_counts)

# make into a tidy dataframe with columns Label, Source, Count
annotated_df = pd.DataFrame(
    {
        'Label': annotated_labels + annotated_labels,
        'Source': ['Annotated'] * len(annotated_labels) + ['Baseline'] * len(annotated_labels),
        'Count': np.concatenate([annotated_counts, baseline_counts])
    }
)

fig = px.bar(
    annotated_df,
    x="Label",
    y="Count",
    color="Source",
    barmode='group',
    # labels={'x': 'Labels', 'value': 'No. of Entities'},
    title='Counts of Labels in Annotated and Baseline Results',
    width=800,
).update_layout(
    xaxis={'categoryorder': 'total descending'}, 
    margin={'l': 0, 'r': 0, 't': 50, 'b': 0}, 
)

fig.show()

In [None]:
percentage_extracted = baseline_counts / annotated_counts

fig = px.bar(
    x=annotated_labels, 
    y=percentage_extracted, 
    color=annotated_labels,
    labels={'x': 'Labels', 'y': 'Percentage Extracted'},
    title='Percentage of Entities Extracted by Baseline vs Annotated',
    width=800,
    # format the text to show percent
    text=np.round(percentage_extracted*100, 1),
).update_layout(
    xaxis={'categoryorder': 'total descending'},
    margin={'l': 0, 'r': 0, 't': 50, 'b': 0},
    yaxis={'tickformat': ',.0%'},
    showlegend=False,
    yaxis_range=[0, 1],
)
fig.show()


## Calculating Precision, Recall and F1 Scores with Seqeval

The python package `seqeval` is used to evaluate the performance of a sequence labeling tasks. ([`seqeval` Github](https://github.com/chakki-works/seqeval)).

It requires the following inputs:
- `y_true` - a list of true labels of each token
- `y_pred` - a list of predicted labels of each token

To take the labelled data from label studio and convert it into the required format, we need to:
1. Load the labelled data
2. Convert the labelled data into a list of lists of labels with 'O' for tokens that are not part of an entity
3. Convert the baseline predictions into a list of lists of labels with 'O' for tokens that are not part of an entity
4. Calculate the precision, recall and f1 scores for each entity type

### Load Labelled Data

In [None]:
labelled_file_path = os.path.join(os.getcwd(), os.pardir, "data", "labelled", "labelling-labelled")

In [None]:
def load_json_label_files(labelled_files: str):
    """
    Load the json files containing the labelled data and combines the text
    into a complete text string.

    Parameters
    ----------
    label_files : list
        List of json files containing the labelled data.

    Returns
    -------
    combined_text : str
        The combined text from all the files.
    all_labelled_entities : list
        List of all the labelled entities re-indexed to account for the combined text.

    """

    combined_text = ""
    all_labelled_entities = []
    for file in labelled_files:
        
        with open(os.path.join(labelled_file_path, file), "r") as f:
            task = json.load(f)

        raw_text = task['task']['data']['text']

        annotation_result = task['result']
        labelled_entities = [annotation['value'] for annotation in annotation_result]

        # add the current text length to the start and end indices of labels plus one for the space
        for entity in labelled_entities:
            entity['start'] += len(combined_text)
            entity['end'] += len(combined_text)

        all_labelled_entities += labelled_entities

        # add the current text to the combined text with space in between
        combined_text += raw_text + " "

    return combined_text, all_labelled_entities

### Simplified Test of Evaluation

The following test case is used to test the evaluation of the `seqeval` package. And ensure the steps to wrangle data into the required format are correct.

The expected result is precision/accuracy/recall of 1.0 for all entity types.

In [None]:
test_text = "The site was 120m above sea level and 1234 BP and found Pediastrum"
test_labelled_entities = [
    {'start': 13, 'end': 33, 'labels': ['ALTI'], 'text': '120m above sea level'},
    {'start': 38, 'end': 45, 'labels': ['AGE'], 'text': '1234 BP'},
    {'start': 56, 'end': 66, 'labels': ['TAXA'], 'text': 'Pediastrum'}
]

In [None]:
# check the token labels and indexes match
test_labelled_tokens = get_token_labels(test_labelled_entities, test_text)

split_text = test_text.split()

# check that the token labels are correct
for i, token in enumerate(split_text):
    print(f"{token}: {test_labelled_tokens[i]}")

In [None]:
test_baseline_entities = baseline_extract_all(test_text)
test_baseline_tokens = get_token_labels(test_baseline_entities, test_text)

### Entity vs. Token Level Evaluation

The `seqeval` package can be used to evaluate the performance of the NER task at the entity level or the token level. The two approaches are summarized below:

1. **Entity Level Evaluation**
   - The entity level evaluation is the standard approach to evaluating the performance of a NER task. 
   - The evaluation is done at the entity level, meaning that the entire entity must be correctly identified to be considered a true positive.
   - The entity level evaluation is the default evaluation method used by the `seqeval` package.

2. **Token Level Evaluation**
    - The token level evaluation is a more lenient evaluation of the performance of a NER task.
    - The evaluation is done at the token level, meaning that each token must just be of the correct tag (e.g. TAXA) to be considered a true positive.
    - The token level evaluation is not the default evaluation method used by the `seqeval` package and is calculated by turning all Inner (I) tags into Begin (B) tags for labelled data.

Below shows the difference between the two evaluation methods:

In [None]:
accuracy, f1, recall, precision = calculate_entity_classification_metrics(
    labelled_tokens = ['O',  'O', 'O',      'B-TAXA', 'I-TAXA', 'I-TAXA', 'O'],  
    predicted_tokens = ['O', 'O', 'B-TAXA', 'I-TAXA', 'I-TAXA', 'I-TAXA', 'O'],
    method = "entities"
)

print(f"Accuracy: {accuracy:.3f}")
print(f"F1: {f1:.3f}")
print(f"Recall: {recall:.3f}")
print(f"Precisions: {precision:.3f}")

In [None]:
accuracy, f1, recall, precision = calculate_entity_classification_metrics(
    labelled_tokens = ['O',  'O', 'O',      'B-TAXA', 'I-TAXA', 'I-TAXA', 'O'],  
    predicted_tokens = ['O', 'O', 'B-TAXA', 'I-TAXA', 'I-TAXA', 'I-TAXA', 'O'],
    method = "tokens"
)

print(f"Accuracy: {accuracy:.3f}")
print(f"F1: {f1:.3f}")
print(f"Recall: {recall:.3f}")
print(f"Precisions: {precision:.3f}")

#### Evaluation on Simplified Test Data

In [None]:
# calculate the metrics for the test case
accuracy, f1, recall, precision = calculate_entity_classification_metrics(
    test_labelled_tokens, test_baseline_tokens, method="tokens"
)

print(f"Accuracy: {accuracy:.3f}")
print(f"F1: {f1:.3f}")
print(f"Recall: {recall:.3f}")
print(f"Precisions: {precision:.3f}")

In [None]:
plot_token_classification_report(
    test_labelled_tokens, 
    test_baseline_tokens, 
    "Test of Entity Extraction Classification Report",
    method="tokens",
)

### Evaluation of All Labelled Files

The following code reads in all labelled files and evaluates the performance of the baseline models across all the text at once. A future improvement will be to be able to analyze individual research papers at a time.

In [None]:
labelled_files = os.listdir(labelled_file_path)

# load the labelled data
combined_text, labelled_entities = load_json_label_files(labelled_files)

# extract the baseline entities
baseline_entities = baseline_extract_all(combined_text)

In [None]:
labelled_tokens = get_token_labels(labelled_entities, combined_text)
baseline_tokens = get_token_labels(baseline_entities, combined_text)

In [None]:
# check that both lists are the same length as they should be split identically
len(labelled_tokens), len(baseline_tokens)

#### Entity Level Performance

In [None]:
accuracy, f1, recall, precision = calculate_entity_classification_metrics(
    labelled_tokens, baseline_tokens, method="entities"
)

print(f"Accuracy: {accuracy:.3f}")
print(f"F1: {f1:.3f}")
print(f"Recall: {recall:.3f}")
print(f"Precisions: {precision:.3f}")

In [None]:
plot_token_classification_report(
    labelled_tokens, 
    baseline_tokens, 
    title="Classification Report\n ENTITY Based Evaluation for Baseline Entity Extraction",
    method="entities",
)

#### Token Level Performance

In [None]:
accuracy, f1, recall, precision = calculate_entity_classification_metrics(
    labelled_tokens, baseline_tokens, method="tokens"
)

print(f"Accuracy: {accuracy:.3f}")
print(f"F1: {f1:.3f}")
print(f"Recall: {recall:.3f}")
print(f"Precisions: {precision:.3f}")

In [None]:
plot_token_classification_report(
    labelled_tokens, 
    baseline_tokens, 
    title="Classification Report\n TOKEN Based Evaluation for Baseline Entity Extraction",
    method="tokens",
)

## Inspecting Mislabelled Entities

The `spacy` tool `displacy` is used to visualize the entities identified by the baseline model and the labelled entities. This is done to understand the types of entities that are being mislabelled and to identify patterns in the mislabelled entities.

We distinguish between the following types of mislabelled entities:  (set the colors of Red/Orange/Green in the markdown text to match the color)
1. **<font color='red'>False Negatives (Red)</font>** - entities labelled in the labelled data that are not identified by the baseline model
   - these are most important as they signify missed data
2. **<font color='orange'>False Positives (Orange)</font>** - entities identified by the baseline model that are not labelled in the labelled data
   - these are considered less important as they can be rejected by the data steward
3. **<font color='green'>True Positives (Green)</font>** - entities identified by the baseline model that are also labelled in the labelled data

In [None]:
text = "This is a test age 1234 BP and missed SITE".split(" ")
actual_labels       = ["O", "O", "O",       "O", "O", "B-AGE", "I-AGE", "O", "O", "B-SITE"]
predicted_labels    = ["O", "O", "B-TAXA",  "O", "O", "B-AGE", "I-AGE", "O", "O", "O"]

visualize_mislabelled_entities(actual_labels, predicted_labels, text)

In [None]:
visualize_mislabelled_entities(labelled_tokens, baseline_tokens, combined_text.split())