# Fossil Data Extraction Baselines

This notebook sets up, runs and evaluates the baseline models for the fossil data extraction task.

The data and baseline approaches are as follows:

| **Entity Name**            | **Baseline Approach**                                              |
|:---:|:---|
| Geographic Location - GEOG | Regular Expressions (Goring et. al 2021)                                      |
| Site Name - SITE           | spaCy Pre-Trained NER model identifying location entities |
| Taxa - TAXA                | In-text search for existing taxa already in Neotoma                |
| Age - AGE                  | Regular Expressions (Goring et. al 2021)                                      |
| Altitude - ALTI            | Regular Expressions ("above sea level", "a.s.l.")                  |
| Email Address(es) - EMAIL  | Regular Expressions                                                |


In [None]:
import os, sys

import re
import pandas as pd
import json
import numpy as np
import plotly.express as px

# ensure that the parent directory is on the path for relative imports
sys.path.append(os.path.join(os.path.abspath(''), ".."))

from src.entity_extraction.baseline_entity_extraction import (
    extract_geographic_coordinates,
    extract_site_names,
    extract_taxa,
    extract_age,
    extract_altitude,
    extract_email,
    baseline_extract_all
)

from src.entity_extraction.entity_extraction_evaluation import (
    get_token_labels,
    plot_token_classification_report,
    calculate_entity_classification_metrics
)

%load_ext autoreload
%autoreload 2

## Geographic Location - GEOG

The coordinates of a site are often reported in the literature in a variety of formats. Few varying examples are:
- 402646302N 0795855903W
- 40:26:46.302N 079:58:55.903W
- 40°26′46″N 079°58′56″W
- 40d 26′ 46″ N 079d 58′ 56″ W
- 40.446195N 79.982195W
- 40.446195, -79.982195
- 40.446195,-79.982195
- 40° 26.7717, -79° 58.93172
- N40:26:46.302 W079:58:55.903
- N40°26′46″ W079°58′56″
- N40d 26′ 46″ W079d 58′ 56″
- N40.446195 W79.982195
- 52°05.75′N, 131°13.25′W
- 10°50'E, 46°51'N

The baseline solution based off of Goring et. al 2021 uses regular expressions to identify coordinates. We've adpated and updated those regex patterns to reduce the number of false positive generated:
1. Pattern for geographic coordinates - ```[-+]?[NESW\d]+\s?[NESWd\.:°o◦'`"″]\s?[NESW]?\d{1,7}\s?[NESWd\.:°o◦′'`"″]?\s?\d{1,6}[[NESWd\.:°o◦′'`"″]?\s?[NESW]?```

## Site Name - SITE

We can use pretrained ner models to extract location names from literature, but we are not sure whether these entities will correspond to `region_name` or `site_name`

In [12]:
import spacy
from spacy.pipeline.ner import DEFAULT_NER_MODEL

nlp = spacy.load('en_core_web_lg')

config = {
   "moves": None,
   "update_with_oracle_cut_size": 100,
   "model": DEFAULT_NER_MODEL,
   "incorrect_spans_key": "incorrect_spans",
}
# nlp.add_pipe("ner", config=config)
with open("../data/raw/54b43244e138239d8684933b.txt", 'r') as f:
    text = f.readlines()

doc = nlp(text[0])
labels = []
for ent in doc.ents:
    if ent.label_ == "LOC":
        labels.append({
            "start": ent.start,
            "end": ent.end,
            "label": ["LOC"],
            "text": ent.text
        })
print(labels)

[{'start': 214, 'end': 215, 'label': ['LOC'], 'text': 'Siberia'}, {'start': 217, 'end': 218, 'label': ['LOC'], 'text': 'Sahara'}, {'start': 224, 'end': 226, 'label': ['LOC'], 'text': 'North Pacific'}, {'start': 306, 'end': 307, 'label': ['LOC'], 'text': 'Sahara'}, {'start': 313, 'end': 315, 'label': ['LOC'], 'text': 'B.Plow lake'}, {'start': 334, 'end': 335, 'label': ['LOC'], 'text': 'Sahara'}, {'start': 365, 'end': 366, 'label': ['LOC'], 'text': 'Sahel'}, {'start': 410, 'end': 412, 'label': ['LOC'], 'text': 'Southern Hemisphere'}, {'start': 420, 'end': 422, 'label': ['LOC'], 'text': 'Northern Hemisphere'}, {'start': 757, 'end': 759, 'label': ['LOC'], 'text': 'Southern Africa'}, {'start': 1021, 'end': 1022, 'label': ['LOC'], 'text': 'Sahel'}, {'start': 1030, 'end': 1031, 'label': ['LOC'], 'text': 'Africa'}, {'start': 1033, 'end': 1034, 'label': ['LOC'], 'text': 'Atlantic'}, {'start': 1035, 'end': 1038, 'label': ['LOC'], 'text': 'the Red Sea'}, {'start': 1085, 'end': 1086, 'label': ['LO

In [15]:
type(nlp)

spacy.lang.en.English

In [17]:
spacy.Language

spacy.language.Language

## Taxa - TAXA

The baseline approach to extract taxas from full text journal articles is to perform string matching. Using an exhaustive list of all the taxas is present on the neotoma database, we search for the exact taxon names in literature. 

## Age - AGE

The age of samples is often reported in the literature in a variety of formats.  The most common formats are:
- years BP - before present
- kyr BP - 1000’s of years BP
- ka BP - kilo annum BP
- a BP - annum BP
- Ma BP - million years BP
- YBP - years BP

In Neotoma there are three age columns, we have ageold, agetype and ageyoung.

- agetype: Age type or units. Includes the following:
  - Calendar years AD/BC
  - Calendar years BP
  - Calibrated radiocarbon years BP
  - Radiocarbon years BP
  - Varve years BP

The baseline solution based off of Goring et. al 2021 uses regular expressions to:
1. Identify the age entity in the sentence - `" BP "`
2. Determine if it is a range of dates - `"(\\d+(?:[.]\\d+)*) ((?:- {1,2})|(?:to)) (\\d+(?:[.]\\d+)*) ([a-zA-Z]+,BP"`
3. Extract the age entity from the sentence - `"(\\d+(?:[.]\\d+)*),((?:- {1,2})|(?:to)),(\\d+(?:[.]\\d+)*),([a-zA-Z]+,BP),"`

In [None]:
test_sentences = [
    "1234 BP",
    "1234 Ma BP",
    "1234 to 1235 BP",
    "1234 - 1235 BP",
    "1234 -- 1235 BP",
    "1234 BP and 456 to 789 BP",
    "1234 BP and 456 to 789 Ma BP",
]

expected_results = [
    [{'start': 0, 'end': 7, 'label': ['AGE'], 'text': '1234 BP'}],
    [{'start': 0, 'end': 10, 'label': ['AGE'], 'text': '1234 Ma BP'}],
    [{'start': 0, 'end': 15, 'label': ['AGE'], 'text': '1234 to 1235 BP'}],
    [{'start': 0, 'end': 14, 'label': ['AGE'], 'text': '1234 - 1235 BP'}],
    [{'start': 0, 'end': 15, 'label': ['AGE'], 'text': '1234 -- 1235 BP'}],
    [
        {'start': 0, 'end': 7, 'label': ['AGE'], 'text': '1234 BP'}, 
        {'start': 12, 'end': 25, 'label': ['AGE'], 'text': '456 to 789 BP'}
    ],
    [
        {'start': 0, 'end': 7, 'label': ['AGE'], 'text': '1234 BP'}, 
        {'start': 12, 'end': 28, 'label': ['AGE'], 'text': '456 to 789 Ma BP'}
    ],
]

In [None]:
# test that all the test sentences are extracted correctly
for i, sentence in enumerate(test_sentences):

    extracted_ages = extract_age(sentence)

    print(f"Testing sentence: {sentence}")
    print(f"Got: {extracted_ages}\n")
    assert extracted_ages == expected_results[i]

## Altitude - ALTI

To identify altitude descriptions the primary indicators are:
- "above sea level"
- "a.s.l."
- a single m as the last character after numbers or as a standalone word

In [None]:
test_sentences = [
    "120m above sea level",
    "120m a.s.l.",
    "120 m above sea level",
    "120 m a.s.l.",
    "120m asl",
    "120 m asl",
    "The site was 120m above sea level",
    "The site was 120m a.s.l.",
    "The site was 120 m above sea level",
    "The site was 120 m a.s.l.",
    "First site was 120m asl and the second was 300 m asl",
]

expected_results = [
    [{'start': 0, 'end': 20, 'label': ['ALTI'], 'text': '120m above sea level'}],
    [{'start': 0, 'end': 11, 'label': ['ALTI'], 'text': '120m a.s.l.'}],
    [{'start': 0, 'end': 21, 'label': ['ALTI'], 'text': '120 m above sea level'}],
    [{'start': 0, 'end': 12, 'label': ['ALTI'], 'text': '120 m a.s.l.'}],
    [{'start': 0, 'end': 8, 'label': ['ALTI'], 'text': '120m asl'}],
    [{'start': 0, 'end': 9, 'label': ['ALTI'], 'text': '120 m asl'}],
    [{'start': 13, 'end': 33, 'label': ['ALTI'], 'text': '120m above sea level'}],
    [{'start': 13, 'end': 24, 'label': ['ALTI'], 'text': '120m a.s.l.'}],
    [{'start': 13, 'end': 34, 'label': ['ALTI'], 'text': '120 m above sea level'}],
    [{'start': 13, 'end': 25, 'label': ['ALTI'], 'text': '120 m a.s.l.'}],
    [
        {'start': 15, 'end': 23, 'label': ['ALTI'], 'text': '120m asl'},
        {'start': 43, 'end': 52, 'label': ['ALTI'], 'text': '300 m asl'}
    ]
]

In [None]:
# test that all the test sentences are extracted correctly
for i, sentence in enumerate(test_sentences):

    extracted_altitude = extract_altitude(sentence)

    print(f"Testing sentence: {sentence}")
    print(f"Found: {extracted_altitude}\n")
    assert extracted_altitude == expected_results[i]

## Email Addresses - EMAIL

There are existing regex patterns developed to identify emails. The one used below was sourced from this StackoverFlow thread: 
- https://stackoverflow.com/questions/201323/how-can-i-validate-an-email-address-using-a-regular-expression

In [None]:
test_sentences = [
    "ty.elgin.andrews@gmail.com",
    "john.smith@aol.com",
    "ty.andrews@student.ubc.ca",
    # from GGD 54b4324ae138239d8684a37b segment 0
    "E-mail addresses : carina.hoorn@milne.cc (C. Hoorn -) mauro.cremaschi@libero.it"
]

expected_results = [
    [{'start': 0, 'end': 26, 'label': ['EMAIL'], 'text': 'ty.elgin.andrews@gmail.com'}],
    [{'start': 0, 'end': 18, 'label': ['EMAIL'], 'text': 'john.smith@aol.com'}],
    [{'start': 0, 'end': 25, 'label': ['EMAIL'], 'text': 'ty.andrews@student.ubc.ca'}],
    [
        {'start': 19, 'end': 40, 'label': ['EMAIL'], 'text': 'carina.hoorn@milne.cc'},
        {'start': 54, 'end': 79, 'label': ['EMAIL'], 'text': 'mauro.cremaschi@libero.it'}
    ]
]

In [None]:
for i, sentence in enumerate(test_sentences):

    extracted_emails = extract_email(sentence)

    print(f"Testing sentence: {sentence}")
    print(f"Found: {extracted_emails}\n")
    assert extracted_emails == expected_results[i]

# Evaluation of Baseline Methods

Multiple methods to evaluate the NER task are evaluated for a given report:
1. Does the number instances of each category align with what is present in Neotoma?
   1. For each article need to pull out the number of instances of each category for the given DOI/GDD id
2. Does the number of instances of each category align with what has been labelled?
   1. Using the team labelled reports, count the number of instances of each category for each report.
3. Does the extracted locations align with the labelled locations?
   1. This can be done with the Python package `seqeval` which is used to evaluate the performance of a sequence labeling tasks. ([`seqeval` Github](https://github.com/chakki-works/seqeval))

## Loading Raw & Labelled Data

In [None]:
from collections import defaultdict

labelled_file_path = os.path.join(os.getcwd(), os.pardir, "data", "labelled", "labelling-labelled")

## Comparison to Neotoma Data



## Comparison of Entity Counts to Labelled Data

In [None]:
labelled_files = os.listdir(labelled_file_path)

# count the number of occurences of each label
annotated_label_counts = defaultdict(int)
baseline_label_counts = defaultdict(int)

for file in labelled_files:
    
    with open(os.path.join(labelled_file_path, file), "r") as f:
        task = json.load(f)

    raw_text = task['task']['data']['text']
    
    # get the baseline annotations
    baseline_result = baseline_extract_all(raw_text)
    for baseline in baseline_result:
        baseline_label_counts[baseline['label'][0]] += 1

    annotation_result = task['result']
    
    for annotation in annotation_result:
        annotated_label_counts[annotation['value']['labels'][0]] += 1

In [None]:
# plot the percentage of entities extracted by baseline vs annotated
annotated_labels = list(annotated_label_counts.keys())
annotated_counts = list(annotated_label_counts.values())

baseline_counts = [baseline_label_counts[label] for label in annotated_labels]

annotated_counts = np.array(annotated_counts)
baseline_counts = np.array(baseline_counts)

# make into a tidy dataframe with columns Label, Source, Count
annotated_df = pd.DataFrame(
    {
        'Label': annotated_labels + annotated_labels,
        'Source': ['Annotated'] * len(annotated_labels) + ['Baseline'] * len(annotated_labels),
        'Count': np.concatenate([annotated_counts, baseline_counts])
    }
)

fig = px.bar(
    annotated_df,
    x="Label",
    y="Count",
    color="Source",
    barmode='group',
    # labels={'x': 'Labels', 'value': 'No. of Entities'},
    title='Counts of Labels in Annotated and Baseline Results',
    width=800,
).update_layout(
    xaxis={'categoryorder': 'total descending'}, 
    margin={'l': 0, 'r': 0, 't': 50, 'b': 0}, 
)

fig.show()

In [None]:
percentage_extracted = baseline_counts / annotated_counts

fig = px.bar(
    x=annotated_labels, 
    y=percentage_extracted, 
    color=annotated_labels,
    labels={'x': 'Labels', 'y': 'Percentage Extracted'},
    title='Percentage of Entities Extracted by Baseline vs Annotated',
    width=800,
    text=np.round(percentage_extracted, 3)*100,
    hover_name=annotated_labels,
).update_layout(
    xaxis={'categoryorder': 'total descending'},
    margin={'l': 0, 'r': 0, 't': 50, 'b': 0},
    yaxis={'tickformat': ',.0%'},
    showlegend=False,
    yaxis_range=[0, 1],
)
fig.show()


## Calculating Precision, Recall and F1 Scores with Seqeval

The python package `seqeval` is used to evaluate the performance of a sequence labeling tasks. ([`seqeval` Github](https://github.com/chakki-works/seqeval)).

It requires the following inputs:
- `y_true` - a list of true labels of each token
- `y_pred` - a list of predicted labels of each token

To take the labelled data from label studio and convert it into the required format, we need to:
1. Load the labelled data
2. Convert the labelled data into a list of lists of labels with 'O' for tokens that are not part of an entity
3. Convert the baseline predictions into a list of lists of labels with 'O' for tokens that are not part of an entity
4. Calculate the precision, recall and f1 scores for each entity type

### Load Labelled Data

In [None]:
labelled_file_path = os.path.join(os.getcwd(), os.pardir, "data", "labelled", "labelling-labelled")

In [None]:
def load_json_label_files(labelled_files: str):
    """
    Load the json files containing the labelled data and combines the text
    into a complete text string.

    Parameters
    ----------
    label_files : list
        List of json files containing the labelled data.

    Returns
    -------
    combined_text : str
        The combined text from all the files.
    all_labelled_entities : list
        List of all the labelled entities re-indexed to account for the combined text.

    """

    combined_text = ""
    all_labelled_entities = []
    for file in labelled_files:
        
        with open(os.path.join(labelled_file_path, file), "r") as f:
            task = json.load(f)

        raw_text = task['task']['data']['text']

        annotation_result = task['result']
        labelled_entities = [annotation['value'] for annotation in annotation_result]

        # add the current text length to the start and end indices of labels plus one for the space
        for entity in labelled_entities:
            entity['start'] += len(combined_text)
            entity['end'] += len(combined_text)

        all_labelled_entities += labelled_entities

        # add the current text to the combined text with space in between
        combined_text += raw_text + " "

    return combined_text, all_labelled_entities

### Simplified Test of Evaluation

The following test case is used to test the evaluation of the `seqeval` package. And ensure the steps to wrangle data into the required format are correct.

The expected result is precision/accuracy/recall of 1.0 for all entity types.

In [None]:
test_text = "The site was 120m above sea level and 1234 BP and found Pediastrum"
test_labelled_entities = [
    {'start': 13, 'end': 33, 'labels': ['ALTI'], 'text': '120m above sea level'},
    {'start': 38, 'end': 45, 'labels': ['AGE'], 'text': '1234 BP'},
    {'start': 56, 'end': 66, 'labels': ['TAXA'], 'text': 'Pediastrum'}
]

In [None]:
# check the token labels and indexes match
test_labelled_tokens = get_token_labels(test_labelled_entities, test_text)

split_text = test_text.split()

# check that the token labels are correct
for i, token in enumerate(split_text):
    print(f"{token}: {test_labelled_tokens[i]}")

In [None]:
test_baseline_entities = baseline_extract_all(test_text)
test_baseline_tokens = get_token_labels(test_baseline_entities, test_text)

In [None]:
# calculate the metrics for the test case
accuracy, f1, recall = calculate_entity_classification_metrics(
    test_labelled_tokens, test_baseline_tokens
)

print(f"Accuracy: {accuracy}")
print(f"F1: {f1}")
print(f"Recall: {recall}")

In [None]:
plot_token_classification_report(
    test_labelled_tokens, 
    test_baseline_tokens, 
    "Test of Entity Extraction Classification Report"
)

### Evaluation of All Labelled Files

The following code reads in all labelled files and evaluates the performance of the baseline models across all the text at once. A future improvement will be to be able to analyze individual research papers at a time.

In [None]:
labelled_files = os.listdir(labelled_file_path)

# load the labelled data
combined_text, labelled_entities = load_json_label_files(labelled_files)

# extract the baseline entities
baseline_entities = baseline_extract_all(combined_text)

In [None]:
labelled_tokens = get_token_labels(labelled_entities, combined_text)
baseline_tokens = get_token_labels(baseline_entities, combined_text)

In [None]:
# check that both lists are the same length as they should be split identically
len(labelled_tokens), len(baseline_tokens)

In [None]:
accuracy, f1, recall = calculate_entity_classification_metrics(
    labelled_tokens, baseline_tokens
)

print(f"Accuracy: {accuracy}")
print(f"F1: {f1}")
print(f"Recall: {recall}")

In [None]:
plot_token_classification_report(
    labelled_tokens, 
    baseline_tokens, 
    title="Classification Report for Baseline Entity Extraction"
)