# Fossil Data Extraction Baselines

This notebook sets up, runs and evaluates the baseline models for the fossil data extraction task.

The data and baseline approaches are as follows:

| **Entity Name**            | **Baseline Approach**                                              |
|:---:|:---|
| Geographic Location - GEOG | Regular Expressions (Goring et. al 2021)                                      |
| Site Name - SITE           | spaCy Pre-Trained NER model identifying location entities |
| Taxa - TAXA                | In-text search for existing taxa already in Neotoma                |
| Age - AGE                  | Regular Expressions (Goring et. al 2021)                                      |
| Altitude - ALTI            | Regular Expressions ("above sea level", "a.s.l.")                  |
| Email Address(es) - EMAIL  | Regular Expressions                                                |


In [1]:
import os, sys

import re

# ensure that the parent directory is on the path for relative imports
sys.path.append(os.path.join(os.path.abspath(''), ".."))

from src.entity_extraction.baseline_entity_extraction import (
    extract_geographic_coordinates,
    extract_site_names,
    extract_taxa,
    extract_age,
    extract_altitude,
    extract_email,
)

%load_ext autoreload
%autoreload 2

## Geographic Location - GEOG

## Site Name - SITE

## Taxa - TAXA

## Age - AGE

The age of samples is often reported in the literature in a variety of formats.  The most common formats are:
- years BP - before present
- kyr BP - 1000’s of years BP
- ka BP - kilo annum BP
- a BP - annum BP
- Ma BP - million years BP
- YBP - years BP

In Neotoma there are three age columns, we have ageold, agetype and ageyoung.

- agetype: Age type or units. Includes the following:
  - Calendar years AD/BC
  - Calendar years BP
  - Calibrated radiocarbon years BP
  - Radiocarbon years BP
  - Varve years BP

The baseline solution based off of Goring et. al 2021 uses regular expressions to:
1. Identify the age entity in the sentence - `" BP "`
2. Determine if it is a range of dates - `"(\\d+(?:[.]\\d+)*) ((?:- {1,2})|(?:to)) (\\d+(?:[.]\\d+)*) ([a-zA-Z]+,BP"`
3. Extract the age entity from the sentence - `"(\\d+(?:[.]\\d+)*),((?:- {1,2})|(?:to)),(\\d+(?:[.]\\d+)*),([a-zA-Z]+,BP),"`

In [2]:
test_sentences = [
    "1234 BP",
    "1234 Ma BP",
    "1234 to 1235 BP",
    "1234 - 1235 BP",
    "1234 -- 1235 BP",
    "1234 BP and 456 to 789 BP",
    "1234 BP and 456 to 789 Ma BP",
]

expected_results = [
    [{'start': 0, 'end': 7, 'label': ['AGE'], 'text': '1234 BP'}],
    [{'start': 0, 'end': 10, 'label': ['AGE'], 'text': '1234 Ma BP'}],
    [{'start': 0, 'end': 15, 'label': ['AGE'], 'text': '1234 to 1235 BP'}],
    [{'start': 0, 'end': 14, 'label': ['AGE'], 'text': '1234 - 1235 BP'}],
    [{'start': 0, 'end': 15, 'label': ['AGE'], 'text': '1234 -- 1235 BP'}],
    [
        {'start': 0, 'end': 7, 'label': ['AGE'], 'text': '1234 BP'}, 
        {'start': 12, 'end': 25, 'label': ['AGE'], 'text': '456 to 789 BP'}
    ],
    [
        {'start': 0, 'end': 7, 'label': ['AGE'], 'text': '1234 BP'}, 
        {'start': 12, 'end': 28, 'label': ['AGE'], 'text': '456 to 789 Ma BP'}
    ],
]

In [4]:
# test that all the test sentences are extracted correctly
for i, sentence in enumerate(test_sentences):

    extracted_ages = extract_age(sentence)

    print(f"Testing sentence: {sentence}")
    print(f"Got: {extracted_ages}\n")
    assert extracted_ages == expected_results[i]

Testing sentence: 1234 BP
Got: [{'start': 0, 'end': 7, 'label': ['AGE'], 'text': '1234 BP'}]

Testing sentence: 1234 Ma BP
Got: [{'start': 0, 'end': 10, 'label': ['AGE'], 'text': '1234 Ma BP'}]

Testing sentence: 1234 to 1235 BP
Got: [{'start': 0, 'end': 15, 'label': ['AGE'], 'text': '1234 to 1235 BP'}]

Testing sentence: 1234 - 1235 BP
Got: [{'start': 0, 'end': 14, 'label': ['AGE'], 'text': '1234 - 1235 BP'}]

Testing sentence: 1234 -- 1235 BP
Got: [{'start': 0, 'end': 15, 'label': ['AGE'], 'text': '1234 -- 1235 BP'}]

Testing sentence: 1234 BP and 456 to 789 BP
Got: [{'start': 0, 'end': 7, 'label': ['AGE'], 'text': '1234 BP'}, {'start': 12, 'end': 25, 'label': ['AGE'], 'text': '456 to 789 BP'}]

Testing sentence: 1234 BP and 456 to 789 Ma BP
Got: [{'start': 0, 'end': 7, 'label': ['AGE'], 'text': '1234 BP'}, {'start': 12, 'end': 28, 'label': ['AGE'], 'text': '456 to 789 Ma BP'}]



## Altitude - ALTI

To identify altitude descriptions the primary indicators are:
- "above sea level"
- "a.s.l."
- a single m as the last character after numbers or as a standalone word

In [5]:
test_sentences = [
    "120m above sea level",
    "120m a.s.l.",
    "120 m above sea level",
    "120 m a.s.l.",
    "120m asl",
    "120 m asl",
    "The site was 120m above sea level",
    "The site was 120m a.s.l.",
    "The site was 120 m above sea level",
    "The site was 120 m a.s.l.",
    "First site was 120m asl and the second was 300 m asl",
]

expected_results = [
    [{'start': 0, 'end': 20, 'label': ['ALTI'], 'text': '120m above sea level'}],
    [{'start': 0, 'end': 11, 'label': ['ALTI'], 'text': '120m a.s.l.'}],
    [{'start': 0, 'end': 21, 'label': ['ALTI'], 'text': '120 m above sea level'}],
    [{'start': 0, 'end': 12, 'label': ['ALTI'], 'text': '120 m a.s.l.'}],
    [{'start': 0, 'end': 8, 'label': ['ALTI'], 'text': '120m asl'}],
    [{'start': 0, 'end': 9, 'label': ['ALTI'], 'text': '120 m asl'}],
    [{'start': 13, 'end': 33, 'label': ['ALTI'], 'text': '120m above sea level'}],
    [{'start': 13, 'end': 24, 'label': ['ALTI'], 'text': '120m a.s.l.'}],
    [{'start': 13, 'end': 34, 'label': ['ALTI'], 'text': '120 m above sea level'}],
    [{'start': 13, 'end': 25, 'label': ['ALTI'], 'text': '120 m a.s.l.'}],
    [
        {'start': 15, 'end': 23, 'label': ['ALTI'], 'text': '120m asl'},
        {'start': 43, 'end': 52, 'label': ['ALTI'], 'text': '300 m asl'}
    ]
]

In [6]:
# test that all the test sentences are extracted correctly
for i, sentence in enumerate(test_sentences):

    extracted_altitude = extract_altitude(sentence)

    print(f"Testing sentence: {sentence}")
    print(f"Found: {extracted_altitude}\n")
    assert extracted_altitude == expected_results[i]

Testing sentence: 120m above sea level
Found: [{'start': 0, 'end': 20, 'label': ['ALTI'], 'text': '120m above sea level'}]

Testing sentence: 120m a.s.l.
Found: [{'start': 0, 'end': 11, 'label': ['ALTI'], 'text': '120m a.s.l.'}]

Testing sentence: 120 m above sea level
Found: [{'start': 0, 'end': 21, 'label': ['ALTI'], 'text': '120 m above sea level'}]

Testing sentence: 120 m a.s.l.
Found: [{'start': 0, 'end': 12, 'label': ['ALTI'], 'text': '120 m a.s.l.'}]

Testing sentence: 120m asl
Found: [{'start': 0, 'end': 8, 'label': ['ALTI'], 'text': '120m asl'}]

Testing sentence: 120 m asl
Found: [{'start': 0, 'end': 9, 'label': ['ALTI'], 'text': '120 m asl'}]

Testing sentence: The site was 120m above sea level
Found: [{'start': 13, 'end': 33, 'label': ['ALTI'], 'text': '120m above sea level'}]

Testing sentence: The site was 120m a.s.l.
Found: [{'start': 13, 'end': 24, 'label': ['ALTI'], 'text': '120m a.s.l.'}]

Testing sentence: The site was 120 m above sea level
Found: [{'start': 13, 'en

## Email Addresses - EMAIL

There are existing regex patterns developed to identify emails. The one used below was sourced from this StackoverFlow thread: 
- https://stackoverflow.com/questions/201323/how-can-i-validate-an-email-address-using-a-regular-expression

In [7]:
test_sentences = [
    "ty.elgin.andrews@gmail.com",
    "john.smith@aol.com",
    "ty.andrews@student.ubc.ca",
    # from GGD 54b4324ae138239d8684a37b segment 0
    "E-mail addresses : carina.hoorn@milne.cc (C. Hoorn -) mauro.cremaschi@libero.it"
]

expected_results = [
    [{'start': 0, 'end': 26, 'label': ['EMAIL'], 'text': 'ty.elgin.andrews@gmail.com'}],
    [{'start': 0, 'end': 18, 'label': ['EMAIL'], 'text': 'john.smith@aol.com'}],
    [{'start': 0, 'end': 25, 'label': ['EMAIL'], 'text': 'ty.andrews@student.ubc.ca'}],
    [
        {'start': 19, 'end': 40, 'label': ['EMAIL'], 'text': 'carina.hoorn@milne.cc'},
        {'start': 54, 'end': 79, 'label': ['EMAIL'], 'text': 'mauro.cremaschi@libero.it'}
    ]
]

In [8]:
for i, sentence in enumerate(test_sentences):

    extracted_emails = extract_email(sentence)

    print(f"Testing sentence: {sentence}")
    print(f"Found: {extracted_emails}\n")
    assert extracted_emails == expected_results[i]

Testing sentence: ty.elgin.andrews@gmail.com
Found: [{'start': 0, 'end': 26, 'label': ['EMAIL'], 'text': 'ty.elgin.andrews@gmail.com'}]

Testing sentence: john.smith@aol.com
Found: [{'start': 0, 'end': 18, 'label': ['EMAIL'], 'text': 'john.smith@aol.com'}]

Testing sentence: ty.andrews@student.ubc.ca
Found: [{'start': 0, 'end': 25, 'label': ['EMAIL'], 'text': 'ty.andrews@student.ubc.ca'}]

Testing sentence: E-mail addresses : carina.hoorn@milne.cc (C. Hoorn -) mauro.cremaschi@libero.it
Found: [{'start': 19, 'end': 40, 'label': ['EMAIL'], 'text': 'carina.hoorn@milne.cc'}, {'start': 54, 'end': 79, 'label': ['EMAIL'], 'text': 'mauro.cremaschi@libero.it'}]

