In [1]:
# Trait-Mediated Interaction Modification
# Empirical:
# You could look for words like "hypothesis", "experiment", "found", and "discovered". That may point
# towards there being an experiment in the paper. There are also words like "control group", "compared",
# "findings", "results", "study", and more.
# Qualitative vs. Quantitative:
# To infer whether something is quantitative, you could look for numeric tokens and units.
# However, you can only do so much with the abstract. Therefore, this is likely not good enough.
# Yet, you could still take advantage of words like "fewer" and "increased" to show that there is a change.
# However, this would be more suited for the above category.
# Traits:
# There is no NLP tool for traits that I can use or create so I think that I could instead use keywords.
# For example, "snail feeding rates" is a trait. You may be able to spot this by looking for a word like
# "rate". You'd expand that word to include "snail feeding rates". As "snail" is a species you can infer
# that "rates" is a trait. I would be more decisive and use a dependency parser to ensure that the trait
# is a property of the species (like before). However, with all the cases that may exist, I think checking
# to see whether a species can be found by traveling back and/or forward without finding certain tokens could
# work well enough.
# 3 Species or More:
# This is simple. However, I think using a dictionary and TaxoNerd would be beneficial (for higher accuracy).
# To handle the potential differences in tokenization, character offsets should be used.
# Standardization:
# There is a lot of variance in the scores. To squash this issue, I think that we could assign each sentence
# a value from 0 to 1. We would add these values and divide by the number of sentences. This would result in
# a number that is also from 0 to 1. However, there are categories that we would like to inspect. So, we must
# create an overall score in the interval from [0, 1] while also scoring each category. Well, for each sentence
# we could add a point for each category that is observed. The sentence would receive said score divided by the
# number of categories. At the end, we add up all the sentence scores and divide by the number of sentences.
# The aggregate score for each category would also be divided by the number of sentences.

In [2]:
import re
import csv
import sys
import time
import spacy
import numpy as np
import pandas as pd
import random
import pickle
from fastcoref import FCoref, LingMessCoref
from taxonerd import TaxoNERD
from spacy.matcher import Matcher
from spacy.matcher import DependencyMatcher, PhraseMatcher
from spacy.language import Language
from IPython.display import clear_output
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
%run -i "./utils.py"

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
class Help:
    def __init__(self, main):
        self.main = main
        # Zero Plurals
        # The singular and plural versions of the words below are the same.
        self.zero_plurals = [
            "species", 
            "deer", 
            "fish", 
            "moose", 
            "sheep", 
            "swine", 
            "buffalo", 
            "trout", 
            "cattle"
        ]
        # Irregular Nouns
        # There's not a defined conversion method.
        self.irregular_nouns = {
            "ox": "oxen",
            "goose": "geese",
            "mouse": "mice",
            "bacterium": "bacteria"
        }
        self.irregular_nouns_rev = {v: k for k, v in self.irregular_nouns.items()}
        self.irregular_singular_nouns = self.irregular_nouns.keys()
        self.irregular_plural_nouns = self.irregular_nouns.values()

    def remove_extra_spaces(self, string):
        # Remove Duplicate Spaces
        string = re.sub(r"\s+", " ", string)
        # Remove Spaces Before Punctuation
        string = re.sub(r"\s+([?.!,])", r"\1", string)
        # Remove Outside Spaces
        return string.strip()

    def remove_outer_non_alnum(self, string):
        while string:
            start_len = len(string)
            # Remove Leading Non-Alphanumeric Character
            if string and not string[0].isalnum():
                string = string[1:]
            # Remove Trailing Non-Alphanumeric Character
            if string and not string[-1].isalnum():
                string = string[:-1]
            # No Changes Made
            if start_len == len(string):
                break
        return string

    def group_text(self, text, flatten=False):
        # The parenthetical would be the content inside of a pair of
        # matching parentheses, brackets, or braces.
        parentheticals = []
        
        # This contains the text that's not inside of
        # parentheses and co.
        base_text = []
        
        # Used for building groups,
        # handles a nested structure.
        stacks = []
        
        # These are the characters we recognize
        # in terms of grouping.
        pairs = {
            "(": ")",
            "[": "]",
            "{": "}"
        }
        open_chars = pairs.keys()
        close_chars = pairs.values()
        
        # This contains the opening characters
        # of the groups that are currently open
        # (e.g. '(', '['). We use it so that we know
        # whether we open or close a group.
        opened = []
        
        for i, char in enumerate(text):
            # Opening Character
            if char in open_chars:
                stacks.append([])
                opened.append(char)
            # Closing Character
            elif opened and char == pairs.get(opened[-1], ""):
                parentheticals.append(stacks.pop())
                opened.pop()
            # Add Character to Group
            elif opened:
                stacks[-1].append(i)
            # Add Character to Ungrouped Text
            else:
                base_text.append(i)
        
        # If an opening character hasn't been closed,
        # we just close all the remaining opened groups.
        # This is moreso a problem regarding the text.
        while stacks:
            parentheticals.append(stacks.pop())
            
        # Merge
        groups = [*parentheticals, base_text]
        tuple_groups = []
        for group in groups:
            if not group:
                continue
            
            tuples = [[group[0], group[0] + 1]]
            for index in group[1:]:
                if tuples[-1][1] == index:
                    tuples[-1][1] = index + 1
                else:
                    tuples.append([index, index + 1])
            tuple_groups.append(tuples)
            
        if flatten:
            flattened_tuple_groups = []
            for tuple_group in tuple_groups:
                for tuple in tuple_group:
                    flattened_tuple_groups.append(tuple)
            tuple_groups = flattened_tuple_groups
        
        return tuple_groups

    def singularize(self, string):
        string = string.lower()
        
        # The string to singularize should not have any
        # non-alphanumeric characters at the end, or else
        # the algorithm will not work.
        words = re.split(r" ", string)

        if not words:
            return [string]

        # If the last word in the string is a zero plural
        # or a singular irregular noun, there's no changes
        # to make. For example, "red sheep" and "ox" are 
        # already singular.
        if (
            words[-1] in self.zero_plurals or 
            words[-1] in self.irregular_singular_nouns
        ):
            return [string]

        # If the last word in the string is an irregular
        # plural noun, we rely on a dictionary with the
        # corresponding mapping.
        if words[-1] in self.irregular_plural_nouns:
            words[-1] = self.irregular_nouns_rev[words[-1]]
            return [self.remove_extra_spaces(" ".join(words))]
        
        singulars = []

        # We take the singular form of the last word and
        # add it back in to the other words. As there could
        # be multiple forms (due to error), we need to
        # handle them all.
        singular_forms = self.singular_form(words[-1])

        if not singular_forms:
            return [string]
        
        for singular_form in singular_forms:
            singular = self.remove_extra_spaces(" ".join([*words[:-1], singular_form]))
            singulars.append(singular)
            
        return singulars

    def singular_form(self, string):
        versions = []

        # Change -ies to -y
        if re.fullmatch(r".*ies$", string):
            versions.append(f'{string[:-3]}y')
            return versions

        # Change -ves to -f and -fe
        if re.fullmatch(r".*ves$", string):
            versions.append(f'{string[:-3]}f')
            versions.append(f'{string[:-3]}fe')
            return versions

        # Remove -es 
        if re.fullmatch(r".*es$", string):
            versions.append(f'{string[:-2]}')
            return versions

        # Change -i to -us
        if re.fullmatch(r".*i$", string):
            versions.append(f'{string[:-1]}us')
            return versions

        # Remove -s
        if re.fullmatch(r".*s$", string):
            versions.append(f'{string[:-1]}')
            return versions

        return versions

    def pluralize(self, string):
        string = string.lower()
        
        # The string to pluralize should not have any
        # non-alphanumeric characters at the end, or else
        # the algorithm will not work.
        words = re.split(r" ", string)

        if not words:
            return [string]

        # If the last word in the string is a zero plural
        # or a plural irregular noun, there's no changes
        # to make. For example, "red sheep" and "oxen" are 
        # already singular.
        if (
            words[-1] in self.zero_plurals or 
            words[-1] in self.irregular_plural_nouns
        ):
            return [string]

        # If the last word in the string is an irregular
        # singular noun, we rely on a dictionary with the
        # corresponding mapping.
        if words[-1] in self.irregular_singular_nouns:
            words[-1] = self.irregular_nouns[words[-1]]
            return [self.remove_extra_spaces(" ".join(words))]
        
        plurals = []
        
        # We take the singular form of the last word and
        # add it back in to the other words. As there could
        # be multiple forms (due to error), we need to
        # handle them all.
        plural_forms = self.plural_form(words[-1])

        if not plural_forms:
            return [string]
            
        for plural_form in plural_forms:
            plural = self.remove_extra_spaces(" ".join([*words[:-1], plural_form]))
            plurals.append(plural)
            
        return plurals
        
    def plural_form(self, string):
        versions = []

        # Words that end with -us often have
        # two different plural versions: -es and -i.
        # For example, the plural version of cactus 
        # can be cactuses or cacti.
        if re.fullmatch(r".*us$", string):
            versions.append(f'{string}es')
            versions.append(f'{string[:-2]}i')
            return versions

        # The -es ending is added to the words below.
        if re.fullmatch(r".*([^l]s|sh|ch|x|z)$", string):
            versions.append(f'{string}es')
            return versions

        # Words that end with a consonant followed by 'y'
        # are made plural by replacing the 'y' with -ies.
        # For example, the plural version of canary is
        # canaries.
        if re.fullmatch(r".*([^aeiou])(y)$", string):
            versions.append(f'{string[:-1]}ies')
            return versions
            
        # The plural version of words ending with -f
        # and -fe aren't clear. To be safe, I will add
        # both versions.
        if (re.fullmatch(r".*(f)(e?)$", string) and not re.fullmatch(r".*ff$", string)):
            last_clean = re.sub(r"(f)(e?)$", "", string)
            versions.append(f'{last_clean}fs')
            versions.append(f'{last_clean}ves')
            return versions

        # People add -s or -es to words that end with 'o'.
        # To be safe, both versions are added.
        if re.fullmatch(r".*([^aeiou])o$", string):
            versions.append(f'{string}s')
            versions.append(f'{string}es')
            return versions

        # If there's no -s at the end of the string and
        # the other cases didn't run, we add an -s.
        if re.fullmatch(r".*[^s]$", string):
            versions.append(f'{string}s')
        
        return versions

    def expand_unit(self, *, il_unit, ir_unit, il_boundary, ir_boundary, speech=[], literals=[], include=True, direction='BOTH', verbose=False):
        assert il_unit <= ir_unit
        if direction in ['BOTH', 'LEFT']:
            assert il_boundary <= il_unit
        if direction in ['BOTH', 'RIGHT']:
            assert ir_boundary >= ir_unit
        
        # Move Left
        if direction in ['BOTH', 'LEFT']:
            # The indices are inclusive, therefore, when 
            # the condition fails, il_unit will be equal
            # to il_boundary.
            while il_unit > il_boundary:
                # We assume that the current token is allowed,
                # and look to the token to the left.
                l_token = self.main.sp_doc[il_unit-1]

                # If the token is invalid, we stop expanding.
                in_set = l_token.pos_ in speech or l_token.lower_ in literals

                # Case 1: include=False, in_set=True
                # If we're not meant to include the defined tokens, and the
                # current token is in that set, we stop expanding.
                # Case 2: include=True, in_set=False
                # If we're meant to include the defined tokens, and the current
                # token is not in that set, we stop expanding.
                # Case 3: include=in_set
                # If we're meant to include the defined tokens, and the current
                # token is in that set, we continue expanding. If we're not meant
                # to include the defined tokens, and the current token is not
                # in that set, we continue expanding.
                if include ^ in_set:
                    break
                
                # Else, the left token is valid, and
                # we continue to expand.
                il_unit -= 1

        # Move Right
        if direction in ['BOTH', 'RIGHT']:
            # Likewise, when the condition fails,
            # ir_unit will be equal to the ir_boundary.
            # The ir_boundary is also inclusive.
            while ir_unit < ir_boundary:
                # Assuming that the current token is valid,
                # we look to the right to see if we can
                # expand.
                r_token = self.main.sp_doc[ir_unit+1]

                # If the token is invalid, we stop expanding.
                in_set = r_token.pos_ in speech or r_token.lower_ in literals
                if include ^ in_set:
                    break

                # Else, the token is valid and
                # we continue.
                ir_unit += 1

        assert il_unit >= il_boundary and ir_unit <= ir_boundary
        expanded_unit = self.main.sp_doc[il_unit:ir_unit+1]
        return expanded_unit

    def contract_unit(self, *, il_unit, ir_unit, speech=[], literals=[], include=True, direction='BOTH', verbose=False):
        if il_unit > ir_unit:
            print(f"il_unit of {il_unit} greater than ir_unit of {ir_unit}")
            return None
        
        # Move Right
        if direction in ['BOTH', 'LEFT']:
            while il_unit < ir_unit:
                # We must check if the current token
                # is not allowed. If it's not allowed,
                # we contract (remove).
                token = self.main.sp_doc[il_unit]

                # The token is invalid, thus we stop
                # contracting.
                # include = True means that we want the tokens that match
                # the speech and/or literals in the contracted unit.
                # include = False means that we don't want the tokens that
                # match the speech and/or literals in the contracted unit.
                # Case 1: include = True, in_set = True
                # We have a token that's meant to be included in the set.
                # However, we're contracting, which means we would end up
                # removing the token if we continue. Therefore, we break.
                # Case 2: include = False, in_set = False
                # We have a token that's not in the set which defines the
                # tokens that aren't meant to be included. Therefore, we 
                # have a token that is meant to be included. If we continue,
                # we would end up removing this token. Therefore, we break.
                # Default:
                # If we have a token that's in the set (in_set=True) of
                # tokens we're not supposed to include in the contracted 
                # unit (include=False), we need to remove it. Likewise, if
                # we have a token that's not in the set (in_set=False) of
                # tokens to include in the contracted unit (include=True),
                # we need to remove it.
                in_set = token.pos_ in speech or token.lower_ in literals
                if include == in_set:
                    break

                # The token is valid, thus we continue.
                il_unit += 1

        # Move Left      
        if direction in ['BOTH', 'RIGHT']:
            while ir_unit > il_unit:
                token = self.main.sp_doc[ir_unit]

                # The token is invalid and we
                # stop contracting.
                in_set = token.pos_ in speech or token.lower_ in literals
                if include == in_set:
                    break

                # The token is valid and we continue.
                ir_unit -= 1

        assert il_unit <= ir_unit
        contracted_unit = self.main.sp_doc[il_unit:ir_unit+1]
        return contracted_unit

    def find_unit_context(self, *, il_unit, ir_unit, il_boundary, ir_boundary, verbose=False):
        assert il_unit <= ir_unit
        assert il_boundary <= il_unit
        assert ir_boundary >= ir_unit
        
        # Caveat: Parentheticals
        # The context of a unit inside of parentheses should not
        # go farther than the boundaries of those parentheses.
        # However, we need to manually determine whether the unit
        # is in parentheses (or any set of the matching symbols
        # below).
        matching_puncts = {
            "[": "]", 
            "(": ")", 
            "-": "-", 
            "--": "--",
            "{": "}",
            ",": ","
        }
        
        # The opening symbols for group punctuation.
        opening_puncts = list(matching_puncts.keys())

        # The closing symbols for group punctuation.
        closing_puncts = list(matching_puncts.values())

        # Both the opening and closing symbols above.
        puncts = [*closing_puncts, *opening_puncts]

        # Look for Group Punctuation on the Left
        i = il_unit
        l_punct = None
        while i >= il_boundary:
            token = self.main.sp_doc[i]
            if token.lower_ in puncts and token.lower_ != ",":
                l_punct = token
                break
            i -= 1

        # Look for Group Punctuation on the Right
        i = ir_unit + 1 if l_punct and il_unit == ir_unit else ir_unit
        r_punct = None
        while i <= ir_boundary:
            token = self.main.sp_doc[i]
            if token.lower_ in puncts and token.lower_ != ",":
                r_punct = token
                break
            i += 1

        # If there's a group punctuation on the left
        # and right, and they match each other (e.g. '(' and ')'),
        # we return the text between the punctuations.
        parenthetical = l_punct and r_punct and matching_puncts.get(l_punct.lower_, '') == r_punct.text
        if parenthetical:
            return self.main.sp_doc[l_punct.i:r_punct.i+1]

        # As the unit is not a parenthetical, we will expand
        # outwards until we run into a stopping token. The exclude
        # list contains tokens that should be excluded from the
        # context. Currently, it will contain any parentheticals
        # that we run into.
        exclude = []

        # If a token's POS falls into these categories, we will
        # continue. If not, we stop expanding.
        speech = ["ADJ", "NOUN", "ADP", "ADV", "PART", "PROPN", "VERB", "PRON", "DET", "AUX", "PART", "SCONJ"]
        
        # Expand Left
        while il_unit > il_boundary:
            # Assuming that the current token is fine,
            # we look to the left.
            l_token = self.main.sp_doc[il_unit-1]

            # If it's a closing punctuation (e.g. ')', ']'),
            # we need to skip over whatever is contained in
            # that punctuation.
            if l_token.lower_ in closing_puncts:
                i = il_unit - 1
                
                token = self.main.sp_doc[i]
                exclude.append(token)

                # We continue until we reach the boundary or
                # we find the matching opening punctuation.
                opening_punct_found = matching_puncts.get(token.lower_, '') == l_token.lower_
                
                while i > il_boundary and (not opening_punct_found or (opening_punct_found and l_token == token)):
                    i -= 1
                    token = self.main.sp_doc[i]
                    exclude.append(token)
                    opening_punct_found = matching_puncts.get(token.lower_, '') == l_token.lower_
                
                exclude.append(token)

                # After we've gone past the parenthetical,
                # we can jump to the next position.
                il_unit = i
                continue
            # If it's not a closing punctuation, we check
            # whether it's a stopping token
            else:
                if l_token.pos_ not in speech:
                    break
                else:
                    il_unit -= 1

        # Expand Right
        while ir_unit < ir_boundary:
            # We're checking the token to the right
            # to see if we can expand or not.
            r_token = self.main.sp_doc[ir_unit+1]
            
            # If the token to the right is an opening
            # punctuation (e.g. '(', '['), we must skip
            # it, the parenthetical inside, and the
            # closing punctuation.
            if r_token.lower_ in opening_puncts:
                i = ir_unit + 1
                
                token = self.main.sp_doc[i]
                exclude.append(token)

                closing_punct_found = token.lower_ == matching_puncts.get(r_token.lower_, '')
                
                while i < ir_boundary and (not closing_punct_found or (closing_punct_found and r_token == token)):
                    i += 1
                    token = self.main.sp_doc[i]
                    exclude.append(token)
                    closing_punct_found = token.lower_ == matching_puncts.get(r_token.lower_, '')
                
                exclude.append(token)

                ir_unit = i
                continue
            # If it's not an opening punctuation, we check
            # whether we can continue expanding.
            else:
                if r_token.pos_ not in speech:
                    break
                else:
                    ir_unit += 1
        
        # We remove the excluded tokens
        # and return the context.
        context = [t for t in self.main.sp_doc[il_unit:ir_unit+1] if t not in exclude]
        return context

In [4]:
# Used for the Dictionary
@Language.component("lower_case_lemmas")
def lower_case_lemmas(doc) :
    for token in doc :
        token.lemma_ = token.lemma_.lower()
    return doc

class Species:
    def __init__(self, main):
        # Tools
        self.main = main
        self.tn_nlp = TaxoNERD(prefer_gpu=False).load(model="en_ner_eco_biobert", exclude=["tagger", "parser", "attribute_ruler"])
        self.tn_nlp.add_pipe("lower_case_lemmas", after="lemmatizer")
        self.tn_doc = None
        
        # Contains any spans that have been identified
        # as a species.
        self.spans = None
        
        # Contains any tokens that have been identified
        # as a species or being a part of a species.
        self.tokens = None
        
        # Used to quickly access the span that a token
        # belongs to.
        self.token_to_span = None
        
        # Maps a string to an array of strings wherein
        # the strings involved in the key-value pair 
        # have been identified as an alternate name of each other.
        self.alternate_names = None
        
        # Used to increase TaxoNERD's accuracy.
        self.dictionary = None
        self.load_dictionary()

    def load_dictionary(self):
        self.dictionary = ["juvenile", "adult", "prey", "predator", "species", "tree", "cat", "dog"]
        # df = pd.read_csv("VernacularNames.csv")
        # self.dictionary += df.VernacularName.to_list()

        patterns = []
        for name in self.dictionary:
            doc = self.tn_nlp(name)
            patterns.append({"label": "LIVB", "pattern": [{"LEMMA": token.lemma_} for token in doc]})
        ruler = self.tn_nlp.add_pipe("entity_ruler")
        ruler.add_patterns(patterns)
        
    def update(self, text, verbose=False):
        if not self.main.sp_doc or not self.main.index_map:
            raise Exception("DNE")
        self.tn_doc = self.tn_nlp(text)
        self.spans, self.tokens, self.token_to_span, self.alternate_names = self.load_species(verbose=verbose)

    def load_species(self, verbose=False):
        if not self.main.sp_doc or not self.main.index_map:
            raise Exception("DNE")

        # We'll search for species in the text.
        text = self.main.sp_doc.text.lower()

        # These three contain the species that have been
        # identified in the text. Tokens that aren't adjectives,
        # nouns, or proper nouns will be stripped.
        spans = []
        tokens = []
        token_to_span = {}

        # It's useful to know if a different name refers to a
        # species we have already seen. For example, in
        # "predatory crab (Carcinus maenas)", "predatory crab"
        # is an alternative name for "Carcinus maenas" and
        # vice versa. This is used so that the species can be
        # properly tracked and redundant points are less
        # likely to be given.
        alternate_names = {}

        # We convert the spans that TaxoNerd has recognized
        # to spans under a different parent document. This is
        # because we're largely using said parent document and
        # there is more functionality in that parent document.
        species_spans = []
        for tn_species_span in self.tn_doc.ents:
            # print(f"TN Species Span: {tn_species_span}")
            
            char_i0 = self.tn_doc[tn_species_span.start].idx
            char_i1 = char_i0 + len(tn_species_span.text) - 1

            sp_token_i0 = self.main.token_at_char(char_i0).i
            sp_token_i1 = self.main.token_at_char(char_i1).i

            sp_species_span = self.main.sp_doc[sp_token_i0:sp_token_i1+1]
            
            # Although they have different parent documents,
            # they should still have the same text.
            if sp_species_span.text.lower() != tn_species_span.text.lower():
                print(sp_species_span.text.lower(), tn_species_span.text.lower())
            assert sp_species_span.text.lower() == tn_species_span.text.lower()

            # Sometimes, TaxoNerd recognizes two names of a species in one span.
            # If they're separated with parentheses, we can handle the case here.
            # The naming is difficult, so I'll just call it species_tuples.
            species_tuples = self.main.help.group_text(sp_species_span.text, flatten=True)

            # print(f"Species Tuples: {species_tuples}")
            
            species_span_chunks = []
            for species_tuple in species_tuples:
                species_span_chunk_text = sp_species_span.text[species_tuple[0]:species_tuple[1]]
                # print(f"Species Tuple Text: {species_span_chunk_text}")
                
                if species_span_chunk_text.isspace():
                    continue
                
                group_char_i0 = char_i0 + species_tuple[0]
                group_char_i1 = char_i0 + species_tuple[1] - 1

                # Update L Index to Exclude Whitespace Characters
                while text[group_char_i0].isspace():
                    group_char_i0 += 1

                # Update R Index to Exclude Whitespace Characters
                while text[group_char_i1].isspace():
                    group_char_i1 -= 1

                group_token_i0 = self.main.token_at_char(group_char_i0).i
                group_token_i1 = self.main.token_at_char(group_char_i1).i

                # print(f"Species Span Chunk Appended: {self.main.sp_doc[group_token_i0:group_token_i1+1]}")
                
                species_span_chunks.append(self.main.sp_doc[group_token_i0:group_token_i1+1])

            for species_span_chunk in species_span_chunks:
                species_spans.append(species_span_chunk)
    
                # TaxoNERD will recognize the full species (i.e. "brown squirrels"),
                # and we can use this to find more instances of a species in the text
                # by extracting the last noun or proper noun from that span 
                # (i.e. "squirrels"). Now, we can find "brown squirrels" and 
                # "squirrels".
                reversed_span = [t for t in species_span_chunk]
                reversed_span.reverse()
                for token in reversed_span:
                    # print(f"Token Reversed Span: {token}")
                    if token.pos_ in ["NOUN", "PROPN"]:
                        # print(f"\tADDED")
                        species_spans.append(self.main.sp_doc[token.i:token.i+1])
                        break

        # TaxoNerd sometimes recognizes one instance of a species
        # and fails to recognize it elsewhere. To fix this, I'll
        # search the text for all the species that TaxoNerd sees.
        # This should resolve that issue. To make this more robust,
        # I'll include the singular and plural versions of the
        # recognized species. Furthermore, the species being used
        # to search for other instances of species in the text will
        # be called search_species. Using a database I downloaded,
        # I've initialized search_species with a set of english
        # vernacular names (e.g., "dog", "cat"). I'm removing it for
        # now because there's seemingly a lot of bogus values.
        # df = pd.read_csv("EnglishVernacularNames-2.csv")
        # search_species = df.Name.to_list()
        search_species = ["juvenile", "adult", "prey", "predator", "predators", "species", "tree", "cat", "dog", "flies", "plants", "plant", "fly"]

        for species_span in species_spans:
            species_text = species_span.text.lower()
            species_text = self.main.help.remove_extra_spaces(self.main.help.remove_outer_non_alnum(species_text))
            
            # print(f"Species Text: {species_text}")

            # not [c for c in species_text if c.isalpha()]
            if not species_text or not [c for c in species_text if c.isalpha()]:
                # print(f"Continued")
                continue
            
            search_species.append(species_text)

            # Add Singular and/or Plural Version
            if species_span[-1].pos_ == "NOUN":
                # Plural
                if species_span[-1].tag_ == "NNS":
                    singular_species = self.main.help.singularize(species_text)
                    search_species.extend(singular_species)
                # Singular
                if species_span[-1].tag_ == "NN":
                    plural_species = self.main.help.pluralize(species_text)
                    search_species.extend(plural_species)

        # Now, we have the species to search for in the text.
        search_species = list(set(search_species))
        # print(f"Search Species: {search_species}")
        
        for species in search_species:
            matches = re.finditer(re.escape(species), text, re.IGNORECASE)
            
            for char_i0, char_i1, matched_text in [(match.start(), match.end(), match.group()) for match in matches]:
                # The full word must match, not just a substring inside of it.
                # So, if the species we're looking for is "ant", only "ant"
                # will match -- not "pants" or "antebellum". Therefore, the
                # characters to the left and right of the matched string must be
                # non-alphanumeric.
                l_char_is_letter = char_i0 > 0 and text[char_i0-1].isalpha()
                r_char_is_letter = char_i1 < len(text) and text[char_i1].isalpha()
                
                if l_char_is_letter or r_char_is_letter or not matched_text:
                    continue

                try:
                    sp_li = self.main.token_at_char(char_i0).i
                    sp_ri = self.main.token_at_char(char_i1-1).i
                except Exception as e:
                    print(f"Matched Text: '{matched_text}'")
                    print(e)
                    continue

                # This is the matched substring (which would be
                # a species) as a span in the parent document.
                species_span = self.main.sp_doc[sp_li:sp_ri+1]
                
                # Expand Species
                # Let's say there's a word like "squirrel". That's a bit ambiguous. 
                # Is it a brown squirrel, a bonobo? If the species is possibly missing
                # information (like an adjective to the left of it), we should expand
                # in order to get a full picture of the species.
                unclear_1 = len(species_span) == 1 and species_span[0].pos_ == "NOUN"
                unclear_2 = species_span.start > 0 and self.main.sp_doc[species_span.start-1].pos_ in ["ADJ"]
                
                if unclear_1 or unclear_2:
                    species_span = self.main.help.expand_unit(
                        il_unit=species_span.start, 
                        ir_unit=species_span.end-1,
                        il_boundary=0,
                        ir_boundary=len(self.main.sp_doc),
                        speech=["ADJ", "PROPN"],
                        literals=["-"],
                        include=True,
                        direction="LEFT",
                        verbose=verbose
                    )
                
                # Remove Outer Symbols
                # There are times where a species is identified with a parenthesis
                # nearby. Here, we remove that parenthesis (and any other symbols).
                species_span = self.main.help.contract_unit(
                    il_unit=species_span.start, 
                    ir_unit=species_span.end-1, 
                    speech=["PUNCT", "SYM", "DET", "PART"],
                    include=False,
                    verbose=verbose
                )

                if not species_span:
                    print(f"Matched Text: '{matched_text}'")
                    print(char_i0)
                    continue
            
                # A species must have a noun or a
                # proper noun. This may help discard
                # bogus results.
                letter_found = False
                for token in species_span:
                    if token.pos_ in ["NOUN", "PROPN"]:
                        letter_found = True
                        break

                if not letter_found:
                    continue

                # Adding Species
                spans.append(species_span)
                for token in species_span:
                    if token in tokens or token.pos_ in ["PUNCT", "SYM", "DET", "PART"]:
                        continue
                    tokens.append(token)
                    token_to_span[token] = species_span

        # Removing Duplicates and Sorting 
        spans = list({span.start: span for span in spans}.values())
        spans.sort(key=lambda span: span.start)
        
        # Finding and Storing Alternative Names
        for i, species_span in enumerate(spans):
            # There's not a next species to
            # evaluate.
            if i + 1 >= len(spans):
                break
            
            next_species_span = spans[i+1]
            
            # If there's one token between the species and the next species,
            # we check if the next species is surrounded by punctuation.
            if next_species_span.start - species_span.end == 1:
                # Token Before and After the Next Species
                before_next = self.main.sp_doc[next_species_span.start-1]
                after_next = self.main.sp_doc[next_species_span.end]

                if before_next.pos_ in ["PUNCT", "SYM"] and after_next.pos_ in ["PUNCT", "SYM"]:
                    sp_1_text = species_span.text.lower()
                    sp_2_text = next_species_span.text.lower()
                    
                    if sp_1_text not in alternate_names:
                        alternate_names[sp_1_text] = []
                    
                    if sp_2_text not in alternate_names:
                        alternate_names[sp_2_text] = []
                    
                    alternate_names[sp_1_text].append(sp_2_text)
                    alternate_names[sp_2_text].append(sp_1_text)
            # If there's no token between the species and the next,
            # species we assume that they refer to the same species.
            elif next_species_span.start - species_span.end == 0:
                sp_1_text = species_span.text.lower()
                sp_2_text = next_species_span.text.lower()
                
                if sp_1_text not in alternate_names:
                    alternate_names[sp_1_text] = []
                
                if sp_2_text not in alternate_names:
                    alternate_names[sp_2_text] = []

                alternate_names[sp_1_text].append(sp_2_text)
                alternate_names[sp_2_text].append(sp_1_text)
       
        return (spans, tokens, token_to_span, alternate_names)

    def span_at_token(self, token):
        if token in self.token_to_span:
            return self.token_to_span[token]
        return None
    
    def is_species(self, token):
        return token in self.tokens
        
    def has_species(self, tokens, verbose=False):
        for token in tokens:
            if token in self.tokens:
                return True
        return False

    def find_same_species(self, sp_A, sp_b, verbose=False):
        # METHOD 1: Check for Literal Matches
        sp_b_text = sp_b.text.lower()
        
        for sp_a in sp_A:
            # Verbatim Text
            sp_a_text = sp_a.text.lower()

            if sp_a_text == sp_b_text:
                return sp_a

            # Singularized Text
            sp_a_singular_texts = sp_a_text if sp_a[-1].tag_ in ["NN", "NNP"] else self.main.help.singularize(sp_a_text)
            sp_b_singular_texts = sp_b_text if sp_b[-1].tag_ in ["NN", "NNP"] else self.main.help.singularize(sp_b_text)

            if set(sp_a_singular_texts).intersection(sp_b_singular_texts):
                return sp_a

        # METHOD 2: Check Alternate Names
        for sp_a in sp_A:
            # Species B is an alternate name for Species A
            if sp_b_text in self.alternate_names.get(sp_a_text, []):
                return sp_a
            # Species A is an alternate name for Species B
            if sp_a_text in self.alternate_names.get(sp_b_text, []):
                return sp_a
        
        # METHOD 3: Check Nouns
        # This is used if one or none of the species being compared
        # has 1 adjective.
        sp_b_0_text = sp_b[0].lower_
        sp_b_is_noun = sp_b[0].pos_ in ["NOUN", "PROPN"]

        sp_b_nouns = []
        sp_b_num_adjectives = 0
        for token in sp_b:
            if not sp_b_nouns and token.pos_ == "ADJ":
                sp_b_num_adjectives += 1
            elif token.pos_ in ["PROPN", "NOUN"]:
                sp_b_nouns.append(token)
        sp_b_nouns_str = [noun.lower_ for noun in sp_b_nouns]
        sp_b_singular_texts = " ".join(sp_b_nouns_str) if sp_b_nouns[-1].tag_ in ["NN", "NNP"] else self.main.help.singularize(" ".join(sp_b_nouns_str))
        
        for sp_a in sp_A:
            sp_a_0_text = sp_a[0].lower_
            sp_a_is_noun = sp_a[0].pos_ in ["NOUN", "PROPN"]

            # Case Example: 'Hyla' v. 'Hyla tadpoles'
            if sp_a_0_text == sp_b_0_text and (sp_a_is_noun or sp_b_is_noun):
                if sp_a_text in sp_b_text or sp_b_text in sp_a_text:
                    return sp_a
            # Case Example: 'dogs' v. 'red dogs'
            else:
                sp_a_nouns = []
                sp_a_num_adjectives = 0
                for token in sp_a:
                    if not sp_a_nouns and token.pos_ == "ADJ":
                        sp_a_num_adjectives += 1
                    elif token.pos_ in ["PROPN", "NOUN"]:
                        sp_a_nouns.append(token)
                sp_a_nouns_str = [noun.lower_ for noun in sp_a_nouns]
                
                if sp_a_nouns and sp_b_nouns and (
                    (sp_a_num_adjectives == 1 and sp_b_num_adjectives == 0) or 
                    (sp_b_num_adjectives == 1 and sp_a_num_adjectives == 0)
                ):
                    sp_a_singular_texts = " ".join(sp_a_nouns_str) if sp_a_nouns[-1].tag_ in ["NN", "NNP"] else self.main.help.singularize(" ".join(sp_a_nouns_str))
                    
                    if set(sp_a_singular_texts).intersection(sp_b_singular_texts):
                        return sp_a

        # METHOD 3: Last Ditch Effort
        # If there's been no matches, we just look for one string inside of
        # another.
        for sp_a in sp_A:
            sp_a_text = sp_a.text.lower()
            if sp_b_text in sp_a_text or sp_a_text in sp_b_text:
                return sp_a
        
        return None

In [5]:
class Keywords:
    def __init__(self, main, *, regexes=[], vocab=[], patterns=[], def_pos=[], def_tag=[], def_threshold=0.7):
        self.main = main
        # When comparing two words, SpaCy returns a value
        # from 0 to 1, representing how similar the two
        # embeddings are. The threshold below determines
        # the minimum number of similarity before two words
        # are considered as being equivalent. This is the
        # default values, specific values can be provided for
        # each word.
        self.def_threshold = def_threshold
        # To decrease our "search space", we can specify the
        # type of tokens we'd want to consider via the
        # tag_ and pos_ attributes. The tag and pos keys for a
        # vocab word is used for more specificity.
        self.def_tag = def_tag
        self.def_pos = def_pos
        # The words are divided into two categories: vocabulary
        # and regex. The vocabulary matches words
        # by definition, whereas the regexes match words by
        # content.
        self.regex = regexes
        self.vocab = []
        for vocab_word in vocab:
            if isinstance(vocab_word, str):
                doc = self.main.sp_nlp(vocab_word)
                self.vocab.append({
                    "doc": doc,
                    "lemma": " ".join([t.lemma_ for t in doc])
                })
            else:
                doc = self.main.sp_nlp(vocab_word["word"])
                self.vocab.append({
                    "doc": doc,
                    "tag": vocab_word.get("tag"),
                    "pos": vocab_word.get("pos"),
                    "threshold": vocab_word.get("threshold"),
                    "lemma": " ".join([t.lemma_ for t in doc])
                })

        # Rule-Based Matching
        self.matcher = Matcher(self.main.sp_nlp.vocab)
        self.matcher.add("Pattern", patterns)

        # Matched Tokens
        self.tokens = []

    def update(self, verbose=False):
        # SpaCy Doc DNE or Indexing Map DNE
        if not self.main.sp_doc or not self.main.index_map:
            raise Exception("DNE")
        self.tokens = self.match_tokens(verbose=verbose)

    def match_tokens(self, verbose=False):
        verbose=True
        # SpaCy Doc DNE or Indexing Map DNE
        if not self.main.sp_doc or not self.main.index_map:
            raise Exception("DNE")
        
        matched_tokens = []

        # Match by Regex
        text = self.main.sp_doc.text.lower()
        for regex in self.regex:
            if verbose:
                print(regex, [match.start() for match in re.finditer(regex, text, re.IGNORECASE)])
            for char_index in [match.start() for match in re.finditer(regex, text, re.IGNORECASE)]:
                adj_char_index = char_index
                while text[adj_char_index].isspace():
                    adj_char_index += 1
                token = self.main.token_at_char(adj_char_index)
                if (
                    (self.def_pos and token.pos_ not in self.def_pos) or 
                    (self.def_tag and token.tag_ not in self.def_tag)
                ):
                    continue
                matched_tokens.append(token)

        # Match by Matcher
        matches = self.matcher(self.main.sp_doc)
        for match_id, start, end in matches:
            span = self.main.sp_doc[start:end]  # The matched span
            print(f"Matched Span: {span}")
            matched_tokens.append(span[0])

        # Match by Vocab
        for token in self.main.sp_doc:
            if (
                (self.def_pos and token.pos_ not in self.def_pos) or 
                (self.def_tag and token.tag_ not in self.def_tag) or 
                (token in matched_tokens)
            ):
                continue

            token_doc = self.main.sp_nlp(token.lower_)
            token_lemma = " ".join([t.lemma_ for t in token_doc])
            
            for vocab_word in self.vocab:
                # Ensure Correct Tag
                if vocab_word.get("tag"):
                    if not [t for t in token_doc if t.tag_ in vocab_word.get("tag")]:
                        continue
                
                # Ensure Correct PoS
                if vocab_word.get("pos"):
                    if not [t for t in token_doc if t.pos_ in vocab_word.get("pos")]:
                        continue

                # Check Lemma
                if token.lower_ == "amplified":
                    print(token_lemma, vocab_word["lemma"])
                    
                if token_lemma == vocab_word["lemma"]:
                    matched_tokens.append(token)
                    break

                # Check Similarity
                similarity = vocab_word["doc"].similarity(token_doc)

                if token.lower_ == "amplified":
                    print(token.lower, vocab_word["doc"], similarity)
                    
                if similarity >= vocab_word.get("threshold", self.def_threshold):
                    matched_tokens.append(token)
                    break
                
        return matched_tokens

In [6]:
class ExperimentKeywords(Keywords):
    def __init__(self, main):
        super().__init__(
            main, 
            vocab=[
                "study", 
                "hypothesis", 
                "experiment", 
                "found", 
                "discover", 
                "compare", 
                "finding", 
                "result", 
                "test", 
                "examine", 
                "model",
                "measure",
                "manipulate",
                "assess",
                "conduct",
                "data",
                "analyze",
                "sample",
                "observe",
                "predict",
                "suggest",
                "method",
                "investigation",
                "trial",
                "experimental",
                "evidence",
                "demonstrate",
                "analysis",
                "show",
                "compare",
                "comparable",
                "control group", 
                "independent",
                "dependent",
                "applied",
                "treatment",
                "survery"
            ],
            def_pos=["VERB", "NOUN", "ADJ"], 
            def_threshold=0.8
        )

In [7]:
class NegativeExperimentKeywords(Keywords):
    def __init__(self, main):
        super().__init__(
            main, 
            vocab=[
                "theory",
                "review",
                "analysis",
                "meta-analysis"
            ],
            def_pos=["VERB", "NOUN", "ADJ"], 
            def_threshold=0.8
        )

In [8]:
class NegativeTopicKeywords(Keywords):
    def __init__(self, main):
        super().__init__(
            main, 
            regexes=[
                r"co-?evolution",
                r"evolution",
            ],
            def_pos=["VERB", "NOUN", "ADJ"], 
            def_threshold=0.8
        )

In [9]:
class CauseKeywords(Keywords):
    def __init__(self, main):
        super().__init__(
            main, 
            vocab=[
                "increase", 
                "decrease", 
                "change", 
                "shift", 
                "cause", 
                "produce", 
                "trigger", 
                "suppress", 
                "inhibit",
                "encourage",
                "allow",
                "influence",
                "affect",
                "alter",
                "induce",
                "produce",
                "result in",
                "associated",
                "correlated",
                "contribute",
                "impact",
                "deter",
                "depressed",
                "when",
                "because",
                # "reduce",
                # "killed",
                # "supported"
            ],
            def_pos=["VERB", "SCONJ", "NOUN"],
            # def_tag=["VB", "VBD", "WRB", "IN", "VBG"],
            # def_threshold=0.75
            def_threshold=0.8
        )

    def update(self, verbose=False):
        Keywords.update(self, verbose)
        self.tokens = self.filter_tokens(self.tokens, verbose)

    def filter_tokens(self, tokens, verbose=False):
        if not self.main.sp_doc or not self.main.index_map:
            raise Exception("DNE")

        filtered = []
        for token in tokens:
            # I'm not sure what cause words should be filtered out, because
            # I haven't seen everything, but this word should be filtered out,
            # it's not really reflective the changes that we're looking for. But,
            # sometimes it is, so it's up in the air. However, I feel like the
            # writer would use more clear language like "decrease" or something.
            if token.lemma_ in ["kill"]:
                continue
            filtered.append(token)
            
        return filtered

In [10]:
class ChangeKeywords(Keywords):
    def __init__(self, main):
        super().__init__(
            main, 
            vocab=[
                "few", 
                "more", 
                "increase", 
                "decrease", 
                "less", 
                "short", 
                "long", 
                "greater"
                "shift",
                "fluctuate",
                "adapt",
                "grow",
                "rise"
                "surge",
                "intensify",
                "amplify",
                "multiply",
                "decline",
                "reduce",
                "drop",
                "diminish",
                "fall",
                "lessen",
                "doubled",
                "tripled",
                "lower",
            ],
            regexes=[
                # Match Examples:
                # 1. "one... as..."
                # 2. "2x than..."
                r"(one|two|three|four|five|six|seven|eight|nine|ten|twice|thrice|([0-9]+|[0-9]+.[0-9]+)(x|%))[\s-]+[^\s]*[\s-]+(as|more|than|likely)([\s-]+|$)"
            ],
            def_pos=["NOUN", "ADJ", "ADV"],
            def_threshold=0.75
        )

    def update(self, verbose=False):
        Keywords.update(self, verbose=True)
        self.tokens = self.filter_tokens(self.tokens, verbose)

    def filter_tokens(self, tokens, verbose=False):
        if not self.main.sp_doc or not self.main.index_map:
            raise Exception("DNE")

        filtered = []
        for token in self.main.sp_doc:
            # Already Matched
            if token in tokens:
                filtered.append(token)
            
            # Comparative Adjective
            # Looking for words like "bigger" and "better".
            elif token.pos_ == "ADJ" and token.tag_ == "JJR":
                filtered.append(token)
                continue
            
        return filtered

In [11]:
class TraitKeywords(Keywords):
    def __init__(self, main):
        super().__init__(
            main, 
            regexes=[
                r"behaviou?r", 
                r"[^A-Za-z]+rate", 
                "colou?r",
                "biomass",
                r"[^A-Za-z]+mass", 
                r"[^A-Za-z]+size",
                "number",
                "length", 
                "pattern", 
                "weight",
                "shape", 
                "efficiency", 
                "trait",
                "phenotype",
                "demography",
                "population (structure|mechanic)s?",
                "ability", 
                "capacity", 
                "height", 
                "width", 
                "[A-Za-z]+span",
                "diet",
                # "food",
                "feeding",
                "nest",
                "substrate",
                "breeding",
                r"[^A-Za-z]+age[^A-Za-z]+",
                "lifespan",
                "development",
                "output",
                "time",
                "period"
                # "mating",
                # "[^A-Za-z]+fur",
                # "feathers",
                # "scales",
                # "skin",
                # "limb",
                "level",
                "configuration",
                "dimorphism",
                "capability",
                # "appendages",
                # "blood",
                "regulation",
                "excretion",
                "luminescence",
                r"[^A-Za-z]+role",
                # "reproduction",
                # "courtship",
                # "pollination",
                # "mechanism",
                "sensitivity",
                "resistance",
                r"(un|(^|\s)[A-Za-z]*-)infected",
                # "temperature"
            ],
            def_pos=["NOUN", "ADJ"]
        )

    def update(self, verbose=False):
        Keywords.update(self, verbose)
        self.tokens = self.filter_tokens(self.tokens, verbose)

    def filter_tokens(self, tokens, verbose=True):
        verbose = True
        if not self.main.sp_doc or not self.main.index_map:
            raise Exception("DNE")

        print(f"Tokens to Filter: {tokens}")
        
        filtered = []
        for token in tokens:
            expanded_token = self.main.help.expand_unit(
                il_unit=token.i, 
                ir_unit=token.i, 
                il_boundary=0, 
                ir_boundary=len(self.main.sp_doc) - 1, 
                speech=["PUNCT"],
                include=False,
                verbose=verbose
            )

            print(f"\tToken: {token}\n\tExpanded Token: {expanded_token}")
            if self.main.species.has_species(expanded_token):
                filtered.append(token)

        print(f"Filtered Tokens: {filtered}")
        
        return filtered

In [12]:
class TestKeywords(Keywords):
    def __init__(self, main):
        super().__init__(
            main,
            vocab=[
                "compare",
                "examine",
                "evaluate",
                "assess",
            ],
            def_pos=["VERB", "NOUN"], 
            def_threshold=0.8
        )

In [13]:
class VariableKeywords(Keywords):
    def __init__(self, main):
        super().__init__(
            main,
            regexes=[
                r"between",
                r"against",
                r"independen(t|ts|tly|cy)",
                r"dependen(t|ts|tly|cy)",
                r"treatments?",
                r"effect"
            ],
            patterns=[
                [{"LOWER": {"IN": ["with", "without"]}}, {"OP": "*", "IS_PUNCT": False}, {"LOWER": {"IN": ["with", "without"]}}],
                [{"LOWER": {"IN": ["at"]}}, {"POS": "NUM"}],
                [{"LOWER": {"IN": ["at"]}}, {"LOWER": {"IN": ["several", "unique", "multiple", "different"]}}],
            ]
        )

In [14]:
class ResultKeywords(Keywords):
    def __init__(self, main):
        super().__init__(
            main, 
            vocab=[
                "therefore",
                "thus",
                "results",
                "showed",
                "displayed",
                "due",
                "hence"
            ],
            def_pos=["VERB", "NOUN", "ADJ", "SCONJ", "ADV"], 
            def_threshold=0.8
        )

In [15]:
class Main:
    def __init__(self):
        # Tools
        self.sp_nlp = spacy.load("en_core_web_lg")
        self.fcoref = FCoref(enable_progress_bar=False, device='cpu')
        self.sp_doc = None

        # Maps Character Position to Token in Document
        # Used to handle differences between different
        # pipelines and tools.
        self.index_map = None
    
        # Parsers
        self.species = Species(self)
        self.trait = TraitKeywords(self)
        self.cause = CauseKeywords(self)
        self.change = ChangeKeywords(self)
        self.experiment = ExperimentKeywords(self)
        self.neg_experiment = NegativeExperimentKeywords(self)
        self.neg_topic = NegativeTopicKeywords(self)
        self.result = ResultKeywords(self)
        self.variable = VariableKeywords(self)
        self.test = TestKeywords(self)

        # Helper
        self.help = Help(self)

    def update_doc(self, doc, verbose=False):
        self.sp_doc = doc
        self.index_map = self.load_index_map()
        self.species.update(doc.text, verbose=True)
        self.trait.update(verbose=False)
        self.cause.update(verbose=False)
        self.change.update(verbose=False)
        self.experiment.update(verbose=False)
        self.neg_experiment.update(verbose=False)
        self.neg_topic.update(verbose=False)
        self.result.update(verbose=False)
        self.variable.update(verbose=False)
        self.test.update(verbose=False)

    def update_text(self, text, verbose=False):
        self.sp_doc = self.sp_nlp(text)
        self.update_doc(self.sp_doc, verbose=verbose)
        
    def token_at_char(self, char_index):
        # SpaCy Doc or Indexing Map Not Found
        if not self.sp_doc or not self.index_map:
            raise Exception("DNE")

        if char_index in self.index_map:
            return self.index_map[char_index]

        raise Exception(f"Token at Index {char_index} Not Found")
        
    def load_index_map(self):
        # SpaCy Doc Not Found
        if self.sp_doc is None:
            raise Exception("DNE")

        # Map Character Index to Token
        index_map = {}
        for token in self.sp_doc:
            # char_i0 is the index of the token's starting character.
            # char_i1 is the index of the character after the token's ending character.
            char_i0 = token.idx
            char_i1 = token.idx + len(token)

            for i in range(char_i0, char_i1):
                index_map[i] = token

        return index_map

    def score(self, verbose=False):
        NUM_CATEGORIES = 6

        # Requires the mention of a trait and a cause or change word.
        # The cause or change word indicates some variation.
        # Index 0 in Array
        TRAIT = 0

        # Requires the mention of a species and a cause or change word.
        # The cause or change word indicates that the species is being
        # affected or is affecting something else.
        # Index 1 in Array
        SPECIES = 1

        # Requires a word that has been defined as "experiment"-related.
        # Index 2 in Array
        EXPERIMENT = 2

        # Requires the mention of several species (more or less).
        # Index 3 in Array
        INTERACTION = 3

        # There are certain topics that we may not want.
        NEGATIVE_TOPIC = 4

        # I'm going to redo the comments anyway.
        TRAIT_VARIATION = 5

        # Max # of Points of Category per Sentence (MPC)
        # A category can collect points from each sentence. However,
        # there's a maximum number of points it can collect. This is
        # determined by the MPC.
        MPC = [1] * NUM_CATEGORIES

        # TODO: This is no longer needed as the scores are first calculated vertically,
        # rather than horizontally. Previously, we'd add up the points a sentence received
        # across its categories. In order to have a final score in the range [0, 1], the
        # maximum points a sentence received had to be in the range [0, 1]. However, 
        # since we no longer add up all the points a sentence received across all
        # categories (to apply the weights at the end), they don't need to add up
        # to 1. It's hard to explain without an example.
        # assert np.sum(MPC) == 1
        
        # Points per Instance of Category (PIC)
        # Each token is evaluated to check whether a category
        # can be given points. The number of points given, if
        # the token is determined to be satisfactory, is the PIC.
        # The PIC is less than or equal to the MPC for the corresponding
        # category. The idea behind the PIC and MPC is similar to how
        # sets work in tennis: you're not immediately awarded the full points
        # for the set (MPC) if your opponent fails to return the ball,
        # instead you're given a smaller # of points (PIC) that allow you to
        # incrementally win the set (category).
        PIC = [0] * NUM_CATEGORIES
        PIC[TRAIT] = MPC[TRAIT]*1.0
        PIC[SPECIES] = MPC[SPECIES]/3.0
        PIC[EXPERIMENT] = MPC[EXPERIMENT]*0.625
        PIC[INTERACTION] = MPC[INTERACTION]/3.0
        PIC[NEGATIVE_TOPIC] = MPC[NEGATIVE_TOPIC]*1.0

        for i in range(NUM_CATEGORIES):
            assert 0 <= PIC[i] <= MPC[i]

        # Category Weights (CW)
        # It may be helpful to weigh a certain category's fraction of total points
        # more or less than another's. Thus, at the end, we'll take a
        # weighted average of the category's FTP. The weights must add up to 1.
        CW = [0] * NUM_CATEGORIES
        CW[TRAIT] = 0.3
        CW[SPECIES] = 0.1
        CW[EXPERIMENT] = 0.1
        CW[INTERACTION] = 0.1
        CW[NEGATIVE_TOPIC] = 0.1
        CW[TRAIT_VARIATION] = 0.3

        assert round(np.sum(CW)) == 1

        # Leniency
        # There are certain categories that aren't going to be as frequent as others.
        # For example, the trait category. You could try and decrease the influence
        # of said category by lowering its MPC and/or increasing the PIC (so that it's
        # easier to achieve the FTP). However, this could make it harder to meaningfully
        # represent the category. The idea of leniency is to remove (some) sentences that had 0
        # points from the scoring. This increases the FTP as, for example, instead of comparing
        # 0.5 points to a total of 2.5 points, you can compare 0.5 to 2.0 points, and so on.
        # A leniency of 1 means that all sentences that received 0 points will be removed from
        # the scoring. A leniency of 0 means that all the sentences are included in the scoring.
        LEN = [0] * NUM_CATEGORIES
        LEN[TRAIT] = 0
        LEN[SPECIES] = 0
        LEN[EXPERIMENT] = 0
        LEN[INTERACTION] = 0
        LEN[NEGATIVE_TOPIC] = 0

        # Used for Leniency
        zero_pt_sents = [0] * NUM_CATEGORIES
        
        # Points
        points = [0] * NUM_CATEGORIES

        # Extracted Information
        cause_tokens = self.cause.tokens
        change_tokens = self.change.tokens
        trait_tokens = self.trait.tokens
        species_tokens = [self.sp_doc[span.start] for span in self.species.spans]
        experiment_tokens = self.experiment.tokens
        neg_experiment_tokens = self.neg_experiment.tokens
        neg_topic_tokens = self.neg_topic.tokens
        result_tokens = self.result.tokens
        test_tokens = self.test.tokens
        var_tokens = self.variable.tokens

        print(f"Cause Tokens: {cause_tokens}")
        print(f"Change Tokens: {change_tokens}")
        print(f"Experiment Tokens: {experiment_tokens}")
        print(f"Negative Experiment Tokens: {neg_experiment_tokens}")
        print(f"Negative Topic Tokens: {neg_topic_tokens}")
        print(f"Trait Tokens: {trait_tokens}")
        print(f"Species Tokens: {species_tokens}")
         
        # This is used to ensure that at least three species
        # are mentioned.
        seen_species = {}

        for sent in self.sp_doc.sents:
            # This contains the number of points
            # each category has accumulated in the sentence.
            curr_points = [0] * NUM_CATEGORIES

            # Contains the tokens in the sentence.
            sent_tokens = [token for token in sent]

            # This is used for the species (must have a nearby cause and/or
            # change word).
            sent_cause_tokens = set(sent_tokens).intersection(cause_tokens)
            sent_change_tokens = set(sent_tokens).intersection(change_tokens)

            # We don't want to visit the same species more than one
            # in the same sentence as to avoid redundant points.
            sent_seen_species = []
            sent_num_unique_species = 0
            
            print(f"Sentence Tokens: {sent_tokens}")
            print(f"Sentence Cause Tokens: {sent_cause_tokens}")
            print(f"Sentence Change Tokens: {sent_change_tokens}")
            
            for token in sent_tokens:
                # If each category has reached their maximum number of points,
                # we can end the loop early.
                all_maxed = True
                for i in range(NUM_CATEGORIES):
                    if curr_points[i] < MPC[i]:
                        all_maxed = False

                if all_maxed:
                    break

                # NEGATIVE TOPIC CATEGORY
                if curr_points[NEGATIVE_TOPIC] < MPC[NEGATIVE_TOPIC] and token in neg_topic_tokens:
                    curr_points[NEGATIVE_TOPIC] += PIC[NEGATIVE_TOPIC]
                    print(f"Added Points for Negative Topic via Token '{token}'\n")

                # TRAIT CATEGORY
                if curr_points[TRAIT] < MPC[TRAIT] and token in trait_tokens:
                    print("TRAIT CATEGORY")
                    # To get points in the trait category, there must 
                    # be (1) a trait; and (2) a change or cause in the token's
                    # context.
                    token_context = set(self.help.find_unit_context(
                        il_unit=token.i, 
                        ir_unit=token.i, 
                        il_boundary=token.sent.start, 
                        ir_boundary=token.sent.end-1, 
                        verbose=verbose)
                    )
                    cause_tokens_in_context = set(sent_cause_tokens).intersection(token_context)
                    change_tokens_in_context = set(sent_change_tokens).intersection(token_context)

                    print(f"Token ({token}) Context: {token_context}")
                    print(f"Cause Tokens in Context: {cause_tokens_in_context}")
                    print(f"Change Tokens in Context: {change_tokens_in_context}")

                    if cause_tokens_in_context or change_tokens_in_context:
                        curr_points[TRAIT] += PIC[TRAIT]
                        print(f"Added Points for Trait via Token '{token}'")

                    print()

                # EXPERIMENT CATEGORY
                if token in neg_experiment_tokens:
                    curr_points[EXPERIMENT] -= 2 * PIC[EXPERIMENT]
                    print(f"Deducted Points for Experiment via Token '{token}'\n")
                elif curr_points[EXPERIMENT] < MPC[EXPERIMENT] and token in experiment_tokens:
                    curr_points[EXPERIMENT] += PIC[EXPERIMENT]
                    print(f"Added Points for Experiment via Token '{token}'\n")

                # SPECIES CATEGORY
                if token in species_tokens:
                    print("SPECIES CATEGORY")
                    # Find Species Span
                    species_span = self.species.span_at_token(token)           

                    # Updating Seen Species (in Entire Text)
                    past_visits = 0

                    # Find Previous Instance of Species (if Any)
                    print("Seen Species Updated")
                    print(seen_species)
                    print()
                    
                    seen_species_span = self.species.find_same_species(seen_species.keys(), species_span)
                    if seen_species_span:
                        past_visits = seen_species[seen_species_span]
                        seen_species[seen_species_span] += 1
                    
                    if not past_visits:
                        seen_species[species_span] = 1

                    print("Seen Species Updated")
                    print(seen_species)
                    print()
                    
                    # Checking Seen Species (in Sentence)
                    # We only add points if it's a species that has not been seen
                    # in the sentence. This is to avoid redundant points. 
                    # Also, if it species has not been seen at all (is_new_species),
                    # then it cannot be a redundant species (we couldn't have seen it in the sentence
                    # either).
                    redundant_species = False

                    if not past_visits:
                        if self.species.find_same_species(sent_seen_species, species_span):
                            redundant_species = True
                    
                    sent_seen_species.append(species_span)

                    print("Seen Species in Sentence")
                    print(sent_seen_species)
                    print()
                    
                    if redundant_species:
                        continue
                    sent_num_unique_species += 1
                    
                    # INTERACTION CATEGORY
                    # It is helpful to have this category here because (if we've reached here)
                    # we're dealing with a new species in the sentence.
                    if curr_points[INTERACTION] < MPC[INTERACTION] and sent_num_unique_species > 1:
                        curr_points[INTERACTION] += PIC[INTERACTION]
                        print(f"Added Points for Interaction via Token '{token}'\n")
                        
                    if curr_points[SPECIES] < MPC[SPECIES]:
                        # To get points in the species category, there must be 
                        # (1) a species; and (2) a change or cause in the phrase
                        # (or clause) that the token is a part of.
                        token_context = set(self.help.find_unit_context(
                            il_unit=token.i, 
                            ir_unit=token.i, 
                            il_boundary=token.sent.start, 
                            ir_boundary=token.sent.end-1, 
                            verbose=verbose)
                        )
                        cause_tokens_in_context = set(sent_cause_tokens).intersection(token_context)
                        change_tokens_in_context = set(sent_change_tokens).intersection(token_context)

                        print(f"Token ({token}) Context: {token_context}")
                        print(f"Cause Tokens in Context: {cause_tokens_in_context}")
                        print(f"Change Tokens in Context: {change_tokens_in_context}")
                        
                        if cause_tokens_in_context or change_tokens_in_context:
                            curr_points[SPECIES] += PIC[SPECIES]
                            print(f"Added Points for Species via Token '{token}'")

                        print()
         
            # SENTENCE DONE
            # Add Sentence Points to Total Points
            for category in [TRAIT, SPECIES, EXPERIMENT, INTERACTION, NEGATIVE_TOPIC]:
                if round(curr_points[category]) == 0:
                    zero_pt_sents[category] += 1
                points[category] += max(0, min(curr_points[category], MPC[category]))

        # Trait Variation Score
        NUM_SENTENCES = len(list(self.sp_doc.sents))
        
        max_trait_var_points = 0
        for i in range(NUM_SENTENCES):
            sent_i = list(self.sp_doc.sents)[i]
            print(f"Trying Sentence: {sent_i}")
            sent_i_tokens = [token for token in sent_i]

            sent_i_neg_exp_tokens = set(sent_i_tokens).intersection(neg_experiment_tokens)
            sent_i_neg_topic_tokens = set(sent_i_tokens).intersection(neg_topic_tokens)
            sent_i_cause_tokens = set(sent_i_tokens).intersection(cause_tokens)
            sent_i_result_tokens = set(sent_i_tokens).intersection(result_tokens) - sent_i_cause_tokens

            if sent_i_neg_topic_tokens or sent_i_neg_exp_tokens:
                print(sent_i_result_tokens)
                print(sent_i_neg_topic_tokens)
                print(sent_i_neg_exp_tokens)
                print('A')
                continue

            sent_trait_tokens = set(sent_i_tokens).intersection(trait_tokens)
            if not sent_trait_tokens:
                print('B')
                continue

            print("!!!")
            
            trait_var_points = 0
            
            sent_exp_tokens = set(sent_i_tokens).intersection(experiment_tokens)
            sent_test_tokens = set(sent_i_tokens).intersection(test_tokens)

            if sent_exp_tokens:
                trait_var_points = 0.1
            if sent_test_tokens:
                trait_var_points = 0.35

            # PUT BACK
            if trait_var_points == 0:
                continue

            print(sent_exp_tokens)
            print(sent_test_tokens)
            
            sent_var_tokens = set(sent_i_tokens).intersection(var_tokens)

            print(sent_var_tokens)
            
            if sent_var_tokens:
                trait_var_points += 0.15

            assert trait_var_points <= 0.5

            print(f"Trait Variation Points: {trait_var_points}")
            if trait_var_points == 0:
                print('C')
                continue
            
            for j in range(i+1, NUM_SENTENCES):
                sent_j = list(self.sp_doc.sents)[j]
                print(f"Comparing to Sentence: {sent_j}")
                sent_j_tokens = [token for token in sent_j]
            
                sent_species_tokens = set(sent_j_tokens).intersection(species_tokens)
                sent_cause_tokens = set(sent_j_tokens).intersection(cause_tokens)
                sent_change_tokens = set(sent_j_tokens).intersection(change_tokens)
                sent_result_tokens = set(sent_j_tokens).intersection(result_tokens) - sent_cause_tokens

                print(sent_species_tokens)
                print(sent_cause_tokens)
                print(sent_change_tokens)
                print(sent_result_tokens)

                if (
                    not sent_species_tokens
                    or 
                    (
                        not sent_cause_tokens and not sent_change_tokens and not sent_result_tokens
                    )
                ):
                    print('D')
                    continue

                _trait_var_points = trait_var_points + (1 - ((j - i - 1) / (NUM_SENTENCES - 1))) * 0.5
                print(f"Trait Variation Points after Sentence 2: {_trait_var_points}")
                _trait_var_points *= (1 - i/(NUM_SENTENCES - 1))
                print(f"Trait Variation Points after Adjustment: {_trait_var_points}")
                print(f"Sentence 1 ({i}): {list(self.sp_doc.sents)[i]}\nSentence 2 ({j}): {list(self.sp_doc.sents)[j]}\nTrait Variation Points: {_trait_var_points}")
                max_trait_var_points = max(max_trait_var_points, _trait_var_points)

        points[TRAIT_VARIATION] = max_trait_var_points
        # Calculating Score            
        # NUM_SENTENCES = len(list(self.sp_doc.sents))

        print(f"Points B4 Normalization: {points}")

        score = 0
        for i in range(NUM_CATEGORIES):
            # The trait variation category is already "normalized".
            if i != TRAIT_VARIATION:
                # This is the number of sentences that did not receive
                # 0 points (for the category). The number of sentences
                # we divide by must minimally account for these sentences.
                num_non_zero_pt_sents = NUM_SENTENCES - zero_pt_sents[i]
                
                # This is the number of sentences to calculate the
                # FTP with, with leniency applied.
                lenient_num_sentences = max(num_non_zero_pt_sents, (1 - LEN[i]) * NUM_SENTENCES)
    
                # Calculating FTP
                points[i] = points[i] / (MPC[i] * lenient_num_sentences)
    
                # Take the Inverse
                if i == NEGATIVE_TOPIC:
                    points[i] = 1 - points[i]
    
            # Add onto Score
            score += max(0, min(points[i], 1)) * CW[i]

        # Enforcing 3 or More Species            
        if len(seen_species) < 3:
            return 0, points
            
        assert 0.0 <= score <= 1.0
        
        return score, points

In [16]:
def score_dataset(name, save_output=False, version=""):
    # Redirect Print Statements
    # https://stackoverflow.com/questions/7152762/how-to-redirect-print-output-to-a-file
    if save_output:
        initial_stdout = sys.stdout
        f = open(f'./Print{name}{"" if not version else f"-{version}"}.txt', 'w')
        sys.stdout = f
        sys.stdout.reconfigure(encoding='utf-8')

    # Load Dataset
    data = load_preprocessed_dataset(name)

    # We'll be running the points algorithm
    # on the abstracts of these papers.
    texts = list(data['Abstract'].to_numpy())
    
    # The scores for each paper will be stored here,
    # we'll set this as a column of the dataframe.
    scores = []
    points = []
    trait_points = []
    species_points = []
    experiment_points = []
    interaction_points = []
    neg_topic_points = []
    trait_var_points = []
    
    # Scan and Evaluate Documents
    main = Main()
    for i, doc in enumerate(main.sp_nlp.pipe(texts)):
        print(f"{i+1}/{data.shape[0]} - {data.iloc[i]['Title']}\n")
        main.update_doc(doc, verbose=save_output)

        # Empty string literals cause errors, so it's
        # being handled here.
        if not main.sp_doc or not main.species.tn_doc:
            scores.append(0)
        else:
            score, _points = main.score(verbose=save_output)
            scores.append(score)
            points.append(_points)
            trait_points.append(_points[0])
            species_points.append(_points[1])
            experiment_points.append(_points[2])
            interaction_points.append(_points[3])
            neg_topic_points.append(_points[4])
            trait_var_points.append(_points[5])

        if not save_output:
            clear_output(wait=True)

    # Reset Standard Output
    if save_output:
        sys.stdout = initial_stdout
        f.close()

    data["Score"] = scores
    data["Trait Points"] = trait_points
    data["Species Points"] = species_points
    data["Experiment Points"] = experiment_points
    data["Interaction Points"] = interaction_points
    data["Negative Topic Points"] = neg_topic_points
    data["Trait Variation Points"] = trait_var_points
    data.sort_values(by='Score', ascending=False, inplace=True)
    
    return data

In [17]:
# Dataset Names: "Examples", "Baseline-1", "SubA", "SubAFiltered", "SubB", "SubBFiltered", "C", "CFiltered", "D", "DFiltered"]
scored_data = score_dataset("Baseline-1", save_output=False, version='')
store_scored_dataset(scored_data, "Baseline-3", version='9')

27/27 - The effect of copper stress on inter-trophic relationships in a model tri-trophic food chain.

behaviou?r []
[^A-Za-z]+rate [974, 1612]
colou?r []
biomass []
[^A-Za-z]+mass [1068, 1630]
[^A-Za-z]+size []
number []
length [1004, 1509]
pattern []
weight [1082]
shape []
efficiency []
trait []
phenotype []
demography []
population (structure|mechanic)s? []
ability []
capacity []
height []
width []
[A-Za-z]+span []
diet []
feeding []
nest []
substrate []
breeding []
[^A-Za-z]+age[^A-Za-z]+ []
lifespan []
development []
output []
time []
periodlevel []
configuration []
dimorphism []
capability []
regulation []
excretion []
luminescence []
[^A-Za-z]+role []
sensitivity []
resistance []
(un|(^|\s)[A-Za-z]*-)infected []
Tokens to Filter: [rate, rate, mass, mass, length, length, weights]
	Token: rate
	Expanded Token: The rate of growth and flag leaf length were affected by levels of Cu in the soil but total plant mass and ear weights were not
	Token: rate
	Expanded Token: Ladybirds also 

In [18]:
data = load_preprocessed_dataset("Baseline-1")

Data Shape: (28, 4)


In [19]:
data.loc[data['Title'].str.contains('native')]

Unnamed: 0,Title,Abstract,DOI,Score
18,Temperature dependency of intraguild predation...,Environmental factors such as temperature can ...,https://doi.org/10.1002/ecy.2157,0
23,Impact of intraspecific and intraguild predati...,Exotic predators are more likely to replace re...,,0


In [20]:
index = 18

title = data.iloc[index].Title
abstract = data.iloc[index].Abstract

print(f"Title: {title}")
print(f"Abstract: {abstract}")

Title: Temperature dependency of intraguild predation between native and invasive crabs
Abstract: Environmental factors such as temperature can affect the geographical distribution of species directly by exceeding physiological tolerances, or indirectly by altering physiological rates that dictate the sign and strength of species interactions. Although the direct effects of environmental conditions are relatively well studied, the effects of environmentally mediated species interactions have garnered less attention. In this study, we examined the temperature dependency of size-structured intraguild predation (IGP) between native blue crabs (Callinectes sapidus, the IG predator) and invasive green crabs (Carcinus maenas, the IG prey) to evaluate how the effect of temperature on competitive and predatory rates may influence the latitudinal distribution of these species. In outdoor mesocosm experiments, we quantified interactions between blue crabs, green crabs, and shared prey (mussels) 

In [21]:
main = Main()
main.update_text(str(abstract))
print(main.score())

06/28/2025 18:41:30 - INFO - 	 missing_keys: []
06/28/2025 18:41:30 - INFO - 	 unexpected_keys: []
06/28/2025 18:41:30 - INFO - 	 mismatched_keys: []
06/28/2025 18:41:30 - INFO - 	 error_msgs: []
06/28/2025 18:41:30 - INFO - 	 Model Parameters: 90.5M, Transformer: 82.1M, Coref head: 8.4M


behaviou?r []
[^A-Za-z]+rate [181, 715, 1373, 1460, 1477, 1863]
colou?r []
biomass []
[^A-Za-z]+mass []
[^A-Za-z]+size [480, 973, 1128, 1351, 1404, 1535]
number []
length []
pattern []
weight []
shape []
efficiency []
trait []
phenotype []
demography []
population (structure|mechanic)s? []
ability []
capacity [1570]
height []
width []
[A-Za-z]+span []
diet []
feeding []
nest []
substrate []
breeding []
[^A-Za-z]+age[^A-Za-z]+ []
lifespan []
development []
output []
time [1393]
periodlevel []
configuration []
dimorphism []
capability []
regulation []
excretion []
luminescence []
[^A-Za-z]+role [1917]
sensitivity []
resistance []
(un|(^|\s)[A-Za-z]*-)infected []
Tokens to Filter: [rates, rates, rates, rate, rate, rates, size, size, size, size, size, capacity, times, role]
	Token: rates
	Expanded Token: or indirectly by altering physiological rates that dictate the sign and strength of species interactions
	Token: rates
	Expanded Token: to evaluate how the effect of temperature on competi

In [22]:
for token in main.sp_doc:
    if token.lower_ == "uninfected":
        print(token.pos_)