In [1]:
# Trait-Mediated Interaction Modification
# Empirical:
# You could look for words like "hypothesis", "experiment", "found", and "discovered". That may point
# towards there being an experiment in the paper. There are also words like "control group", "compared",
# "findings", "results", "study", and more.
# Qualitative vs. Quantitative:
# To infer whether something is quantitative, you could look for numeric tokens and units.
# However, you can only do so much with the abstract. Therefore, this is likely not good enough.
# Yet, you could still take advantage of words like "fewer" and "increased" to show that there is a change.
# However, this would be more suited for the above category.
# Traits:
# There is no NLP tool for traits that I can use or create so I think that I could instead use keywords.
# For example, "snail feeding rates" is a trait. You may be able to spot this by looking for a word like
# "rate". You'd expand that word to include "snail feeding rates". As "snail" is a species you can infer
# that "rates" is a trait. I would be more decisive and use a dependency parser to ensure that the trait
# is a property of the species (like before). However, with all the cases that may exist, I think checking
# to see whether a species can be found by traveling back and/or forward without finding certain tokens could
# work well enough.
# 3 Species or More:
# This is simple. However, I think using a dictionary and TaxoNerd would be beneficial (for higher accuracy).
# To handle the potential differences in tokenization, character offsets should be used.
# Standardization:
# There is a lot of variance in the scores. To squash this issue, I think that we could assign each sentence
# a value from 0 to 1. We would add these values and divide by the number of sentences. This would result in
# a number that is also from 0 to 1. However, there are categories that we would like to inspect. So, we must
# create an overall score in the interval from [0, 1] while also scoring each category. Well, for each sentence
# we could add a point for each category that is observed. The sentence would receive said score divided by the
# number of categories. At the end, we add up all the sentence scores and divide by the number of sentences.
# The aggregate score for each category would also be divided by the number of sentences.

In [2]:
import re
import csv
import sys
import time
import spacy
import numpy as np
import pandas as pd
import random
import pickle
from fastcoref import FCoref, LingMessCoref
from taxonerd import TaxoNERD
from spacy.matcher import Matcher
from spacy.matcher import DependencyMatcher, PhraseMatcher
from spacy.language import Language
from IPython.display import clear_output
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
%run -i "../utils.py"

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
class References:
    def __init__(self, main):
        self.main = main
        self.predictions = None
        # Token's Index to Token's Cluster
        self.cluster_map = None

    def update(self, text, verbose=False):
        if not self.main.sp_doc or not self.main.index_map:
            raise Exception("DNE")
        self.predictions = self.main.fcoref.predict(texts=[text])
        self.cluster_map = self.load_cluster_map(self.predictions, verbose=verbose)
        
    def load_cluster_map(self, predictions, verbose=False):
        if verbose:
            print(f"Load Cluster Map:")
        
        cluster_map = {}
        
        for prediction in predictions:
            clusters = prediction.get_clusters(as_strings=False)
            if verbose:
                print(f"Clusters:\n{prediction.get_clusters(as_strings=True)}")

            # A cluster contains spans (segments of the text) that are
            # reference each other (e.g. ['We', 'our'] or [(0, 2), (5, 8)].
            for cluster in clusters:
                # It's a cluster of spans,
                # but instead it'll be represented into a cluster of tokens.
                # This makes it easier to use (coding-wise).
                clustered_tokens = []

                if verbose:
                    print(f"Cluster: {cluster}")
                
                for span in cluster:
                    span_words = self.main.sp_doc.text[span[0]:span[1]].split()
                    char_index = span[0]
                    for i in range(len(span_words)):
                        word = span_words[i]
                        clustered_tokens.append(self.main.token_at_char(char_index))
                        char_index += len(word) + 1
                
                for token in clustered_tokens:
                    cluster_map[token] = list(filter(lambda t: t != token, clustered_tokens))
        
        if verbose:
            print("Cluster Map")
            print(cluster_map)
            print()
        
        return cluster_map

    def same_reference(self, token_a, token_b, verbose=False, compare_text=True):
        if verbose:
            print("Same Reference:")
            print(f"Token A: {token_a} v. Token B: {token_b}")

        # Compare Text
        if compare_text and token_a.lower_ == token_b.lower_:
            if verbose:
                print("Same String")
            return True

        # Check if Token B in Token A Cluster
        if token_a.i in self.cluster_map and token_b in self.cluster_map[token_a.i]:
            if verbose:
                print(f"Token A Cluster: {self.cluster_map[token_a.i]}")
                print("\tToken B in Token A Cluster")
            return True

        # Check if Token A in Token B Cluster
        if token_b.i in self.cluster_map and token_a in self.cluster_map[token_b.i]:
            if verbose:
                print(f"tToken B Cluster: {self.cluster_map[token_b.i]}")
                print("Token A in Token B Cluster")
            return True
        
        return False

    def same_reference_span(self, span_a, span_b, verbose=False):
        # Compare Text
        if span_a.text.lower() == span_b.text.lower():
            if verbose:
                print("Same String")
            return True
        
        for token_a in span_a:
            for token_b in span_b:
                if self.same_reference(token_a, token_b, compare_text=False, verbose=verbose):
                    return True
        return False

In [4]:
# Used for the 'Gazetteer'
@Language.component("lower_case_lemmas")
def lower_case_lemmas(doc) :
    for token in doc :
        token.lemma_ = token.lemma_.lower()
    return doc

class Species:
    def __init__(self, main):
        self.main = main
        self.tn_nlp = TaxoNERD(prefer_gpu=False).load(model="en_ner_eco_biobert", exclude=["tagger", "parser", "attribute_ruler"])
        self.tn_nlp.add_pipe("lower_case_lemmas", after="lemmatizer")
        self.tn_doc = None
        # Contains any spans that have been identified
        # as a species.
        self.spans = None
        # Contains any tokens that have been identified
        # as being a part of a species or a species.
        # Meaning, if "brown squirrel" was in the text,
        # this list would contain ["brown", "squirrel", ...].
        self.tokens = None
        # To make the lines of code easier later on,
        # I've mapped the species token to the span it
        # belongs it.
        self.token_to_span = None
        # Contains pairs of spans that have been identified
        # as alternative names of each other.
        self.alternative_spans = None
        # There are words that TN may not recognize. This
        # can be used to help that.
        self.gazetteer = ["juvenile", "adult", "prey", "predator", "species", "crab", "snail"]
        patterns = []
        for name in self.gazetteer:
            doc = self.tn_nlp(name)
            patterns.append({"label": "LIVB", "pattern": [{"LEMMA": token.lemma_} for token in doc]})
        ruler = self.tn_nlp.add_pipe("entity_ruler")
        ruler.add_patterns(patterns)

    def update(self, text, verbose=False):
        if not self.main.sp_doc or not self.main.index_map:
            raise Exception("DNE")
        self.tn_doc = self.tn_nlp(text)
        self.spans, self.tokens, self.token_to_span, self.alternate_spans = self.load_species(verbose=verbose)
        
    def load_species(self, verbose=False):
        if not self.main.sp_doc or not self.main.index_map:
            raise Exception("DNE")

        # These three contain the literal species that have been
        # identified in the text. Tokens that aren't adjectives,
        # nouns, or proper nouns will be stripped.
        spans = []
        tokens = []
        token_to_span = {}

        # It's useful to know if a different name refers to a
        # species we have already seen -- so we pair any alternative
        # names. For example, in "predatory crab (Carcinus maenas)",
        # "predatory crab" is an alternative name for "Carcinus maenas"
        # and vice versa. This is used so that the same species can be
        # properly tracked and won't be given redundant points.
        alternate_spans = {}

        # TaxoNerd
        if verbose:
            print("Recognized TN Entities:")
            print(self.tn_doc.ents)

        for species_span in self.tn_doc.ents:
            if verbose:
                print(f"Species Span: {species_span}")
   
            # Translated to Main Doc
            sp_li = self.main.token_at_char(self.tn_doc[species_span.start].idx).i
            sp_ri = self.main.token_at_char(self.tn_doc[species_span.end].idx).i
            
            sp_species_span = self.main.sp_doc[sp_li:sp_ri]
            
            if verbose:
                print(f"Species Span (SpaCy): {species_span}")

            # Expand Species if Ambiguous (1st Condition) or Possibly Missing Information (2nd Condition)
            if sp_species_span.text.lower() in self.gazetteer or (sp_species_span.start > 0 and self.main.sp_doc[sp_species_span.start-1].pos_ in ["ADJ"]):
                sp_species_span = self.main.expand_unit(
                    il_unit=sp_species_span.start, 
                    ir_unit=sp_species_span.end - 1,
                    il_boundary=0,
                    ir_boundary=len(self.main.sp_doc),
                    direction='LEFT',
                    allowed_speech=["ADJ", "PROPN"],
                    allowed_literals=["-"],
                    verbose=verbose
                )

                if verbose:
                    print(f"Expanded Species Span (SpaCy): {species_span}")
            
            # Contract Species (Remove Outer Punctuations and/or Symbols)
            sp_species_span = self.main.contract_unit(
                il_unit=sp_species_span.start, 
                ir_unit=sp_species_span.end - 1, 
                allowed_speech=["ADJ", "NOUN", "PROPN"],
                verbose=verbose
            )

            if verbose:
                print(f"Contracted Species Span (SpaCy): {sp_species_span}")

            # Add Span and Tokens
            spans.append(sp_species_span)
            for token in sp_species_span:
                if verbose:
                    print(token, token.pos_)
                if token.pos_ not in ["ADJ", "PROPN", "NOUN"]:
                    continue
                tokens.append(token)
                token_to_span[token] = sp_species_span

        
        # Finding and Storing Alternative Names
        if verbose:
            print("Finding Alternate Names")
        
        for i, species_span in enumerate(spans):
            if i + 1 >= len(spans):
                break
            
            next_species_span = spans[i+1]
            if verbose:
                print(f"SP1: {species_span}, SP2: {next_species_span}, DIST: {next_species_span.start - species_span.end == 1}")
            
            # There's one token in between the two species
            if next_species_span.start - species_span.end == 1:
                before_next = self.main.sp_doc[next_species_span.start - 1]
                after_next = self.main.sp_doc[next_species_span.end]

                if verbose:
                    print(f"Token Before SP2: {before_next}, Token After SP2: {after_next}")
                
                # The next species span is surrounded by parentheses (generalized).
                # This likely indicates that this next species is an alternative name
                # for the species before.
                if before_next.pos_ in ["PUNCT", "SYM"] and after_next.pos_ in ["PUNCT", "SYM"]:
                    sp_span_text = species_span.text.lower()
                    next_sp_span_text = next_species_span.text.lower()
                    
                    if sp_span_text not in alternate_spans:
                        alternate_spans[sp_span_text] = []
                    if next_sp_span_text not in alternate_spans:
                        alternate_spans[next_sp_span_text] = []
                    
                    alternate_spans[sp_span_text].append(next_sp_span_text)
                    alternate_spans[next_sp_span_text].append(sp_span_text)
                            
        if verbose:
            print(f"Spans: {spans}")
            print(f"Tokens: {tokens}")
            print(f"Alternate Spans: {alternate_spans}")
        
        return (spans, tokens, token_to_span, alternate_spans)

    def span_at_token(self, token):
        if token in self.token_to_span:
            return self.token_to_span[token]
        return None
    
    def is_species(self, token):
        return token in self.tokens
        
    def has_species(self, tokens, verbose=False):
        if verbose:
            print(f"Given Tokens: {[t.i for t in tokens]}")
            print(f"Species Tokens: {[t.i for t in self.tokens]}")
        for token in tokens:
            if token in self.tokens:
                return True
        return False

    def same_species(self, sp_1, sp_2, verbose=False):
        if verbose:
            print(f"SP 1: {sp_1}")
            print(f"SP 2: {sp_2}")
            
        # Compare
        if verbose:
            print(f"Compare Literals: {sp_1.text.lower()}' == '{sp_2.text.lower()}'")
        
        if sp_1.text.lower() == sp_2.text.lower():
            return True

        # Alternate Names
        if verbose:
            print("Check Alternate Names")

        sp_1_text = sp_1.text.lower()
        sp_2_text = sp_2.text.lower()

        if verbose:
            print(f"SP 1 TEXT: {sp_1_text}")
            print(f"SP 2 TEXT: {sp_2_text}")
        
        if sp_1_text in self.alternate_spans:
            if verbose:
                print(f"SP 1 Alternate Spans: {self.alternate_spans[sp_1_text]}")
            if sp_2_text in self.alternate_spans[sp_1_text]:
                return True

        
        if sp_2_text in self.alternate_spans:
            if verbose:
                print(f"SP 2 Alternate Spans: {self.alternate_spans[sp_2_text]}")
            if sp_1_text in self.alternate_spans[sp_2_text]:
                return True

        # Singular Version of Phrase (e.g. "fewer crabs" becomes "fewer crab")
        singular_version = lambda tokens : " ".join([*[token.text for token in tokens[:-1]], tokens[-1].lemma_]).lower()

        # Removing Adjectives
        # If you had two spans, "fewer crabs" and "crabs", you'd want them to be
        # recognized as the same species. However, you don't want "red crabs" and
        # "blue crabs" to be recognized as the same. So, perhaps we remove adjectives
        # until the two species have an equivalent number of adjectives.
        # So, "fewer crabs" turns into "crabs" and is compared with "crabs". However,
        # "red crabs" would remain as "red crabs" since "red crabs" and "blue crabs" have
        # the same number of adjectives. I think this could be an okay method.
        # This will only be used for species that each have one or less species. Anything else,
        # and I feel that they're probably not the same anyway.
        sp_1_adjs = []
        sp_1_nouns = []
        for token in sp_1:
            if token.pos_ == "ADJ":
                sp_1_adjs.append(token)
            elif token.pos_ in ["PROPN", "NOUN"]:
                sp_1_nouns.append(token)
        
        sp_2_adjs = []
        sp_2_nouns = []
        for token in sp_2:
            if token.pos_ == "ADJ":
                sp_2_adjs.append(token)
            elif token.pos_ in ["PROPN", "NOUN"]:
                sp_2_nouns.append(token)

        if verbose:
            print(f"Number of Adjectives in 1: {len(sp_1_adjs)}")
            print(f"Number of Adjectives in 2: {len(sp_2_adjs)}")

        if sp_1_nouns and sp_2_nouns and ((len(sp_1_adjs) == 1 and len(sp_2_adjs) == 0) or (len(sp_2_adjs) == 1 and len(sp_1_adjs) == 0)):
            # "fewer crabs" vs. "crabs"
            sp_singular_nouns_1 = singular_version(sp_1_nouns)
            sp_singular_nouns_2 = singular_version(sp_2_nouns)

            if verbose:
                print(f"Compare Singular Nouns: {sp_singular_nouns_1}' == '{sp_singular_nouns_2}'")
            
            return sp_singular_nouns_1 == sp_singular_nouns_2
            
        # Compare Singular Version
        # These lines are so that spans like "predatory crab" and "predatory crabs"
        # aren't ruled out as different species. The above check may also be handled
        # by this, but the lemma of a token depends on the surrounding context, so it
        # also might not.
        sp_lemma_1 = singular_version(sp_1)
        sp_lemma_2 = singular_version(sp_2)

        if verbose:
            print(f"Compare Singular Spans: {sp_lemma_1}' == '{sp_lemma_2}'")
        
        if sp_lemma_1 == sp_lemma_2:
            return True

        return False

In [5]:
class Keywords:
    def __init__(self, main, base=[], phrases=[], speech=[], threshold=0.7):
        self.main = main
        self.base = [b.lower() for b in base]
        self.speech = [s.upper() for s in speech]
        self.phrases = [p.lower() for p in phrases]
        self.threshold = threshold
        self.vocab = [self.main.sp_nlp(word) for word in self.base]
        self.tokens = []

    def update(self, verbose=False):
        # SpaCy Doc DNE or Indexing Map DNE
        if not self.main.sp_doc or not self.main.index_map:
            raise Exception("DNE")
        self.tokens = self.match_tokens(verbose=verbose)

    def match_tokens(self, verbose=False):
        # SpaCy Doc DNE or Indexing Map DNE
        if not self.main.sp_doc or not self.main.index_map:
            raise Exception("DNE")
        
        matched_tokens = []

        # Words
        for token in self.main.sp_doc:
            if verbose:
                print(f"Potential Keyword: {token, token.pos_} v. Speech: {self.speech}")
            if token.pos_ not in self.speech:
                continue
            # Comparing Literal Text
            if token.lemma_.lower() in self.base or token.lower_ in self.base:
                matched_tokens.append(token)
                continue
            # Comparing Similarity
            lemma = self.main.sp_nlp(token.lemma_)
            for word in self.vocab:
                similarity = word.similarity(lemma)
                if verbose:
                    print(f"{lemma} and {word} Similarity: {similarity}")
                if similarity >= self.threshold:
                    matched_tokens.append(token)
                    break

        # Phrases
        text = self.main.sp_doc.text.lower()
        for phrase in self.phrases:
            for char_index in [match.start() for match in re.finditer(phrase, text)]:
                matched_tokens.append(self.main.token_at_char(char_index))
                
        return matched_tokens

class ExperimentKeywords(Keywords):
    def __init__(self, main):
        super().__init__(
            main, 
            base=["study", "hypothesis", "experiment", "found", "discover", "compare", "finding", "result"],
            phrases=["control group", "independent", "dependent"],
            speech=["VERB", "NOUN"], 
            threshold=0.7
        )

class CauseKeywords(Keywords):
    def __init__(self, main):
        super().__init__(
            main, 
            base=["increase", "decrease", "change", "shift", "cause", "produce"], 
            speech=["VERB"], 
            threshold=0.6
        )

class ChangeKeywords(Keywords):
    def __init__(self, main):
        super().__init__(
            main, 
            base=["few", "more", "increase", "decrease", "less", "short", "long"], 
            speech=["NOUN"], 
            threshold=0.6
        )

In [6]:
class TraitKeywords(Keywords):
    def __init__(self, main):
        super().__init__(
            main, 
            base=["behavior", "rate", "color", "mass", "size", "length"], 
            speech=["NOUN", "ADJ"], 
            threshold=0.7
        )

    def update(self, verbose=False):
        Keywords.update(self, verbose)
        if verbose:
            print(f"Unfiltered Tokens: {self.tokens}")
        self.tokens = self.filter_tokens(self.tokens, verbose)

    def filter_tokens(self, tokens, verbose=False):
        if not self.main.sp_doc or not self.main.index_map:
            raise Exception("DNE")

        filtered = []
        for token in tokens:
            expanded_token = self.main.expand_unit(
                il_unit=token.i, 
                ir_unit=token.i, 
                il_boundary=0, 
                ir_boundary=len(self.main.sp_doc), 
                allowed_speech=["ADJ", "NOUN", "ADP", "PART"],
                allowed_literals=["-", ","],
                verbose=verbose
            )
            
            if verbose:
                print(f"Token: {token}")
                print(f"Expanded Token: {expanded_token}")

            if self.main.species.has_species(expanded_token):
                if verbose:
                    print(f"\tContains Species")
                filtered.append(token)
        
        return filtered

In [7]:
class Main:
    def __init__(self):
        # Tools
        self.sp_nlp = spacy.load("en_core_web_lg")
        self.fcoref = FCoref(enable_progress_bar=False, device='cpu')
        self.sp_doc = None

        # Maps Character Position to Token in Document
        # Used to handle differences between different
        # pipelines and tools.
        self.index_map = None

        # Parsers
        self.species = Species(self)
        self.traits = TraitKeywords(self)
        self.causes = CauseKeywords(self)
        self.changes = ChangeKeywords(self)
        self.references = References(self)
        self.experiment = ExperimentKeywords(self)

    def update_doc(self, doc, verbose=False):
        self.sp_doc = doc
        self.index_map = self.load_index_map()
        self.references.update(doc.text, verbose=False)
        self.species.update(doc.text, verbose=verbose)
        self.traits.update(verbose=False)
        self.causes.update(verbose=False)
        self.changes.update(verbose=False)
        self.experiment.update(verbose=False)

    def update_text(self, text, verbose=False):
        self.sp_doc = self.sp_nlp(text)
        self.update_doc(self.sp_doc, verbose=verbose)
        
    def token_at_char(self, char_index):
        # SpaCy Doc or Indexing Map Not Found
        if not self.sp_doc or not self.index_map:
            raise Exception("DNE")

        # Index into Map
        if char_index in self.index_map:
            return self.index_map[char_index]

        # Looking in Tokens
        # Depending on the tokenizer, the character being
        # used to find a token may not be the first character
        # of the token.
        for token in self.sp_doc:
            if char_index >= token.idx and char_index < token.idx + len(token):
                return token

        # There must be a token that corresponds to the
        # given character index. If there's not, there's
        # an issue.
        raise Exception("Token Not Found")
        
    def load_index_map(self):
        # SpaCy Doc Not Found
        if self.sp_doc is None:
            raise Exception("DNE")

        # Map Character Index to Token
        index_map = {}
        for token in self.sp_doc:
            index_map[token.idx] = token
        return index_map

    def expand_unit(self, *, il_unit, ir_unit, il_boundary, ir_boundary, allowed_speech=[], allowed_literals=[], direction='BOTH', verbose=False):
        if verbose:
            print("LEFT")
            
        if direction in ['BOTH', 'LEFT']:
            while il_unit > il_boundary:
                prev_token = self.sp_doc[il_unit-1]
                if verbose:
                    print(f"il_unit: {il_unit}, il_boundary: {il_boundary}, prev_token: {prev_token}, prev_token.pos_: {prev_token.pos_}")
                if prev_token.pos_ not in allowed_speech and prev_token.lower_ not in allowed_literals:
                    break
                il_unit -= 1

        if direction in ['BOTH', 'RIGHT']:
            while ir_unit < ir_boundary:
                next_token = self.sp_doc[ir_unit+1]
                if verbose:
                    print(f"ir_unit: {ir_unit}, ir_boundary: {ir_boundary}, next_token: {next_token}, next_token.pos_: {next_token.pos_}")
                if next_token.pos_ not in allowed_speech and next_token.lower_ not in allowed_literals:
                    break
                ir_unit += 1

        expanded_unit = self.sp_doc[il_unit:ir_unit+1]
        if verbose:
            print(f"Expanded Unit: {expanded_unit}")
        return expanded_unit

    def contract_unit(self, *, il_unit, ir_unit, allowed_speech=[], allowed_literals=[], direction='BOTH', verbose=False):
        # il_unit_0 = il_unit
        # ir_unit_0 = ir_unit

        if verbose:
            print("LEFT")
            
        if direction in ['BOTH', 'LEFT']:
            while il_unit <= ir_unit:
                curr_token = self.sp_doc[il_unit]
                if verbose:
                    print(f"il_unit: {il_unit}, ir_unit: {ir_unit}, curr_token: {curr_token}, curr_token.pos_: {curr_token.pos_}")
                if curr_token.pos_ in allowed_speech or curr_token.lower_ in allowed_literals:
                    break
                il_unit += 1

        if verbose:
            print("RIGHT")
        
        if direction in ['BOTH', 'RIGHT']:
            while ir_unit >= il_unit:
                curr_token = self.sp_doc[ir_unit]
                if verbose:
                    print(f"il_unit: {il_unit}, ir_unit: {ir_unit}, curr_token: {curr_token}, curr_token.pos_: {curr_token.pos_}")
                if curr_token.pos_ in allowed_speech or curr_token.lower_ in allowed_literals:
                    break
                ir_unit -= 1

        # il_unit = min(il_unit, ir_unit_0)
        # ir_unit = max(ir_unit, il_unit_0)
        assert il_unit <= ir_unit
        contracted_unit = self.sp_doc[il_unit:ir_unit+1]
        if verbose:
            print(f"Contracted Unit: {contracted_unit}")
        return contracted_unit

    def score(self, verbose=False):
        # Categories
        SPECIES = 0
        TRAIT = 1
        EXPERIMENT = 2

        # Number Categories
        NUM_CATEGORIES = 3
        
        # Points
        # Each sentence can add only one to each
        # of these categories. An entry represents
        # the points for a category.
        points = [0] * NUM_CATEGORIES

        # Extracted Information
        change_tokens = self.changes.tokens
        cause_tokens = self.causes.tokens
        trait_tokens = self.traits.tokens
        species_tokens = self.species.tokens
        experiment_tokens = self.experiment.tokens

        if verbose:
            print(f"Change Tokens: {self.changes.tokens}")
            print(f"Cause Tokens: {self.causes.tokens}")
            print(f"Trait Tokens: {self.traits.tokens}")
            print(f"Species Tokens: {self.species.tokens}")
            print(f"Experiment Tokens: {self.experiment.tokens}")
        
        # Removing Redundant Species Tokens
        seen = set()
        species_tokens = [token for token in species_tokens if self.species.span_at_token(token).start not in seen and (seen.add(self.species.span_at_token(token).start) or True)]
        
        if verbose:
            print(f"Not-Redundant Species Tokens: {species_tokens}")
        
        # This is used to ensure that at least three species
        # are mentioned.
        seen_species = {}

        for sent in self.sp_doc.sents:
            # This is here so that I don't add more than one point
            # for each category in a single sentence.
            found = [0] * NUM_CATEGORIES

            sent_tokens = [token for token in sent]
            sent_cause_tokens = set(sent_tokens).intersection(cause_tokens)
            sent_change_tokens = set(sent_tokens).intersection(change_tokens)
            sent_seen_species = []
            
            if verbose:
                print(f"Sentence: {sent}")
                print(f"Sentence Tokens: {sent_tokens}")
                print(f"Sentence Cause Tokens: {sent_cause_tokens}")
                print(f"Sentence Change Tokens: {sent_change_tokens}")
            
            for token in sent_tokens:
                # No More Searching Needed
                if found[SPECIES] >= 1 and found[TRAIT] >= 1 and found[EXPERIMENT] >= 1:
                    break
                
                if verbose:
                    print(f"Token: '{token}' ({token.pos_})")

                # Trait
                if found[TRAIT] < 1 and token in trait_tokens:
                    points[TRAIT] += 1
                    found[TRAIT] = True

                    if verbose:
                        print("Points Added for Trait")

                # Experiment
                if found[EXPERIMENT] < 1 and token in experiment_tokens:
                    points[EXPERIMENT] += 1
                    found[EXPERIMENT] = True

                    if verbose:
                        print("Points Added for Experiment")

                # Species
                if token in species_tokens:
                    # Find Species Span
                    species_span = self.species.span_at_token(token)
                    if not species_span:
                        raise Exception("Species Span DNE")

                    if verbose:
                        print(f"Species Span: {species_span}")
    
                    # Updating Seen Species
                    if verbose:
                        print("Seen Species:")
                        print(seen_species)
                    
                    past_visits = 0
                    for seen_species_span in seen_species.keys():
                        if verbose:
                            print(f"Comparing '{species_span}' and '{seen_species_span}'")

                        if self.species.same_species(species_span, seen_species_span, verbose=verbose):
                            past_visits = seen_species[seen_species_span]
                            if verbose:
                                print(f"\t'{species_span}' == '{seen_species_span}'")
                                print(f"\tNumber of Visits: {past_visits}")
                            seen_species[seen_species_span] += 1
                            break

                    if past_visits == 0:
                        seen_species[species_span] = 1
                    
                    if verbose:
                        print("Seen Species Updated:")
                        print(seen_species)

                    # We only add points if it's a species that has not been seen
                    # in the sentence. This is to avoid redundant points.
                    if verbose:
                        print("Seen Species in Sentence:")
                    redundant_species = False
                    for seen_species_span in sent_seen_species:
                        if verbose:
                            print(f"Comparing '{species_span}' and '{seen_species_span}'")
                        if self.species.same_species(species_span, seen_species_span, verbose=verbose):
                            redundant_species = True
                            if verbose:
                                print(f"\tEqual => Continue")
                            break
                    if redundant_species:
                        continue
                    sent_seen_species.append(species_span)
                    
                    if found[SPECIES] < 1:
                        # To get points in the species category,
                        # there must be (1) a species; and (2) a change or cause
                        # word nearby.
                        distance = 5
                        sent_cause_tokens_in_area = [c_token for c_token in sent_cause_tokens if c_token != token and abs(c_token.i - token.i) <= distance]
                        sent_change_tokens_in_area = [c_token for c_token in sent_change_tokens if c_token != token and abs(c_token.i - token.i) <= distance]
                        
                        if verbose:
                            print(f"Cause Tokens in Area: {sent_cause_tokens_in_area}")
                            print(f"Change Tokens in Area: {sent_change_tokens_in_area}")
                        
                        if sent_cause_tokens_in_area or sent_change_tokens_in_area:
                            points[SPECIES] += 0.5
                            found[SPECIES] = True

                            if verbose:
                                print("Points Added for Species")

        # 3 Species Required
        if verbose:
            print(f"Seen Species: {seen_species}")
        
        if len(seen_species) < 3:
            return 0
        
        # Normalizing Score
        NUM_SENTENCES = len(list(self.sp_doc.sents))
        
        score = (points[TRAIT] + points[SPECIES] + points[EXPERIMENT]) / (NUM_CATEGORIES * NUM_SENTENCES)
        assert 0.0 <= score <= 1.0

        if verbose:
            print(f"Score: {score}")

        return score

In [15]:
df = pd.read_csv("../../Datasets/Baseline-1.csv")
df.head(4)
text = df.Abstract[3]
print(text)

Replicated experiments in artificial ponds demonstrated that an assemblage of aquatic insects competed with tadpoles of the frogs Hyla andersonii and Bufo woodhousei fowleri. We independently manipulated the presence or absence of aquatic insects, and the abundance of an anuran competitor (O or 150 Bufo w. fowleri per experimental pond), using a completely crossed design for two—factor variance analysis, and observed the responses of initially similar cohorts of Hyla andersonii tadpoles to neither, either, or both insect and anuran competitors. Insects and Bufo significantly depressed the mean individual mass at metamorphosis of Hyla froglets and the cumulative biomass of anurans leaving the ponds at metamorphosis. Neither insects nor Bufo affected the survival or larval period of Hyla. Insects also significantly reduced the mean mass of Bufo, showing that both anurans responded to competition from insects. The intensity of competition between natural densities of insects and Hyla tadp

In [16]:
main = Main()
main.update_text(text, verbose=False)

06/13/2025 12:50:50 - INFO - 	 missing_keys: []
06/13/2025 12:50:50 - INFO - 	 unexpected_keys: []
06/13/2025 12:50:50 - INFO - 	 mismatched_keys: []
06/13/2025 12:50:50 - INFO - 	 error_msgs: []
06/13/2025 12:50:50 - INFO - 	 Model Parameters: 90.5M, Transformer: 82.1M, Coref head: 8.4M
06/13/2025 12:51:15 - INFO - 	 Tokenize 1 inputs...
Map: 100%|████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 15.77 examples/s]
06/13/2025 12:51:15 - INFO - 	 ***** Running Inference on 1 texts *****


In [17]:
score = main.score(verbose=False)
print(score)

0.19444444444444445


In [11]:
main.species.alternate_spans

{'ceratitis capitata': ['diptera'],
 'diptera': ['ceratitis capitata', 'tephritidae'],
 'tephritidae': ['diptera']}