# Person.ipynb

Demonstration notebook for Text Extensions for Pandas.

This notebook creates business rules for a person name extractor using the facilities of Text Extensions for Pandas.

Instructions to run:
1. (optional) Use the script `env.sh` at the root of this project to create an Anaconda environment `pd` with required packages. Activate this environment by typing `conda activate pd`.
1. From a shell window at the root of the project, start up JupyterLab by typing `jupyter lab`
1. Inside JupyterLab, navigate to the `notebooks` directory and open up this notebook. You should now be able to run the code in this notebook.

In [None]:
# INITIALIZATION BOILERPLATE

# The Jupyter kernel for this notebook usually starts up inside the notebooks
# directory, but the text_extensions_for_pandas package code is in the parent
# directory. Add that parent directory to the front of the Python include path.
import sys
if (sys.path[0] != ".."):
    sys.path[0] = ".."
    
# Libraries
import numpy as np
import pandas as pd
import regex
import spacy
spacy_language_model = spacy.load("en_core_web_sm")
import textwrap

# And of course we need the text_extensions_for_pandas library itself.
import text_extensions_for_pandas as tp

In [None]:
# Example document text courtesy https://en.wikipedia.org/wiki/Monty_Python_and_the_Holy_Grail
# License: CC-BY-SA
with open ("../resources/holy_grail.txt", "r") as f:
    doc_text = f.read()
    pass
 
# Parse the document text with SpaCy, then convert the results to a dataframe
token_features = tp.make_tokens_and_features(doc_text, spacy_language_model)
token_features

In [None]:
# We can extract out all unique sentence spans by aggregating the "sentence" 
# column of the above dataframe:
sentences = pd.DataFrame({"sentence": token_features["sentence"].unique()})
sentences.head(10)

In [None]:
# The "ent_iob" and "ent_type" fields contain entity tags in 
# Inside-Outside-Beginning (IOB) format.
# Text Extensions for Pandas has a built-in function to convert 
# IOB tagged data to spans of entities.
entities = tp.iob_to_spans(token_features)
entities

In [None]:
# Let's look at just the entities tagged "PERSON"
person_entities = entities[entities["ent_type"] == "PERSON"]
person_entities.head()

In [None]:
# Use the TokenSpanArray's built-in HTML rendering to look at these
# PERSON entities in the context of the document.
person_entities["token_span"].values

In [None]:
# Load gold standard labels in IOB format from a CSV file
person_gold_iob = pd.read_csv("../resources/holy_grail_person.csv")

# Pull in token offsets from our token_features dataframe
person_gold_iob["token_span"] = token_features["token_span"].values
person_gold_iob["char_span"] = token_features["char_span"].values
person_gold_iob.iloc[25:35]

In [None]:
# Convert from IOB format to spans of entities
person_gold = tp.iob_to_spans(person_gold_iob, entity_type_col_name=None)
person_gold.head()

In [None]:
# Find all the spans that are in both the extractor's answer set and the gold standard
person_intersection = person_gold.merge(person_entities)
person_intersection.head()

In [None]:
# Compute precision and recall
num_true_positives = len(person_intersection.index)
num_entities = len(person_gold.index)
num_entities_extracted = len(person_entities.index)

precision = num_true_positives / num_entities_extracted
recall = num_true_positives / num_entities
F1 = 2.0 * (precision * recall) / (precision + recall)

print(
"""Number of correct answers: {}
Number of entities identified: {}
Actual number of entities: {}
Precision: {:1.2f}
Recall: {:1.2f}
F1: {:1.2f}""".format(num_true_positives, num_entities, num_entities_extracted, precision, recall, F1))

---

Our baseline model produces an **F1 score of 0.52** on this document, 
which is not so good.

We could at this point work to retrain the baseline model for this domain, 
but that approach would involve several difficulties. We would need to 
obtain and label additional documents to cover this and similar documents
without introducing skew. And we would need to retrain a deep learning model, 
which is also a very nontrivial task.

In the case of this model, those two steps are the *easy* part, because the
model is trained on the OntoNotes corpus. Doing anything with that corpus
for commercial purposes requires purchasing an expensive license from the
Linguistic Data Consortium:

![alt text](../resources/ontonotes_license.png)

So instead, let's leave the model as-is for now and try some easier approaches
to improve our accuracy for this domain.

In [None]:
# The simplest form of domain adaptation is whitelists and blacklists.
# Let's find some candidates for a blacklist by looking for spans that
# the model frequently and incorrectly labels as PERSON entities.
false_positives_mask = ~person_entities["token_span"].isin(person_gold["token_span"])
false_positives = person_entities[false_positives_mask]
false_positives

In [None]:
# "Monty Python" and "Knights" are highly unlikely to be PERSON entities.
# Create a dictionary (gazetteer) to hold these and other blacklisted strings.
!cat ../resources/person_blacklist.dict

In [None]:
# Load the dictionary as a dataframe
blacklist_dict = tp.load_dict("../resources/person_blacklist.dict", spacy_language_model)
blacklist_dict

In [None]:
# Build up a dataframe of all spans that match the dictionary
tokens = token_features["char_span"]
blacklist_matches = tp.extract_dict(tokens, blacklist_dict)
blacklist_matches

In [None]:
# Exclude any extracted entities that overlap exactly with a blacklist match.
mask = ~person_entities["token_span"].isin(blacklist_matches["match"].values)
person_entities_2 = person_entities[mask]
person_entities_2.head()

In [None]:
# Redo F1 calculation
def compute_and_print_accuracy(ents: pd.DataFrame):
    person_intersection = person_gold.merge(ents)
    num_true_positives = len(person_intersection.index)
    num_entities = len(person_gold.index)
    num_entities_extracted = len(ents.index)
    precision = num_true_positives / num_entities_extracted
    recall = num_true_positives / num_entities
    F1 = 2.0 * (precision * recall) / (precision + recall)
    print(textwrap.dedent("""    Number of correct answers: {}
    Number of entities identified: {}
    Actual number of entities: {}
    Precision: {:1.2f}
    Recall: {:1.2f}
    F1: {:1.2f}""".format(num_true_positives, num_entities, 
                          num_entities_extracted, precision, recall, F1)))
    
compute_and_print_accuracy(person_entities_2)

In [None]:
# The blacklist improved our precision from 0.61 to 0.68. 
# Let's see what we can do to improve recall. 
# Here are the remaining false positives.
false_positives_2 = person_entities_2[~person_entities_2["token_span"].isin(person_gold["token_span"])]
false_positives_2

In [None]:
# Most of these false positives appear to be partial matches of actual Person
# entities.

# TODO: Implement ContainsJoin and use it to identify cases where the SpaCy NER
#  model's output span is part of an entity span from the gold standard.

In [None]:
mp = blacklist_matches["match"].iloc[0]
mp

In [None]:
mp2 = person_entities["token_span"].iloc[0]
mp2

In [None]:
mp == mp2

In [None]:
# TODO: Aggregate these partial matches by sentence to find the sentence (9)
#  that has the most examples of these partial matches.

In [None]:
# Let's look at that one "problem" sentence.
sentence = token_features[token_features["sentence"] == sentences["sentence"][9]]
sentence.head(10)

In [None]:
# Use SpaCy to render the dependency parse of the sentence
tp.render_parse_tree(sentence)

In [None]:
# That's a lot of parse tree! Let's cut that down to the portions of the parse
# that cover entities from the gold standard data.

# TODO: Use ContainsJoin to filter down to the tokens that take part in the 
#  partial results

In [None]:
# TODO: Augment entity spans by following "compound" links in the dependency parse

In [None]:
# Stuff below this paragraph needs to be reincorporated into the main flow

In [None]:
# Use a Gremlin query to find all compound proper nouns in the document
g = tp.token_features_to_traversal(token_features)
compound_nouns = (
    g.V()  # Start with all vertices.
    .has("tag", "NNP")  # Filter out those not tagged NNP (proper noun).
    .has("dep", "compound").as_("src")  # Filter out those without a dependency link of type "compound".
    .out()  # Follow the outgoing link to the parent node.
    .has("tag", "NNP").as_("dest")  # Filter paths where the parent node is not a proper noun.
    .select("src", "dest").by("token_span")  # Return parents of tokens
).toDataFrame()
# Add a third column with the combined span
compound_nouns["phrase"] = tp.combine_spans(compound_nouns["src"], compound_nouns["dest"])
compound_nouns.head()

In [None]:
# Display the locations of those compound nouns
compound_nouns["phrase"].values

In [None]:
# Filter down the example sentence to just the tokens that take part in compound nouns
all_tokens_df = pd.DataFrame({
    "token_span" : pd.concat([compound_nouns[c] for c in compound_nouns]).unique()})
compound_noun_tokens = sentence.merge(all_tokens_df)
compound_noun_tokens = compound_noun_tokens.set_index(compound_noun_tokens["id"])
compound_noun_tokens.head(10)

In [None]:
# Render the partial parse trees of just those tokens
tp.render_parse_tree(compound_noun_tokens)

In [None]:
class Resources:
    """
    Data structures that are loaded once, as opposed to recreated on
    every document. For convenience, we hang all of these data structures
    off of a single Python object.
    """
    def __init__(self):
        self.LanguageModel = spacy.load("en_core_web_sm")
        self.Tokenizer = self.LanguageModel.Defaults.create_tokenizer(self.LanguageModel)
        self.FirstNameDict = tp.load_dict("../resources/first_name.dict", self.Tokenizer)
        self.LastNameDict = tp.load_dict("../resources/last_name.dict", self.Tokenizer)
        self.CapsWordRegex = regex.compile("[A-Z][a-z]*")

        
resources = Resources()

In [None]:
# Build some business rules that define some text features.
# The rules are organized into Python classes.
# The output of each rule is a Pandas DataFrame.

# TEMPORARY until we can use Python 3.8 functools' built-in memoized property
from memoized_property import memoized_property

class Dictionaries:
    """
    Rules that evaluate dictionaries against the document's raw tokens.
    """
    def __init__(self, d: Document, resources: Resources):
        self._d = d
        self._resources = resources
    
    @memoized_property
    def FirstName(self):
        return tp.extract_dict(self._d.Tokens, self._resources.FirstNameDict)
    
    @memoized_property
    def LastName(self):
        return tp.extract_dict(self._d.Tokens, self._resources.LastNameDict)

class Regexes:
    """
    Rules that evaluate regular expressions against the document's raw tokens.
    """
    def __init__(self, d: Document, resources: Resources):
        self._d = d
        self._resources = resources
    
    @property
    def CapsWord(self):
        """
        A single token that starts with a capital letter, with subsequent letters not
        capitalized.
        """
        return tp.extract_regex_tok(
            tokens = self._d.Tokens,
            compiled_regex = self._resources.CapsWordRegex)
    

class Morphology:
    """
    Rules that filter tokens according to shallow linguistic features.
    """
    def __init__(self, d: Document):
        self._d = d
        
    @property
    def ProperNounToken(self):
        """
        Tokens that the part of speech tagger tagged as proper nouns.
        """
        feats = self._d.TokenFeatures
        return pd.DataFrame({"match": feats["token_span"][feats["tag"] == "NNP"]})



In [None]:
# Show the tokens labeled as proper nouns
doc = Document(TEST_TEXT, resources)
morph = Morphology(doc)
morph.ProperNounToken

In [None]:
# Pretty-print the spans in ProperNounToken
morph.ProperNounToken["match"].values

In [None]:
# Write some additional business rules that define a person extractor.
# Note the use of a Python method to avoid duplicate code in the rules.
    
class PersonName:
    """
    Rules that extract potential person name entities.
    """
    def __init__(self, doc: Document, dicts: Dictionaries, regexes: Regexes,
                 morphology: Morphology):
        self._doc = doc
        self._dicts = dicts
        self._regexes = regexes
        self._morphology = morphology

    @staticmethod
    def first_last_name(first: pd.DataFrame, last: pd.DataFrame):
        """
        Generic <first name> <last name> pattern match. Subroutine of rules below.
        
        :param first: DataFrame of first names, with the name in the column "match".
        
        :param last: DataFrame of last names, with the name in the column "match".
        
        :returns: A DataFrame with all <first name> <last name> matches, including the
            columns "first_name", "last_name", and "name" 
            (span that covers both first and last names)
        """
        ret = tp.adjacent_join(
            first_series = first["match"],
            second_series = last["match"],
            first_name = "first_name",
            second_name = "last_name")
        ret["name"] = tp.combine_spans(ret["first_name"], ret["last_name"])
        return ret
    
    @property
    def Person1(self):
        """
        <match of GlobalFirstName dict> <match of GlobalLastName dict>
        """
        return PersonName.first_last_name(self._dicts.FirstName, self._dicts.LastName)
    
    @property
    def Person2(self):
        """
        <match of GlobalFirstName dict> <capitalized word>
        """
        return PersonName.first_last_name(self._dicts.FirstName, self._regexes.CapsWord)
    
    @property
    def Person3(self):
        """
        <token labeled as proper noun> <match of GlobalLastName dict>
        """
        return PersonName.first_last_name(self._morphology.ProperNounToken, self._dicts.LastName)


In [None]:
# Instantiate our rules for a document
doc = Document(TEST_TEXT, resources)
dicts = Dictionaries(doc, resources)
regexes = Regexes(doc, resources)
morph = Morphology(doc)
persons = PersonName(doc, dicts, regexes, morph)

In [None]:
# Show one of the output DataFrames
persons.Person3

In [None]:
# Show a detailed view of the "name" column of the above DataFrame
persons.Person3["name"].values

In [None]:
class Document:
    """
    By convention, we 
    """
    def __init__(self, doc_text: str, resources: Resources):
        self._text = doc_text
        self._resources = resources
        
    @property
    def Text(self):
        return self._text
    
    @memoized_property
    def TokenFeatures(self):
        return tp.make_tokens_and_features(self._text, self._resources.LanguageModel)
    
    @memoized_property
    def Sentence(self):
        return pd.DataFrame({"sentence": self.TokenFeatures["sentence"].unique()})
    
    @property
    def Tokens(self):
        """
        :return: tokens as a `pd.Series` backed by a `CharSpanArray`.
        """
        return self.TokenFeatures["char_span"]