# Person.ipynb

Demonstration notebook for Text Extensions for Pandas.

This notebook evaluates the effectiveness of a person name extractor using the facilities of Text Extensions for Pandas.

Instructions to run:
1. (optional) Use the script `env.sh` at the root of this project to create an Anaconda environment `pd` with required packages. Activate this environment by typing `conda activate pd`.
1. From a shell window at the root of the project, start up JupyterLab by typing `jupyter lab`
1. Inside JupyterLab, navigate to the `notebooks` directory and open up this notebook. You should now be able to run the code in this notebook.

In [None]:
# INITIALIZATION BOILERPLATE

# The Jupyter kernel for this notebook usually starts up inside the notebooks
# directory, but the text_extensions_for_pandas package code is in the parent
# directory. Add that parent directory to the front of the Python include path.
import sys
if ".." not in sys.path:
    sys.path.insert(0, "..")

# Libraries
import numpy as np
import pandas as pd
import regex
import spacy
spacy_language_model = spacy.load("en_core_web_sm")
import textwrap

# And of course we need the text_extensions_for_pandas library itself.
import text_extensions_for_pandas as tp

In [None]:
# Example document text courtesy https://en.wikipedia.org/wiki/Monty_Python_and_the_Holy_Grail
# License: CC-BY-SA
with open ("../resources/holy_grail.txt", "r") as f:
    doc_text = f.read()
    pass
 
# Parse the document text with SpaCy, then convert the results to a dataframe
token_features = tp.make_tokens_and_features(doc_text, spacy_language_model)
token_features

In [None]:
# We can extract out all unique sentence spans by aggregating the "sentence" 
# column of the above dataframe:
sentences = pd.DataFrame({"sentence": token_features["sentence"].unique()})
sentences.head(10)

In [None]:
# The "ent_iob" and "ent_type" fields contain entity tags in 
# Inside-Outside-Beginning (IOB) format.
# Text Extensions for Pandas has a built-in function to convert 
# IOB tagged data to spans of entities.
entities = tp.iob_to_spans(token_features)
entities

In [None]:
# Let's look at just the entities tagged "PERSON"
person_entities = entities[entities["ent_type"] == "PERSON"]
person_entities.head()

In [None]:
# Use the TokenSpanArray's built-in HTML rendering to look at these
# PERSON entities in the context of the document.
person_entities["token_span"].values

In [None]:
# Load gold standard labels in IOB format from a CSV file
person_gold_iob = pd.read_csv("../resources/holy_grail_person.csv")

# Pull in token offsets from our token_features dataframe
person_gold_iob["token_span"] = token_features["token_span"].values
person_gold_iob["char_span"] = token_features["char_span"].values
person_gold_iob.iloc[25:35]

In [None]:
# Convert from IOB format to spans of entities
person_gold = tp.iob_to_spans(person_gold_iob, entity_type_col_name=None)
person_gold.head()

In [None]:
# Find all the spans that are in both the extractor's answer set and the gold standard
person_intersection = person_gold.merge(person_entities)
person_intersection.head()

In [None]:
# Let's compute precision and recall, just on this document.
# Of course, in a real use case, we would be computing these values on a 
# development holdout set of documents while tuning the model, then
# computing them again on a validation set during final testing.
# We use a single document here to show that it is straightforward 
# to collect the necessary information using Pandas.
num_true_positives = len(person_intersection.index)
num_entities = len(person_gold.index)
num_entities_extracted = len(person_entities.index)

precision = num_true_positives / num_entities_extracted
recall = num_true_positives / num_entities
F1 = 2.0 * (precision * recall) / (precision + recall)

print(
"""Number of correct answers: {}
Number of entities identified: {}
Actual number of entities: {}
Precision: {:1.2f}
Recall: {:1.2f}
F1: {:1.2f}""".format(num_true_positives, num_entities, num_entities_extracted, precision, recall, F1))

In [None]:
# That seems a bit low. Let's look at the false positives.
false_positives = person_entities[~person_entities["token_span"].isin(person_gold["token_span"])]
false_positives

In [None]:
# Hmm, aside from the first three, most of these appear to be partial matches.
# Let's recompute precision and recall giving credit for partial matches.
# We start by finding out how many spans in person_entities["token_span"]
# are contained within a span from person_gold["token_span"]
looser_intersection = tp.contain_join(person_gold["token_span"], person_entities["token_span"],
                                      "gold", "extracted")
looser_intersection

In [None]:
# Note that there are some duplicates (rows 23 and 24, for example).
# Use the number of unique values in the "gold" column to compute
# how many partial or complete matches of an entity we found.
num_unique_matches = len(looser_intersection["gold"].unique())
num_unique_matches

In [None]:
# Recompute precision, recall, and F1 score on this looser basis.
# Again, in a real use case we would be doing this operation on a holdout set of
# multiple documents. The point here is that the core computations all map
# easily into Pandas.
num_true_positives = num_unique_matches
num_entities = len(person_gold.index)
num_entities_extracted = len(person_entities.index)

precision = num_true_positives / num_entities_extracted
recall = num_true_positives / num_entities
F1 = 2.0 * (precision * recall) / (precision + recall)

print(
"""Number of correct answers: {}
Number of entities identified: {}
Actual number of entities: {}
Precision: {:1.2f}
Recall: {:1.2f}
F1: {:1.2f}""".format(num_true_positives, num_entities, num_entities_extracted, precision, recall, F1))

In [None]:
# Let's drill down on those partial matches to see what's causing them
# (at least on this one document)
partial_matches = looser_intersection[looser_intersection["gold"].values != looser_intersection["extracted"].values].reset_index(drop=True)
partial_matches

In [None]:
# Hmm, there seems to be some clustering of the matches. Let's see how
# they map onto the sentences of the document.
extracted_sentence = tp.contain_join(sentences["sentence"], partial_matches["extracted"],
                                     first_name="sentence")
partial_matches["sentence"] = extracted_sentence["sentence"].values
partial_matches

In [None]:
# Looks like 1/3 of the partial matches on this document are clustered in a 
# single problem sentence. Let's take a closer look at that sentence.
sentence_span = partial_matches["sentence"].loc[0]
sentence = token_features[token_features["sentence"] == sentence_span]
sentence.head(10)

In [None]:
# Use SpaCy to render the dependency parse of the sentence
tp.render_parse_tree(sentence)

In [None]:
# That's a lot of parse tree! Let's cut that down to the tokens
# that cover entities from the gold standard data.
entity_tokens = tp.contain_join(person_gold["token_span"], sentence["token_span"],
                                "entity", "token_span")
entity_tokens.head(10)

In [None]:
# Extract out and display the part of the dependency parse that covers just those tokens
mask = token_features["token_span"].isin(entity_tokens["token_span"])
partial_parse = token_features[mask]
tp.render_parse_tree(partial_parse)

In [None]:
# With the filtered parse tree, two things pop out:
# 1. The dependency parser model finds information about proper noun phrases
#    that the NER model does not catch.
# 2. The phrase "Sir Not-Appearing-in-this-Film" causes the dependency parser 
#    model to go off the rails.
#
# Let's see if we can combine the results of the two models to get more accurate
# spans.
# First, let's use some Gremlin to extract out the compound proper nouns from
# the parse tree. We'll do this at the document level.
g = tp.token_features_to_traversal(token_features)
compound_proper_nouns = (
    g.V()  # Start with all vertices.
    .has("tag", "NNP")  # Filter out those not tagged NNP (proper noun).
    .has("dep", "compound").as_("src")  # Filter out those without a dependency link of type "compound".
    .out()  # Follow the outgoing link to the parent node.
    .has("tag", "NNP").as_("dest")  # Filter paths where the parent node is not a proper noun.
    .select("src", "dest").by("token_span")  # Return parents of tokens
).toDataFrame()
# Add a third column with the combined span
compound_proper_nouns["phrase"] = compound_proper_nouns["src"] + compound_proper_nouns["dest"]
compound_proper_nouns.head(10)

In [None]:
# Let's find the cases where a compound proper noun from the deep parser
# overlaps (but does not exactly match) with a person entity from the 
# named entity recognizer.
overlap = tp.overlap_join(compound_proper_nouns["phrase"], person_entities["token_span"],
                          first_name="compound_phrase", second_name="person")
strict_overlap = overlap[~overlap["compound_phrase"].isin(person_entities["token_span"])].reset_index(drop=True)
strict_overlap

In [None]:
# Use these pairs of spans to build up expanded person spans
strict_overlap["expanded_person"] = strict_overlap["compound_phrase"] + strict_overlap["person"]
strict_overlap

In [None]:
# If we just added these expanded spans back to our original set of 
# entities, we would get overlapping results. Find and filter out the 
# results from the original entities that overlap with our expanded
# person entities.
to_filter = tp.overlap_join(strict_overlap["expanded_person"], person_entities["token_span"],
                            first_name="expanded_person", second_name="token_span")
to_filter

In [None]:
# Remove the contents of to_filter and add the contents of strict_overlap to
# our original set of persons
filtered = person_entities["token_span"][~person_entities["token_span"].isin(to_filter["token_span"])]
person_entities_2 = pd.DataFrame({"token_span": 
                                  pd.concat([filtered, strict_overlap["expanded_person"]])
                                    .sort_values()
                                    .reset_index(drop=True)})
person_entities_2

In [None]:
# Let's see what this correction does to the exact-match precision and recall
person_intersection_2 = person_gold.merge(person_entities_2)
num_true_positives = len(person_intersection_2.index)
num_entities = len(person_gold.index)
num_entities_extracted = len(person_entities_2.index)

precision = num_true_positives / num_entities_extracted
recall = num_true_positives / num_entities
F1 = 2.0 * (precision * recall) / (precision + recall)

print(
"""Number of correct answers: {}
Number of entities identified: {}
Actual number of entities: {}
Precision: {:1.2f}
Recall: {:1.2f}
F1: {:1.2f}""".format(num_true_positives, num_entities, num_entities_extracted, precision, recall, F1))

Here we've just shown that you can quickly combine the results of multiple
models using Pandas and Gremlin.

It's important to note that the improvement in precision may or
may not generalize to the other documents of the corpus. In a real use case, we would need to 
validate this approach against a development set of test documents. If this
simple hybrid approach works well there, an appropriate next step would be 
to retrain the NER model using the dependency parser's "compound" tags as 
an additional feature.

In [None]:
# Now precision is looking pretty good, but recall is kind of low.
# Let's examine the missing results.
missing_results_mask = ~(person_gold["token_span"].isin(looser_intersection["gold"]))
missing_results = person_gold[missing_results_mask]
missing_results