In [1]:
# Import Python libraries
from typing import *
import os
import ibm_watson
import ibm_watson.natural_language_understanding_v1 as nlu
import ibm_cloud_sdk_core
import pandas as pd
import spacy
import sys

# And of course we need the text_extensions_for_pandas library itself.
_PROJECT_ROOT = "../.."
try:
    import text_extensions_for_pandas as tp
except ModuleNotFoundError as e:
    # If we're running from within the project source tree and the parent Python
    # environment doesn't have the text_extensions_for_pandas package, use the
    # version in the local source tree.
    if not os.getcwd().endswith("market"):
        raise e
    if _PROJECT_ROOT not in sys.path:
        sys.path.insert(0, _PROJECT_ROOT)
    import text_extensions_for_pandas as tp
    
# Download the SpaCy model if necessary
try:
    spacy.load("en_core_web_trf")
except IOError:
    raise IOError("SpaCy dependency parser not found. Please run "
                  "'python -m spacy download en_core_web_trf', then "
                  "restart JupyterLab.")


if "IBM_API_KEY" not in os.environ:
    raise ValueError("IBM_API_KEY environment variable not set. Please create "
                     "a free instance of IBM Watson Natural Language Understanding "
                     "(see https://www.ibm.com/cloud/watson-natural-language-understanding) "
                     "and set the IBM_API_KEY environment variable to your instance's "
                     "API key value.")

api_key = os.environ.get("IBM_API_KEY")
service_url = os.environ.get("IBM_SERVICE_URL")  
natural_language_understanding = ibm_watson.NaturalLanguageUnderstandingV1(
    version="2021-01-01",
    authenticator=ibm_cloud_sdk_core.authenticators.IAMAuthenticator(api_key)
)
natural_language_understanding.set_service_url(service_url)

OSError: SpaCy dependency parser not found. Please run 'python -m spacy download en_core_web_trf', then restart your Jupyter kernel.

# Part 2: Use dependency parsing to extract executives' titles

*Dependency parsing* is a natural language processing technique that identifies the relationships between the words that make up a sentence. We can treat these relationships between a sentence's words as the edges of a graph. This graph is always a tree, so we refer to it as the *dependency-based parse tree* of the sentence. "Dependency-based parse tree" is an awkward phrase, so it's common to refer to this tree as a "dependency parse" or a "parse tree".

In this second part of the series, we'll use *dependency parsing* to break down these phrases into their component parts and extract out the



*TODO: Diagram of a example dependency parse.*

The detailed information in the parse tree allows us to quickly create a very general 
solution to many extraction tasks without needing to create complex rules or train a machine learning model. In this post, we'll use the dependency parsing to extract executives' titles from the phrases that our code from Part 1 produces.

We'll use the dependency parser from the open source NLP library SpaCy. Text Extensions for Pandas includes a utility function that turns the output of SpaCy language models into a DataFrame. Here's what we get when we run the SpaCy language model over our example document and convert the output to a DataFrame:

In [None]:
import spacy

spacy_language_model = spacy.load("en_core_web_trf")
all_token_features = tp.io.spacy.make_tokens_and_features("""
I like natural language processing.
""", spacy_language_model)


In [None]:
import spacy

doc_text = step_2_results["subject"].array.document_text

spacy_language_model = spacy.load("en_core_web_trf")
all_token_features = tp.io.spacy.make_tokens_and_features(doc_text, spacy_language_model)
all_token_features.head()

The SpaCy language model output contains many different features for each token position in the document.
We're only interested in the dependency parse, so let's project this language model output down to just the parts 
that are relevant to the parse.

In [None]:
parse_features = all_token_features[["id", "span", "tag", "dep", "head"]]
parse_features

In Part 1, we used Text Extensions for Pandas and Watson Natural Language Understanding to identify locations where IBM press releases quoted a person by name. We walked through this process in great detail, but at high level, you can think of it as a two-step process:
1. Use IBM Watson Natural Language Understanding to extract semantic roles and person mentions from the press release.
2. Use Text Extensions for Pandas to convert those model outputs to Pandas DataFrames. Then cross-reference the data in those DataFrames to find the places where the press release quoted a person by name.

*TODO: Describe how we've shared the code from Part 1 in `market_intelligence.py`.*

In [None]:
import market_intelligence as mi

Let's quickly recap what the output of those two processing steps looks. We'll use the same example document as in Part 1.

In [None]:
example_doc_url = "https://newsroom.ibm.com/2021-01-04-IBM-Study-Majority-of-Surveyed-Companies-are-Not-Prepared-for-IT-Needs-of-the-Future-Say-U-S-and-U-K-Tech-Leaders"
example_doc_html = mi.download_article(example_doc_url)
display(HTML(textwrap.shorten(example_doc_html, 5000)))

The first processing step extracts named entities and semantic roles with IBM Watson Natural Language Understanding.

In [None]:
step_1_results = (
    mi.extract_named_entities_and_semantic_roles(example_doc_html, 
                                                 natural_language_understanding)
)
textwrap.shorten(str(step_1_results), 1000)

The second processing step uses Text Extensions for Pandas to convert these model outputs into DataFrames, then uses these DataFrames to identify persons that the document quotes by name:

In [None]:
step_2_results = mi.identify_persons_quoted_by_name(step_1_results)
step_2_results

As we noted at the end of Part 1, the phrase in the `subject` column of our DataFrame 
contains additional information about each executive's job position:

In [None]:
step_2_results.iloc[0]["subject"].covered_text

SpaCy's dependency parse is based on the [Universal Dependencies](https://universaldependencies.org/) framework. The parser gives each word, or *token*, in the document a part of speech (`tag` in the DataFrame above), a link to its *head* token (`head`), and a dependency type (`dep`).

The parser's output covers all 826 tokens in the document. Let's filter down to just the tokens that overlap with the phrases we've previously identified as describing persons who made statements. We can use Text Extensions for Pandas' `contain_join()` span operation to implement this filtering:

In [None]:
phrase_tokens = (
    tp.spanner.contain_join(step_2_results["subject"], 
                            parse_features["span"], 
                            "subject", "span")
    .merge(parse_features)
    .set_index("id", drop=False)
)
phrase_tokens

## Navigating the parse tree

This subtree of the document's parse tree describes the relationships between the words in our target phrase. We can visualize these relationships by rendering the subtree with SpaCy's rendering engine, DisplaCy:

In [None]:
tp.io.spacy.render_parse_tree(phrase_tokens)

We will start out with the parse tree nodes that comprise the person entity and traverse `appos` and `compound` links to build up likely titles.

To facilitate this traversal, let's convert the graph in `phrase_tokens` into DataFrames of nodes and edges.

In [None]:
nodes = phrase_tokens[["id", "span", "tag", "subject"]].reset_index(drop=True)
nodes

In [None]:
edges = phrase_tokens[["id", "head", "dep"]].reset_index(drop=True)
edges

We start with the graph nodes that are parts of target person names. The span operation `overlap_join()` lets us efficiently correlate the spans in the Watson model output with the spans in the SpaCy model output:

In [None]:
person_nodes = (
    tp.spanner.overlap_join(step_2_results["person"], nodes["span"],
                            "person", "span")
    .merge(nodes)
)
person_nodes

Next, we define the set of edges that we will follow to expand this set of nodes. For this application, we will
follow two types of dependency links: 
* [appositional modifier (`appos`)](https://universaldependencies.org/docs/en/dep/appos.html) links that connect
  names to their associated titles; and
* [compound](https://universaldependencies.org/docs/en/dep/compound.html) links that connect the components of
  these titles to each other

In [None]:
filtered_edges = edges[edges["dep"].isin(["appos", "compound"])]
filtered_edges

Now we have a set of starting nodes and a set of edges to traverse, so we can perform a [transitive closure](https://en.wikipedia.org/wiki/Transitive_closure) operation: 
Expand our set of nodes by traversing links of the graph; and keep doing so as long as there are additional nodes
to be found.

We can use the Pandas `merge` function to implement a single step of traversing links:

In [None]:
# Find all nodes that are on the other end of an edge from a node in `person_nodes`.
selected_nodes = person_nodes.drop(columns="person").copy()

addl_nodes = (
    selected_nodes[["id"]]
    .merge(filtered_edges, left_on="id", right_on="head", suffixes=["_head", ""])[["id"]]
    .merge(nodes)
)
addl_nodes

If we add these additional nodes to our set of ndoes and repeat the traversal step until
the set stops growing, we have a transitive closure operation:

In [None]:
selected_nodes = person_nodes.drop(columns="person").copy()
previous_num_nodes = 0

# Keep going as long as the previous round added nodes to our set.
while len(selected_nodes.index) > previous_num_nodes:
    previous_num_nodes = len(selected_nodes.index)
    
    # Traverse one edge out from all nodes in `selected_nodes`
    addl_nodes = (
        selected_nodes[["id"]]
        .merge(filtered_edges, left_on="id", right_on="head", suffixes=["_head", ""])[["id"]]
        .merge(nodes)
    )
    
    # Add any previously unselected node to `selected_nodes`
    selected_nodes = pd.concat([selected_nodes, addl_nodes]).drop_duplicates()

selected_nodes

Now we have the set of all nodes that are reachable from one of our selected person names by traversing
`appos` and `compound` links in the dependency parse. If we filter out the nodes we started with, we
should get the nodes for the tokens that comprise the title:

In [None]:
title_nodes = selected_nodes[~selected_nodes["id"].isin(person_nodes["id"])]
title_nodes

## Tying it all together

Now we just need to turn these sets of nodes into spans. We can Pandas' grouping
and aggregation to do so, taking advantage of the fact that the "addition" operation 
for spans is defined as:
```
span1 + span2 = smallest span that contains both span1 and span2
```


In [None]:
titles_df = (
    title_nodes
    .groupby("subject")
    .aggregate({"span": "sum"})
    .reset_index()
    .rename(columns={"span": "title"})
)
# As of Pandas 1.5.1, groupby over extension types downgrades them to object dtype.
# Cast back up to the extension type.
titles_df["subject"] = titles_df["subject"].astype(tp.SpanDtype())

titles_df

Finally we can join back with our DataFrame of person/company information:

In [None]:
step_2_results

In [None]:
execs_with_titles_df = pd.merge(step_2_results, titles_df)
execs_with_titles_df

Now we have Python code that goes all the way from an HTML document to a DataFrame of names and titles of executives. We're ready to do some data mining! In the next part of this series, we'll apply the NLP code we've developed so far to many IBM press releases at once and extract the names and titles of many different executives.