In [1]:
# Import Python libraries
from typing import *
import os
import ibm_watson
import ibm_watson.natural_language_understanding_v1 as nlu
import ibm_cloud_sdk_core
import pandas as pd
import sys

# And of course we need the text_extensions_for_pandas library itself.
_PROJECT_ROOT = "../.."
try:
    import text_extensions_for_pandas as tp
except ModuleNotFoundError as e:
    # If we're running from within the project source tree and the parent Python
    # environment doesn't have the text_extensions_for_pandas package, use the
    # version in the local source tree.
    if not os.getcwd().endswith("market"):
        raise e
    if _PROJECT_ROOT not in sys.path:
        sys.path.insert(0, _PROJECT_ROOT)
    import text_extensions_for_pandas as tp


if "IBM_API_KEY" not in os.environ:
    raise ValueError("IBM_API_KEY environment variable not set. Please create "
                     "a free instance of IBM Watson Natural Language Understanding "
                     "(see https://www.ibm.com/cloud/watson-natural-language-understanding) "
                     "and set the IBM_API_KEY environment variable to your instance's "
                     "API key value.")
api_key = os.environ.get("IBM_API_KEY")
service_url = os.environ.get("IBM_SERVICE_URL")  
natural_language_understanding = ibm_watson.NaturalLanguageUnderstandingV1(
    version="2021-01-01",
    authenticator=ibm_cloud_sdk_core.authenticators.IAMAuthenticator(api_key)
)
natural_language_understanding.set_service_url(service_url)

# Screenshots of this notebook should be this wide: ----------------------------->

# Market Intelligence with Pandas and IBM Watson

One of the most common applications of natural language processing in the enterprise is *market intelligence*. Market intelligence involves finding useful facts about customers and competitors in news articles.
In this article, we'll show how to perform an example market intelligence task using 
[Watson Natural Language Understanding](https://www.ibm.com/cloud/watson-natural-language-understanding) and our open source library [Text Extensions for Pandas](https://ibm.biz/text-extensions-for-pandas).

Our target corpus for this task will be a collection of IBM press releases from [the "announcements" section of IBM.com](https://newsroom.ibm.com/announcements). We'll use these documents to find information about the **names and titles of executives** at IBM and its business partners. 

Information about a company's leadership has many uses. You could use it to identify potential points of contact for sales or partnership discussions. Or you could estimate how much of a company's executive team is focused on different strategic areas. Some organizations even use it for recruiting purposes.

Press releases are a good place to find the names and titles of executives, because these articles often feature quotes from company leaders. Here's an example quote from an IBM press release from January 2021:

> "Equal access to skills and jobs is the key to unlocking economic opportunity and prosperity for diverse populations," said Valinda Scarbro Kennedy, HBCU Program Lead, IBM Global University Programs. 

In addition to what Valinda Kennedy is quoted as saying here, this snippet contains information about Ms. Kennedy's position within IBM. She is an IBM executive, and her title is *HBCU Program Lead, IBM Global University Programs*.

This snippet is an example of the general pattern that our code will be looking for:
* The article quotes a person by name.
* The article associates the name with a title.

The key challenge that we need to address is the many different ways that the pattern can manifest itself in natural language text. Here are some examples of variations of the quote above that we would like to capture:

<!--* **Original:** *"Equal access to skills and jobs is the key to unlocking economic opportunity and prosperity for diverse populations," said Valinda Scarbro Kennedy, HBCU Program Lead, IBM Global University Programs.*-->
* **Present tense:** *"Equal access to skills and jobs is the key to unlocking economic opportunity and prosperity for diverse populations," **says** Valinda Scarbro Kennedy, HBCU Program Lead, IBM Global University Programs.*
* **Attribution before quote:** *&nbsp;**Valinda Scarbro Kennedy, HBCU Program Lead, IBM Global University Programs**, said, "Equal access to skills and jobs is the key to unlocking economic opportunity and prosperity for diverse populations."*
<!--* **Attribution in middle of quote:** *"Equal access to skills and jobs", **said Valinda Scarbro Kennedy, HBCU Program Lead, IBM Global University Programs,** "is the key to unlocking economic opportunity and prosperity for diverse populations."*-->
* **Different form for title:** *"Equal access to skills and jobs is the key to unlocking economic opportunity and prosperity for diverse populations," said Valinda Scarbro Kennedy, **Leader of the HBCU program at IBM Global University Programs.**&nbsp;*

To overcome this challenge, we'll use general-purpose *semantic* models that extract high-level meaning from formal English text. Basically, these models analyze natural language text and identify facts in the text. The text could express a given fact in many different ways, but all of those different *syntactic* forms will map to the same output of our semantic model.

Semantic models can save a lot of work. There's no need to label separate training data or write separate rules or for all of the variations of our target pattern. A small amount of business logic can capture all these variations at once.

The main disadvantage of semantic models is that they are computationally expensive to compute. We'll use the power of Pandas and the IBM Cloud to manage this issue.

Our end-to-end goal is to create a program that takes as input a stack of press releases and produces as output a Pandas DataFrame full of executive names and titles. We'll break down the process of building this program into four parts:
* **Part 1:** Use IBM Watson Natural Language Understanding's semantic role labeling model to identify people quoted by name.
* **Part 2:** Use SpaCy's dependency parser model to identify the titles of people quoted by name.
* **Part 3:** Use Text Extensions for Pandas to make the code from Parts 1 and 2 faster.
* **Part 4:** Use [Ray](https://ray.io) and the IBM Cloud to to make the code run even faster.

Let's get started with Part 1!


# Part 1: Use IBM Watson to identify people quoted by name.

Watson Natural Language Understanding integrates multiple state-of-the-art models for analyzing natural langauge text.  Among these models is the `semantic_roles` model, which, as the name suggests, performs [Semantic Role Labeling](https://en.wikipedia.org/wiki/Semantic_role_labeling). You can think of Semantic Role Labeling as the task of identifying *subject-verb-object* triples:
* the *actions* that occured in the text (the verb),
* *who* performed each action (the subject), and 
* *on whom* the action was performed (the object).

If take our example snippet:
> "Equal access to skills and jobs is the key to unlocking economic opportunity and prosperity for diverse populations," said Valinda Scarbro Kennedy, HBCU Program Lead, IBM Global University Programs. 

and feed it through the `semantic_roles` model, we get the following raw output:



In [2]:
response = natural_language_understanding.analyze(
    text='''"Equal access to skills and jobs is the key to unlocking economic \
opportunity and prosperity for diverse populations," said Valinda Scarbro \
Kennedy, HBCU Program Lead, IBM Global University Programs.''',
    return_analyzed_text=True,
    features=nlu.Features(
        semantic_roles=nlu.SemanticRolesOptions()
    )).get_result()
response

{'usage': {'text_units': 1, 'text_characters': 199, 'features': 1},
 'semantic_roles': [{'subject': {'text': 'Equal access to skills and jobs'},
   'sentence': '"Equal access to skills and jobs is the key to unlocking economic opportunity and prosperity for diverse populations," said Valinda Scarbro Kennedy, HBCU Program Lead, IBM Global University Programs.',
   'object': {'text': 'the key to unlocking economic opportunity and prosperity for diverse populations'},
   'action': {'verb': {'text': 'be', 'tense': 'present'},
    'text': 'is',
    'normalized': 'be'}},
  {'subject': {'text': 'Valinda Scarbro Kennedy, HBCU Program Lead, IBM Global University Programs'},
   'sentence': '"Equal access to skills and jobs is the key to unlocking economic opportunity and prosperity for diverse populations," said Valinda Scarbro Kennedy, HBCU Program Lead, IBM Global University Programs.',
   'object': {'text': 'Equal access to skills and jobs is the key to unlocking economic opportunity and pros

That format is a bit hard to read. Let's use our open-source library, Text Extensions for Pandas, to convert to a DataFrame:

In [3]:
dfs = tp.io.watson.nlu.parse_response(response)
dfs["semantic_roles"]

Unnamed: 0,subject.text,sentence,object.text,action.verb.text,action.verb.tense,action.text,action.normalized
0,Equal access to skills and jobs,"""Equal access to skills and jobs is the key to...",the key to unlocking economic opportunity and ...,be,present,is,be
1,"Valinda Scarbro Kennedy, HBCU Program Lead, IB...","""Equal access to skills and jobs is the key to...",Equal access to skills and jobs is the key to ...,say,past,said,say


Now we can see that the `semantic_roles` model has identified two subject-verb-object triples. Each row of this DataFrame contains one such triple. In the first row, the verb is "to be", and in the second row, the verb is "to say".

This second row is where things get interesting for us, because the verb "to say" indicates that *someone made a statement*. And that's exactly the high-level pattern we're looking for. Let's filter the DataFrame down to that row and look at it more closely.

In [4]:
dfs["semantic_roles"][dfs["semantic_roles"]["action.normalized"] == "say"]

Unnamed: 0,subject.text,sentence,object.text,action.verb.text,action.verb.tense,action.text,action.normalized
1,"Valinda Scarbro Kennedy, HBCU Program Lead, IB...","""Equal access to skills and jobs is the key to...",Equal access to skills and jobs is the key to ...,say,past,said,say


The subject in this subject-verb-object triple is "Valinda Scarbro Kennedy, HBCU Program Lead, IBM Global University Programs", and the object is the quote from Ms. Kennedy. This model's output has captured the general action of "\[person\] says \[quotation\]", and it has done so in a way that is invariant under changes to the structure of the sentence. 

If we move the attribution to the middle of the quote:
> *"Equal access to skills and jobs", **said Valinda Scarbro Kennedy, HBCU Program Lead, IBM Global University Programs,** "is the key to unlocking economic opportunity and prosperity for diverse populations."*

...we get the same result:

In [5]:
response = natural_language_understanding.analyze(
    text='''"Equal access to skills and jobs is the key to unlocking economic \
opportunity and prosperity for diverse populations," said Valinda Scarbro \
Kennedy, HBCU Program Lead, IBM Global University Programs.''',
    # TODO: Truncate this cell after this point for display purposes
    return_analyzed_text=True,
    features=nlu.Features(
        semantic_roles=nlu.SemanticRolesOptions()
    )).get_result()
dfs = tp.io.watson.nlu.parse_response(response)
dfs["semantic_roles"][dfs["semantic_roles"]["action.normalized"] == "say"]

Unnamed: 0,subject.text,sentence,object.text,action.verb.text,action.verb.tense,action.text,action.normalized
1,"Valinda Scarbro Kennedy, HBCU Program Lead, IB...","""Equal access to skills and jobs is the key to...",Equal access to skills and jobs is the key to ...,say,past,said,say


If we change the past-tense verb "said" to the present-tense "says":

> *"Equal access to skills and jobs is the key to unlocking economic opportunity and prosperity for diverse populations," **says** Valinda Scarbro Kennedy, HBCU Program Lead, IBM Global University Programs.*

...we get the same result again:

In [6]:
response = natural_language_understanding.analyze(
    text='''"Equal access to skills and jobs is the key to unlocking economic \
opportunity and prosperity for diverse populations," said Valinda Scarbro \
Kennedy, HBCU Program Lead, IBM Global University Programs.''',
    # TODO: Truncate this cell after this point for display purposes
    return_analyzed_text=True,
    features=nlu.Features(
        semantic_roles=nlu.SemanticRolesOptions()
    )).get_result()
dfs = tp.io.watson.nlu.parse_response(response)
dfs["semantic_roles"][dfs["semantic_roles"]["action.normalized"] == "say"]

Unnamed: 0,subject.text,sentence,object.text,action.verb.text,action.verb.tense,action.text,action.normalized
1,"Valinda Scarbro Kennedy, HBCU Program Lead, IB...","""Equal access to skills and jobs is the key to...",Equal access to skills and jobs is the key to ...,say,past,said,say


All the different variations that we talked about earlier will produce the same result. This model lets us capture them all with very little code. All we need to do is to run the model and filter the outputs down to the verb we're looking for.

So far we've been looking at one paragraph. Let's rerun the same process on a full document. We'll use a single example press release about [an IBM study on the future of IT](
https://newsroom.ibm.com/2021-01-04-IBM-Study-Majority-of-Surveyed-Companies-are-Not-Prepared-for-IT-Needs-of-the-Future-Say-U-S-and-U-K-Tech-Leaders) as a running example.

As before, we can run the document through Watson Natural Language Understanding's Python interface, specifying that the service should run its `semantic_roles` model. Then we use Text Extensions for Pandas to convert the model results to a DataFrame:

In [7]:
DOC_URL = "https://newsroom.ibm.com/2021-01-04-IBM-Study-Majority-of-Surveyed-Companies-are-Not-Prepared-for-IT-Needs-of-the-Future-Say-U-S-and-U-K-Tech-Leaders"

# Make the request
response = natural_language_understanding.analyze(
    url=DOC_URL,  # NLU will fetch the URL for us.
    return_analyzed_text=True,
    features=nlu.Features(
        semantic_roles=nlu.SemanticRolesOptions()
    )).get_result()

# Convert the output of the `semantic_roles` model to a DataFrame
semantic_roles_df = tp.io.watson.nlu.parse_response(response)["semantic_roles"]
semantic_roles_df.head(3)

Unnamed: 0,subject.text,sentence,object.text,action.verb.text,action.verb.tense,action.text,action.normalized
0,Nearly a quarter of CIOs and CTOs surveyed,- Nearly a quarter of CIOs and CTOs surveyed s...,they are just starting their IT modernization ...,say,present,say,say
1,they,- Nearly a quarter of CIOs and CTOs surveyed s...,,be,present,are,be
2,they,- Nearly a quarter of CIOs and CTOs surveyed s...,yet to begin modernizing,have,present,have,have


If we filter down to the subject-verb-object triples for the verb "to say", we can see that this document has quite a few examples of the "person says statement" pattern:

In [8]:
quotes_df = semantic_roles_df[semantic_roles_df["action.normalized"] == "say"]
quotes_df.loc[16:29]

Unnamed: 0,subject.text,sentence,object.text,action.verb.text,action.verb.tense,action.text,action.normalized
16,Nearly a quarter of CIOs and CTOs (24%) surveyed,Nearly a quarter of CIOs and CTOs (24%) surve...,their company is just starting its IT moderniz...,say,present,say,say
20,more than 95% of IT leaders surveyed,"As a result, more than 95% of IT leaders surv...","they are looking to adopt public, hybrid or pr...",say,past,said,say
28,"Archana Vemulapalli, General Manager, IBM Infr...","""Our clients are looking to accelerate IT mod...",Our clients are looking to accelerate IT moder...,say,past,said,say


The DataFrame `quotes_df` contains all the instances of the "person says statement" pattern that the
Watson service's semantic role labeler has identified. For this use case, we want to filter this
set down to cases where the subject (the person making the statement) is mentioned by name.
Here's an example:

In [9]:
quotes_df.loc[[28]]

Unnamed: 0,subject.text,sentence,object.text,action.verb.text,action.verb.tense,action.text,action.normalized
28,"Archana Vemulapalli, General Manager, IBM Infr...","""Our clients are looking to accelerate IT mod...",Our clients are looking to accelerate IT moder...,say,past,said,say


## Identifying person names

How can we find the matches where the subject contains a person's name? Fortunately for us, Watson Natural Language Understanding has a model for exactly that task. The `entities` model in this Watson service finds *named entity mentions* --- that is, places where the document mentions an entity like a person or company by the entity's name. This model will find all of the person names in our documents with high accuracy. To invoke the model, we use the same `analyze()` Python method as before. We tell the service to run its entities model and retrieve mentions. Then we convert the result to a DataFrame using Text Extensions for Pandas:

In [10]:
response = natural_language_understanding.analyze(
    url=DOC_URL,
    return_analyzed_text=True,
    features=nlu.Features(
        # Ask Watson to find mentions of named entities
        entities=nlu.EntitiesOptions(mentions=True),
        
        # Also divide the document into words. We'll use these in just a moment.
        syntax=nlu.SyntaxOptions(tokens=nlu.SyntaxOptionsTokens()),
    )).get_result()
entity_mentions_df = tp.io.watson.nlu.parse_response(response)["entity_mentions"]
entity_mentions_df.head()

Unnamed: 0,type,text,span,confidence
0,Organization,CTOs,"[31, 35): 'CTOs'",0.941306
1,Organization,CTOs,"[672, 676): 'CTOs'",0.904886
2,Organization,CTOs,"[993, 997): 'CTOs'",0.905264
3,Organization,CTOs,"[3032, 3036): 'CTOs'",0.921493
4,Organization,CTOs,"[3322, 3326): 'CTOs'",0.972899


The `entities` model's output contains mentions of many types of entity. For this application, we need
mentions of person names. Let's filter our DataFrame down to just those types of mentions:

In [11]:
person_mentions_df = entity_mentions_df[entity_mentions_df["type"] == "Person"]
person_mentions_df

Unnamed: 0,type,text,span,confidence
23,Person,Archana Vemulapalli,"[1775, 1794): 'Archana Vemulapalli'",0.702555
39,Person,Clay Helm,"[4331, 4340): 'Clay Helm'",0.370151


## Tying it all together

Now we have two pieces of information that we need to combine:
* Instances of the "person said statement" pattern from the `semantic_roles` model
* Mentions of person names from the `entities` model

We need to align the "subject" part of the semantic role labeler's output with the person mentions. We can use the span manipulation facilities of Text Extensions for Pandas to do this.

*Spans* are a common concept in natural language processing. A span represents a region of the document, usually as begin and end offsets and a reference to the document's text. Text Extensions for Pandas adds a special `SpanDtype` data type to Pandas DataFrames. With this data type, you can define a DataFrame with one or more columns of span data. For example, the column called "span" in the DataFrame above is of the `SpanDtype` data type. The first span in this column, `[1775, 1794): 'Archana Vemulapalli'`, shows that the name "Archana Vemulapalli" occurs between locations 1775 and 1794 in the document.

The output of the `semantic_roles` model doesn't contain location information. However, strings in the "subject" field are long enough that we can easily find their locations in the document.
We can use the [dictionary-matching facilities](https://text-extensions-for-pandas.readthedocs.io/en/latest/#module-text_extensions_for_pandas.spanner) of Text Extensions for Pandas to find where these strings come from:

In [12]:
# Create a dictionary from the strings in quotes_df["subject.text"].
tokenizer = tp.io.spacy.simple_tokenizer()
dictionary = tp.spanner.extract.create_dict(quotes_df["subject.text"], tokenizer)

# Match the dictionary against the document text.
doc_text = entity_mentions_df["span"].array.document_text
tokens = tp.io.spacy.make_tokens(doc_text, tokenizer)
matches_df = tp.spanner.extract_dict(tokens, dictionary, output_col_name="span")
matches_df["subject.text"] = matches_df["span"].array.covered_text  # Join key

# Merge the dictionary matches back with the original strings.
subjects_df = quotes_df[["subject.text"]].merge(matches_df)
subjects_df

Unnamed: 0,subject.text,span
0,Nearly a quarter of CIOs and CTOs surveyed,"[2, 44): 'Nearly a quarter of CIOs and CTOs su..."
1,nearly 80 percent of leaders surveyed,"[160, 197): 'nearly 80 percent of leaders surv..."
2,Many corporate IT leaders,"[339, 364): 'Many corporate IT leaders'"
3,Nearly a quarter of CIOs and CTOs (24%) surveyed,"[964, 1012): 'Nearly a quarter of CIOs and CTO..."
4,more than 95% of IT leaders surveyed,"[1205, 1241): 'more than 95% of IT leaders sur..."
5,"Archana Vemulapalli, General Manager, IBM Infr...","[1775, 1860): 'Archana Vemulapalli, General Ma..."
6,More than 60% of technology leaders surveyed,"[2317, 2361): 'More than 60% of technology lea..."
7,more than three in four surveyed,"[2891, 2923): 'more than three in four surveyed'"
8,the majority (60%) of CIOs and CTOs surveyed,"[3291, 3335): 'the majority (60%) of CIOs and ..."
9,approximately 56% of U.S. respondents,"[3498, 3535): 'approximately 56% of U.S. respo..."


Now we have a column of span data for the `semantic_roles` model's output, and we can align these spans with the spans of person mentions. Text Extensions for Pandas includes built-in span operations. One of these operations, `contain_join()` takes two columns of span data and identifies all pairs of spans where the first span contains the second span. We can use this operation to find all the places where the span from the `semantic_roles` model contains a span from the output of the `entities` model: 

In [13]:
execs_df = tp.spanner.contain_join(subjects_df["span"], 
                                   person_mentions_df["span"],
                                   "subject", "person")
execs_df

Unnamed: 0,subject,person
0,"[1775, 1860): 'Archana Vemulapalli, General Ma...","[1775, 1794): 'Archana Vemulapalli'"


To recap: With a few lines of Python code, we've identified places in the article 
where the article quoted a person by name. For each of those quotations, we've 
identified the person name (the `person` column in the DataFrame above) as well 
as the phrase containing that name (the `subject` column in the DataFrame above).

There is more useful information to be found in the phrases. For example, here's 
the full text of the `subject` span from the above DataFrame:

In [14]:
execs_df.iloc[0]["subject"].covered_text

'Archana Vemulapalli, General Manager, IBM Infrastructure Services - Offerings and CTO'

This phrase tells us that Archana Vemulapalli's title is "General Manager, IBM Infrastructure Services - Offerings and CTO."

In [Part 2](./Market_Intelligence_Part2.ipynb), we'll show how to use *dependency parsing* (and Pandas!) to break down these phrases
into their component parts so that we can extract the titles of executives.

<!--, as well as 
filtering out instances where the article quotes someone who is not an executive.-->

In [15]:
# Old version of adding spans to subjecs: Use exact string matching.

# Retrieve the full document text from the entity mentions output.
doc_text = entity_mentions_df["span"].array.document_text

# Filter down to just the rows and columns we're interested in
subjects_df = quotes_df[["subject.text"]].copy().reset_index(drop=True)

# Use String.index() to find where the strings in "subject.text" begin
subjects_df["begin"] = pd.Series(
    [doc_text.index(s) for s in subjects_df["subject.text"]], dtype=int)

# Compute end offsets and wrap the <begin, end, text> triples in a SpanArray column
subjects_df["end"] = subjects_df["begin"] + subjects_df["subject.text"].str.len()
subjects_df["span"] = tp.SpanArray(doc_text, subjects_df["begin"], subjects_df["end"])
subjects_df

Unnamed: 0,subject.text,begin,end,span
0,Nearly a quarter of CIOs and CTOs surveyed,2,44,"[2, 44): 'Nearly a quarter of CIOs and CTOs su..."
1,nearly 80 percent of leaders surveyed,160,197,"[160, 197): 'nearly 80 percent of leaders surv..."
2,Many corporate IT leaders,339,364,"[339, 364): 'Many corporate IT leaders'"
3,Nearly a quarter of CIOs and CTOs (24%) surveyed,964,1012,"[964, 1012): 'Nearly a quarter of CIOs and CTO..."
4,more than 95% of IT leaders surveyed,1205,1241,"[1205, 1241): 'more than 95% of IT leaders sur..."
5,"Archana Vemulapalli, General Manager, IBM Infr...",1775,1860,"[1775, 1860): 'Archana Vemulapalli, General Ma..."
6,More than 60% of technology leaders surveyed,2317,2361,"[2317, 2361): 'More than 60% of technology lea..."
7,more than three in four surveyed,2891,2923,"[2891, 2923): 'more than three in four surveyed'"
8,the majority (60%) of CIOs and CTOs surveyed,3291,3335,"[3291, 3335): 'the majority (60%) of CIOs and ..."
9,approximately 56% of U.S. respondents,3498,3535,"[3498, 3535): 'approximately 56% of U.S. respo..."
