# Watson NLP Example for Text Extensions for Pandas

## Introduction

This demo shows how to use the `watson` module from Text Extension for Pandas to 
process a Watson NLP response from the IBM cloud into Pandas DataFrames for analysis.
Pandas is the de facto tool for data science and ...
https://github.com/CODAIT/text-extensions-for-pandas

The notebook is broken up into 2 parts:

**Part 1:** Shows how to authenticate with the IBM Watson SDK and make a request with the
Watson NLU API. The response is then processed by Text Extensions for Pandas to convert
the JSON response into several Pandas DataFrames.

**Part 2:** Will go deeper into the data received from Watson NLU and show how to do
analytics with the DataFrames from Text Extensions for Pandas


## Authentication

This demo uses the IBM Watson Python SDK to perform authentication on the IBM Cloud with the 
`IAMAuthenticator`. See https://github.com/watson-developer-cloud/python-sdk#iam for more 
information. To properly authenticate with IBM Cloud, please set the environment variable
`IBM_API_KEY` with your correct apikey to make requests to `ibm_watson.NaturalLanguageUnderstandingV1`.

In [1]:
# INITIALIZATION BOILERPLATE

# The Jupyter kernel for this notebook usually starts up inside the notebooks
# directory, but the text_extensions_for_pandas package code is in the parent
# directory. Add that parent directory to the front of the Python include path.
import sys
if (sys.path[0] != ".."):
    sys.path[0] = ".."

import json
import os
from ibm_watson import NaturalLanguageUnderstandingV1
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator
from ibm_watson.natural_language_understanding_v1 import Features, CategoriesOptions, ConceptsOptions, EmotionOptions, EntitiesOptions, KeywordsOptions, \
    MetadataOptions, RelationsOptions, SemanticRolesOptions, SentimentOptions, SyntaxOptions, SyntaxOptionsTokens
import pandas as pd
import text_extensions_for_pandas as tp

In [2]:
# Retrieve the APIKEY for authentication
apikey = os.environ.get("IBM_API_KEY")
if apikey is None:
    raise ValueError("Expected apikey in the environment variable 'IBM_API_KEY'")

# Set the service URL for your IBM Cloud instance
ibm_cloud_service_url = 'https://api.us-south.natural-language-understanding.watson.cloud.ibm.com/instances/21b9b875-4ddb-46ad-bb22-d78747622ca7'

In [3]:
# Initialize the authenticator for making requests
authenticator = IAMAuthenticator(apikey)
natural_language_understanding = NaturalLanguageUnderstandingV1(
    version='2019-07-12',
    authenticator=authenticator
)

natural_language_understanding.set_service_url(ibm_cloud_service_url)

# Part 1: Turning the Watson NLU Response into Pandas DataFrames 

The responses should be in the form of decoded JSON Python and the following features
will be processed into DataFrames:

* entities
* keywords
* relations
* semantic_roles
* syntax with sentences and tokens

See https://cloud.ibm.com/apidocs/natural-language-understanding?code=python#text-analytics-features

In [4]:
# Make the request
response = natural_language_understanding.analyze(
    url="https://raw.githubusercontent.com/CODAIT/text-extensions-for-pandas/master/resources/holy_grail.txt",
    return_analyzed_text=True,
    features=Features(
        entities=EntitiesOptions(sentiment=True),
        keywords=KeywordsOptions(sentiment=True,emotion=True),
        relations=RelationsOptions(),
        semantic_roles=SemanticRolesOptions(),
        syntax=SyntaxOptions(sentences=True, tokens=SyntaxOptionsTokens(lemma=True, part_of_speech=True))  # Experimental
    )).get_result()

In [6]:
# View response as JSON
print(json.dumps(response, indent=2))

{
  "usage": {
    "text_units": 1,
    "text_characters": 5338,
    "features": 4
  },
  "syntax": {
    "tokens": [
      {
        "text": "Monty",
        "part_of_speech": "PROPN",
        "location": [
          0,
          5
        ]
      },
      {
        "text": "Python",
        "part_of_speech": "PROPN",
        "location": [
          6,
          12
        ],
        "lemma": "python"
      },
      {
        "text": "and",
        "part_of_speech": "CCONJ",
        "location": [
          13,
          16
        ],
        "lemma": "and"
      },
      {
        "text": "the",
        "part_of_speech": "DET",
        "location": [
          17,
          20
        ],
        "lemma": "the"
      },
      {
        "text": "Holy",
        "part_of_speech": "PROPN",
        "location": [
          21,
          25
        ]
      },
      {
        "text": "Grail",
        "part_of_speech": "PROPN",
        "location": [
          26,
          31
        ]
      },


In [7]:
# Get the response as processed Pandas DataFrames
dfs = tp.watson_nlu_parse_response(response)

In [8]:
# Created DataFrames from the response
dfs.keys()

dict_keys(['syntax', 'entities', 'keywords', 'relations', 'semantic_roles'])

### View the created DataFrames

In [9]:
dfs["entities"].head()

Unnamed: 0,type,text,sentiment.label,sentiment.score,relevance,count,confidence,disambiguation.subtype,disambiguation.name,disambiguation.dbpedia_resource
0,Person,Arthur,negative,-0.312834,0.956097,12,1.0,,,
1,Person,Lancelot,positive,0.835873,0.678523,5,1.0,,,
2,Person,Monty Python,neutral,0.0,0.644313,2,0.977538,,,
3,Person,King Arthur,neutral,0.0,0.561727,2,0.992188,,,
4,Person,Sir Galahad,positive,0.835873,0.540271,2,0.999984,,,


In [10]:
dfs["keywords"].head()

Unnamed: 0,text,sentiment.label,sentiment.score,relevance,emotion.sadness,emotion.joy,emotion.fear,emotion.disgust,emotion.anger,count
0,legend of King Arthur,neutral,0.0,0.746411,0.175057,0.691404,0.058051,0.031335,0.071927,1
1,Sir Lancelot,positive,0.835873,0.642571,0.046902,0.810654,0.01634,0.095661,0.021033,1
2,King Arthur,neutral,0.0,0.642235,0.09149,0.747356,0.043658,0.033299,0.112061,1
3,Holy Grail,positive,0.724846,0.624115,0.125927,0.696048,0.103502,0.153742,0.110257,5
4,British comedy film,neutral,0.0,0.619629,0.056536,0.657384,0.108932,0.048683,0.128826,1


In [11]:
dfs["relations"].head()

Unnamed: 0,type,sentence_span,score,arguments.0.span,arguments.1.span,arguments.0.entities.type,arguments.1.entities.type,arguments.0.entities.text,arguments.1.entities.text,arguments.0.entities.disambiguation.subtype,arguments.1.entities.disambiguation.subtype
0,timeOf,"[0, 273): 'Monty Python and the Holy Grail is ...",0.462615,"[37, 41): '1975'","[57, 61): 'film'",Date,TitleWork,1975,comedy,,
1,locatedAt,"[1489, 1639): 'Arthur leads the men to Camelot...",0.339446,"[1506, 1509): 'men'","[1513, 1520): 'Camelot'",Person,GeopoliticalEntity,men,Camelot,,
2,affectedBy,"[1640, 1756): 'As they turn away, God (an imag...",0.604304,"[1699, 1703): 'them'","[1689, 1695): 'speaks'",Person,EventCommunication,their,speaks,,
3,locatedAt,"[1758, 1935): 'Searching the land for clues to...",0.304596,"[1794, 1799): 'Grail'","[1802, 1810): 'location'",Organization,Location,Grail,location,,
4,employedBy,"[1758, 1935): 'Searching the land for clues to...",0.895035,"[1872, 1880): 'soldiers'","[1865, 1871): 'French'",Person,GeopoliticalEntity,soldiers,French,,[Country]


In [12]:
dfs["semantic_roles"].head()

Unnamed: 0,subject.text,sentence,object.text,action.verb.text,action.verb.tense,action.text,action.normalized
0,Monty Python and the Holy Grail,Monty Python and the Holy Grail is a 1975 Brit...,a 1975 British comedy film concerning the Arth...,be,present,is,be
1,by the Monty Python comedy group of Graham Cha...,Monty Python and the Holy Grail is a 1975 Brit...,Monty Python and the Holy Grail,perform,past,written and performed,write and perform
2,It,It was conceived during the hiatus between th...,,conceive,past,was conceived,be conceive
3,a compilation of sketches,"While the group's first film, And Now for Som...",from the first two television series,be,past,was,be
4,Holy Grail,"While the group's first film, And Now for Som...",a new story that parodies the legend of King A...,be,present,is,be


In [13]:
dfs["syntax"].head()

Unnamed: 0,char_span,token_span,part_of_speech,lemma,sentence
0,"[0, 5): 'Monty'","[0, 5): 'Monty'",PROPN,,"[0, 273): 'Monty Python and the Holy Grail is ..."
1,"[6, 12): 'Python'","[6, 12): 'Python'",PROPN,python,"[0, 273): 'Monty Python and the Holy Grail is ..."
2,"[13, 16): 'and'","[13, 16): 'and'",CCONJ,and,"[0, 273): 'Monty Python and the Holy Grail is ..."
3,"[17, 20): 'the'","[17, 20): 'the'",DET,the,"[0, 273): 'Monty Python and the Holy Grail is ..."
4,"[21, 25): 'Holy'","[21, 25): 'Holy'",PROPN,,"[0, 273): 'Monty Python and the Holy Grail is ..."


# Part 2: *** Finding all pronouns in each sentence

Now we will do some analysis on the Watson NLU syntax result containing 
part of speech recognition with the Pandas DataFrame.

In [15]:
syntax = dfs["syntax"]

# Retrieve sentence information from the above dataframe
sentences = pd.DataFrame({"sentence": syntax["sentence"].unique()})
sentences.head()

Unnamed: 0,sentence
0,"[0, 273): 'Monty Python and the Holy Grail is ..."
1,"[274, 405): 'It was conceived during the hiatu..."
2,"[407, 642): 'While the group's first film, And..."
3,"[643, 720): 'Thirty years later, Idle used the..."
4,"[722, 823): 'Monty Python and the Holy Grail g..."


In [16]:
# Find all the pronouns in each sentence, *without* using Pandas.
# NON-scalable traversal of the syntax analysis data structure
# (runs in time proportional to the square of document length).

response_sentences = response["syntax"]["sentences"]
response_tokens = response["syntax"]["tokens"]

pronouns_by_sentence = {s["text"]: [] for s in response_sentences}

# Nested for loops. 
# Running time: O(num_tokens * num_sentences), i.e. O(document_size^2)
for t in response_tokens:
    pos_str = t["part_of_speech"]  # Decode numeric POS enum
    if pos_str == "PRON":
        found_sentence = False
        for s in response_sentences:
            if (t["location"][0] >= s["location"][0] 
                    and t["location"][1] <= s["location"][1]):
                found_sentence = True
                pronouns_by_sentence[s["text"]].append(t)
        if not found_sentence:
            raise ValueError(f"Token {t} is not in any sentence")
            pass  # Make JupyterLab syntax highlight happy
        
pronouns_by_sentence

{'Monty Python and the Holy Grail is a 1975 British comedy film concerning the Arthurian legend, written and performed by the Monty Python comedy group of Graham Chapman, John Cleese, Terry Gilliam, Eric Idle, Terry Jones and Michael Palin, and directed by Gilliam and Jones.': [],
 "It was conceived during the hiatus between the third and fourth series of their BBC television series Monty Python's Flying Circus.": [{'text': 'It',
   'part_of_speech': 'PRON',
   'location': [274, 276],
   'lemma': 'it'},
  {'text': 'their',
   'part_of_speech': 'PRON',
   'location': [348, 353],
   'lemma': 'their'}],
 "While the group's first film, And Now for Something Completely Different, was a compilation of sketches from the first two television series, Holy Grail is a new story that parodies the legend of King Arthur's quest for the Holy Grail.": [{'text': 'Something',
   'part_of_speech': 'PRON',
   'location': [449, 458],
   'lemma': 'something'},
  {'text': 'that',
   'part_of_speech': 'PRON',

In [17]:
# Find all the pronouns in each sentence.
# Pandas version.
pronouns_by_sentence = syntax[syntax["part_of_speech"] == "PRON"][["sentence", "token_span"]]
pronouns_by_sentence

Unnamed: 0,sentence,token_span
52,"[274, 405): 'It was conceived during the hiatu...","[274, 276): 'It'"
65,"[274, 405): 'It was conceived during the hiatu...","[348, 353): 'their'"
85,"[407, 642): 'While the group's first film, And...","[449, 458): 'Something'"
107,"[407, 642): 'While the group's first film, And...","[575, 579): 'that'"
161,"[824, 954): 'In the US, it was selected as the...","[835, 837): 'it'"
185,"[824, 954): 'In the US, it was selected as the...","[945, 948): 'Our'"
200,"[955, 1122): 'In the UK, readers of Total Film...","[1012, 1014): 'it'"
224,"[955, 1122): 'In the UK, readers of Total Film...","[1113, 1115): 'it'"
237,"[1122, 1256): '[5] In AD 932, King Arthur and ...","[1154, 1157): 'his'"
261,"[1257, 1488): 'Along the way, he recruits Sir ...","[1272, 1274): 'he'"


In [19]:
# How would the previous cell look if the tokens and sentences weren't pre-joined?
pronouns = syntax[syntax["part_of_speech"] == "PRON"]["token_span"]
pronouns_by_sentence = tp.contain_join(sentences["sentence"], pronouns, "sentence", "token_span")
pronouns_by_sentence.head()

Unnamed: 0,sentence,token_span
0,"[274, 405): 'It was conceived during the hiatu...","[274, 276): 'It'"
1,"[274, 405): 'It was conceived during the hiatu...","[348, 353): 'their'"
2,"[407, 642): 'While the group's first film, And...","[449, 458): 'Something'"
3,"[407, 642): 'While the group's first film, And...","[575, 579): 'that'"
4,"[824, 954): 'In the US, it was selected as the...","[835, 837): 'it'"


In [20]:
# Ask the tokens of the first sentence to render themselves as HTML
sentence_tokens_df = syntax[syntax["sentence"] == sentences["sentence"].loc[0]]
sentence_tokens_df["char_span"].values

Unnamed: 0,begin,end,covered_text
0,0,5,Monty
1,6,12,Python
2,13,16,and
3,17,20,the
4,21,25,Holy
5,26,31,Grail
6,32,34,is
7,35,36,a
8,37,41,1975
9,42,49,British


In [21]:
# TODO - these tokens don't have dependency info REMOVE THIS

# Display our the first sentence's dependency parse
#sentence_tokens_df = df[df["sentence"] == sentences["sentence"].loc[0]]
#tp.render_parse_tree(sentence_tokens_df, tag_col=None)

# Part 3: Scoring NLU Entity Recognition with DataFrames

Here we will process the entities DataFrame and compute precision and recall

In [22]:
entities = dfs["entities"]

# Display all unique entity types found
entity_types = pd.DataFrame({"unique_types": entities["type"].unique()})
entity_types

Unnamed: 0,unique_types
0,Person
1,PrintMedia
2,Organization
3,Movie
4,TelevisionShow
5,Facility
6,Location
7,Broadcaster
8,Quantity
9,Company


In [None]:
# Let's look at just the entities tagged "PERSON"
person_entities = entities[entities["type"] == "Person"]
person_entities.head(10)

In [None]:
# Make a token span array from person entities
char_span = dfs['syntax']['char_span'].values

token_span = tp.make_span_from_entities(entities, 'text', char_span)
token_span

In [None]:
# Merge the token spans with the entity dataframe and 
person_entities_span = pd.DataFrame({"token_span": token_span})
person_entities_span['text'] = person_entities_span['token_span'].map(lambda span: span.covered_text)
person_entities_span = person_entities_span.merge(person_entities, on="text", how="left")
person_entities_span = person_entities_span.drop_duplicates(subset=['text'])
person_entities_span = person_entities_span.drop(columns=["text", "disambiguation.dbpedia_resource", "disambiguation.name", "disambiguation.subtype"])
top_relevant = person_entities_span.sort_values(by=['relevance'], ascending=False).head()
top_relevant

In [None]:
# TODO: Maybe get a list of sentences with most relevant people?


## Now lets compute the precision/recall on the person entities

In [None]:
import spacy
spacy_language_model = spacy.load("en_core_web_sm")

# Example document text courtesy https://en.wikipedia.org/wiki/Monty_Python_and_the_Holy_Grail
# License: CC-BY-SA
with open ("../resources/holy_grail.txt", "r") as f:
    doc_text = f.read()
    pass
 
# Parse the document text with SpaCy, then convert the results to a dataframe
token_features = tp.make_tokens_and_features(doc_text, spacy_language_model)

In [None]:
# Load gold standard labels in IOB format from a CSV file
person_gold_iob = pd.read_csv("../resources/holy_grail_person.csv")

# Pull in token offsets from our token_features dataframe
person_gold_iob["token_span"] = token_features["token_span"].values
person_gold_iob["char_span"] = token_features["char_span"].values
person_gold_iob.iloc[25:35]

In [None]:
person_gold_iob[50:70]

In [None]:
# Convert from IOB format to spans of entities
person_gold = tp.iob_to_spans(person_gold_iob, entity_type_col_name=None)
person_gold.head()

In [None]:
# Find all the spans that are in both the extractor's answer set and the gold standard
person_gold['text'] = person_gold['token_span'].map(lambda span: span.covered_text)
person_intersection = person_gold.merge(person_entities)
person_intersection.head()

In [None]:
# Let's compute precision and recall, just on this document.
# Of course, in a real use case, we would be computing these values on a 
# development holdout set of documents while tuning the model, then
# computing them again on a validation set during final testing.
# We use a single document here to show that it is straightforward 
# to collect the necessary information using Pandas.
num_true_positives = len(person_intersection.index)
num_entities = len(person_gold.index)
num_entities_extracted = len(person_entities.index)

precision = num_true_positives / num_entities_extracted
recall = num_true_positives / num_entities
F1 = 2.0 * (precision * recall) / (precision + recall)

print(
"""Number of correct answers: {}
Number of entities identified: {}
Actual number of entities: {}
Precision: {:1.2f}
Recall: {:1.2f}
F1: {:1.2f}""".format(num_true_positives, num_entities, num_entities_extracted, precision, recall, F1))

In [None]:
false_positives = person_entities[~person_entities["text"].isin(person_gold["text"])]
false_positives

In [None]:
false_negatives = person_gold[~person_gold["text"].isin(person_entities["text"])]
false_negatives

In [None]:
len(false_positives), len(false_negatives)