# Watson NLP Example for Text Extensions for Pandas

## Introduction

This demo shows how to use the `watson` module from Text Extension for Pandas to 
process a Watson NLP response from the IBM cloud into Pandas DataFrames for analysis.
Pandas is the de facto tool for data science and ...
https://github.com/CODAIT/text-extensions-for-pandas

The notebook is broken up into 2 parts:

**Part 1:** Shows how to authenticate with the IBM Watson SDK and make a request with the
Watson NLU API. The response is then processed by Text Extensions for Pandas to convert
the JSON response into several Pandas DataFrames.

**Part 2:** Will go deeper into the data received from Watson NLU and show how to do
analytics with the DataFrames from Text Extensions for Pandas


## Authentication

This demo uses the IBM Watson Python SDK to perform authentication on the IBM Cloud with the 
`IAMAuthenticator`. See https://github.com/watson-developer-cloud/python-sdk#iam for more 
information. To properly authenticate with IBM Cloud, please set the environment variable
`IBM_API_KEY` with your correct apikey to make requests to `ibm_watson.NaturalLanguageUnderstandingV1`.

In [1]:
# INITIALIZATION BOILERPLATE

# The Jupyter kernel for this notebook usually starts up inside the notebooks
# directory, but the text_extensions_for_pandas package code is in the parent
# directory. Add that parent directory to the front of the Python include path.
import sys
if (sys.path[0] != ".."):
    sys.path[0] = ".."

import json
import os
from ibm_watson import NaturalLanguageUnderstandingV1
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator
from ibm_watson.natural_language_understanding_v1 import Features, CategoriesOptions, ConceptsOptions, EmotionOptions, EntitiesOptions, KeywordsOptions, \
    MetadataOptions, RelationsOptions, SemanticRolesOptions, SentimentOptions, SyntaxOptions, SyntaxOptionsTokens
import pandas as pd
import text_extensions_for_pandas as tp
#from text_extensions_for_pandas.io.watson import watson_nlu_parse_response

In [2]:
# Retrieve the APIKEY for authentication
apikey = os.environ.get("IBM_API_KEY")
if apikey is None:
    raise ValueError("Expected apikey in the environment variable 'IBM_API_KEY'")

# Set the service URL for your IBM Cloud instance
ibm_cloud_service_url = 'https://api.us-south.natural-language-understanding.watson.cloud.ibm.com/instances/21b9b875-4ddb-46ad-bb22-d78747622ca7'

In [3]:
# Initialize the authenticator for making requests
authenticator = IAMAuthenticator(apikey)
natural_language_understanding = NaturalLanguageUnderstandingV1(
    version='2019-07-12',
    authenticator=authenticator
)

natural_language_understanding.set_service_url(ibm_cloud_service_url)

# Part 1: Turning the Watson NLU Response into Pandas DataFrames 

The responses should be in the form of decoded JSON Python and the following features
will be processed into DataFrames:

* entities
* keywords
* relations
* semantic_roles
* syntax with sentences and tokens

See https://cloud.ibm.com/apidocs/natural-language-understanding?code=python#text-analytics-features

In [69]:
# Make the request
response = natural_language_understanding.analyze(
    url="https://raw.githubusercontent.com/CODAIT/text-extensions-for-pandas/master/resources/holy_grail.txt",
    features=Features(
        #categories=CategoriesOptions(limit=3), 
        #concepts=ConceptsOptions(limit=3), 
        #emotion=EmotionOptions(targets=['grail']),
        entities=EntitiesOptions(sentiment=True),
        keywords=KeywordsOptions(sentiment=True,emotion=True),
        #metadata=MetadataOptions(),
        relations=RelationsOptions(),
        semantic_roles=SemanticRolesOptions(),
        #sentiment=SentimentOptions(targets=['Arthur']),
        syntax=SyntaxOptions(sentences=True, tokens=SyntaxOptionsTokens(lemma=True, part_of_speech=True))  # Experimental
    )).get_result()

In [None]:
# View response as JSON
print(json.dumps(response, indent=2))

In [70]:
# Get the response as processed Pandas DataFrames
dfs = tp.watson_nlu_parse_response(response)

In [71]:
# Created DataFrames from the response
dfs.keys()

dict_keys(['entities', 'keywords', 'relations', 'semantic_roles', 'syntax'])

### View the created DataFrames

In [72]:
dfs["entities"].head()

Unnamed: 0,confidence,count,disambiguation.dbpedia_resource,disambiguation.name,disambiguation.subtype,relevance,sentiment.label,sentiment.mixed,sentiment.score,text,type
0,1.0,12,,,,0.956097,negative,1.0,-0.312834,Arthur,Person
1,1.0,5,,,,0.678523,positive,,0.835873,Lancelot,Person
2,0.977538,2,,,,0.644313,neutral,,0.0,Monty Python,Person
3,0.992188,2,,,,0.561727,neutral,,0.0,King Arthur,Person
4,0.999984,2,,,,0.540271,positive,,0.835873,Sir Galahad,Person


In [None]:
dfs["keywords"].head()

In [None]:
dfs["relations"].head()

In [None]:
dfs["semantic_roles"].head()

In [16]:
dfs["syntax"].head()

Unnamed: 0,lemma,part_of_speech,char_span,token_span,sentence
0,,PROPN,"[0, 5): 'Monty'","[0, 5): 'Monty'","[0, 273): 'Monty Python and the Holy Grail is ..."
1,python,PROPN,"[6, 12): 'Python'","[6, 12): 'Python'","[0, 273): 'Monty Python and the Holy Grail is ..."
2,and,CCONJ,"[13, 16): 'and'","[13, 16): 'and'","[0, 273): 'Monty Python and the Holy Grail is ..."
3,the,DET,"[17, 20): 'the'","[17, 20): 'the'","[0, 273): 'Monty Python and the Holy Grail is ..."
4,,PROPN,"[21, 25): 'Holy'","[21, 25): 'Holy'","[0, 273): 'Monty Python and the Holy Grail is ..."


# Part 2: Part of Speech Analysis with Pandas

Now we will do some analysis on the Watson NLU syntax result containing 
part of speech recognition with the Pandas DataFrame.

In [None]:
df = dfs["syntax"]

# Retrieve sentence information from the above dataframe
sentences = pd.DataFrame({"sentence": df["sentence"].unique()})
sentences

In [None]:
# Find all the pronouns in each sentence, *without* using Pandas.
# NON-scalable traversal of the syntax analysis data structure
# (runs in time proportional to the square of document length).

sentences = response["syntax"]["sentences"]
tokens = response["syntax"]["tokens"]

pronouns_by_sentence = {s["text"]: [] for s in sentences}

# Nested for loops. 
# Running time: O(num_tokens * num_sentences), i.e. O(document_size^2)
for t in tokens:
    pos_str = t["part_of_speech"]  # Decode numeric POS enum
    if pos_str == "PRON":
        found_sentence = False
        for s in sentences:
            if (t["location"][0] >= s["location"][0] 
                    and t["location"][1] <= s["location"][1]):
                found_sentence = True
                pronouns_by_sentence[s["text"]].append(t)
        if not found_sentence:
            raise ValueError(f"Token {t} is not in any sentence")
            pass  # Make JupyterLab syntax highlight happy
        
pronouns_by_sentence

In [None]:
# Find all the pronouns in each sentence.
# Pandas version.
pronouns_by_sentence = df[df["part_of_speech"] == "PRON"][["sentence", "token_span"]]
pronouns_by_sentence

In [None]:
# How would the previous cell look if the tokens and sentences weren't pre-joined?
pronouns = df[df["part_of_speech"] == "PRON"]["token_span"]
pronouns_by_sentence = tp.contain_join(sentences["sentence"], pronouns, "sentence", "token_span")
pronouns_by_sentence

In [None]:
# Ask the tokens of the first sentence to render themselves as HTML
sentence_tokens_df = df[df["sentence"] == sentences["sentence"].loc[0]]
sentence_tokens_df["char_span"].values

In [None]:
# TODO - these tokens don't have dependency info

# Display our the first sentence's dependency parse
sentence_tokens_df = df[df["sentence"] == sentences["sentence"].loc[0]]
#tp.render_parse_tree(sentence_tokens_df, tag_col=None)

# Part 3: Scoring NLU Entity Recognition with DataFrames

Here we will process the entities DataFrame and compute precision and recall

In [7]:
entities = dfs["entities"]

# Display all unique entity types found
entity_types = pd.DataFrame({"unique_types": entities["type"].unique()})
entity_types

Unnamed: 0,unique_types
0,Person
1,PrintMedia
2,Organization
3,Movie
4,TelevisionShow
5,Facility
6,Location
7,Broadcaster
8,Quantity
9,Company


In [115]:
# Let's look at just the entities tagged "PERSON"
person_entities = entities[entities["type"] == "Person"]
person_entities.head(10)

Unnamed: 0,confidence,count,disambiguation.dbpedia_resource,disambiguation.name,disambiguation.subtype,relevance,sentiment.label,sentiment.mixed,sentiment.score,text,type
0,1.0,12,,,,0.956097,negative,1.0,-0.312834,Arthur,Person
1,1.0,5,,,,0.678523,positive,,0.835873,Lancelot,Person
2,0.977538,2,,,,0.644313,neutral,,0.0,Monty Python,Person
3,0.992188,2,,,,0.561727,neutral,,0.0,King Arthur,Person
4,0.999984,2,,,,0.540271,positive,,0.835873,Sir Galahad,Person
5,0.999847,2,,,,0.532632,positive,1.0,0.387661,Sir Robin,Person
7,0.999997,4,,,,0.507619,negative,,-0.589501,Bedevere,Person
8,0.975353,1,,,,0.496844,positive,,0.835873,Sir Bedevere,Person
9,0.655069,1,,,,0.484381,positive,,0.835873,Sir Not,Person
10,0.863914,1,http://dbpedia.org/resource/W._G._Grace,W._G._Grace,"[Athlete, Physician, CricketPlayer]",0.455931,positive,,0.721918,W. G. Grace,Person


In [49]:
# Make a token span array from person entities
char_span = dfs['syntax']['char_span'].values

token_span = tp.make_span_from_entities(person_entities, 'text', char_span)
token_span

Unnamed: 0,begin,end,begin_token,end_token,covered_text
0,0,12,0,2,Monty Python
1,124,136,22,24,Monty Python
2,153,167,27,29,Graham Chapman
3,169,180,30,32,John Cleese
4,182,195,33,35,Terry Gilliam
5,197,206,36,38,Eric Idle
6,208,219,39,41,Terry Jones
7,224,237,42,44,Michael Palin
8,255,262,48,49,Gilliam
9,267,272,50,51,Jones


In [94]:
# Merge the token spans with the entity dataframe and 
person_entities_span = pd.DataFrame({"token_span": token_span})
person_entities_span['text'] = person_entities_span['token_span'].map(lambda span: span.covered_text)
person_entities_span = person_entities_span.merge(person_entities, on="text", how="left")
person_entities_span = person_entities_span.drop_duplicates(subset=['text'])
person_entities_span = person_entities_span.drop(columns=["text", "disambiguation.dbpedia_resource", "disambiguation.name", "disambiguation.subtype"])
top_relevant = person_entities_span.sort_values(by=['relevance'], ascending=False).head()
top_relevant

Unnamed: 0,token_span,confidence,count,relevance,sentiment.label,sentiment.mixed,sentiment.score,type
21,"[1489, 1495): 'Arthur'",1.0,12,0.956097,negative,1.0,-0.312834,Person
18,"[1393, 1401): 'Lancelot'",1.0,5,0.678523,positive,,0.835873,Person
0,"[0, 12): 'Monty Python'",0.977538,2,0.644313,neutral,,0.0,Person
11,"[603, 614): 'King Arthur'",0.992188,2,0.561727,neutral,,0.0,Person
16,"[1331, 1342): 'Sir Galahad'",0.999984,2,0.540271,positive,,0.835873,Person


In [97]:
# TODO: Maybe get a list of sentences with most relevant people?


## Now lets compute the precision/recall on the person entities

In [98]:
import spacy
spacy_language_model = spacy.load("en_core_web_sm")

# Example document text courtesy https://en.wikipedia.org/wiki/Monty_Python_and_the_Holy_Grail
# License: CC-BY-SA
with open ("../resources/holy_grail.txt", "r") as f:
    doc_text = f.read()
    pass
 
# Parse the document text with SpaCy, then convert the results to a dataframe
token_features = tp.make_tokens_and_features(doc_text, spacy_language_model)

In [99]:
# Load gold standard labels in IOB format from a CSV file
person_gold_iob = pd.read_csv("../resources/holy_grail_person.csv")

# Pull in token offsets from our token_features dataframe
person_gold_iob["token_span"] = token_features["token_span"].values
person_gold_iob["char_span"] = token_features["char_span"].values
person_gold_iob.iloc[25:35]

Unnamed: 0,token_text,ent_iob,token_span,char_span
25,group,O,"[144, 149): 'group'","[144, 149): 'group'"
26,of,O,"[150, 152): 'of'","[150, 152): 'of'"
27,Graham,B,"[153, 159): 'Graham'","[153, 159): 'Graham'"
28,Chapman,I,"[160, 167): 'Chapman'","[160, 167): 'Chapman'"
29,",",O,"[167, 168): ','","[167, 168): ','"
30,John,B,"[169, 173): 'John'","[169, 173): 'John'"
31,Cleese,I,"[174, 180): 'Cleese'","[174, 180): 'Cleese'"
32,",",O,"[180, 181): ','","[180, 181): ','"
33,Terry,B,"[182, 187): 'Terry'","[182, 187): 'Terry'"
34,Gilliam,I,"[188, 195): 'Gilliam'","[188, 195): 'Gilliam'"


In [100]:
person_gold_iob[50:70]

Unnamed: 0,token_text,ent_iob,token_span,char_span
50,Jones,B,"[267, 272): 'Jones'","[267, 272): 'Jones'"
51,.,O,"[272, 273): '.'","[272, 273): '.'"
52,It,O,"[274, 276): 'It'","[274, 276): 'It'"
53,was,O,"[277, 280): 'was'","[277, 280): 'was'"
54,conceived,O,"[281, 290): 'conceived'","[281, 290): 'conceived'"
55,during,O,"[291, 297): 'during'","[291, 297): 'during'"
56,the,O,"[298, 301): 'the'","[298, 301): 'the'"
57,hiatus,O,"[302, 308): 'hiatus'","[302, 308): 'hiatus'"
58,between,O,"[309, 316): 'between'","[309, 316): 'between'"
59,the,O,"[317, 320): 'the'","[317, 320): 'the'"


In [101]:
# Convert from IOB format to spans of entities
person_gold = tp.iob_to_spans(person_gold_iob, entity_type_col_name=None)
person_gold.head()

Unnamed: 0,token_span
0,"[153, 167): 'Graham Chapman'"
1,"[169, 180): 'John Cleese'"
2,"[182, 195): 'Terry Gilliam'"
3,"[197, 206): 'Eric Idle'"
4,"[208, 219): 'Terry Jones'"


In [127]:
# Find all the spans that are in both the extractor's answer set and the gold standard
person_gold['text'] = person_gold['token_span'].map(lambda span: span.covered_text)
person_intersection = person_gold.merge(person_entities)
person_intersection.head()

Unnamed: 0,token_span,text,confidence,count,disambiguation.dbpedia_resource,disambiguation.name,disambiguation.subtype,relevance,sentiment.label,sentiment.mixed,sentiment.score,type
0,"[153, 167): 'Graham Chapman'",Graham Chapman,0.981543,1,http://dbpedia.org/resource/Graham_Chapman,Graham_Chapman,"[Actor, Composer, MusicalArtist, Physician, Fi...",0.439328,neutral,,0.0,Person
1,"[169, 180): 'John Cleese'",John Cleese,0.973674,1,http://dbpedia.org/resource/John_Cleese,John_Cleese,"[MusicalArtist, Politician, AwardNominee, Awar...",0.391066,neutral,,0.0,Person
2,"[182, 195): 'Terry Gilliam'",Terry Gilliam,0.982,1,http://dbpedia.org/resource/Terry_Gilliam,Terry_Gilliam,"[Actor, Composer, MusicalArtist, AwardNominee,...",0.392222,neutral,,0.0,Person
3,"[197, 206): 'Eric Idle'",Eric Idle,0.983174,1,http://dbpedia.org/resource/Eric_Idle,Eric_Idle,"[Actor, Composer, MusicalArtist, FilmActor, Fi...",0.416211,neutral,,0.0,Person
4,"[208, 219): 'Terry Jones'",Terry Jones,0.961858,1,http://dbpedia.org/resource/Terry_Jones,Terry_Jones,"[MusicalArtist, AwardNominee, Celebrity, FilmD...",0.395943,neutral,,0.0,Person


In [124]:
# Let's compute precision and recall, just on this document.
# Of course, in a real use case, we would be computing these values on a 
# development holdout set of documents while tuning the model, then
# computing them again on a validation set during final testing.
# We use a single document here to show that it is straightforward 
# to collect the necessary information using Pandas.
num_true_positives = len(person_intersection.index)
num_entities = len(person_gold.index)
num_entities_extracted = len(person_entities.index)

precision = num_true_positives / num_entities_extracted
recall = num_true_positives / num_entities
F1 = 2.0 * (precision * recall) / (precision + recall)

print(
"""Number of correct answers: {}
Number of entities identified: {}
Actual number of entities: {}
Precision: {:1.2f}
Recall: {:1.2f}
F1: {:1.2f}""".format(num_true_positives, num_entities, num_entities_extracted, precision, recall, F1))

Number of correct answers: 44
Number of entities identified: 55
Actual number of entities: 29
Precision: 1.52
Recall: 0.80
F1: 1.05


In [119]:
false_positives = person_entities[~person_entities["text"].isin(person_gold["text"])]
false_positives

Unnamed: 0,confidence,count,disambiguation.dbpedia_resource,disambiguation.name,disambiguation.subtype,relevance,sentiment.label,sentiment.mixed,sentiment.score,text,type
2,0.977538,2,,,,0.644313,neutral,,0.0,Monty Python,Person
8,0.975353,1,,,,0.496844,positive,,0.835873,Sir Bedevere,Person
9,0.655069,1,,,,0.484381,positive,,0.835873,Sir Not,Person
21,0.962657,1,http://dbpedia.org/resource/Sir_Lancelot_%28si...,Sir_Lancelot_%28singer%29,"[MusicalArtist, TVActor]",0.342728,positive,,0.835873,Sir Lancelot,Person
28,0.597452,1,,,,0.266359,neutral,,0.0,Camelot,Person
38,0.756888,1,,,,0.146316,negative,,-0.652091,Knight,Person
40,0.359389,1,,,,0.116073,neutral,,0.0,Maynard,Person
41,0.969143,1,,,,0.099675,neutral,,0.0,Tim,Person


In [120]:
false_negatives = person_gold[~person_gold["text"].isin(person_entities["text"])]
false_negatives

Unnamed: 0,token_span,text
9,"[663, 667): 'Idle'",Idle
11,"[1166, 1171): 'Patsy'",Patsy
12,"[1284, 1305): 'Sir Bedevere the Wise'",Sir Bedevere the Wise
13,"[1307, 1329): 'Sir Lancelot the Brave'",Sir Lancelot the Brave
14,"[1331, 1351): 'Sir Galahad the Pure'",Sir Galahad the Pure
15,"[1353, 1401): 'Sir Robin the Not-Quite-So-Brav...",Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot
16,"[1407, 1437): 'Sir Not-Appearing-in-this-Film'",Sir Not-Appearing-in-this-Film
27,"[2684, 2703): 'Three-Headed Knight'",Three-Headed Knight
33,"[3327, 3344): 'Tim the Enchanter'",Tim the Enchanter
34,"[3587, 3596): 'Sirs Bors'",Sirs Bors


In [121]:
len(false_positives), len(false_negatives)

(8, 11)