# Watson NLP Example for Text Extensions for Pandas

## Introduction

This demo shows how to use the `watson` module from Text Extension for Pandas to 
process a Watson NLP response from the IBM cloud into Pandas DataFrames for analysis.
Pandas is the de facto tool for data science and ...
https://github.com/CODAIT/text-extensions-for-pandas

The notebook is broken up into 2 parts:

**Part 1:** Shows how to authenticate with the IBM Watson SDK and make a request with the
Watson NLU API. The response is then processed by Text Extensions for Pandas to convert
the JSON response into several Pandas DataFrames.

**Part 2:** Will go deeper into the data received from Watson NLU and show how to do
analytics with the DataFrames from Text Extensions for Pandas


## Authentication

This demo uses the IBM Watson Python SDK to perform authentication on the IBM Cloud with the 
`IAMAuthenticator`. See https://github.com/watson-developer-cloud/python-sdk#iam for more 
information. To properly authenticate with IBM Cloud, please set the environment variable
`IBM_API_KEY` with your correct apikey to make requests to `ibm_watson.NaturalLanguageUnderstandingV1`.

In [44]:
# INITIALIZATION BOILERPLATE

# The Jupyter kernel for this notebook usually starts up inside the notebooks
# directory, but the text_extensions_for_pandas package code is in the parent
# directory. Add that parent directory to the front of the Python include path.
import sys
if (sys.path[0] != ".."):
    sys.path[0] = ".."

import json
import os
from ibm_watson import NaturalLanguageUnderstandingV1
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator
from ibm_watson.natural_language_understanding_v1 import Features, CategoriesOptions, ConceptsOptions, EmotionOptions, EntitiesOptions, KeywordsOptions, \
    MetadataOptions, RelationsOptions, SemanticRolesOptions, SentimentOptions, SyntaxOptions, SyntaxOptionsTokens
import pandas as pd
import text_extensions_for_pandas as tp
from text_extensions_for_pandas.io.watson import watson_nlu_parse_response

In [2]:
# Retrieve the APIKEY for authentication
apikey = os.environ.get("IBM_API_KEY")
if apikey is None:
    raise ValueError("Expected apikey in the environment variable 'IBM_API_KEY'")

# Set the service URL for your IBM Cloud instance
ibm_cloud_service_url = 'https://api.us-south.natural-language-understanding.watson.cloud.ibm.com/instances/21b9b875-4ddb-46ad-bb22-d78747622ca7'

In [3]:
# Initialize the authenticator for making requests
authenticator = IAMAuthenticator(apikey)
natural_language_understanding = NaturalLanguageUnderstandingV1(
    version='2019-07-12',
    authenticator=authenticator
)

natural_language_understanding.set_service_url(ibm_cloud_service_url)

# Part 1: Processing the Watson NLU Response to Pandas

The responses should be in the form of decoded JSON Python and the following features
will be processed into DataFrames:

* entities
* keywords
* relations
* semantic_roles
* syntax with sentences and tokens

See https://cloud.ibm.com/apidocs/natural-language-understanding?code=python#text-analytics-features

In [4]:
# Make the request
response = natural_language_understanding.analyze(
    url='https://raw.githubusercontent.com/CODAIT/text-extensions-for-pandas/master/resources/holy_grail.txt',
    features=Features(
        #categories=CategoriesOptions(limit=3), 
        #concepts=ConceptsOptions(limit=3), 
        #emotion=EmotionOptions(targets=['grail']),
        entities=EntitiesOptions(sentiment=True,limit=3),
        keywords=KeywordsOptions(sentiment=True,emotion=True,limit=3),
        #metadata=MetadataOptions(),
        relations=RelationsOptions(),
        semantic_roles=SemanticRolesOptions(limit=3),
        #sentiment=SentimentOptions(targets=['Arthur']),
        syntax=SyntaxOptions(sentences=True, tokens=SyntaxOptionsTokens(lemma=True, part_of_speech=True))  # Experimental
    )).get_result()

In [5]:
# View response as JSON
print(json.dumps(response, indent=2))

{
  "usage": {
    "text_units": 1,
    "text_characters": 5338,
    "features": 4
  },
  "syntax": {
    "tokens": [
      {
        "text": "Monty",
        "part_of_speech": "PROPN",
        "location": [
          0,
          5
        ]
      },
      {
        "text": "Python",
        "part_of_speech": "PROPN",
        "location": [
          6,
          12
        ],
        "lemma": "python"
      },
      {
        "text": "and",
        "part_of_speech": "CCONJ",
        "location": [
          13,
          16
        ],
        "lemma": "and"
      },
      {
        "text": "the",
        "part_of_speech": "DET",
        "location": [
          17,
          20
        ],
        "lemma": "the"
      },
      {
        "text": "Holy",
        "part_of_speech": "PROPN",
        "location": [
          21,
          25
        ]
      },
      {
        "text": "Grail",
        "part_of_speech": "PROPN",
        "location": [
          26,
          31
        ]
      },


In [6]:
# Get the response as processed Pandas DataFrames
dfs = watson_nlu_parse_response(response)

In [7]:
# Created DataFrames from the response
dfs.keys()

dict_keys(['entities', 'keywords', 'relations', 'semantic_roles', 'syntax'])

### View the response as DataFrames

In [8]:
dfs['keywords']

Unnamed: 0,count,emotion.anger,emotion.disgust,emotion.fear,emotion.joy,emotion.sadness,relevance,sentiment.label,sentiment.score,text
0,1,0.071927,0.031335,0.058051,0.691404,0.175057,0.746411,neutral,0.0,legend of King Arthur
1,1,0.021033,0.095661,0.01634,0.810654,0.046902,0.642571,positive,0.835873,Sir Lancelot
2,1,0.112061,0.033299,0.043658,0.747356,0.09149,0.642235,neutral,0.0,King Arthur


In [9]:
dfs['entities']

Unnamed: 0,confidence,count,relevance,sentiment.label,sentiment.mixed,sentiment.score,text,type
0,1.0,12,0.956097,negative,1.0,-0.312834,Arthur,Person
1,1.0,5,0.678523,positive,,0.835873,Lancelot,Person
2,0.977538,2,0.644313,neutral,,0.0,Monty Python,Person


In [10]:
dfs['relations']

Unnamed: 0,score,sentence,type,arguments.0.entities.disambiguation.subtype,arguments.1.entities.disambiguation.subtype,arguments.0.entities.text,arguments.1.entities.text,arguments.0.entities.type,arguments.1.entities.type,arguments.0.location,arguments.1.location,arguments.0.text,arguments.1.text
0,0.462615,Monty Python and the Holy Grail is a 1975 Brit...,timeOf,,,1975,comedy,Date,TitleWork,"[37, 41]","[57, 61]",1975,film
1,0.339446,"Arthur leads the men to Camelot, but upon furt...",locatedAt,,,men,Camelot,Person,GeopoliticalEntity,"[1506, 1509]","[1513, 1520]",men,Camelot
2,0.604304,"As they turn away, God (an image of W. G. Grac...",affectedBy,,,their,speaks,Person,EventCommunication,"[1699, 1703]","[1689, 1695]",them,speaks
3,0.304596,Searching the land for clues to the Grail's lo...,locatedAt,,,Grail,location,Organization,Location,"[1794, 1799]","[1802, 1810]",Grail,location
4,0.895035,Searching the land for clues to the Grail's lo...,employedBy,,[Country],soldiers,French,Person,GeopoliticalEntity,"[1872, 1880]","[1865, 1871]",soldiers,French
5,0.903545,Arthur and Bedevere eventually reach the Castl...,employedBy,,[Country],soldiers,French,Person,GeopoliticalEntity,"[4952, 4960]","[4945, 4951]",soldiers,French
6,0.945371,Searching the land for clues to the Grail's lo...,agentOf,,,soldiers,claim,Person,EventCommunication,"[1881, 1884]","[1885, 1890]",who,claim
7,0.943395,A modern-day historian filming a documentary d...,agentOf,,,Grail,investigation,Organization,EventLegal,"[2455, 2461]","[2462, 2475]",police,investigation
8,0.600505,Sir Robin avoids a fight with a Three-Headed K...,agentOf,,,Robin,arguing,Person,EventCommunication,"[2656, 2661]","[2740, 2747]",Robin,arguing
9,0.542254,"Lancelot, after receiving an arrow-shot note f...",locatedAt,,,Lancelot,castle,Person,Facility,"[2894, 2902]","[3038, 3044]",Lancelot,castle


In [11]:
dfs['syntax']

Unnamed: 0,lemma,part_of_speech,char_span,token_span,sentence
0,,PROPN,"[0, 5): 'Monty'","[0, 5): 'Monty'","[0, 273): 'Monty Python and the Holy Grail is ..."
1,python,PROPN,"[6, 12): 'Python'","[6, 12): 'Python'","[0, 273): 'Monty Python and the Holy Grail is ..."
2,and,CCONJ,"[13, 16): 'and'","[13, 16): 'and'","[0, 273): 'Monty Python and the Holy Grail is ..."
3,the,DET,"[17, 20): 'the'","[17, 20): 'the'","[0, 273): 'Monty Python and the Holy Grail is ..."
4,,PROPN,"[21, 25): 'Holy'","[21, 25): 'Holy'","[0, 273): 'Monty Python and the Holy Grail is ..."
...,...,...,...,...,...
1076,officer,NOUN,"[5306, 5314): 'officers'","[5306, 5314): 'officers'","[5275, 5335): 'The movie ends with one of the ..."
1077,break,VERB,"[5315, 5323): 'breaking'","[5315, 5323): 'breaking'","[5275, 5335): 'The movie ends with one of the ..."
1078,the,DET,"[5324, 5327): 'the'","[5324, 5327): 'the'","[5275, 5335): 'The movie ends with one of the ..."
1079,camera,NOUN,"[5328, 5334): 'camera'","[5328, 5334): 'camera'","[5275, 5335): 'The movie ends with one of the ..."


# Part 2: NLU Syntax Analysis using DataFrames

Now we will do some analysis on the NLU Syntax result using the Pandas DataFrame

In [31]:
df = dfs['syntax']

# Retrieve sentence information from the above dataframe
sentences = pd.DataFrame({"sentence": df["sentence"].unique()})
sentences

Unnamed: 0,sentence
0,"[0, 273): 'Monty Python and the Holy Grail is ..."
1,"[274, 405): 'It was conceived during the hiatu..."
2,"[407, 642): 'While the group's first film, And..."
3,"[643, 720): 'Thirty years later, Idle used the..."
4,"[722, 823): 'Monty Python and the Holy Grail g..."
5,"[824, 954): 'In the US, it was selected as the..."
6,"[955, 1122): 'In the UK, readers of Total Film..."
7,"[1122, 1256): '[5] In AD 932, King Arthur and ..."
8,"[1257, 1488): 'Along the way, he recruits Sir ..."
9,"[1489, 1639): 'Arthur leads the men to Camelot..."


In [30]:
import pandas as pd
s = pd.DataFrame(u)
s

Unnamed: 0,0
0,"[0, 273): 'Monty Python and the Holy Grail is ..."
1,"[274, 405): 'It was conceived during the hiatu..."
2,"[407, 642): 'While the group's first film, And..."
3,"[643, 720): 'Thirty years later, Idle used the..."
4,"[722, 823): 'Monty Python and the Holy Grail g..."
5,"[824, 954): 'In the US, it was selected as the..."
6,"[955, 1122): 'In the UK, readers of Total Film..."
7,"[1122, 1256): '[5] In AD 932, King Arthur and ..."
8,"[1257, 1488): 'Along the way, he recruits Sir ..."
9,"[1489, 1639): 'Arthur leads the men to Camelot..."


In [34]:
syntax_analysis = response['syntax']['sentences']

In [35]:
syntax_analysis

[{'text': 'Monty Python and the Holy Grail is a 1975 British comedy film concerning the Arthurian legend, written and performed by the Monty Python comedy group of Graham Chapman, John Cleese, Terry Gilliam, Eric Idle, Terry Jones and Michael Palin, and directed by Gilliam and Jones.',
  'location': [0, 273]},
 {'text': "It was conceived during the hiatus between the third and fourth series of their BBC television series Monty Python's Flying Circus.",
  'location': [274, 405]},
 {'text': "While the group's first film, And Now for Something Completely Different, was a compilation of sketches from the first two television series, Holy Grail is a new story that parodies the legend of King Arthur's quest for the Holy Grail.",
  'location': [407, 642]},
 {'text': 'Thirty years later, Idle used the film as the basis for the musical Spamalot.',
  'location': [643, 720]},
 {'text': 'Monty Python and the Holy Grail grossed more than any other British film exhibited in the US in 2005.',
  'loca

In [38]:
# TODO - tokens come in a different dictionary than sentences

# Find all the pronouns in each sentence, *without* using Pandas.

# Version 1: Chain together accessor methods of the Python class that wraps 
# syntax analysis results.

'''syntax_analysis = response['syntax']['sentences']
token_spans_by_sentence = {
    s: l
    for s, l in zip(syntax_analysis.text, 
                    syntax_analysis.location)
}
pronouns_by_sentence = {}
for sentence, offsets_list in token_spans_by_sentence.items():
    pronouns_in_sentence = []
    for offsets_tuple in offsets_list:
        token_ix = syntax_analysis.find_token(offsets_tuple[0])
        if token_ix < 0:
            raise ValueError(f"No token found at offset {offsets_tuple[0]}")
        token = syntax_analysis.tokens[token_ix]
        pos_str = token.to_dict()["part_of_speech"]  # Decode numeric POS enum
        if pos_str == "POS_PRON":
            pronouns_in_sentence.append(token)
    pronouns_by_sentence[sentence] = pronouns_in_sentence

pronouns_by_sentence
'''

'syntax_analysis = response[\'syntax\'][\'sentences\']\ntoken_spans_by_sentence = {\n    s: l\n    for s, l in zip(syntax_analysis.text, \n                    syntax_analysis.location)\n}\npronouns_by_sentence = {}\nfor sentence, offsets_list in token_spans_by_sentence.items():\n    pronouns_in_sentence = []\n    for offsets_tuple in offsets_list:\n        token_ix = syntax_analysis.find_token(offsets_tuple[0])\n        if token_ix < 0:\n            raise ValueError(f"No token found at offset {offsets_tuple[0]}")\n        token = syntax_analysis.tokens[token_ix]\n        pos_str = token.to_dict()["part_of_speech"]  # Decode numeric POS enum\n        if pos_str == "POS_PRON":\n            pronouns_in_sentence.append(token)\n    pronouns_by_sentence[sentence] = pronouns_in_sentence\n\npronouns_by_sentence\n'

In [42]:
# Find all the pronouns in each sentence.
# Pandas version.
pronouns_by_sentence = df[df["part_of_speech"] == "PRON"][["sentence", "token_span"]]
pronouns_by_sentence

Unnamed: 0,sentence,token_span
52,"[274, 405): 'It was conceived during the hiatu...","[274, 276): 'It'"
65,"[274, 405): 'It was conceived during the hiatu...","[348, 353): 'their'"
85,"[407, 642): 'While the group's first film, And...","[449, 458): 'Something'"
107,"[407, 642): 'While the group's first film, And...","[575, 579): 'that'"
161,"[824, 954): 'In the US, it was selected as the...","[835, 837): 'it'"
185,"[824, 954): 'In the US, it was selected as the...","[945, 948): 'Our'"
200,"[955, 1122): 'In the UK, readers of Total Film...","[1012, 1014): 'it'"
224,"[955, 1122): 'In the UK, readers of Total Film...","[1113, 1115): 'it'"
237,"[1122, 1256): '[5] In AD 932, King Arthur and ...","[1154, 1157): 'his'"
261,"[1257, 1488): 'Along the way, he recruits Sir ...","[1272, 1274): 'he'"


In [45]:
# How would the previous cell look if the tokens and sentences weren't pre-joined?
pronouns = df[df["part_of_speech"] == "PRON"]["token_span"]
pronouns_by_sentence = tp.contain_join(sentences["sentence"], pronouns, "sentence", "token_span")
pronouns_by_sentence

Unnamed: 0,sentence,token_span
0,"[274, 405): 'It was conceived during the hiatu...","[274, 276): 'It'"
1,"[274, 405): 'It was conceived during the hiatu...","[348, 353): 'their'"
2,"[407, 642): 'While the group's first film, And...","[449, 458): 'Something'"
3,"[407, 642): 'While the group's first film, And...","[575, 579): 'that'"
4,"[824, 954): 'In the US, it was selected as the...","[835, 837): 'it'"
5,"[824, 954): 'In the US, it was selected as the...","[945, 948): 'Our'"
6,"[955, 1122): 'In the UK, readers of Total Film...","[1012, 1014): 'it'"
7,"[955, 1122): 'In the UK, readers of Total Film...","[1113, 1115): 'it'"
8,"[1122, 1256): '[5] In AD 932, King Arthur and ...","[1154, 1157): 'his'"
9,"[1257, 1488): 'Along the way, he recruits Sir ...","[1272, 1274): 'he'"


In [46]:
# Ask the tokens of the first sentence to render themselves as HTML
sentence_tokens_df = df[df["sentence"] == sentences["sentence"].loc[0]]
sentence_tokens_df["char_span"].values

Unnamed: 0,begin,end,covered_text
0,0,5,Monty
1,6,12,Python
2,13,16,and
3,17,20,the
4,21,25,Holy
5,26,31,Grail
6,32,34,is
7,35,36,a
8,37,41,1975
9,42,49,British


In [48]:
# Display our the first sentence's dependency parse
sentence_tokens_df = df[df["sentence"] == sentences["sentence"].loc[0]]
tp.render_parse_tree(sentence_tokens_df, tag_col=None)

ModuleNotFoundError: No module named 'spacy'