# Watson NLP Example for Text Extensions for Pandas

## Introduction

This demo shows how to use the `watson` module from Text Extension for Pandas to 
process a Watson NLP response from the IBM cloud into Pandas DataFrames for analysis.
Pandas is the de facto tool for data science and using Text Extension for Pandas with
Watson NLU gives a powerful, user-friendly way to leverage natural language understanding.
See the following links for details on installation and environment setup:

https://cloud.ibm.com/apidocs/natural-language-understanding?code=python#introduction

https://github.com/CODAIT/text-extensions-for-pandas

This notebook will first walk through how to authenticate with the IBM Watson SDK and 
make a request with the Watson NLU API. The response is then processed by 
Text Extensions for Pandas to convert the JSON response into several Pandas DataFrames.

Next, it will go deeper into the data received from Watson NLU and show how to use the
resulting Pandas DataFrames to easily filter and analyze the data to gain deeper insight.


## Authentication

This demo uses the IBM Watson Python SDK to perform authentication on the IBM Cloud with the 
`IAMAuthenticator`. See https://github.com/watson-developer-cloud/python-sdk#iam for more 
information. To properly authenticate with IBM Cloud, please set the environment variables
`IBM_API_KEY` with your correct apikey to make requests to `ibm_watson.NaturalLanguageUnderstandingV1`
and set `IBM_SERVICE_URL` to the service URL of your IBM Watson instance.

In [1]:
# INITIALIZATION BOILERPLATE

# The Jupyter kernel for this notebook usually starts up inside the notebooks
# directory, but the text_extensions_for_pandas package code is in the parent
# directory. Add that parent directory to the front of the Python include path.
import sys
if (sys.path[0] != ".."):
    sys.path[0] = ".."

import json
import os
from ibm_watson import NaturalLanguageUnderstandingV1
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator
from ibm_watson.natural_language_understanding_v1 import Features, CategoriesOptions, ConceptsOptions, EmotionOptions, EntitiesOptions, KeywordsOptions, \
    MetadataOptions, RelationsOptions, SemanticRolesOptions, SentimentOptions, SyntaxOptions, SyntaxOptionsTokens
import pandas as pd
import text_extensions_for_pandas as tp

In [2]:
# Retrieve the APIKEY for authentication
apikey = os.environ.get("IBM_API_KEY")
if apikey is None:
    raise ValueError("Expected apikey in the environment variable 'IBM_API_KEY'")

# Get the service URL for your IBM Cloud instance
ibm_cloud_service_url = os.environ.get("IBM_SERVICE_URL")
if ibm_cloud_service_url is None:
    raise ValueError("Expected IBM cloud service URL in the environment variable 'IBM_SERVICE_URL'")

In [3]:
# Initialize the authenticator for making requests
authenticator = IAMAuthenticator(apikey)
natural_language_understanding = NaturalLanguageUnderstandingV1(
    version='2019-07-12',
    authenticator=authenticator
)

natural_language_understanding.set_service_url(ibm_cloud_service_url)

# Process the Watson NLU Response into Pandas DataFrames 

The responses should be in the form of decoded JSON Python and the following features
will be processed into DataFrames:

* entities
* keywords
* relations
* semantic_roles
* syntax with sentences and tokens

See https://cloud.ibm.com/apidocs/natural-language-understanding?code=python#text-analytics-features

In [4]:
# Make the request
response = natural_language_understanding.analyze(
    url="https://raw.githubusercontent.com/CODAIT/text-extensions-for-pandas/master/resources/holy_grail.txt",
    return_analyzed_text=True,
    features=Features(
        entities=EntitiesOptions(sentiment=True),
        keywords=KeywordsOptions(sentiment=True, emotion=True),
        relations=RelationsOptions(),
        semantic_roles=SemanticRolesOptions(),
        syntax=SyntaxOptions(sentences=True, tokens=SyntaxOptionsTokens(lemma=True, part_of_speech=True))
    )).get_result()

In [5]:
# View response as JSON
print(json.dumps(response, indent=2))

{
  "usage": {
    "text_units": 1,
    "text_characters": 5338,
    "features": 4
  },
  "syntax": {
    "tokens": [
      {
        "text": "Monty",
        "part_of_speech": "PROPN",
        "location": [
          0,
          5
        ]
      },
      {
        "text": "Python",
        "part_of_speech": "PROPN",
        "location": [
          6,
          12
        ],
        "lemma": "python"
      },
      {
        "text": "and",
        "part_of_speech": "CCONJ",
        "location": [
          13,
          16
        ],
        "lemma": "and"
      },
      {
        "text": "the",
        "part_of_speech": "DET",
        "location": [
          17,
          20
        ],
        "lemma": "the"
      },
      {
        "text": "Holy",
        "part_of_speech": "PROPN",
        "location": [
          21,
          25
        ]
      },
      {
        "text": "Grail",
        "part_of_speech": "PROPN",
        "location": [
          26,
          31
        ]
      },


In [6]:
# Get the response as processed Pandas DataFrames
dfs = tp.watson_nlu_parse_response(response)

In [7]:
# Created DataFrames from the response
dfs.keys()

dict_keys(['syntax', 'entities', 'keywords', 'relations', 'semantic_roles'])

### View the created DataFrames

In [8]:
dfs["entities"].head()

Unnamed: 0,type,text,sentiment.label,sentiment.score,relevance,count,confidence,disambiguation.subtype,disambiguation.name,disambiguation.dbpedia_resource
0,Person,Arthur,negative,-0.312834,0.956097,12,1.0,,,
1,Person,Lancelot,positive,0.835873,0.678523,5,1.0,,,
2,Person,Monty Python,neutral,0.0,0.644313,2,0.977538,,,
3,Person,King Arthur,neutral,0.0,0.561727,2,0.992188,,,
4,Person,Sir Galahad,positive,0.835873,0.540271,2,0.999984,,,


In [9]:
dfs["keywords"].head()

Unnamed: 0,text,sentiment.label,sentiment.score,relevance,emotion.sadness,emotion.joy,emotion.fear,emotion.disgust,emotion.anger,count
0,legend of King Arthur,neutral,0.0,0.746577,0.175057,0.691404,0.058051,0.031335,0.071927,1
1,Sir Lancelot,positive,0.835873,0.642571,0.046902,0.810654,0.01634,0.095661,0.021033,1
2,King Arthur,neutral,0.0,0.642235,0.09149,0.747356,0.043658,0.033299,0.112061,1
3,Holy Grail,positive,0.724846,0.624115,0.125927,0.696048,0.103502,0.153742,0.110257,5
4,British comedy film,neutral,0.0,0.619836,0.056536,0.657384,0.108932,0.048683,0.128826,1


In [10]:
dfs["relations"].head()

Unnamed: 0,type,sentence_span,score,arguments.0.span,arguments.1.span,arguments.0.entities.type,arguments.1.entities.type,arguments.0.entities.text,arguments.1.entities.text,arguments.0.entities.disambiguation.subtype,arguments.1.entities.disambiguation.subtype
0,timeOf,"[0, 273): 'Monty Python and the Holy Grail is ...",0.462615,"[37, 41): '1975'","[57, 61): 'film'",Date,TitleWork,1975,comedy,,
1,locatedAt,"[1489, 1639): 'Arthur leads the men to Camelot...",0.339446,"[1506, 1509): 'men'","[1513, 1520): 'Camelot'",Person,GeopoliticalEntity,men,Camelot,,
2,affectedBy,"[1640, 1756): 'As they turn away, God (an imag...",0.604304,"[1699, 1703): 'them'","[1689, 1695): 'speaks'",Person,EventCommunication,their,speaks,,
3,locatedAt,"[1758, 1935): 'Searching the land for clues to...",0.304596,"[1794, 1799): 'Grail'","[1802, 1810): 'location'",Organization,Location,Grail,location,,
4,employedBy,"[1758, 1935): 'Searching the land for clues to...",0.895035,"[1872, 1880): 'soldiers'","[1865, 1871): 'French'",Person,GeopoliticalEntity,soldiers,French,,[Country]


In [11]:
dfs["semantic_roles"].head()

Unnamed: 0,subject.text,sentence,object.text,action.verb.text,action.verb.tense,action.text,action.normalized
0,Monty Python and the Holy Grail,Monty Python and the Holy Grail is a 1975 Brit...,a 1975 British comedy film concerning the Arth...,be,present,is,be
1,by the Monty Python comedy group of Graham Cha...,Monty Python and the Holy Grail is a 1975 Brit...,Monty Python and the Holy Grail,perform,past,written and performed,write and perform
2,It,It was conceived during the hiatus between th...,,conceive,past,was conceived,be conceive
3,a compilation of sketches,"While the group's first film, And Now for Som...",from the first two television series,be,past,was,be
4,Holy Grail,"While the group's first film, And Now for Som...",a new story that parodies the legend of King A...,be,present,is,be


In [12]:
dfs["syntax"].head()

Unnamed: 0,char_span,token_span,part_of_speech,lemma,sentence
0,"[0, 5): 'Monty'","[0, 5): 'Monty'",PROPN,,"[0, 273): 'Monty Python and the Holy Grail is ..."
1,"[6, 12): 'Python'","[6, 12): 'Python'",PROPN,python,"[0, 273): 'Monty Python and the Holy Grail is ..."
2,"[13, 16): 'and'","[13, 16): 'and'",CCONJ,and,"[0, 273): 'Monty Python and the Holy Grail is ..."
3,"[17, 20): 'the'","[17, 20): 'the'",DET,the,"[0, 273): 'Monty Python and the Holy Grail is ..."
4,"[21, 25): 'Holy'","[21, 25): 'Holy'",PROPN,,"[0, 273): 'Monty Python and the Holy Grail is ..."


# Using Pandas to find all pronouns in each sentence

Now we will take the Watson NLU syntax response data and find all pronouns in each sentence first using standard Python, and then using Pandas. 

In [13]:
syntax = dfs["syntax"]

# Retrieve sentence information from the above dataframe
sentences = pd.DataFrame({"sentence": syntax["sentence"].unique()})
sentences.head()

Unnamed: 0,sentence
0,"[0, 273): 'Monty Python and the Holy Grail is ..."
1,"[274, 405): 'It was conceived during the hiatu..."
2,"[407, 642): 'While the group's first film, And..."
3,"[643, 720): 'Thirty years later, Idle used the..."
4,"[722, 823): 'Monty Python and the Holy Grail g..."


In [14]:
# Find all the pronouns in each sentence, *without* using Pandas.
# NON-scalable traversal of the syntax analysis data structure
# (runs in time proportional to the square of document length).

response_sentences = response["syntax"]["sentences"]
response_tokens = response["syntax"]["tokens"]

pronouns_by_sentence = {s["text"]: [] for s in response_sentences}

# Nested for loops. 
# Running time: O(num_tokens * num_sentences), i.e. O(document_size^2)
for t in response_tokens:
    pos_str = t["part_of_speech"]  # Decode numeric POS enum
    if pos_str == "PRON":
        found_sentence = False
        for s in response_sentences:
            if (t["location"][0] >= s["location"][0] 
                    and t["location"][1] <= s["location"][1]):
                found_sentence = True
                pronouns_by_sentence[s["text"]].append(t)
        if not found_sentence:
            raise ValueError(f"Token {t} is not in any sentence")
            pass  # Make JupyterLab syntax highlight happy
        
pronouns_by_sentence

{'Monty Python and the Holy Grail is a 1975 British comedy film concerning the Arthurian legend, written and performed by the Monty Python comedy group of Graham Chapman, John Cleese, Terry Gilliam, Eric Idle, Terry Jones and Michael Palin, and directed by Gilliam and Jones.': [],
 "It was conceived during the hiatus between the third and fourth series of their BBC television series Monty Python's Flying Circus.": [{'text': 'It',
   'part_of_speech': 'PRON',
   'location': [274, 276],
   'lemma': 'it'},
  {'text': 'their',
   'part_of_speech': 'PRON',
   'location': [348, 353],
   'lemma': 'their'}],
 "While the group's first film, And Now for Something Completely Different, was a compilation of sketches from the first two television series, Holy Grail is a new story that parodies the legend of King Arthur's quest for the Holy Grail.": [{'text': 'Something',
   'part_of_speech': 'PRON',
   'location': [449, 458],
   'lemma': 'something'},
  {'text': 'that',
   'part_of_speech': 'PRON',

In [15]:
# Find all the pronouns in each sentence.
# Pandas version.
pronouns_by_sentence = syntax[syntax["part_of_speech"] == "PRON"][["sentence", "token_span"]]
pronouns_by_sentence

Unnamed: 0,sentence,token_span
52,"[274, 405): 'It was conceived during the hiatu...","[274, 276): 'It'"
65,"[274, 405): 'It was conceived during the hiatu...","[348, 353): 'their'"
85,"[407, 642): 'While the group's first film, And...","[449, 458): 'Something'"
107,"[407, 642): 'While the group's first film, And...","[575, 579): 'that'"
161,"[824, 954): 'In the US, it was selected as the...","[835, 837): 'it'"
185,"[824, 954): 'In the US, it was selected as the...","[945, 948): 'Our'"
200,"[955, 1122): 'In the UK, readers of Total Film...","[1012, 1014): 'it'"
224,"[955, 1122): 'In the UK, readers of Total Film...","[1113, 1115): 'it'"
237,"[1122, 1256): '[5] In AD 932, King Arthur and ...","[1154, 1157): 'his'"
261,"[1257, 1488): 'Along the way, he recruits Sir ...","[1272, 1274): 'he'"


In [16]:
# Highlight all pronouns with sentences containing 'Arthur'
mask = pronouns_by_sentence["sentence"].map(lambda s: s.covered_text).str.contains("Arthur")
pronouns_by_sentence["token_span"][mask].values

Unnamed: 0,begin,end,begin_token,end_token,covered_text
0,449,458,85,86,Something
1,575,579,107,108,that
2,1154,1157,237,238,his
3,1582,1584,334,335,he
4,1617,1619,341,342,it
5,1643,1647,350,351,they
6,1699,1703,365,366,them
7,1823,1826,390,391,his
8,1881,1884,401,402,who
9,2242,2247,472,473,their


In [17]:
# How would the previous cell look if the tokens and sentences weren't pre-joined?
pronouns = syntax[syntax["part_of_speech"] == "PRON"]["token_span"]
pronouns_by_sentence = tp.contain_join(sentences["sentence"], pronouns, "sentence", "token_span")
pronouns_by_sentence

Unnamed: 0,sentence,token_span
0,"[274, 405): 'It was conceived during the hiatu...","[274, 276): 'It'"
1,"[274, 405): 'It was conceived during the hiatu...","[348, 353): 'their'"
2,"[407, 642): 'While the group's first film, And...","[449, 458): 'Something'"
3,"[407, 642): 'While the group's first film, And...","[575, 579): 'that'"
4,"[824, 954): 'In the US, it was selected as the...","[835, 837): 'it'"
5,"[824, 954): 'In the US, it was selected as the...","[945, 948): 'Our'"
6,"[955, 1122): 'In the UK, readers of Total Film...","[1012, 1014): 'it'"
7,"[955, 1122): 'In the UK, readers of Total Film...","[1113, 1115): 'it'"
8,"[1122, 1256): '[5] In AD 932, King Arthur and ...","[1154, 1157): 'his'"
9,"[1257, 1488): 'Along the way, he recruits Sir ...","[1272, 1274): 'he'"


In [18]:
relations = dfs["relations"]
relations

Unnamed: 0,type,sentence_span,score,arguments.0.span,arguments.1.span,arguments.0.entities.type,arguments.1.entities.type,arguments.0.entities.text,arguments.1.entities.text,arguments.0.entities.disambiguation.subtype,arguments.1.entities.disambiguation.subtype
0,timeOf,"[0, 273): 'Monty Python and the Holy Grail is ...",0.462615,"[37, 41): '1975'","[57, 61): 'film'",Date,TitleWork,1975,comedy,,
1,locatedAt,"[1489, 1639): 'Arthur leads the men to Camelot...",0.339446,"[1506, 1509): 'men'","[1513, 1520): 'Camelot'",Person,GeopoliticalEntity,men,Camelot,,
2,affectedBy,"[1640, 1756): 'As they turn away, God (an imag...",0.604304,"[1699, 1703): 'them'","[1689, 1695): 'speaks'",Person,EventCommunication,their,speaks,,
3,locatedAt,"[1758, 1935): 'Searching the land for clues to...",0.304596,"[1794, 1799): 'Grail'","[1802, 1810): 'location'",Organization,Location,Grail,location,,
4,employedBy,"[1758, 1935): 'Searching the land for clues to...",0.895035,"[1872, 1880): 'soldiers'","[1865, 1871): 'French'",Person,GeopoliticalEntity,soldiers,French,,[Country]
5,employedBy,"[4849, 4985): 'Arthur and Bedevere eventually ...",0.903545,"[4952, 4960): 'soldiers'","[4945, 4951): 'French'",Person,GeopoliticalEntity,soldiers,French,,[Country]
6,agentOf,"[1758, 1935): 'Searching the land for clues to...",0.945371,"[1881, 1884): 'who'","[1885, 1890): 'claim'",Person,EventCommunication,soldiers,claim,,
7,agentOf,"[2310, 2476): 'A modern-day historian filming ...",0.943395,"[2455, 2461): 'police'","[2462, 2475): 'investigation'",Organization,EventLegal,Grail,investigation,,
8,agentOf,"[2652, 2748): 'Sir Robin avoids a fight with a...",0.600505,"[2656, 2661): 'Robin'","[2740, 2747): 'arguing'",Person,EventCommunication,Robin,arguing,,
9,locatedAt,"[2894, 3196): 'Lancelot, after receiving an ar...",0.542254,"[2894, 2902): 'Lancelot'","[3038, 3044): 'castle'",Person,Facility,Lancelot,castle,,


In [19]:
arg_0_spans = relations["arguments.0.span"]
arg_1_spans = relations["arguments.1.span"]
arg_0_spans

0                                  [37, 41): '1975'
1                               [1506, 1509): 'men'
2                              [1699, 1703): 'them'
3                             [1794, 1799): 'Grail'
4                          [1872, 1880): 'soldiers'
5                          [4952, 4960): 'soldiers'
6                               [1881, 1884): 'who'
7                            [2455, 2461): 'police'
8                             [2656, 2661): 'Robin'
9                          [2894, 2902): 'Lancelot'
10                           [3163, 3169): 'father'
11                              [3346, 3349): 'who'
12          [124, 149): 'Monty Python comedy group'
13                              [3346, 3349): 'who'
14                            [3373, 3378): 'where'
15                          [3541, 3548): 'knights'
16                           [3560, 3566): 'Rabbit'
17                  [3676, 3691): 'Brother Maynard'
18                  [3967, 3982): 'Brother Maynard'
19          

In [20]:
doc_text = relations["arguments.0.span"].iloc[0].target_text

In [21]:
import spacy
spacy_language_model = spacy.load("en_core_web_sm")
token_features = tp.make_tokens_and_features(doc_text, spacy_language_model)
token_features

Unnamed: 0,id,char_span,token_span,lemma,pos,tag,dep,head,shape,ent_iob,ent_type,is_alpha,is_stop,sentence
0,0,"[0, 5): 'Monty'","[0, 5): 'Monty'",Monty,PROPN,NNP,compound,1,Xxxxx,B,PERSON,True,False,"[0, 273): 'Monty Python and the Holy Grail is ..."
1,1,"[6, 12): 'Python'","[6, 12): 'Python'",Python,PROPN,NNP,nsubj,6,Xxxxx,I,PERSON,True,False,"[0, 273): 'Monty Python and the Holy Grail is ..."
2,2,"[13, 16): 'and'","[13, 16): 'and'",and,CCONJ,CC,cc,1,xxx,O,,True,True,"[0, 273): 'Monty Python and the Holy Grail is ..."
3,3,"[17, 20): 'the'","[17, 20): 'the'",the,DET,DT,det,5,xxx,B,ORG,True,True,"[0, 273): 'Monty Python and the Holy Grail is ..."
4,4,"[21, 25): 'Holy'","[21, 25): 'Holy'",Holy,PROPN,NNP,compound,5,Xxxx,I,ORG,True,False,"[0, 273): 'Monty Python and the Holy Grail is ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1079,1079,"[5315, 5323): 'breaking'","[5315, 5323): 'breaking'",break,VERB,VBG,acl,1078,xxxx,O,,True,False,"[5275, 5338): 'The movie ends with one of the ..."
1080,1080,"[5324, 5327): 'the'","[5324, 5327): 'the'",the,DET,DT,det,1081,xxx,O,,True,True,"[5275, 5338): 'The movie ends with one of the ..."
1081,1081,"[5328, 5334): 'camera'","[5328, 5334): 'camera'",camera,NOUN,NN,dobj,1079,xxxx,O,,True,False,"[5275, 5338): 'The movie ends with one of the ..."
1082,1082,"[5334, 5335): '.'","[5334, 5335): '.'",.,PUNCT,.,punct,1073,.,O,,False,False,"[5275, 5338): 'The movie ends with one of the ..."


In [22]:
g = tp.token_features_to_traversal(token_features)
g

<text_extensions_for_pandas.gremlin.traversal.constant.PrecomputedTraversal at 0x7fce63e5c410>

In [23]:
# Uncomment to restrict analysis to only certain arguments
#some_arg_0_spans = arg_0_spans.iloc[[2]]
some_arg_0_spans = arg_0_spans
#some_arg_0_spans

In [24]:
query = (
    # Start with all the arg0 values
    g.V().has('char_span', tp.within(*some_arg_0_spans)).as_("child0")
    # Enumerate all the parent, grandparent, etc. nodes of the arg0s
    .emit().repeat(tp.__.out("head")).as_("ancestor")
    # Enumerate all children of ancestors...
    .emit().repeat(tp.__.in_('head')).as_("sibling")
    # ...that are in the arg_1_spans list
    .has("char_span", tp.within(*arg_1_spans)).as_("child1")
    
    # Generate output with named columns
    .select("child0", "ancestor", "child1")
    .by("token_span")
).compute()

In [25]:
df = query.toDataFrame()
df

Unnamed: 0,child0,ancestor,child1
0,"[1450, 1455): 'their'","[1450, 1455): 'their'","[1450, 1455): 'their'"
1,"[37, 41): '1975'","[57, 61): 'film'","[57, 61): 'film'"
2,"[1450, 1455): 'their'","[1456, 1463): 'squires'","[1456, 1463): 'squires'"
3,"[1794, 1799): 'Grail'","[1802, 1810): 'location'","[1802, 1810): 'location'"
4,"[1881, 1884): 'who'","[1885, 1890): 'claim'","[1885, 1890): 'claim'"
...,...,...,...
153,"[2894, 2902): 'Lancelot'","[2957, 2965): 'believed'","[3157, 3162): 'whose'"
154,"[3163, 3169): 'father'","[2957, 2965): 'believed'","[3157, 3162): 'whose'"
155,"[4560, 4563): 'his'","[4504, 4506): 'to'","[4585, 4591): 'bridge'"
156,"[4492, 4494): 'he'","[4495, 4503): 'responds'","[4585, 4591): 'bridge'"


In [26]:
filtered_df = (
    df
    # df contains any pairs of elements of relations, even if they came from 
    # different relationships within the same sentence.
    # Identify pairs of child0, child1 that are the same as pairs in the original
    # relations dataframe.
    .merge(relations[["arguments.0.span", "arguments.1.span"]], left_on=["child0", "child1"],
           right_on=["arguments.0.span", "arguments.1.span"])
    [["arguments.0.span", "arguments.1.span", "ancestor"]]
)
filtered_df

Unnamed: 0,arguments.0.span,arguments.1.span,ancestor
0,"[37, 41): '1975'","[57, 61): 'film'","[57, 61): 'film'"
1,"[37, 41): '1975'","[57, 61): 'film'","[32, 34): 'is'"
2,"[1450, 1455): 'their'","[1456, 1463): 'squires'","[1456, 1463): 'squires'"
3,"[1450, 1455): 'their'","[1456, 1463): 'squires'","[1445, 1449): 'with'"
4,"[1450, 1455): 'their'","[1456, 1463): 'squires'","[1439, 1444): 'along'"
...,...,...,...
93,"[4560, 4563): 'his'","[4585, 4591): 'bridge'","[4532, 4540): 'question'"
94,"[4560, 4563): 'his'","[4585, 4591): 'bridge'","[4504, 4506): 'to'"
95,"[4560, 4563): 'his'","[4585, 4591): 'bridge'","[4495, 4503): 'responds'"
96,"[1393, 1401): 'Lancelot'","[1450, 1455): 'their'","[1393, 1401): 'Lancelot'"


In [27]:
# Add some token metadata columns
augmented_df = (
 filtered_df
    .merge(token_features, left_on="ancestor", right_on="char_span")
    [["arguments.0.span", "arguments.1.span", "ancestor", "head", "id"]]
    .reset_index()  # Give each row a unique ID
)
augmented_df

Unnamed: 0,index,arguments.0.span,arguments.1.span,ancestor,head,id
0,0,"[37, 41): '1975'","[57, 61): 'film'","[57, 61): 'film'",6,11
1,1,"[37, 41): '1975'","[57, 61): 'film'","[32, 34): 'is'",6,6
2,2,"[1450, 1455): 'their'","[1456, 1463): 'squires'","[1456, 1463): 'squires'",305,307
3,3,"[1450, 1455): 'their'","[1456, 1463): 'squires'","[1445, 1449): 'with'",304,305
4,4,"[1450, 1455): 'their'","[1456, 1463): 'squires'","[1439, 1444): 'along'",290,304
...,...,...,...,...,...,...
93,93,"[4560, 4563): 'his'","[4585, 4591): 'bridge'","[4557, 4559): 'of'",924,925
94,94,"[4560, 4563): 'his'","[4585, 4591): 'bridge'","[4548, 4556): 'question'",922,924
95,95,"[4560, 4563): 'his'","[4585, 4591): 'bridge'","[4541, 4545): 'with'",921,922
96,96,"[4560, 4563): 'his'","[4585, 4591): 'bridge'","[4532, 4540): 'question'",914,921


In [28]:
# Remove all ancestors that aren't the least ancestor
to_remove = augmented_df.merge(augmented_df, 
                               left_on="head", right_on="id")[["arguments.0.span_y", "id_y", "index_y"]]
to_remove

Unnamed: 0,arguments.0.span_y,id_y,index_y
0,"[37, 41): '1975'",6,1
1,"[37, 41): '1975'",6,1
2,"[1450, 1455): 'their'",305,3
3,"[1450, 1455): 'their'",304,4
4,"[1450, 1455): 'their'",290,5
...,...,...,...
194,"[4560, 4563): 'his'",925,93
195,"[4560, 4563): 'his'",924,94
196,"[4560, 4563): 'his'",922,95
197,"[4560, 4563): 'his'",921,96


In [29]:
# Now we can compute the least common ancestor of each pair
lca_df = (
    augmented_df[~augmented_df["index"].isin(to_remove["index_y"])]
    [["arguments.0.span", "arguments.1.span", "ancestor", 'id']]
)
lca_df

Unnamed: 0,arguments.0.span,arguments.1.span,ancestor,id
0,"[37, 41): '1975'","[57, 61): 'film'","[57, 61): 'film'",11
2,"[1450, 1455): 'their'","[1456, 1463): 'squires'","[1456, 1463): 'squires'",307
7,"[1794, 1799): 'Grail'","[1802, 1810): 'location'","[1802, 1810): 'location'",384
15,"[1881, 1884): 'who'","[1885, 1890): 'claim'","[1885, 1890): 'claim'",400
26,"[2455, 2461): 'police'","[2462, 2475): 'investigation'","[2462, 2475): 'investigation'",508
44,"[5100, 5106): 'castle'","[5088, 5095): 'assault'","[5088, 5095): 'assault'",1032
49,"[4830, 4839): 'historian'","[4812, 4825): 'investigating'","[4812, 4825): 'investigating'",981
54,"[5065, 5072): 'knights'","[5057, 5061): 'army'","[5057, 5061): 'army'",1026
55,"[3163, 3169): 'father'","[3157, 3162): 'whose'","[3163, 3169): 'father'",651
64,"[4952, 4960): 'soldiers'","[4945, 4951): 'French'","[4952, 4960): 'soldiers'",1006


In [30]:
# TODO: Replace with something equivalent 
'''
#Use a Gremlin query to find all the children of this LCA
ancestors = lca_df['ancestor']


subtree = (
            #select ancestor Vertex
            g.V().has('char_span' ,tp.within(*ancestors)).as_('LCA')
            .emit().repeat(tp.__.in_('head')).as_('child').select('LCA','child').by('id')
            ).compute()

st_df = subtree.toDataFrame()
st_df
'''

Unnamed: 0,LCA,child
0,11,11
1,307,307
2,384,384
3,400,400
4,508,508
...,...,...
127,11,40
128,11,39
129,11,41
130,11,43


In [31]:
''''
#use pandas to group sets by LCA id

def group_children(series): return [r for _,r in series.items()]

subtree_df = st_df.groupby(['LCA'],as_index=False).aggregate(group_children).rename(columns={'child': 'children'})
subtree_df
'''

Unnamed: 0,LCA,children
0,11,"[11, 7, 8, 9, 10, 12, 15, 13, 14, 16, 17, 18, ..."
1,307,"[307, 306, 308, 311, 309, 310]"
2,384,"[384, 382, 381, 383]"
3,400,"[400, 399, 402, 401, 404, 405, 406, 403, 408, ..."
4,508,"[508, 506, 507]"
5,651,"[651, 650]"
6,701,"[701, 696, 700, 704, 695, 697, 694, 702, 703, ..."
7,927,"[927, 926, 929, 928, 933, 930, 931, 932]"
8,940,"[940, 939, 941, 942, 943]"
9,981,"[981, 985, 983, 982, 984]"


In [32]:
'''
# Choose a row to show
row = 6


print('Displaying local parse tree for the following relation:')
display(lca_df[lca_df["id"] == subtree_df.at[row,'LCA']].drop(columns = ['id']))

#Select the spacy outputs of tokens that are members of the subtree 
selected_df = token_features[token_features['id'].isin(subtree_df.at[row,'children'])]

tp.render_parse_tree(selected_df)
'''

Displaying local parse tree for the following relation:


Unnamed: 0,arguments.0.span,arguments.1.span,ancestor
82,"[3373, 3378): 'where'","[3383, 3391): 'location'","[3408, 3412): 'said'"


In [33]:
'''
# also display all the elements of the subtree, as extracted from SpaCy
selected_df
'''

Unnamed: 0,id,char_span,token_span,lemma,pos,tag,dep,head,shape,ent_iob,ent_type,is_alpha,is_stop,sentence
694,694,"[3373, 3378): 'where'","[3373, 3378): 'where'",where,ADV,WRB,advmod,704,xxxx,O,,True,True,"[3317, 3427): 'They meet Tim the Enchanter, wh..."
695,695,"[3379, 3382): 'the'","[3379, 3382): 'the'",the,DET,DT,det,696,xxx,O,,True,True,"[3317, 3427): 'They meet Tim the Enchanter, wh..."
696,696,"[3383, 3391): 'location'","[3383, 3391): 'location'",location,NOUN,NN,nsubjpass,701,xxxx,O,,True,False,"[3317, 3427): 'They meet Tim the Enchanter, wh..."
697,697,"[3392, 3394): 'of'","[3392, 3394): 'of'",of,ADP,IN,prep,696,xx,O,,True,True,"[3317, 3427): 'They meet Tim the Enchanter, wh..."
698,698,"[3395, 3398): 'the'","[3395, 3398): 'the'",the,DET,DT,det,699,xxx,O,,True,True,"[3317, 3427): 'They meet Tim the Enchanter, wh..."
699,699,"[3399, 3404): 'Grail'","[3399, 3404): 'Grail'",Grail,PROPN,NNP,pobj,697,Xxxxx,B,PERSON,True,False,"[3317, 3427): 'They meet Tim the Enchanter, wh..."
700,700,"[3405, 3407): 'is'","[3405, 3407): 'is'",be,VERB,VBZ,auxpass,701,xx,O,,True,True,"[3317, 3427): 'They meet Tim the Enchanter, wh..."
701,701,"[3408, 3412): 'said'","[3408, 3412): 'said'",say,VERB,VBN,relcl,693,xxxx,O,,True,False,"[3317, 3427): 'They meet Tim the Enchanter, wh..."
702,702,"[3413, 3415): 'to'","[3413, 3415): 'to'",to,PART,TO,aux,704,xx,O,,True,True,"[3317, 3427): 'They meet Tim the Enchanter, wh..."
703,703,"[3416, 3418): 'be'","[3416, 3418): 'be'",be,VERB,VB,auxpass,704,xx,O,,True,True,"[3317, 3427): 'They meet Tim the Enchanter, wh..."
