<h1>Text Extensions for Pandas</h1>
<h2>Interactive Dataframe Widget</h2>
The interactive dataframe widget is an application within the IBM CODAIT team's open source Python library: Text Extension for Pandas. The widget aims to provide data scientists with a meaningful, visual way to interpret NLP (Natural Language Processing) data.

This demo will walk you though an example session of using the widget and related visualizers provided in the ```jupyter``` sub-module of Text Extensions for Pandas.

In [1]:
import os
import regex
import sys
import numpy as np
import pandas as pd

# And of course we need the text_extensions_for_pandas library itself.
try:
    import text_extensions_for_pandas as tp
except ModuleNotFoundError as e:
    # If we're running from within the project source tree and the parent Python
    # environment doesn't have the text_extensions_for_pandas package, use the
    # version in the local source tree.
    if not os.getcwd().endswith("notebooks"):
        raise e
    if ".." not in sys.path:
        sys.path.insert(0, "..")
    import text_extensions_for_pandas as tp

This demo will make use of the CoNLL-2003 dataset, a dataset concerning named entity recognition. For our purposes, we are interested in the categorical entity classifications of ```locations (LOC)```, ```persons (PER)```, ```organizations (ORG)``` and ```miscellaneous (MISC)```.

We will use Text Extensions for Pandas to download and parse the CoNLL dataset into dataframes for us to work with.

In [2]:
# Download and cache the data set.
# NOTE: This data set is licensed for research use only. Be sure to adhere
#  to the terms of the license when using this data set!
data_set_info = tp.io.conll.maybe_download_conll_data("outputs")
data_set_info

{'train': 'outputs/eng.train',
 'dev': 'outputs/eng.testa',
 'test': 'outputs/eng.testb'}

In [3]:
gold_standard = tp.io.conll.conll_2003_to_dataframes(
    data_set_info["test"], ["pos", "phrase", "ent"], [False, True, True])
gold_standard = [
    df.drop(columns=["pos", "phrase_iob", "phrase_type"])
    for df in gold_standard
]


Once we have our dataset downloaded and parsed, we can prepare our dataframe for visualization.

In [4]:
tokens = gold_standard[0]
tokens

Unnamed: 0,span,ent_iob,ent_type,sentence,line_num
0,"[0, 10): '-DOCSTART-'",O,,"[0, 10): '-DOCSTART-'",0
1,"[11, 17): 'SOCCER'",O,,"[11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ...",2
2,"[17, 18): '-'",O,,"[11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ...",3
3,"[19, 24): 'JAPAN'",B,LOC,"[11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ...",4
4,"[25, 28): 'GET'",O,,"[11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ...",5
...,...,...,...,...,...
415,"[2178, 2182): 'each'",O,,"[2138, 2197): 'All four teams are level with o...",437
416,"[2183, 2187): 'from'",O,,"[2138, 2197): 'All four teams are level with o...",438
417,"[2188, 2191): 'one'",O,,"[2138, 2197): 'All four teams are level with o...",439
418,"[2192, 2196): 'game'",O,,"[2138, 2197): 'All four teams are level with o...",440


In [5]:
sentences = tokens["sentence"].unique()
sentences

Unnamed: 0,begin,end,begin token,end token,context
0,0,10,0,1,-DOCSTART-
1,11,65,1,13,"SOCCER- JAPAN GET LUCKY WIN, CHINA IN SURPRISE DEFEAT."
2,66,77,13,15,Nadim Ladki
3,78,117,15,21,"AL-AIN, United Arab Emirates 1996-12-06"
4,118,244,21,46,Japan began the defence of their Asian Cup title with a lucky 2-1 win against Syria in a Group C championship match on Friday.
5,245,374,46,71,"But China saw their luck desert them in the second match of the group, crashing to a surprise 2-0 defeat to newcomers Uzbekistan."
6,375,617,71,113,China controlled most of the match and saw several chances missed until the 78th minute when Uzbek striker Igor Shkvyrin took advantage of a misdirected defensive header to lob the ball over the advancing Chinese keeper and into an empty net.
7,618,735,113,136,"Oleg Shatskiku made sure of the win in injury time, hitting an unstoppable left foot shot from just outside the area."
8,736,821,136,153,The former Soviet republic was playing in an Asian Cup finals tie for the first time.
9,822,917,153,171,"Despite winning the Asian Games title two years ago, Uzbekistan are in the finals as outsiders."


In [6]:
entity_mentions = tp.io.conll.iob_to_spans(tokens)
entity_mentions

Unnamed: 0,span,ent_type
0,"[19, 24): 'JAPAN'",LOC
1,"[40, 45): 'CHINA'",PER
2,"[66, 77): 'Nadim Ladki'",PER
3,"[78, 84): 'AL-AIN'",LOC
4,"[86, 106): 'United Arab Emirates'",LOC
5,"[118, 123): 'Japan'",LOC
6,"[151, 160): 'Asian Cup'",MISC
7,"[196, 201): 'Syria'",LOC
8,"[249, 254): 'China'",LOC
9,"[363, 373): 'Uzbekistan'",LOC


In [7]:
entity_sentence_pairs = tp.spanner.contain_join(pd.Series(sentences), entity_mentions["span"], "sentence", "span")
entity_sentence_pairs

Unnamed: 0,sentence,span
0,"[11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ...","[19, 24): 'JAPAN'"
1,"[11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ...","[40, 45): 'CHINA'"
2,"[66, 77): 'Nadim Ladki'","[66, 77): 'Nadim Ladki'"
3,"[78, 117): 'AL-AIN, United Arab Emirates 1996-...","[78, 84): 'AL-AIN'"
4,"[78, 117): 'AL-AIN, United Arab Emirates 1996-...","[86, 106): 'United Arab Emirates'"
5,"[118, 244): 'Japan began the defence of their ...","[118, 123): 'Japan'"
6,"[118, 244): 'Japan began the defence of their ...","[151, 160): 'Asian Cup'"
7,"[118, 244): 'Japan began the defence of their ...","[196, 201): 'Syria'"
8,"[245, 374): 'But China saw their luck desert t...","[249, 254): 'China'"
9,"[245, 374): 'But China saw their luck desert t...","[363, 373): 'Uzbekistan'"


In [8]:
entity_mentions = entity_mentions.merge(entity_sentence_pairs)
entity_mentions

Unnamed: 0,span,ent_type,sentence
0,"[19, 24): 'JAPAN'",LOC,"[11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ..."
1,"[40, 45): 'CHINA'",PER,"[11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ..."
2,"[66, 77): 'Nadim Ladki'",PER,"[66, 77): 'Nadim Ladki'"
3,"[78, 84): 'AL-AIN'",LOC,"[78, 117): 'AL-AIN, United Arab Emirates 1996-..."
4,"[86, 106): 'United Arab Emirates'",LOC,"[78, 117): 'AL-AIN, United Arab Emirates 1996-..."
5,"[118, 123): 'Japan'",LOC,"[118, 244): 'Japan began the defence of their ..."
6,"[151, 160): 'Asian Cup'",MISC,"[118, 244): 'Japan began the defence of their ..."
7,"[196, 201): 'Syria'",LOC,"[118, 244): 'Japan began the defence of their ..."
8,"[249, 254): 'China'",LOC,"[245, 374): 'But China saw their luck desert t..."
9,"[363, 373): 'Uzbekistan'",LOC,"[245, 374): 'But China saw their luck desert t..."


In [9]:
entity_mentions["sentence_id"] = entity_mentions["sentence"].array.begin
entity_mentions

Unnamed: 0,span,ent_type,sentence,sentence_id
0,"[19, 24): 'JAPAN'",LOC,"[11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ...",11
1,"[40, 45): 'CHINA'",PER,"[11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ...",11
2,"[66, 77): 'Nadim Ladki'",PER,"[66, 77): 'Nadim Ladki'",66
3,"[78, 84): 'AL-AIN'",LOC,"[78, 117): 'AL-AIN, United Arab Emirates 1996-...",78
4,"[86, 106): 'United Arab Emirates'",LOC,"[78, 117): 'AL-AIN, United Arab Emirates 1996-...",78
5,"[118, 123): 'Japan'",LOC,"[118, 244): 'Japan began the defence of their ...",118
6,"[151, 160): 'Asian Cup'",MISC,"[118, 244): 'Japan began the defence of their ...",118
7,"[196, 201): 'Syria'",LOC,"[118, 244): 'Japan began the defence of their ...",118
8,"[249, 254): 'China'",LOC,"[245, 374): 'But China saw their luck desert t...",245
9,"[363, 373): 'Uzbekistan'",LOC,"[245, 374): 'But China saw their luck desert t...",245


We can take a closer look at what the ```span``` column might look like in context by viewing the column alone as the SpanArray datatype.

In [10]:
entity_mentions["span"].array

Unnamed: 0,begin,end,begin token,end token,context
0,19,24,3,4,JAPAN
1,40,45,8,9,CHINA
2,66,77,13,15,Nadim Ladki
3,78,84,15,16,AL-AIN
4,86,106,17,20,United Arab Emirates
5,118,123,21,22,Japan
6,151,160,27,29,Asian Cup
7,196,201,36,37,Syria
8,249,254,47,48,China
9,363,373,69,70,Uzbekistan


We don't really want to visualize every column in our dataframe as we're only interested in viewing the entity classifications. The next step is to drop any columns we don't care about.

Now that our data is prepared for analysis, we can load it up in our widget.

In [11]:
widget = tp.jupyter.render_dataframe(entity_mentions.drop(columns=["sentence"]))
#widget = tp.jupyter.render_dataframe(entity_mentions, interactive_columns=["ent_type"])
widget.display()

Output(_dom_classes=('tep--dfwidget--output',))

If we want to view this widget interactively, we can pass in the additional parameter ```interactive_columns``` with an array of column names we want to become interactive widgets.

One thing you may notice in the above widgets is that the column ```ent_type``` is editable via a text box. This is fine, but there is a more appropriate way to interact with categorical data.

In [13]:
categorical = pd.Categorical(df_doc["ent_type"], categories=["PER", "LOC", "ORG", "MISC"])
df_doc["ent_type"] = categorical
df_doc.dtypes

NameError: name 'df_doc' is not defined

In [None]:
widget = tp.jupyter.render_dataframe(df_doc, interactive_columns=["ent_type"])
widget.display()

In [None]:
corrected_entities = entity_mentions.copy(True)
new_types = corrected_entities["ent_type"].copy()
new_types[widget._metadata_column] = "ORG"
corrected_entities["new_type"] = new_types
corrected_entities

In [None]:
out_df = widget.to_dataframe()
out_df

In [None]:
Named Entity Recognition, Named Entity Extraction - Entity Mention Extraction