<h1>Text Extensions for Pandas</h1>
<h2>Interactive Dataframe Widget</h2>
The interactive dataframe widget is an application within the IBM CODAIT team's open source Python library: Text Extension for Pandas. The widget aims to provide data scientists with a meaningful, visual way to interpret NLP (Natural Language Processing) data.

This demo will walk you though an example session of using the widget and related visualizers provided in the ```jupyter``` sub-module of Text Extensions for Pandas.

In [1]:
import os
import regex
import sys
import numpy as np
import pandas as pd

# And of course we need the text_extensions_for_pandas library itself.
try:
    import text_extensions_for_pandas as tp
except ModuleNotFoundError as e:
    # If we're running from within the project source tree and the parent Python
    # environment doesn't have the text_extensions_for_pandas package, use the
    # version in the local source tree.
    if not os.getcwd().endswith("notebooks"):
        raise e
    if ".." not in sys.path:
        sys.path.insert(0, "..")
    import text_extensions_for_pandas as tp

This demo will make use of the CoNLL-2003 dataset, a dataset concerning named entity recognition. For our purposes, we are interested in the categorical entity classifications of ```locations (LOC)```, ```persons (PER)```, ```organizations (ORG)``` and ```miscellaneous (MISC)```.

We will use Text Extensions for Pandas to download and parse the CoNLL dataset into dataframes for us to work with.

In [2]:
# Download and cache the data set.
# NOTE: This data set is licensed for research use only. Be sure to adhere
#  to the terms of the license when using this data set!
data_set_info = tp.io.conll.maybe_download_conll_data("outputs")
data_set_info

{'train': 'outputs/eng.train',
 'dev': 'outputs/eng.testa',
 'test': 'outputs/eng.testb'}

In [3]:
gold_standard = tp.io.conll.conll_2003_to_dataframes(
    data_set_info["test"], ["pos", "phrase", "ent"], [False, True, True])
gold_standard = [
    df.drop(columns=["pos", "phrase_iob", "phrase_type"])
    for df in gold_standard
]


Once we have our dataset downloaded and parsed, we can prepare our dataframe for visualization.

In [4]:
df_doc = gold_standard[0][:30]
df_doc

Unnamed: 0,span,ent_iob,ent_type,sentence,line_num
0,"[0, 10): '-DOCSTART-'",O,,"[0, 10): '-DOCSTART-'",0
1,"[11, 17): 'SOCCER'",O,,"[11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ...",2
2,"[17, 18): '-'",O,,"[11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ...",3
3,"[19, 24): 'JAPAN'",B,LOC,"[11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ...",4
4,"[25, 28): 'GET'",O,,"[11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ...",5
5,"[29, 34): 'LUCKY'",O,,"[11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ...",6
6,"[35, 38): 'WIN'",O,,"[11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ...",7
7,"[38, 39): ','",O,,"[11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ...",8
8,"[40, 45): 'CHINA'",B,PER,"[11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ...",9
9,"[46, 48): 'IN'",O,,"[11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ...",10


We don't really want to visualize every column in this dataframe as we're only interested in viewing the entity classifications. The next step is to drop any columns we don't care about.

In [5]:
doc_df = gold_standard[0][:30].drop(columns=['sentence'])
doc_df

Unnamed: 0,span,ent_iob,ent_type,line_num
0,"[0, 10): '-DOCSTART-'",O,,0
1,"[11, 17): 'SOCCER'",O,,2
2,"[17, 18): '-'",O,,3
3,"[19, 24): 'JAPAN'",B,LOC,4
4,"[25, 28): 'GET'",O,,5
5,"[29, 34): 'LUCKY'",O,,6
6,"[35, 38): 'WIN'",O,,7
7,"[38, 39): ','",O,,8
8,"[40, 45): 'CHINA'",B,PER,9
9,"[46, 48): 'IN'",O,,10


Now that our data is prepared for analysis, we can load it up in our widget.

In [6]:
#widget = tp.jupyter.render_dataframe(df_doc)
widget = tp.jupyter.render_dataframe(doc_df, interactive_columns=["ent_type"])
widget.display()

Output(_dom_classes=('tep--dfwidget--output',))

If we want to view this widget interactively, we can pass in the additional parameter ```interactive_columns``` with an array of column names we want to become interactive widgets.

One thing you may notice in the above widgets is that the column ```ent_type``` is editable via a text box. This is fine, but there is a more appropriate way to interact with categorical data.

In [7]:
categorical = pd.Categorical(doc_df["ent_type"], categories=["PER", "LOC", "ORG", "MISC"])
doc_df["ent_type"] = categorical
doc_df.dtypes

span        SpanDtype
ent_iob        object
ent_type     category
line_num        int64
dtype: object

In [8]:
[*doc_df["ent_type"].cat.categories.values, 'nan']

['PER', 'LOC', 'ORG', 'MISC', 'nan']

In [9]:
widget = tp.jupyter.render_dataframe(doc_df, interactive_columns=["ent_type"])
display(widget._df["ent_type"].cat.categories)
widget.display()

Index(['PER', 'LOC', 'ORG', 'MISC'], dtype='object')

Output(_dom_classes=('tep--dfwidget--output',))

In [12]:
out_df = widget.to_dataframe()
out_df

Unnamed: 0,span,ent_iob,ent_type,line_num
0,"[0, 10): '-DOCSTART-'",O,,0
1,"[11, 17): 'SOCCER'",O,,2
2,"[17, 18): '-'",O,,3
3,"[19, 24): 'JAPAN'",B,LOC,4
4,"[25, 28): 'GET'",O,,5
5,"[29, 34): 'LUCKY'",O,,6
6,"[35, 38): 'WIN'",O,,7
7,"[38, 39): ','",O,,8
8,"[40, 45): 'CHINA'",B,LOC,9
9,"[46, 48): 'IN'",O,,10
