<font size=6>
    <b>Text_Extensions_for_Pandas_Overview.ipynb:</b>
    <p>Overview of the basic functionality and usage of Text Extensions for Pandas.</p>
</font>

## Text Extensions for Pandas

**TODO**

See the following notebooks for in depth examples that use Text Extensions for Pandas for data analysis, NLP, and model training:
- [Analyze_Model_Outputs](./Analyze_Model_Outputs.ipynb)
- [Analyze_Text](./Analyze_Text.ipynb)
- [Integrate_NLP_Libraries](./Integrate_NLP_Libraries.ipynb)
- [Model_Training_with_TensorArray](./Model_Training_with_TensorArray.ipynb)

## Environment Setup

This notebook requires a Python 3.7 or later environment with NumPy, and Pandas. 

The notebook also requires the  `text_extensions_for_pandas` library. You can satisfy this dependency in two ways:

* Run `pip install text_extensions_for_pandas` before running this notebook. This command adds the library to your Python environment.
* Run this notebook out of your local copy of the Text Extensions for Pandas project's [source tree](https://github.com/CODAIT/text-extensions-for-pandas). In this case, the notebook will use the version of Text Extensions for Pandas in your local source tree **if the package is not installed in your Python environment**.

In [7]:
import os
import sys
import numpy as np
import pandas as pd

# And of course we need the text_extensions_for_pandas library itself.
try:
    import text_extensions_for_pandas as tp
except ModuleNotFoundError as e:
    # If we're running from within the project source tree and the parent Python
    # environment doesn't have the text_extensions_for_pandas package, use the
    # version in the local source tree.
    if not os.getcwd().endswith("notebooks"):
        raise e
    if ".." not in sys.path:
        sys.path.insert(0, "..")
    import text_extensions_for_pandas as tp

## Pandas Extension Arrays

Text Extensions for Pandas provides several Pandas extension arrays on which much of the functionality is built on top of. These extension arrays **TODO** and integrate with  standard Pandas to provide a workflow centered around the easy to use and powerful Pandas DataFrame.

This section will introduce and show basic usage of the extension arrays.

### SpanArray

In [29]:
# Sample text input
text = """\
In AD 932, King Arthur and his squire, Patsy, travel throughout Britain \
searching for men to join the Knights of the Round Table. Along the way, \
he recruits Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Galahad \
the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and Sir \
Not-Appearing-in-this-Film, along with their squires and Robin's troubadours.\
"""

In [94]:
# Define a crude tokenizer for example only
def tokenize_with_offsets(text):
    """Return offsets of tokens from given `text`"""
    splits = text.split(" ")
    begins = np.cumsum([0] + [len(s) + 1 for s in splits[:-1]])
    ends = begins + [len(s.strip(",.")) for s in splits]
    return begins, ends

In [95]:
begins, ends = tokenize_with_offsets(text)
tokens = tp.SpanArray(text, begins, ends)
tokens

Unnamed: 0,begin,end,covered_text
0,0,2,In
1,3,5,AD
2,6,9,932
3,11,15,King
4,16,22,Arthur
5,23,26,and
6,27,30,his
7,31,37,squire
8,39,44,Patsy
9,46,52,travel


### TokenSpanArray

In [100]:
begin_tokens = [4, 8, 11, 21]
end_tokens = [5, 9, 12, 23]
token_spans = tp.TokenSpanArray(tokens, begin_tokens, end_tokens)
token_spans

Unnamed: 0,begin,end,begin_token,end_token,covered_text
0,16,22,4,5,Arthur
1,39,44,8,9,Patsy
2,64,71,11,12,Britain
3,117,128,21,23,Round Table


In [103]:
tp.TokenSpanArray.align_to_tokens(tokens, token_spans)

Unnamed: 0,begin,end,begin_token,end_token,covered_text
0,16,22,4,5,Arthur
1,39,44,8,9,Patsy
2,64,71,11,12,Britain
3,117,128,21,23,Round Table


### Spanner

### TensorArray

A `TensorArray` represents an array of tensors where each element is an N-dimensional tensor of the same shape. If there are M tensor elements in the array, then the entire `TensorArray` will have a shape of M x N, where the outer dimension is the number of elements. Backing the `TensorArray` is a single `numpy.ndarray` with shape M x N. Standard arithmetic and comparison operations are supported and delegated to the backing ndarray. Taking a slice or multiple item selection will produce another `TensorArray`, while a single element selection will produce a `TensorElement` that also wraps a view of the `numpy.ndarray`, with similar operator support.

A `TensorArray` can be constructed with zero copy from a single `numpy.ndarray` or with a sequence of elements of similar shape. Conversion of a `TensorArray` to a `numpy.ndarray` can be done with zero copy by calling `TensorArray.to_numpy()` or using the provided numpy array interface, e.g. `numpy.asarray(TensorArray(...))`. The `TensorArray` is a Pandas extension type of type `TensorType` and can be wrapped in a `pandas.Series` or used as a column in a `pandas.DataFrame` and used in standard Pandas operations. A `NULL` or missing value in the `TensorArray` is represented as a N-dimensional `numpy.ndarray` where all items are `numpy.nan`.

In [2]:
# Construct from a numpy.ndarray
arr = tp.TensorArray(np.arange(10).reshape(5, 2))
arr, arr.dtype

(array([[0, 1],
        [2, 3],
        [4, 5],
        [6, 7],
        [8, 9]]),
 <text_extensions_for_pandas.array.tensor.TensorType at 0x7f48e2cb3d90>)

In [3]:
# Wrap in a Pandas Series
s = pd.Series(arr)
s

0   [0 1]
1   [2 3]
2   [4 5]
3   [6 7]
4   [8 9]
dtype: TensorType

In [4]:
# Convert back to numpy using the provided array interface
np_arr = np.asarray(s)
np_arr, np_arr.dtype

(array([[0, 1],
        [2, 3],
        [4, 5],
        [6, 7],
        [8, 9]]),
 dtype('int64'))

In [5]:
# Apply operations on the Series, result is another Series of type TensorType
thresh = s > 4
thresh

0    [False False]
1    [False False]
2    [False  True]
3    [ True  True]
4    [ True  True]
dtype: TensorType

In [6]:
# Select items using a mix of numpy and pandas, use `.array` to get the Series as a `TensorArray`
# which can be used on numpy operations
s[np.all(thresh.array, axis=1)] * 2

3   [12 14]
4   [16 18]
dtype: TensorType

In [7]:
# TensorArray can also be added to a Pandas DataFrame
df = pd.DataFrame({"time": pd.date_range('2018-01-01', periods=5, freq='H'), "features": arr})
df

Unnamed: 0,time,features
0,2018-01-01 00:00:00,[0 1]
1,2018-01-01 01:00:00,[2 3]
2,2018-01-01 02:00:00,[4 5]
3,2018-01-01 03:00:00,[6 7]
4,2018-01-01 04:00:00,[8 9]


In [8]:
# TensorArray will be incorporated with standard DataFrame operations 
df.sort_values(by="time", ascending=False)

Unnamed: 0,time,features
4,2018-01-01 04:00:00,[8 9]
3,2018-01-01 03:00:00,[6 7]
2,2018-01-01 02:00:00,[4 5]
1,2018-01-01 01:00:00,[2 3]
0,2018-01-01 00:00:00,[0 1]


### Saving Arrays to Disk

In [106]:
s = pd.Series(tokens)

## NLP Library Integration for Text Input/Output

Text Extensions for Pandas also provides integration with other NLP libraries and takes care of processing inputs and outputs with Pandas, fully utilizing the above extension arrays.

### Watson

### SpaCy

### BERT

### CoNLL