Thoughts about text-annotation use case and Pandas Ext. API #78

rtbs-dev · 2023-01-30T16:35:05Z

Hi there, super happy to find portion with such a lovely feature set and support!

I'm considering using this in conjunction with a text-analysis library, where the fundamental unit of observation tends to be "spans" of character positions within a document. However, I'm noticing e.g. from the readme and #44 that it doesn't make much sense to an IntervalDict to talk about (A) multiple, unique ("atomic"?) spans having the same "data", or (B) setting different data on possibly overlapping spans. See e.g. ipymarkup for a good intuition on what this looks like (and what I'd potentially feed IntervalDict to).

A) when tokenizing a document, each token (think "word" or compound-word, etc) gets a span+text combination. This is good since we need to know that Token(0,2,"the") is distinct from Token(10,12,"the"), letting us count freqencies, do markov language models, etc.
B) when annotating a document, we need to add on top of these further information, that is non-destructive to the underlying spans. Maybe someone says that they want the intervaldict {(0,2):'ice', (3,7):'cream', (0,7):Annotation('ice-cream', timestamp=..., user=...)}.

What's the problem?

If I'm understanding things correctly,

(A) is somewhat doable

since I could manually separate the disjunctions portion creates on duplicate values (like repeated "the"). But...

(B) is problematic?

it looks like adding (0,7):Annotation(...) to an existing dict would overwrite the data at that location, or at least, require some sophisticated machinery to create a lossless "combine" function that doesn't naively join the information the way the orange/banana example does, right?

Why does this matter

I'm hoping to use intervals to create a pandas extension dtype that does more for us than the existing pd.IntervalIndex does. I've considered IntervalTree, which the ipymarkup is using internally, but it's not using immutable operations the way portion does so the logic around array manipulation gets... wonky.

Ultimately the text analysis is our use case, but in the meantime I imagine a very small plugin that adds a PortionDType to be used in, say, pd.Series, and a .span or .portion accessor to do special things if a PortionDtype is found (e.g. in the __init__ validation step).

One mechanism to this is to make a PortionArray that effectively wraps an IntervalDict, since, both are "containers of Interval", but I'm now wondering if that really makes sense given (A) and (B). What are your thoughts? And would such a plugin be of interest to keep inside portion, or in its own separate package (which was my original idea)?

The text was updated successfully, but these errors were encountered:

rtbs-dev · 2023-01-30T16:56:44Z

also, wanted to point out another group from IBM that did something similar, though their assertions for what constitutes a "span" and what is necessary to create/manipulate them are a lot stronger/larger.

It's possibly of interest to you as well, since I came across #63 and recalled they have an entire "span algebra" system, and published a paper about it.

EDIT: I thought it might be helpful to visualize the existing portion objects and how I foresee them corresponding to pandas objects

portion	proposed ext.	via pandas?	note
`P.Interval`	`PortionDType`	`ExtensionDType`
`P.iterate(...)`	`PortionArray`	`ExtensionArray`	mostly used for indexing (`pd.Index[PortionDType]`)
`P.IntervalDict`	`Series.span.<...>`	`register_series_accessor()`	validates the index is `PortionArray`

AlexandreDecan · 2023-01-30T19:23:32Z

Hi!

Thanks for your interest in portion! This issue seems to involve a lot of different things and to be, to some extent, related to a kinda specific use case ;-)

Give me some time to process all the information you provided and I'll come back "soon", either with some solution or (more likely) with a lot of questions :-D

AlexandreDecan · 2023-01-31T10:12:32Z

(A) is somewhat doable
since I could manually separate the disjunctions portion creates on duplicate values (like repeated "the"). But...

Indeed, (A) is doable. For example:

>>> import portion as P
>>> d = P.IntervalDict()
>>> d[P.closed(0, 2) | P.closed(10, 12)] = "the"
>>> d
{[0,2] | [10,12]: 'the'}
>>> list(d.find("the"))
[[0,2], [10,12]]

In the above example, each element of the list is an atomic Interval.

(B) is problematic?
it looks like adding (0,7):Annotation(...) to an existing dict would overwrite the data at that location, or at least, require some sophisticated machinery to create a lossless "combine" function that doesn't naively join the information the way the orange/banana example does, right?

You're right, again :-) An IntervalDictcan only be used to associate one "field" to ranges. I already considered implementing a kind of IntervalMultiDict where more than one "field" can be associated to a range. However, I hadn't have enough time to come up with a solution that performs well (IntervalDict is already quite slow, and supporting multiple fields involves, as you said, some sophisticated machinery).

As a workaround, and depending on your exact use case, you can create one instance of IntervalDict for each field and "query" them in parallel. But if you have many fields, or if you need to traverse them to find related data, this won't be convenient.

Notice that having an IntervalMultiDict is still in my backlog. As a first step in this direction, I've a student that is currently working on converting IntervalDictso that it relies on an interval tree structure internally. The expected speed up might be enough to consider a naive implementation of IntervalMultiDict. on top of the new IntervalDict.

One mechanism to this is to make a PortionArray that effectively wraps an IntervalDict, since, both are "containers of Interval", but I'm now wondering if that really makes sense given (A) and (B). What are your thoughts? And would such a plugin be of interest to keep inside portion, or in its own separate package (which was my original idea)?

Having a pandas accessor for Interval or even for IntervalDict is something I already considered (at least for Interval instances) but since we cannot really vectorize the operations involving intervals, it would be mostly syntactic sugar without any benefit in terms of performance, so I gave up :-)

AlexandreDecan · 2023-02-07T19:01:55Z

Any update on this?

AlexandreDecan · 2023-02-13T18:12:24Z

I'll reopen if needed.

AlexandreDecan added the question Issue is a question label Jan 31, 2023

AlexandreDecan closed this as completed Feb 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thoughts about text-annotation use case and Pandas Ext. API #78

Thoughts about text-annotation use case and Pandas Ext. API #78

rtbs-dev commented Jan 30, 2023

rtbs-dev commented Jan 30, 2023 •

edited

Loading

AlexandreDecan commented Jan 30, 2023

AlexandreDecan commented Jan 31, 2023

AlexandreDecan commented Feb 7, 2023

AlexandreDecan commented Feb 13, 2023

Thoughts about text-annotation use case and Pandas Ext. API #78

Thoughts about text-annotation use case and Pandas Ext. API #78

Comments

rtbs-dev commented Jan 30, 2023

What's the problem?

Why does this matter

rtbs-dev commented Jan 30, 2023 • edited Loading

AlexandreDecan commented Jan 30, 2023

AlexandreDecan commented Jan 31, 2023

AlexandreDecan commented Feb 7, 2023

AlexandreDecan commented Feb 13, 2023

rtbs-dev commented Jan 30, 2023 •

edited

Loading