Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thoughts about text-annotation use case and Pandas Ext. API #78

Closed
rtbs-dev opened this issue Jan 30, 2023 · 5 comments
Closed

Thoughts about text-annotation use case and Pandas Ext. API #78

rtbs-dev opened this issue Jan 30, 2023 · 5 comments
Labels
question Issue is a question

Comments

@rtbs-dev
Copy link

Hi there, super happy to find portion with such a lovely feature set and support!

I'm considering using this in conjunction with a text-analysis library, where the fundamental unit of observation tends to be "spans" of character positions within a document. However, I'm noticing e.g. from the readme and #44 that it doesn't make much sense to an IntervalDict to talk about (A) multiple, unique ("atomic"?) spans having the same "data", or (B) setting different data on possibly overlapping spans. See e.g. ipymarkup for a good intuition on what this looks like (and what I'd potentially feed IntervalDict to).

  • A) when tokenizing a document, each token (think "word" or compound-word, etc) gets a span+text combination. This is good since we need to know that Token(0,2,"the") is distinct from Token(10,12,"the"), letting us count freqencies, do markov language models, etc.
  • B) when annotating a document, we need to add on top of these further information, that is non-destructive to the underlying spans. Maybe someone says that they want the intervaldict {(0,2):'ice', (3,7):'cream', (0,7):Annotation('ice-cream', timestamp=..., user=...)}.

What's the problem?

If I'm understanding things correctly,

(A) is somewhat doable

since I could manually separate the disjunctions portion creates on duplicate values (like repeated "the"). But...

(B) is problematic?

it looks like adding (0,7):Annotation(...) to an existing dict would overwrite the data at that location, or at least, require some sophisticated machinery to create a lossless "combine" function that doesn't naively join the information the way the orange/banana example does, right?

Why does this matter

I'm hoping to use intervals to create a pandas extension dtype that does more for us than the existing pd.IntervalIndex does. I've considered IntervalTree, which the ipymarkup is using internally, but it's not using immutable operations the way portion does so the logic around array manipulation gets... wonky.

Ultimately the text analysis is our use case, but in the meantime I imagine a very small plugin that adds a PortionDType to be used in, say, pd.Series, and a .span or .portion accessor to do special things if a PortionDtype is found (e.g. in the __init__ validation step).

One mechanism to this is to make a PortionArray that effectively wraps an IntervalDict, since, both are "containers of Interval", but I'm now wondering if that really makes sense given (A) and (B). What are your thoughts? And would such a plugin be of interest to keep inside portion, or in its own separate package (which was my original idea)?

@rtbs-dev
Copy link
Author

rtbs-dev commented Jan 30, 2023

also, wanted to point out another group from IBM that did something similar, though their assertions for what constitutes a "span" and what is necessary to create/manipulate them are a lot stronger/larger.

It's possibly of interest to you as well, since I came across #63 and recalled they have an entire "span algebra" system, and published a paper about it.

EDIT: I thought it might be helpful to visualize the existing portion objects and how I foresee them corresponding to pandas objects

portion proposed ext. via pandas? note
P.Interval PortionDType ExtensionDType
P.iterate(...) PortionArray ExtensionArray mostly used for indexing (pd.Index[PortionDType])
P.IntervalDict Series.span.<...> register_series_accessor() validates the index is PortionArray

@AlexandreDecan
Copy link
Owner

Hi!

Thanks for your interest in portion! This issue seems to involve a lot of different things and to be, to some extent, related to a kinda specific use case ;-)

Give me some time to process all the information you provided and I'll come back "soon", either with some solution or (more likely) with a lot of questions :-D

@AlexandreDecan
Copy link
Owner

(A) is somewhat doable
since I could manually separate the disjunctions portion creates on duplicate values (like repeated "the"). But...

Indeed, (A) is doable. For example:

>>> import portion as P
>>> d = P.IntervalDict()
>>> d[P.closed(0, 2) | P.closed(10, 12)] = "the"
>>> d
{[0,2] | [10,12]: 'the'}
>>> list(d.find("the"))
[[0,2], [10,12]]

In the above example, each element of the list is an atomic Interval.

(B) is problematic?
it looks like adding (0,7):Annotation(...) to an existing dict would overwrite the data at that location, or at least, require some sophisticated machinery to create a lossless "combine" function that doesn't naively join the information the way the orange/banana example does, right?

You're right, again :-) An IntervalDictcan only be used to associate one "field" to ranges. I already considered implementing a kind of IntervalMultiDict where more than one "field" can be associated to a range. However, I hadn't have enough time to come up with a solution that performs well (IntervalDict is already quite slow, and supporting multiple fields involves, as you said, some sophisticated machinery).

As a workaround, and depending on your exact use case, you can create one instance of IntervalDict for each field and "query" them in parallel. But if you have many fields, or if you need to traverse them to find related data, this won't be convenient.

Notice that having an IntervalMultiDict is still in my backlog. As a first step in this direction, I've a student that is currently working on converting IntervalDictso that it relies on an interval tree structure internally. The expected speed up might be enough to consider a naive implementation of IntervalMultiDict. on top of the new IntervalDict.

One mechanism to this is to make a PortionArray that effectively wraps an IntervalDict, since, both are "containers of Interval", but I'm now wondering if that really makes sense given (A) and (B). What are your thoughts? And would such a plugin be of interest to keep inside portion, or in its own separate package (which was my original idea)?

Having a pandas accessor for Interval or even for IntervalDict is something I already considered (at least for Interval instances) but since we cannot really vectorize the operations involving intervals, it would be mostly syntactic sugar without any benefit in terms of performance, so I gave up :-)

@AlexandreDecan AlexandreDecan added the question Issue is a question label Jan 31, 2023
@AlexandreDecan
Copy link
Owner

Any update on this?

@AlexandreDecan
Copy link
Owner

I'll reopen if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Issue is a question
Projects
None yet
Development

No branches or pull requests

2 participants