-
-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thoughts about text-annotation use case and Pandas Ext. API #78
Comments
also, wanted to point out another group from IBM that did something similar, though their assertions for what constitutes a "span" and what is necessary to create/manipulate them are a lot stronger/larger. It's possibly of interest to you as well, since I came across #63 and recalled they have an entire "span algebra" system, and published a paper about it. EDIT: I thought it might be helpful to visualize the existing
|
Hi! Thanks for your interest in portion! This issue seems to involve a lot of different things and to be, to some extent, related to a kinda specific use case ;-) Give me some time to process all the information you provided and I'll come back "soon", either with some solution or (more likely) with a lot of questions :-D |
Indeed, (A) is doable. For example: >>> import portion as P
>>> d = P.IntervalDict()
>>> d[P.closed(0, 2) | P.closed(10, 12)] = "the"
>>> d
{[0,2] | [10,12]: 'the'}
>>> list(d.find("the"))
[[0,2], [10,12]] In the above example, each element of the list is an atomic
You're right, again :-) An As a workaround, and depending on your exact use case, you can create one instance of Notice that having an
Having a |
Any update on this? |
I'll reopen if needed. |
Hi there, super happy to find
portion
with such a lovely feature set and support!I'm considering using this in conjunction with a text-analysis library, where the fundamental unit of observation tends to be "spans" of character positions within a document. However, I'm noticing e.g. from the readme and #44 that it doesn't make much sense to an
IntervalDict
to talk about (A) multiple, unique ("atomic"?) spans having the same "data", or (B) setting different data on possibly overlapping spans. See e.g.ipymarkup
for a good intuition on what this looks like (and what I'd potentially feedIntervalDict
to).Token(0,2,"the")
is distinct fromToken(10,12,"the")
, letting us count freqencies, do markov language models, etc.{(0,2):'ice', (3,7):'cream', (0,7):Annotation('ice-cream', timestamp=..., user=...)}
.What's the problem?
If I'm understanding things correctly,
since I could manually separate the disjunctions
portion
creates on duplicate values (like repeated "the"). But...it looks like adding
(0,7):Annotation(...)
to an existing dict would overwrite the data at that location, or at least, require some sophisticated machinery to create a lossless "combine" function that doesn't naively join the information the way theorange/banana
example does, right?Why does this matter
I'm hoping to use intervals to create a pandas extension dtype that does more for us than the existing
pd.IntervalIndex
does. I've consideredIntervalTree
, which theipymarkup
is using internally, but it's not using immutable operations the wayportion
does so the logic around array manipulation gets... wonky.Ultimately the text analysis is our use case, but in the meantime I imagine a very small plugin that adds a
PortionDType
to be used in, say,pd.Series
, and a.span
or.portion
accessor to do special things if aPortionDtype
is found (e.g. in the__init__
validation step).One mechanism to this is to make a
PortionArray
that effectively wraps anIntervalDict
, since, both are "containers ofInterval
", but I'm now wondering if that really makes sense given (A) and (B). What are your thoughts? And would such a plugin be of interest to keep inside portion, or in its own separate package (which was my original idea)?The text was updated successfully, but these errors were encountered: