Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify indexing by deprecating questionable features such as linked documents #447

Open
jan-niestadt opened this issue Sep 7, 2023 · 0 comments
Labels
indexing Issues relates to indexing data

Comments

@jan-niestadt
Copy link
Member

jan-niestadt commented Sep 7, 2023

The indexing system has grown complex over the years, and some features aren't very useful, or aren't needed anymore if you use Saxon with XPath 3 support. We should deprecate and eventually remove unnecessary features to reduce complexity. There should be good documentation that makes it clear how to accomplish your goals without relying on deprecated features.

Linked documents

Could we deprecate the "linked documents" feature, using XPath's document() function to look things up instead? We'd have to look into this, especially when it comes to AutoSearch, uploading (possibly zipped) contents and metadata, but it's probably possible to use a custom document resolver with Saxon?

tokenIdPath/tokenRefPath (Standoff annotations)

Standoff annotations that index an inline tag or relation are still useful, as they refer to multiple words. But maybe we can do away with tokenIdPath/tokenRefPath (capturing and referring to token id values), and simply have the standoff annotation's XPaths refer to the relevant word nodes, and look up the token position for that.

Standoff annotations tied to a single word can be retrieved while indexing that word using XPath, but it probably doesn't make sense to remove this option from standoffAnnotations. Sometimes doing it this way can be simpler.

Non-XML file formats don't support standoff annotations, so they don't use tokenIdPath/tokenRefPath anyway.

Subannotations

How useful is the concept of subannotations?

  • they automatically inherit the parent's name as a prefix (doesn't seem essential)
  • they can re-use the parent's matching nodes if the valuePaths match (but that optimization could be extended to all annotations that share the same valuePath, and becomes a little trickier when you're doing more work in XPath expressions)
  • they support forEach, unlike top-level annotations, although each subannotation does need to be declared. So there's no reason we couldn't support forEach for top-level annotations in the same way.

Other than that, they're just annotations like any other, so it may not be worth it to treat them differently. This would simplify the code and documentation.

Capture value paths, processing steps

XML file formats should be able to replace these with XPath 3.1 expressions. If we're capturing annotations with forEach, but want to perform different processing for different annotations, we can't do that yet. Maybe we need a per-annotation processPath that is applied after the annotation node or value is captured while executing the forEach.

Non-XML file formats also use processing steps, and there's no clear alternative here. Maybe a few precooked options, or plugins?

Note also that XPath can be a little tricky, so the documentation should give lots of examples of how to accomplish the same functionality in XPath. E.g. a default value is done like this:

# get 'ana' attribute from link element, or use 'dep' as the default value (XPath 2+)
valuePath: "./link/(string(@ana), 'dep')[1]"

Input format inheritance (NOW DEPRECATED, remove in v5)

It is possible to derive input formats from other input formats, but this isn't very useful in practice. It is rare that your input format just needs to add a small thing compared to a base format; more often you need to make more changes. Having the whole format in one file is the simplest and most readable solution, even if it leads to a little bit of duplication sometimes.

This would simplify loading formats as we don't need to ensure the base format has been loaded before loading the derived format. It also just reduces BlackLab's complexity.

@jan-niestadt jan-niestadt added the indexing Issues relates to indexing data label Sep 7, 2023
@jan-niestadt jan-niestadt added this to the Parallel corpora milestone Sep 7, 2023
@jan-niestadt jan-niestadt changed the title Simplify indexing by deprecating unnecessary features such as inheritance Simplify indexing by deprecating questionable features such as linked documents Sep 8, 2023
@jan-niestadt jan-niestadt removed this from the Parallel corpora milestone Feb 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
indexing Issues relates to indexing data
Projects
None yet
Development

No branches or pull requests

1 participant