upgrade to Solr 9 #572

mnaydan · 2024-01-10T21:17:16Z

rlskoeser · 2024-01-16T19:21:36Z

refer to commits on Princeton-CDH/geniza#1359

laurejt · 2024-03-05T21:17:31Z

With this upgrade, we'll need to make updates to the solr config file (solrconfig.xml) since there've been major updates to the cache implementations and the existing ones are deprecated as of 9.0.

Note that this also seems to relate to some of the failed tests I'm running into (I'm using Solr 8.11.2).

rlskoeser · 2024-03-08T18:50:14Z

@laurejt I pulled your branch and uploaded the config files to a new configset in my existing docker solr 9.2 instance. I was able to create a new collection with that configset, and had no problem indexing works (python manage.py index -i work).

When I tried to index page content (python manage.py index_pages) I saw an error that sounds like what you described - it looks like the label field is defined as an integer in solr, but not all of our page labels are integers. When I switched the page indexing to index it as a string, it looked like indexing was working ok.

Not sure if we want to commit index field changes yet, so here's what I did:

in ppa.archive.models line 1190 I changed "label" to "label_s"
in ppa.archive.solr line 194 I changed the key "label" to "label_s" (although I didn't really test this change yet)

laurejt · 2024-03-26T15:48:08Z

Potential Issue: deprecation of Trie fields

Is it okay to just convert these instances to their corresponding Point fields? It looks like the "equivalent" declarations do not set the positionIncrementGap nor the precisionStep field properties (when they're set to 0).

rlskoeser · 2024-03-26T16:00:13Z

I found another thing we need to handle, just added to our checklist: we have a couple of copy fields defined in the old version of the schema https://github.com/Princeton-CDH/ppa-django/blob/main/solr_conf/conf/managed-schema#L529-L530

This is so we can index things two different ways: in this case, one version for tokenized text-based search and one for facet/sort.

laurejt · 2024-03-26T16:06:34Z

Is there a reason that managed-schema#L135, doesn't set positionIncrementGap to 100?

rlskoeser · 2024-03-26T16:08:24Z

Is it okay to just convert these instances to their corresponding Point fields? It looks like the "equivalent" declarations do not set the positionIncrementGap nor the precisionStep field properties (when they're set to 0).

Thanks for flagging. I think the only date field we have is the last modified timestamp, which we aren't faceting on; we only use it for last-modified header checks. We use an integer field for the pubdate, which we do facet on. So I think this is ok, but we'll want to check the last-modified header behavior. (Could be formatting issues, maybe)

rlskoeser · 2024-03-26T16:11:44Z

Is there a reason that managed-schema#L135, doesn't set positionIncrementGap to 100?

Probably because I didn't know about this setting! It looks useful for non-contiguous text blocks; I can't think of anywhere this wouldn't be helpful.

laurejt · 2024-03-26T16:13:26Z

Oddly, additional TrieDateField-based fields were defined (tdate, tdates), however they don't appear to be used by anything. They only appear to be used to create dynamic fields (_tdt, _tdts) that don't appear to be used anywhere within the codebase.

I'm going to go ahead and ignore / drop these, since the type is deprecated as it is.

laurejt · 2024-03-26T16:14:44Z

Is it okay to just convert these instances to their corresponding Point fields? It looks like the "equivalent" declarations do not set the positionIncrementGap nor the precisionStep field properties (when they're set to 0).

Thanks for flagging. I think the only date field we have is the last modified timestamp, which we aren't faceting on; we only use it for last-modified header checks. We use an integer field for the pubdate, which we do facet on. So I think this is ok, but we'll want to check the last-modified header behavior. (Could be formatting issues, maybe)

One thing to note is that this holds true for all other basic numeric types (e.g., int, double, long), but maybe this doesn't really matter since it could just be an indexing thing?

laurejt · 2024-03-26T17:05:48Z

For text_string is there a reason lowercase filtering is being invoked given that ICU folding effectively incorporates this behavior?

laurejt · 2024-03-26T17:07:44Z

Worth highlighting that the field type text_gen_sort (dyanamic field: _t_sort) behaves differently thantext_string which is used for sort_title

laurejt · 2024-03-26T17:18:55Z

For work_type is it worth doing the text transformations within Solr rather than within Python (e.g., models.py#L939-41)?

laurejt · 2024-03-26T17:30:20Z

Should dynamic fields of the form *_t be multiValued? I'm guessing this is a mistake but right now this property is set to true because that's the "default" inherited from text_general.

rlskoeser · 2024-03-26T17:47:59Z

For text_string is there a reason lowercase filtering is being invoked given that ICU folding effectively incorporates this behavior?

Sounds like an oversight and/or leftover; let's remove the redundant filter.

rlskoeser · 2024-03-26T17:49:33Z

Worth highlighting that the field type text_gen_sort (dyanamic field: _t_sort) behaves differently thantext_string which is used for sort_title

I think that the facet and sort fields are ones we will want to look at once we get things basically working with solr 9 - I remember customizing those some (like adding unicode folding, I think?) but not the specifics. I do remember we had to use text instead of string because you can't apply filters like unicode folding to string fields.

rlskoeser · 2024-03-26T17:52:22Z

For work_type is it worth doing the text transformations within Solr rather than within Python (e.g., models.py#L939-41)?

get_item_type_display is a django method you get automatically for a field that has choices defined - see https://docs.djangoproject.com/en/5.0/ref/models/instances/#django.db.models.Model.get_FOO_display . It makes more sense to me to keep with the python code than to put in a solr config, because if we change it in python it will automatically propagate without an additional step.

rlskoeser · 2024-03-26T17:53:45Z

Should dynamic fields of the form *_t be multiValued? I'm guessing this is a mistake but right now this property is set to true because that's the "default" inherited from text_general.

I like the idea of making them consistent with the other fields! The inheritance makes that more complicated, doesn't it? So then would *_t be explicitly not multivalued and *_ts (if we need one) would be multi-valued?

laurejt · 2024-03-26T17:54:09Z

For work_type is it worth doing the text transformations within Solr rather than within Python (e.g., models.py#L939-41)?

get_item_type_display is a django method you get automatically for a field that has choices defined - see https://docs.djangoproject.com/en/5.0/ref/models/instances/#django.db.models.Model.get_FOO_display . It makes more sense to me to keep with the python code than to put in a solr config, because if we change it in python it will automatically propagate without an additional step.

To clarify, I don't meant to replace that part; just the lowercasing and removal of hyphens that occurs after.

laurejt · 2024-03-26T17:54:56Z

Should dynamic fields of the form *_t be multiValued? I'm guessing this is a mistake but right now this property is set to true because that's the "default" inherited from text_general.

I like the idea of making them consistent with the other fields! The inheritance makes that more complicated, doesn't it? So then would *_t be explicitly not multivalued and *_ts (if we need one) would be multi-valued?

In the example / default schema *_t is explicitly not multivalued (i.e. multiValued="false"), while the multi-valued version *_txt is implicitly multi-valued.

rlskoeser · 2024-03-26T17:57:11Z

To clarify, I don't meant to replace that part; just the lowercasing and removal of hyphens that occurs after.

Oh! Thank you for clarifying, I didn't read it properly with the code wrapping on GitHub. That sounds smart, although may or may not be worth the overhead! Do we have an existing field that would handle this, or would need to customize (I don't even remember now why we did it this way...)

laurejt · 2024-03-26T18:05:08Z

To clarify, I don't meant to replace that part; just the lowercasing and removal of hyphens that occurs after.

Oh! Thank you for clarifying, I didn't read it properly with the code wrapping on GitHub. That sounds smart, although may or may not be worth the overhead! Do we have an existing field that would handle this, or would need to customize (I don't even remember now why we did it this way...)

We might have to create one. I'm guessing text_en_splitting is too aggressive, although it does something with hyphens. The minimum thing we could do is use the *_s_lower which treats the text as a keyword and lowercases it, which would let us eliminate the lowercasing.

mnaydan · 2024-04-23T13:12:41Z

Tested functionality described in testing notes as well as some advanced Solr syntax and fielded keyword searches. Everything looks good as far as I can tell. @rlskoeser kicking back to you.

rlskoeser · 2024-04-24T13:35:39Z

@mnaydan hyphenation filter is ready for you test now. Do you need help finding examples?

Here's a phrase I just found that works on the test site now but not in production: "ability to pass from one register"

mnaydan · 2024-04-24T14:27:13Z

This is cool! Thanks for the example. Here are some more examples of it working: "slight alterations of manuscript readings" working on test but not on production; "conflicting phenomena" working on test but not on production.

One strange case I found didn't highlight the whole phrase because - I figured out - there's a white space either at the end of the line or at the beginning of the next line. Here is the search without the white space, and the search with the white space.

Since this was a cool bonus feature anyway, I think I'm in agreement with your original impulse to keep it as is and close this issue. We can always revisit later and make a new issue if we want to add "hypen-whitespace" search functionality, after we have more information about the kinds of instances from our NLP work. @rlskoeser what do you think?

rlskoeser · 2024-04-24T14:42:40Z

@mnaydan thanks for the testing and the cool examples! I noticed, and see in your highlighting for the last example, that in a lot of cases the stemming is giving us matches for these hyphenated words, although not always.

I'm inclined to close this and get a release out with the Solr upgrade, and then keep it in mind when we decide how to handle hyphenations for the NLP work.

mnaydan added the chore label Jan 16, 2024

laurejt self-assigned this Mar 5, 2024

rlskoeser self-assigned this Mar 26, 2024

jerielizabeth assigned laurejt and unassigned laurejt Apr 15, 2024

mnaydan added awaiting testing and removed awaiting testing labels Apr 22, 2024

rlskoeser added a commit that referenced this issue Apr 23, 2024

Improve search for hyphenated words wrapped around lines #572

b51ad95

rlskoeser mentioned this issue Apr 23, 2024

Improve search for hyphenated words wrapped around lines #647

Merged

rlskoeser added the awaiting testing label Apr 24, 2024

mnaydan removed the awaiting testing label Apr 24, 2024

mnaydan closed this as completed Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upgrade to Solr 9 #572

upgrade to Solr 9 #572

mnaydan commented Jan 10, 2024 •

edited

rlskoeser commented Jan 16, 2024

laurejt commented Mar 5, 2024

rlskoeser commented Mar 8, 2024

laurejt commented Mar 26, 2024

rlskoeser commented Mar 26, 2024

laurejt commented Mar 26, 2024

rlskoeser commented Mar 26, 2024

rlskoeser commented Mar 26, 2024

laurejt commented Mar 26, 2024

laurejt commented Mar 26, 2024

laurejt commented Mar 26, 2024

laurejt commented Mar 26, 2024

laurejt commented Mar 26, 2024

laurejt commented Mar 26, 2024

rlskoeser commented Mar 26, 2024

rlskoeser commented Mar 26, 2024

rlskoeser commented Mar 26, 2024

rlskoeser commented Mar 26, 2024

laurejt commented Mar 26, 2024

laurejt commented Mar 26, 2024 •

edited

rlskoeser commented Mar 26, 2024

laurejt commented Mar 26, 2024 •

edited

mnaydan commented Apr 23, 2024

rlskoeser commented Apr 24, 2024

mnaydan commented Apr 24, 2024

rlskoeser commented Apr 24, 2024

upgrade to Solr 9 #572

upgrade to Solr 9 #572

Comments

mnaydan commented Jan 10, 2024 • edited

testing notes

dev notes

rlskoeser commented Jan 16, 2024

laurejt commented Mar 5, 2024

rlskoeser commented Mar 8, 2024

laurejt commented Mar 26, 2024

rlskoeser commented Mar 26, 2024

laurejt commented Mar 26, 2024

rlskoeser commented Mar 26, 2024

rlskoeser commented Mar 26, 2024

laurejt commented Mar 26, 2024

laurejt commented Mar 26, 2024

laurejt commented Mar 26, 2024

laurejt commented Mar 26, 2024

laurejt commented Mar 26, 2024

laurejt commented Mar 26, 2024

rlskoeser commented Mar 26, 2024

rlskoeser commented Mar 26, 2024

rlskoeser commented Mar 26, 2024

rlskoeser commented Mar 26, 2024

laurejt commented Mar 26, 2024

laurejt commented Mar 26, 2024 • edited

rlskoeser commented Mar 26, 2024

laurejt commented Mar 26, 2024 • edited

mnaydan commented Apr 23, 2024

rlskoeser commented Apr 24, 2024

mnaydan commented Apr 24, 2024

rlskoeser commented Apr 24, 2024

mnaydan commented Jan 10, 2024 •

edited

laurejt commented Mar 26, 2024 •

edited

laurejt commented Mar 26, 2024 •

edited