Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrade to Solr 9 #572

Closed
14 tasks done
mnaydan opened this issue Jan 10, 2024 · 26 comments
Closed
14 tasks done

upgrade to Solr 9 #572

mnaydan opened this issue Jan 10, 2024 · 26 comments
Assignees
Labels

Comments

@mnaydan
Copy link
Contributor

mnaydan commented Jan 10, 2024

testing notes

  • confirm that site search works as expected with solr 9
    • metadata search
    • keyword search
    • filtering
    • sorting
  • confirm that you can search for words hyphenated across line breaks as single words

dev notes

  • update solr config files to work with solr 9
  • update index fields to use dynamic fields
  • update solr queryset to use the new dynamic fields and alias to the old names
  • figure out why unit tests are hanging
  • add support for searching on hyphenated words across line ends

note: use [skip ci] on commits until we get the hanging unit test issue resolved, so we don't have tests hanging on github actions

@mnaydan mnaydan added the chore label Jan 16, 2024
@rlskoeser
Copy link
Contributor

refer to commits on Princeton-CDH/geniza#1359

@laurejt
Copy link
Contributor

laurejt commented Mar 5, 2024

With this upgrade, we'll need to make updates to the solr config file (solrconfig.xml) since there've been major updates to the cache implementations and the existing ones are deprecated as of 9.0.

Note that this also seems to relate to some of the failed tests I'm running into (I'm using Solr 8.11.2).

@laurejt laurejt self-assigned this Mar 5, 2024
@rlskoeser
Copy link
Contributor

@laurejt I pulled your branch and uploaded the config files to a new configset in my existing docker solr 9.2 instance. I was able to create a new collection with that configset, and had no problem indexing works (python manage.py index -i work).

When I tried to index page content (python manage.py index_pages) I saw an error that sounds like what you described - it looks like the label field is defined as an integer in solr, but not all of our page labels are integers. When I switched the page indexing to index it as a string, it looked like indexing was working ok.

Not sure if we want to commit index field changes yet, so here's what I did:

  • in ppa.archive.models line 1190 I changed "label" to "label_s"
  • in ppa.archive.solr line 194 I changed the key "label" to "label_s" (although I didn't really test this change yet)

@rlskoeser rlskoeser self-assigned this Mar 26, 2024
@laurejt
Copy link
Contributor

laurejt commented Mar 26, 2024

Potential Issue: deprecation of Trie fields

Is it okay to just convert these instances to their corresponding Point fields? It looks like the "equivalent" declarations do not set the positionIncrementGap nor the precisionStep field properties (when they're set to 0).

@rlskoeser
Copy link
Contributor

I found another thing we need to handle, just added to our checklist: we have a couple of copy fields defined in the old version of the schema https://github.com/Princeton-CDH/ppa-django/blob/main/solr_conf/conf/managed-schema#L529-L530

This is so we can index things two different ways: in this case, one version for tokenized text-based search and one for facet/sort.

@laurejt
Copy link
Contributor

laurejt commented Mar 26, 2024

Is there a reason that managed-schema#L135, doesn't set positionIncrementGap to 100?

@rlskoeser
Copy link
Contributor

Is it okay to just convert these instances to their corresponding Point fields? It looks like the "equivalent" declarations do not set the positionIncrementGap nor the precisionStep field properties (when they're set to 0).

Thanks for flagging. I think the only date field we have is the last modified timestamp, which we aren't faceting on; we only use it for last-modified header checks. We use an integer field for the pubdate, which we do facet on. So I think this is ok, but we'll want to check the last-modified header behavior. (Could be formatting issues, maybe)

@rlskoeser
Copy link
Contributor

Is there a reason that managed-schema#L135, doesn't set positionIncrementGap to 100?

Probably because I didn't know about this setting! It looks useful for non-contiguous text blocks; I can't think of anywhere this wouldn't be helpful.

@laurejt
Copy link
Contributor

laurejt commented Mar 26, 2024

Oddly, additional TrieDateField-based fields were defined (tdate, tdates), however they don't appear to be used by anything. They only appear to be used to create dynamic fields (_tdt, _tdts) that don't appear to be used anywhere within the codebase.

I'm going to go ahead and ignore / drop these, since the type is deprecated as it is.

@laurejt
Copy link
Contributor

laurejt commented Mar 26, 2024

Is it okay to just convert these instances to their corresponding Point fields? It looks like the "equivalent" declarations do not set the positionIncrementGap nor the precisionStep field properties (when they're set to 0).

Thanks for flagging. I think the only date field we have is the last modified timestamp, which we aren't faceting on; we only use it for last-modified header checks. We use an integer field for the pubdate, which we do facet on. So I think this is ok, but we'll want to check the last-modified header behavior. (Could be formatting issues, maybe)

One thing to note is that this holds true for all other basic numeric types (e.g., int, double, long), but maybe this doesn't really matter since it could just be an indexing thing?

@laurejt
Copy link
Contributor

laurejt commented Mar 26, 2024

For text_string is there a reason lowercase filtering is being invoked given that ICU folding effectively incorporates this behavior?

@laurejt
Copy link
Contributor

laurejt commented Mar 26, 2024

Worth highlighting that the field type text_gen_sort (dyanamic field: _t_sort) behaves differently thantext_string which is used for sort_title

@laurejt
Copy link
Contributor

laurejt commented Mar 26, 2024

For work_type is it worth doing the text transformations within Solr rather than within Python (e.g., models.py#L939-41)?

@laurejt
Copy link
Contributor

laurejt commented Mar 26, 2024

Should dynamic fields of the form *_t be multiValued? I'm guessing this is a mistake but right now this property is set to true because that's the "default" inherited from text_general.

@rlskoeser
Copy link
Contributor

For text_string is there a reason lowercase filtering is being invoked given that ICU folding effectively incorporates this behavior?

Sounds like an oversight and/or leftover; let's remove the redundant filter.

@rlskoeser
Copy link
Contributor

Worth highlighting that the field type text_gen_sort (dyanamic field: _t_sort) behaves differently thantext_string which is used for sort_title

I think that the facet and sort fields are ones we will want to look at once we get things basically working with solr 9 - I remember customizing those some (like adding unicode folding, I think?) but not the specifics. I do remember we had to use text instead of string because you can't apply filters like unicode folding to string fields.

@rlskoeser
Copy link
Contributor

For work_type is it worth doing the text transformations within Solr rather than within Python (e.g., models.py#L939-41)?

get_item_type_display is a django method you get automatically for a field that has choices defined - see https://docs.djangoproject.com/en/5.0/ref/models/instances/#django.db.models.Model.get_FOO_display . It makes more sense to me to keep with the python code than to put in a solr config, because if we change it in python it will automatically propagate without an additional step.

@rlskoeser
Copy link
Contributor

Should dynamic fields of the form *_t be multiValued? I'm guessing this is a mistake but right now this property is set to true because that's the "default" inherited from text_general.

I like the idea of making them consistent with the other fields! The inheritance makes that more complicated, doesn't it? So then would *_t be explicitly not multivalued and *_ts (if we need one) would be multi-valued?

@laurejt
Copy link
Contributor

laurejt commented Mar 26, 2024

For work_type is it worth doing the text transformations within Solr rather than within Python (e.g., models.py#L939-41)?

get_item_type_display is a django method you get automatically for a field that has choices defined - see https://docs.djangoproject.com/en/5.0/ref/models/instances/#django.db.models.Model.get_FOO_display . It makes more sense to me to keep with the python code than to put in a solr config, because if we change it in python it will automatically propagate without an additional step.

To clarify, I don't meant to replace that part; just the lowercasing and removal of hyphens that occurs after.

@laurejt
Copy link
Contributor

laurejt commented Mar 26, 2024

Should dynamic fields of the form *_t be multiValued? I'm guessing this is a mistake but right now this property is set to true because that's the "default" inherited from text_general.

I like the idea of making them consistent with the other fields! The inheritance makes that more complicated, doesn't it? So then would *_t be explicitly not multivalued and *_ts (if we need one) would be multi-valued?

In the example / default schema *_t is explicitly not multivalued (i.e. multiValued="false"), while the multi-valued version *_txt is implicitly multi-valued.

@rlskoeser
Copy link
Contributor

To clarify, I don't meant to replace that part; just the lowercasing and removal of hyphens that occurs after.

Oh! Thank you for clarifying, I didn't read it properly with the code wrapping on GitHub. That sounds smart, although may or may not be worth the overhead! Do we have an existing field that would handle this, or would need to customize (I don't even remember now why we did it this way...)

@laurejt
Copy link
Contributor

laurejt commented Mar 26, 2024

To clarify, I don't meant to replace that part; just the lowercasing and removal of hyphens that occurs after.

Oh! Thank you for clarifying, I didn't read it properly with the code wrapping on GitHub. That sounds smart, although may or may not be worth the overhead! Do we have an existing field that would handle this, or would need to customize (I don't even remember now why we did it this way...)

We might have to create one. I'm guessing text_en_splitting is too aggressive, although it does something with hyphens. The minimum thing we could do is use the *_s_lower which treats the text as a keyword and lowercases it, which would let us eliminate the lowercasing.

@mnaydan
Copy link
Contributor Author

mnaydan commented Apr 23, 2024

Tested functionality described in testing notes as well as some advanced Solr syntax and fielded keyword searches. Everything looks good as far as I can tell. @rlskoeser kicking back to you.

@rlskoeser
Copy link
Contributor

@mnaydan hyphenation filter is ready for you test now. Do you need help finding examples?

Here's a phrase I just found that works on the test site now but not in production: "ability to pass from one register"

@mnaydan
Copy link
Contributor Author

mnaydan commented Apr 24, 2024

This is cool! Thanks for the example. Here are some more examples of it working: "slight alterations of manuscript readings" working on test but not on production; "conflicting phenomena" working on test but not on production.

One strange case I found didn't highlight the whole phrase because - I figured out - there's a white space either at the end of the line or at the beginning of the next line. Here is the search without the white space, and the search with the white space.

Since this was a cool bonus feature anyway, I think I'm in agreement with your original impulse to keep it as is and close this issue. We can always revisit later and make a new issue if we want to add "hypen-whitespace" search functionality, after we have more information about the kinds of instances from our NLP work. @rlskoeser what do you think?

@rlskoeser
Copy link
Contributor

@mnaydan thanks for the testing and the cool examples! I noticed, and see in your highlighting for the last example, that in a lot of cases the stemming is giving us matches for these hyphenated words, although not always.

I'm inclined to close this and get a release out with the Solr upgrade, and then keep it in mind when we decide how to handle hyphenations for the NLP work.

@mnaydan mnaydan closed this as completed Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

No branches or pull requests

3 participants