Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wildcard search in GroupedOr() #383

Closed
bjarnef opened this issue May 3, 2024 · 13 comments
Closed

Wildcard search in GroupedOr() #383

bjarnef opened this issue May 3, 2024 · 13 comments
Labels

Comments

@bjarnef
Copy link
Contributor

bjarnef commented May 3, 2024

I may be missing something, but in a current project I have something like this to have a free text search on bæredygtighed (sustainability), but just the part bæredygtig without hed.

filter.And().GroupedOr(searchFields, words[0].EscapeRegexSpecialCharacters().MultipleCharacterWildcard());

and the raw lucene query:

+(+__IndexType:content) +hideFromSearch:0 +(__NodeTypeAlias:category) +(nodeName:bæredygtig*)

but it returns no results:

image

But same query in backoffice (I think this is using NativeQuery() does return results).

image

Umbraco v12.3.9
Examine v3.1.0

@Shazwazza
Copy link
Owner

I assume this is the analyzer you are using. The backoffice uses a culture invariant analzyer - see https://github.com/umbraco/Umbraco-CMS/blob/8b878a7aa6302a1ee060f4f601cd6994ca178e3f/src/Umbraco.Examine.Lucene/DependencyInjection/ConfigureIndexOptions.cs#L36

the external one uses a standard analyzer.

Under the hood, the CultureInvariantWhitespaceAnalyzer is this: https://github.com/Shazwazza/Examine/blob/release/3.0/src/Examine.Lucene/Analyzers/CultureInvariantWhitespaceAnalyzer.cs

Which is a whitespace analyzer + LowerCaseFilter + ASCIIFoldingFilter (removes international symbols and converts to plain ascii)

You could try this for the external index: https://github.com/Shazwazza/Examine/blob/release/3.0/src/Examine.Lucene/Analyzers/CultureInvariantStandardAnalyzer.cs

which is the same as above, but with standard analyzer instead of whitespace.

@bjarnef
Copy link
Contributor Author

bjarnef commented May 3, 2024

That's for InternalIndex in backoffice global search (or when searching in InternalIndex from Examine dashboard I guees).

ExternalIndex is using the standard analyzer
https://github.com/umbraco/Umbraco-CMS/blob/8b878a7aa6302a1ee060f4f601cd6994ca178e3f/src/Umbraco.Examine.Lucene/DependencyInjection/ConfigureIndexOptions.cs#L41

I tried setting Analyzer in configuration, but it didn't seem to make a difference - and I think it would only be necessary if I wanted to change the default in Umbraco :)

image

I will check with CultureInvariantStandardAnalyzer.

@bjarnef
Copy link
Contributor Author

bjarnef commented May 3, 2024

Actually it was InternalIndex I was searching in for this specific task as it was including search on unpublished nodes and searching category nodes in backoffice.

image

image

which doesn't return results from Examine dashboard in InternalIndex:
image

but in ExternalIndex:
image

@Shazwazza
Copy link
Owner

I'm pretty sure this is because of the analyzer. You can test by searching with the ascii folder chars instead.

@Shazwazza
Copy link
Owner

The backoffice uses the culture invariant analyzer to try to provide a reasonable all-rounder experience for anyone working in the back office. If you have a very specific language structure in your entire site and all of your editors are the same language, than you can change the default analyzer to Standard, or whatever suits your team.

@bjarnef
Copy link
Contributor Author

bjarnef commented May 3, 2024

Yeah, I tried this

private const LuceneVersion _luceneVersion = LuceneVersion.LUCENE_48;

case Umbraco.Cms.Core.Constants.UmbracoIndexes.InternalIndexName:
    //options.Analyzer = new CultureInvariantWhitespaceAnalyzer();
    options.Analyzer = new StandardAnalyzer(_luceneVersion);
    break;

but didn't seem it returned the results with Danish characters.

I found something like this if we want to customize/extend a specific analyzer. Not sure if it has been documented.
https://stackoverflow.com/a/14811453

Will investigate further :)

@Shazwazza
Copy link
Owner

That link just shows what we already have for the CultureInvariantStandardAnalyzer https://github.com/Shazwazza/Examine/blob/release/3.0/src/Examine.Lucene/Analyzers/CultureInvariantStandardAnalyzer.cs

@bjarnef
Copy link
Contributor Author

bjarnef commented May 3, 2024

Yes :)

Actually I have this instead:

case Umbraco.Cms.Core.Constants.UmbracoIndexes.InternalIndexName:
    options.Analyzer = new StandardAnalyzer(LuceneInfo.CurrentVersion);
    break;

would have assumed the search returned the same results as searching ExternalIndex:
https://github.com/umbraco/Umbraco-CMS/blob/8b878a7aa6302a1ee060f4f601cd6994ca178e3f/src/Umbraco.Examine.Lucene/DependencyInjection/ConfigureIndexOptions.cs#L41

but I recall the NativeQuery() sometimes returns different results than e.g. default search using StandardAnalyzer on ExternalIndex.
https://github.com/umbraco/Umbraco-CMS/blob/8b878a7aa6302a1ee060f4f601cd6994ca178e3f/src/Umbraco.Examine.Lucene/BackOfficeExamineSearcher.cs#L152

I tried replacing the analyzer with CultureInvariantStandardAnalyzer instead, but it seems it also return zero results for term bæredygtig or bæredygtighed:

case Umbraco.Cms.Core.Constants.UmbracoIndexes.InternalIndexName:
    options.Analyzer = new CultureInvariantStandardAnalyzer();
    break;

@bjarnef
Copy link
Contributor Author

bjarnef commented May 3, 2024

I couldn't make it work by replacing the analyzer, so for now I have this workaround instead to replace the Danish letters æ, ø and å before passing in term to query:

if (!string.IsNullOrEmpty(term))
{
    var replacement = new Dictionary<string, string>
    {
        { "æ", "ae" },
        { "ø", "o" },
        { "å", "a" }
    };

    term = term.ToLowerInvariant().ReplaceMany(replacement);
}

Then it find results like bæredygtighed, grøn and affaldshåndtering.

@bjarnef
Copy link
Contributor Author

bjarnef commented May 3, 2024

I noticed there's a ScandinavianFoldingFilter and ScandinavianNormalizationFilter

The difference is:

ScandinavianFoldingFilter
blåbærsyltetøj == blåbärsyltetöj == blaabaarsyltetoej == blabarsyltetoj

ScandinavianNormalizationFilter
blåbærsyltetøj == blåbärsyltetöj == blaabaarsyltetoej but not blabarsyltetoj

I wonder if it makes sence to able to use a different filter than ASCIIFoldingFilter in CultureInvariantStandardAnalyzer?

I tried making a copy of that and used ScandinavianFoldingFilter instead, but the raw query include the Danish letter and although I can see the DefaultAnalyzer change based on configuration, it didn't seem to have effect on the results except when I use the replacements in #383 (comment)

@bjarnef
Copy link
Contributor Author

bjarnef commented May 3, 2024

@bjarnef
Copy link
Contributor Author

bjarnef commented May 6, 2024

@Shazwazza btw in the current logic without any configuration of InternalIndex and custom analyzer set, this find results searching the exact word bæredygtighed:

if (string.IsNullOrEmpty(searchTerm))
{
    return filter;
}

searchTerm = searchTerm.Replace("-", string.Empty);

var words = Tokenize(QueryParserBase.Escape(searchTerm)).ToArray();

filter.And().GroupedOr(searchFields, words?.ToArray());

image

but it seems to be related to wilcard search as you mentioned here:
umbraco/Umbraco-CMS#11176 (comment)

@Shazwazza
Copy link
Owner

@bjarnef Appreciate all the feedback and research here but ultimately this comes down to how analyzers are configured for the various indexes in Umbraco.

My advice to get to the bottom of this is to run simple tests i.e. clone the Examine Repo and create a test case using the FluentApiTests - this is quite easy and will allow you to iterate quicker in testing to validate results/expectations. As I don't see this being an Examine bug, I will close this issue but feel free to comment on it. I'm more than happy to make tweaks to Examine where it makes sense but in this case I don't think this is Examine specific and is mostly based on how Umbraco configures the indexes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants