Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Examinex escape incorrectly value for -1 #100

Open
bielu opened this issue Apr 8, 2024 · 11 comments
Open

Examinex escape incorrectly value for -1 #100

bielu opened this issue Apr 8, 2024 · 11 comments
Labels
question Further information is requested

Comments

@bielu
Copy link

bielu commented Apr 8, 2024

Hi @Shazwazza ,
When working on project where examinex was requested by client, I found similar bug to Novicell/Novicell.Examine.ElasticSearch#17
When passing my field value as -1, it is the lucene query retrieve by examinex:
+(+x__IndexType:content) +x__Published:y -hideFromSearch:1 +spath:1
which cause incorrect results as my spath has indexed -1!
In my elastic search provider i resolved it by escaping negative ints, not sure if that is correct approach for this however.

Edit: Forgot mentioned versions:
Umbraco 13.2.2
Examinex 6.0.0-beta.2

@Shazwazza
Copy link
Contributor

The - is a reserved character in Lucene so the lucene parser will remove it unless you escape it for searching with prefixed backslashes.

Can you please advise on what field type you are searching on and what query you are using to search along with what data is in that field? I need this to try to replicate.

There are several unit tests in ExamineX for testing paths with -1 so need to know specifically what and how you are searching.

Thanks!

@bielu
Copy link
Author

bielu commented Apr 8, 2024

It is example what I do with query:
image
and there is field type:
image
and there is how it is index:
image
it is interesting that it works with normal examine on lucene and generate correctly -1 😂 but as soon it comes here it doesnt

@Shazwazza
Copy link
Contributor

So spath is being indexed as a normal full text field type with a standard analyzer, so all stop words and punctuation, etc... are stripped out - is this what you are expecting?

Similarly, because the field type is full text and the analyzer is standard for that field, when it parses the query, it will strip off reserved words and punctuation (like the minus symbol). If you want to retain the minus symbol in your search query then you would need to escape the value, like request.ListingId.ToString().Escape()

But I don't know what the value for ListingId is, can you please specify? Is it always a specific number or is it a path value, etc... ? Perhaps standard analyzer isn't what you should be using on this field, it really depends on what this field is.

@bielu
Copy link
Author

bielu commented Apr 8, 2024

Listing id is always int, so values would be spath can be index as keyword, but it still not explain why switching to azure search makes difference between how query looks 🤔
i mean here, based on docs, the indexing and querying should be 1:1 with lucene examine. So i would expect both to return same lucene query :)

@Shazwazza Shazwazza added the question Further information is requested label Apr 8, 2024
@Shazwazza
Copy link
Contributor

I am assuming that the ToString() for your query, regardless of whether ExamineX is enabled or not will result in this:

+(+x__IndexType:content) +x__Published:y -hideFromSearch:1 +spath:1 ?

With the - character stripped out? It is the same query parser used in both so I would expect the query to be generated the same way.

I can help in investigating why this would work with Lucene as opposed to Azure Search if I can replicate.

Based on what your requirements are, I would suggest trying to escape the value passed to the query (i.e. request.ListingId.ToString().Escape() and see if that works in both engines, else I would think whitespace analyzer would work well. Another alternative, would be to use a keyword analyzer (i.e. Raw examine field) which is what the normal Umbraco Path field is indexed as and then use path queries that are escaped (that is ... if this field is actually like an Umbraco path).

@bielu
Copy link
Author

bielu commented Apr 8, 2024

@Shazwazza yes I didnt change anything around the code of this, when installing examinex, but I will need play around little mroe to figure out little more or change analyzer :)

@Shazwazza
Copy link
Contributor

What is the exact value that is being sent to the index for spath? I see that it ends up as "-1 -20 1250" in the Azure Search index, but what is the value being sent there? Is it "-1,-20,1250"?

@bielu
Copy link
Author

bielu commented Apr 8, 2024

the exact submited value is "-1 -20 1250", what I need check if it is just a string or collection as I was sure I am submitting collection 🤔

@Shazwazza
Copy link
Contributor

I put a couple of rudimentary tests together and the results are the same between Examine and ExamineX with Azure Search:

Examine Test:

[Test]
public void Can_Search_On_Negative_Numbers_With_Standard_Analyzer()
{
    var analyzer = new StandardAnalyzer(LuceneInfo.CurrentVersion);
    using (var luceneDir = new RandomIdRAMDirectory())
    using (var indexer = GetTestIndex(luceneDir, analyzer))
    {
        indexer.IndexItems(new[] {
            ValueSet.FromObject(1.ToString(), "content",
                new { spath = "-1 -20 1250" }),
            ValueSet.FromObject(2.ToString(), "content",
                new { spath = "-1 -20 1251" }),
            ValueSet.FromObject(3.ToString(), "content",
                new { spath = "-1 -20 1252" }),
            ValueSet.FromObject(4.ToString(), "content",
                new { spath = "-1 -20 1253" }),
        });

        var searcher = (BaseLuceneSearcher)indexer.Searcher;

        var query = searcher
            .CreateQuery(IndexTypes.Content)
            .Field("spath", "-1");

        Console.WriteLine(query);
        var results = query.Execute();

        Assert.AreEqual(4, results.TotalItemCount);
    }
}

ExamineX Test:

[Test]
public void Can_Search_On_Negative_Numbers_With_Standard_Analyzer()
{
    using (AzureIndex.ProcessNonAsync())
    {
        AzureIndex.IndexItems(new[] {
            ValueSet.FromObject(NewId(), "content",
                new { spath = "-1 -20 1250" }),
            ValueSet.FromObject(NewId(), "content",
                new { spath = "-1 -20 1251" }),
            ValueSet.FromObject(NewId(), "content",
                new { spath = "-1 -20 1252" }),
            ValueSet.FromObject(NewId(), "content",
                new { spath = "-1 -20 1253" }),
        });
    }

    var s = GetDelayedSearcher();

    var query = s
        .CreateQuery(IndexTypes.Content)
        .Field("spath", "-1");

    Console.WriteLine(query);
    var results = query.Execute();

    Assert.AreEqual(4, results.TotalItemCount);
}

In both cases, the query is parsed exactly the same: { Category: content, LuceneQuery: +spath:1 }

which is expected because the field is using StandardAnalyzer in both scenarios.

I'm not sure if this helps at all, but might help show what is different between these tests vs how your query/index is put together.

@bielu
Copy link
Author

bielu commented Apr 17, 2024

I debugged it and it is weird one, but found workaround so for now. It seems like for some reason it works when you index values as string, but not when you index them as collection 🤔

@Shazwazza
Copy link
Contributor

Ok thanks, I will try to replicate indexing the values that way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants