Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticsearch feature ✨💥 (Lombiq Technologies: OSOE-83) #11052

Merged
merged 314 commits into from
Sep 20, 2022

Conversation

Skrypt
Copy link
Contributor

@Skrypt Skrypt commented Jan 22, 2022

Fixes #4316

How to use:

Install Elasticsearch 7.x with Docker compose

Elasticsearch uses a mmapfs directory by default to store its indices. The default operating system limits on mmap counts is likely to be too low, which may result in out of memory exceptions.

https://www.elastic.co/guide/en/elasticsearch/reference/current/vm-max-map-count.html

For Docker with WSL2, you will need to persist this setting by using a .wslconfig file.

In your Windows %userprofile% directory (typically C:\Users\<username>) create or edit the file .wslconfig with the following:

[wsl2]
kernelCommandLine = "sysctl.vm.max_map_count=262144"

Then exit any WSL instance, wsl --shutdown, and restart.

> sysctl vm.max_map_count
vm.max_map_count = 262144

Elasticsearch v7.17.5 Docker Compose file :

elasticsearch.txt

  • Copy this file in a folder named Docker somewhere safe.
  • Rename this file extension as .yml instead of .txt.
  • Open up a Terminal or Command Shell in this folder.
  • Execute docker-compose up to deploy Elasticsearch containers.

Advice: don't remove this file from its folder if you want to remove all their containers at once later on in Docker desktop.

You should get this result in Docker Desktop app :

image

Set up Elasticsearch in Orchard Core

  • Add Elastic Connection in the shell configuration (OrchardCore.Cms.Web appsettings.json file)
"OrchardCore_Elastic": {
    "ConnectionType": "SingleNodeConnectionPool",
    "Url": "http://localhost",
    "Ports": "9200"
}
  • Start an Orchard Core instance with VS Code debugger
  • Go to Orchard Core features, Enable Elasticsearch.

Implementation details

Analyzed and Stored Properties are not very meaningful in context of Elasticsearch.

Analyzed

Analyzed is default for strings in Elasticsearch.
Because of automatic mapping, by default, all string fields are indexed twice in Elasticsearch as a "Text" Field and a "Keyword" field.

So we will have a field called ContentItemId(Text) analyzed and another called ContentItemId.Keyword to match on exact values using TermQuery for fields like ContentItemId or emails (Elastic Stores text fields in 2 fields analyzed vs not analyzed, a field ContentItemId.Keyword is created automatically)

Elasticsearch documentation:
https://www.elastic.co/blog/strings-are-dead-long-live-strings
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-index-search-time.html

Stored

Stored is really an overhead and only required if we are processing thousands of large documents.
By default Elastic will store the entire document into a field called _source and retrieves them when asked them from Index itself.

Elasticsearch documentation:
https://www.elastic.co/guide/en/elasticsearch/reference/7.12/search-fields.html

Lucene vs Elasticsearch

Here is a small table to compare Lucene and Elasticsearch (string) types:

Lucene Elasticsearch Description When Stored Search Query type
StringField Keyword A field that is indexed but not tokenized: the entire value is indexed as a single token original value AND indexed stored fields because indexed as a single token.
TextField Text A field that is indexed and tokenized, without term vectors original value AND indexed analyzed fields. Also known as full-text search
StoredField stored in _source by mapping configuration A field containing original value (not analyzed) original value stored fields

DSL Query Syntax

It is suggested to always use MatchQuery instead TermQuery for text fields in Elastic, where fully confident use (.Keyword) fields for exact match with TermQuery. (e.g. matching id, or fields like email, phone number, hostname)

Elasticsearch documentation:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html

NEST CheatSheet:

https://github.com/mjebrahimi/Elasticsearch-NEST-CheatSheet-Tutorials/blob/master/README.md

TODO

  • Refactor Search form to support both Lucene and Elastic. 🚧
  • Restructure project names using OrchardCore.Search.X ♻️ 🚚
  • Add Advance Support for multilingual search. (Equivalent of Lucene Analyzers) ✨
  • Refactor UI to hide Analyzed as Indexing Options for Fields as it is now the default for Lucene too. ♻️ 💄
  • Refactor ContentIndexSettings as IContentIndexSettings : See Refactor ContentIndexSettings to IContentIndexSettings ♻️ #10515
  • Move ContentIndexSettings drivers to the OrchardCore.Indexing module so that they are common to every Search provider.
  • Replace Query type with Nest.IQuery in the Elastic Search module. We should not parse the Queries with the OC Lucene Parser. Nest.IQuery is on par with Elasticsearch Query DSL.
  • Recipe backward compatibility refactor. (migration) "not necessary anymore"
  • UserName/Password configurations
  • Fix ElasticContentPickerResultProvider by adding a method in the ElasticIndexManager that allows to delegate the _elasticClient and thus allowing using fluent Queries.
  • Add Lucene "keyword" implementation.
  • Fix Elasticsearch document Id mapping (was not able to remove older document by Id when publishing a new ContentItem).
  • Add Advance Elastic Configuration. (Cluster connection strings) - Optional
  • Configure error handling. See : Recommended way of handling errors. elastic/elasticsearch-net#1793
  • Change how we index custom fields like "Normalized, Sanitized, Inherited" by using an underscore instead of a dot as a separator else Elasticsearch fails to bind the data because of already used Mapping.
  • Add an index setting for allowing to store "_source" on the index for Elasticsearch. Defaults to true.
  • Using Elasticsearch deserializer to parse JSON queries
  • Make Analyzers configuration work on each index.
  • Background Task. Indexing state (Last Task Id) persisted in each Elasticsearch index.
  • Secure ElasticIndexManager methods to operate only on current tenant indices.
  • Add SQL migrations.cs file where needed. More likely in the Lucene module.
  • Module documentation. 📝
  • Create a migration for Lucene content index settings. Set to "Keyword" everything that was set as "Included".
  • Create a migration for Elasticsearch content index settings. Get the settings from the Lucene content index settings. Move this migration implementation to a controller action that can be executed by a user at any time instead of having it only as part of a migration.
  • Add Rebuild and Reset indices recipe steps.

Migration

Manual migration to get back Lucene Indices Settings, Deployment plans, and Queries. (Reference only)

  UPDATE Document SET Content = REPLACE(content, '\"$type\":\"OrchardCore.Lucene.Deployment.LuceneIndexDeploymentStep, OrchardCore.Lucene\"', '\"$type\":\"OrchardCore.Search.Lucene.Deployment.LuceneIndexDeploymentStep, OrchardCore.Search.Lucene\"')
  WHERE [Type] = 'OrchardCore.Deployment.DeploymentPlan, OrchardCore.Deployment.Abstractions'

  UPDATE Document SET Content = REPLACE(content, '\"$type\":\"OrchardCore.Lucene.Deployment.LuceneSettingsDeploymentStep, OrchardCore.Lucene\"', '\"$type\":\"OrchardCore.Search.Lucene.Deployment.LuceneSettingsDeploymentStep, OrchardCore.Search.Lucene\"')
  WHERE [Type] = 'OrchardCore.Deployment.DeploymentPlan, OrchardCore.Deployment.Abstractions'

  UPDATE Document SET Content = REPLACE(content, '\"$type\":\"OrchardCore.Lucene.Deployment.LuceneIndexResetDeploymentStep, OrchardCore.Lucene\"', '\"$type\":\"OrchardCore.Search.Lucene.Deployment.LuceneIndexResetDeploymentStep, OrchardCore.Search.Lucene\"')
  WHERE [Type] = 'OrchardCore.Deployment.DeploymentPlan, OrchardCore.Deployment.Abstractions'

  UPDATE Document SET Content = REPLACE(content, '\"$type\":\"OrchardCore.Lucene.Deployment.LuceneIndexRebuildDeploymentStep, OrchardCore.Lucene\"', '\"$type\":\"OrchardCore.Search.Lucene.Deployment.LuceneIndexRebuildDeploymentStep, OrchardCore.Search.Lucene\"')
  WHERE [Type] = 'OrchardCore.Deployment.DeploymentPlan, OrchardCore.Deployment.Abstractions'

  UPDATE Document SET Content = REPLACE(content, '"$type":"OrchardCore.Lucene.LuceneQuery, OrchardCore.Lucene"', '"$type":"OrchardCore.Search.Lucene.LuceneQuery, OrchardCore.Search.Lucene"')
  WHERE  [Type] = 'OrchardCore.Queries.Services.QueriesDocument, OrchardCore.Queries'

  UPDATE Document SET [Type] = 'OrchardCore.Search.Lucene.Model.LuceneIndexSettingsDocument, OrchardCore.Search.Lucene'
  WHERE [Type] = 'OrchardCore.Lucene.Model.LuceneIndexSettingsDocument, OrchardCore.Lucene'

Breaking Changes

IndexingConstants changes :

Constant Before after
DisplayTextKey Content.ContentItem.DisplayText Content.ContentItem.DisplayText.keyword
ContainedPartKey + IdsKey (new) Content.ContentItem.ContainedPart.ListContentItemId Content.ContentItem.ContainedPart.Ids

Taxonomies module indexing

You can now access the term ids of a taxonomy field by using "{ContentTypeName}.{FieldName}.Ids".

Queries migration

Elasticsearch maps automatically the data which means that Text fields will always be Analyzed. You can now access the "Stored" value of that Text field by using ".keyword" as a suffix to your field name. This means that you can now use a TermQuery on that ".keyword" field and a MatchQuery on the basic field name.

Permissions

ManageIndexes will be now ManageLuceneIndexes.

Lucene indexation

Before After Action
Indexed Indexed Indexed meant "Keyword" in Lucene so we migrated these to "Keyword" in the content index settings.
Analyzed Keyword The Analyzed option is removed. Everything that was set as analyzed doesn't need migration because it is now the default.
Stored Stored Nothing to do. Should work as before

Notes

There is a new client for Elasticsearch 8.x.
Nest has been renamed to Elastic.Clients.Elasticsearch.

Also, OpenSearch will not work with this PR.

There is an initiative with moving Elastic.Clients.Elasticsearch to OpenSearch here :

https://github.com/opensearch-project/opensearch-net

This means that eventually OpenSearch and Elasticsearch will need to become 2 distinct features.

So to summarize compatibility :

Nest = OpenSearch 1.x, Elasticsearch 7.x, Elasticsearch 8.x (in compatibility mode)
https://github.com/elastic/elasticsearch-net/tree/7.17

Elastic.Client.Elasticsearch (replacing Nest) = Elasticsearch 8.x
https://github.com/elastic/elasticsearch-net
https://www.nuget.org/packages/Elastic.Clients.Elasticsearch/

OpenSearch.Client (Nuget package not released yet) = OpenSearch 2.x
opensearch-project/opensearch-build#2051

@Skrypt Skrypt added breaking change 💥 Issues or pull requests that introduces breaking change(s) notready labels Jan 22, 2022
@Skrypt Skrypt closed this Jan 22, 2022
@Skrypt Skrypt deleted the skrypt/elasticsearch branch January 22, 2022 00:29
@Skrypt Skrypt restored the skrypt/elasticsearch branch January 22, 2022 00:33
@Skrypt Skrypt reopened this Jan 22, 2022
@Skrypt Skrypt added this to the 1.3 milestone Jan 22, 2022
@Skrypt Skrypt modified the milestones: 1.3, 1.x Mar 9, 2022
@Skrypt Skrypt changed the title Elasticsearch / OpenSearch feature ✨💥 Elasticsearch / OpenSearch feature ✨💥 (Lombiq Technologies: OSOE-83) Mar 25, 2022
@Skrypt
Copy link
Contributor Author

Skrypt commented Mar 30, 2022

Squashed commits.

@Piedone
Copy link
Member

Piedone commented May 31, 2022

Any news on this by chance?

@Skrypt
Copy link
Contributor Author

Skrypt commented May 31, 2022

Ping me on Teams or else when you get time.

@hyzx86
Copy link
Contributor

hyzx86 commented Jun 18, 2022

Is there any problem with this PR?

@Skrypt
Copy link
Contributor Author

Skrypt commented Jun 18, 2022

I'm about to start working on it. I'm still on vacation ...

@hyzx86
Copy link
Contributor

hyzx86 commented Jul 12, 2022

found some problem:

  • Document total never greate than 10, fix document total never greate than 10 #11999
  • Need check ES Connection before oprate es index . Now, When create elastic index before es service start , the operation will success , but it not be work and can't delete or update

@Skrypt
Copy link
Contributor Author

Skrypt commented Jul 12, 2022

@Piedone I'm starting working on this PR today.
@hyzx86 Not sure what you mean about this :

Need check ES Connection before oprate es index . Now, When create elastic index before es service start , the operation will success , but it not be work and can't delete or update

@hyzx86
Copy link
Contributor

hyzx86 commented Jul 12, 2022

@Skrypt

  1. I stop my es service
  2. create a es index
  3. After a long time, the operation will succeed

But in fact, it is a wrong check
For the following reasons , So I don't think we should just return a Boolean value here. Maybe we can throw an exception or directly return ExistsResponse

image

@hyzx86
Copy link
Contributor

hyzx86 commented Jul 12, 2022

Should we check the link status before each operation of ES?

Or throw an exception every time the API link fails

@Skrypt
Copy link
Contributor Author

Skrypt commented Jul 12, 2022

We need to prevent allowing to create indexes if we don't have a successfully working ES connection.
This is more a UI thing in that case. We need to display a notification that there is no ES connection working when going to the page where we can add indexes. Also, we need to throw an exception as a safety net on these methods.

@Skrypt
Copy link
Contributor Author

Skrypt commented Jul 22, 2022

Search feature modules changes explanations with sound :

2022-07-22.13-08-22.mp4

@Piedone
Copy link
Member

Piedone commented Jul 22, 2022

Awesome! Would it be possible to merge the Lucene and Elastic UX, to make Elastic a drop-in replacement for the search backend? Like how you manage Media files the same regardless they're stored locally or in Azure Blob Storage.

@Skrypt
Copy link
Contributor Author

Skrypt commented Jul 22, 2022

They are complementary. Both can be used at the same time. But of course, you could use only Elastic Search if that's what you want instead of Lucene.

Right now, there are discrepancies between both implementations. You can't use a Lucene Query and copy/paste it as an Elastic Query because everything is "analyzed" with Elastic Search. So a Term Query becomes a Match Query. But, to remove this discrepancy we would need to make the Lucene implementation work like Elastic Search which would take more time.

I will take some more time to make sure we are doing the proper thing here by setting every document as "Analyzed" with Elastic Search. But the thing is that with Elastic Search everything is also always stored as I documented in the first post.

So, right now, we can't just switch from Lucene to Elastic Search seamlessly unless we would keep the same ContentIndexSettings for both implementations. I removed the "Stored" option for Elastic because it was unnecessary.

Also, another thing to take into consideration. We parse Lucene Queries with our own QueryParser implementation which follows Lucene 4.0 standards and Nest uses Lucene 7.x. not to mention that Lucene 8.x will use a newer implementation of Nest which is getting renamed to Elastic.Clients.Elasticsearch https://github.com/elastic/elasticsearch-net

Lots of moving parts with Elastic compared with Lucene. Maybe it is better to keep Lucene and Elastic implementations unique for now.

@Piedone
Copy link
Member

Piedone commented Jul 22, 2022

From this, it seems to me that if these two are independent features, we'll have two increasingly diverging Lucene-based search implementations. Why I see this as an issue because while Elastic offers a lot more than vanilla Lucene, I'd also consider it a "production" version of search, since otherwise, you need to employ workarounds in a production app unless you have a single instance and that's all to your environment (like with staged publishing, blue-green deployment, multi-node hosting). Contrast this with the Lucene module which just works without any external dependencies and is thus perfect for local development and testing. Again, like local storage-Azure Blob Storage: You can use both anywhere in principle, but most possibly you want to use local storage while you write the code, and simply switch to Blob Storage in production.

I don't have any insights into how, when, or even whether it makes sense to tackle this, but wanted to share this viewpoint.

@Skrypt
Copy link
Contributor Author

Skrypt commented Sep 16, 2022

@Piedone Does the Pull Request answer all your needs so far as a first iteration? On my side, I think it is ready as it is. There is only one comment on your side about the Migration class name which I won't take action on for now. I need to understand the reasoning behind this. Maybe @jtkech had some thoughts about this because of his tenant removal feature. I'm not sure if we should start having different migration files per module when there is only one feature in that module. One migration file per feature makes more sense.

Else, if we want to get this merged I need to get approvals from the main contributors. So please add your comments/reviews.

@jtkech
Copy link
Member

jtkech commented Sep 16, 2022

I will comment soon the related comment, hmm and another one I just saw related to an async call, maybe the last one first ;)

@jtkech
Copy link
Member

jtkech commented Sep 16, 2022

@Skrypt I did 2 comments related to the async delegate and migration. Doesn't appear in the main thread, maybe I forgot to submit them as a review, so look at the file changes.

@Skrypt
Copy link
Contributor Author

Skrypt commented Sep 16, 2022

Can't find it either. There is too much stuff in here.

@Piedone
Copy link
Member

Piedone commented Sep 16, 2022

Once you've addressed all my comments, please re-request review.

@Skrypt Skrypt requested a review from Piedone September 17, 2022 04:27
//"OrchardCore_Elasticsearch": {
// "ConnectionType": "SingleNodeConnectionPool",
// "Url": "http://localhost",
// "Ports": [ 9200 ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comma is still missing here.

Skrypt and others added 5 commits September 18, 2022 18:55
…/Admin/Index.cshtml

Co-authored-by: Zoltán Lehóczky <zoltan.lehoczky@lombiq.com>
Fix search bar and buttons
Lucene Settings for deployment steps.
@sebastienros sebastienros merged commit 87b4271 into main Sep 20, 2022
@sebastienros sebastienros deleted the skrypt/elasticsearch branch September 20, 2022 20:04
@Piedone
Copy link
Member

Piedone commented Sep 23, 2022

Why did you merge, @sebastienros? As you can see above, I was reviewing, and after our last one I was waiting for Jasmin re-requesting my review.

@Skrypt
Copy link
Contributor Author

Skrypt commented Sep 23, 2022

This Pull Request has become way too big to make any more reviews. There are 287 items hidden because this page is too big. Also, I think the Elasticsearch module is working well enough to merge it at least into the main branch so that people can try it and report issues. If there are any other issues we can open individual issues about them now.

Also, I think it is ready enough to start using it.

If you prefer @Piedone we can open a single issue about Elasticsearch where we will add a list of tasks to do. And I will open a new Pull Request if needed where we will be able to start fresh.

@Piedone
Copy link
Member

Piedone commented Sep 23, 2022

I'm talking solely about reviewing changes that I requested last time.

@Skrypt
Copy link
Contributor Author

Skrypt commented Sep 23, 2022

Yes, they were taken care of.

https://github.com/OrchardCMS/OrchardCore/blob/main/src/OrchardCore.Cms.Web/appsettings.json#L127

I did not merge the "Synchronize Elasticsearch content index settings with Lucene." because somehow you should already know that you are in the Elasticsearch indices list. The other ones have been marked as resolved and are fixed.

As I said, if there are issues remaining, nothing prevents us from creating new issues now. Else, maybe that's because you wanted to keep the OSOE-83 label?

Also, at this point, if there was any major issue it should have been found and fixed.

@Piedone
Copy link
Member

Piedone commented Sep 23, 2022

OK, but I didn't get a chance to check them before this was merged ;). I've been repeatedly reviewing this PR, including the changes you've made based on my feedback. So, before merging it, I'd have appreciated being able to do that for the last batch of changes too, to make sure that no issues slipped through (like it also happened immediately before that). In short, wait for approval before merge as usual. That is all. It's not about the label, nor about branching off new issues (what we've already done).

@Skrypt
Copy link
Contributor Author

Skrypt commented Sep 23, 2022

True, we merged this on Tuesday's meeting with @sebastienros approval. Also because we are merging PRs that are set to be included in the 1.5 release this week so it needed to be merged.

branching off new issues (what we've already done)

I'm probably not understanding that last sentence. I did not see any new branch in here for fixing Elasticsearch issues yet?

@Skrypt
Copy link
Contributor Author

Skrypt commented Sep 23, 2022

Also, I said at Tuesday's meeting that I was going to keep supporting this feature. If you guys have found any other issues and have branched off on your private repository then I'm not being advised about them. If that is the issue then please advise.

I think we can say that the Elasticsearch module is functional and that for the first iteration it is good enough to be merged. We need to ship at some point else this PR becomes a never-ending iteration of things to change in Lucene and in Elasticsearch. If there are fundamental design issues with the Elasticsearch module then I should be advised about it. Never heard about anything related to that yet. So, it is now merged and I'm surprised by your reaction because I thought you would have been happy about it being merged sooner than later.

@Piedone
Copy link
Member

Piedone commented Sep 25, 2022

What I meant by "branching off new issues (what we've already done)" is that you've already opened issues before to address things that came up in this PR but were out of scope of it, so we agreed not to address it in this one. I.e., not to grow this PR indefinitely.

Again, my issue is only what I described above, nothing more, nothing else, and nothing hidden. Yes, naturally I also wanted this to be merged as soon as possible. But it's not the act of the merge itself but rather being ready to be merged as soon as possible. Waiting those perhaps <24 hours that it'd have taken me to check the new changes after you request review wouldn't have been against this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change 💥 Issues or pull requests that introduces breaking change(s)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement ElasticSearch module
6 participants