Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Media search indexing (Lombiq Technologies: OCORE-55) #9544

Merged
merged 39 commits into from
Jan 10, 2022

Conversation

Piedone
Copy link
Member

@Piedone Piedone commented May 25, 2021

Provides a simple implementation to index media for search. More precisely, it indexes media files related to content items so said content items will turn up in search when one of their media files matches the search query.

The following data can be indexed for each file referenced from a Media Field:

  • Media Text
  • Textual content of PDF files

@Piedone Piedone requested a review from agriffard as a code owner May 25, 2021 18:46
@agriffard
Copy link
Member

@Piedone
Copy link
Member Author

Piedone commented May 25, 2021

Guys, slow down :D.

@Skrypt
Copy link
Contributor

Skrypt commented May 25, 2021

Ok I will try it before adding any comments ;)

}
}

foreach (var path in field.Paths.Where(path => path.EndsWith(".pdf")))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sebastien's idea: Make this extensible, just via options (not DI).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just recognized I don't know what you meant by options @sebastienros. If IOptions<T> then I'm not sure how providing an extension point for services that would hook into this indexing work. I'd rather create an injectable IMediaFieldFileIndexHandler and run through them here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Piedone like the feature, it is the thing that I always recomend to clients not do do, is use too many pdfs, because indexing and seo is dificult with them.

re options: using Options<T> vs providers is essentially the same thing, just potentially easier to configure, and less expensive, in terms of DI lookups etc. There are probably more places we could have used the Options pattern, over a DI pattern.

Consider for example this could just be a dictionary of funcs on an options class.

public Dictionary<string, Func<Stream, string[]>> MediaPaths;

then in the MediaFieldIndexHandler

if (_options.TryGetValue(fileExtension, out var indexer)
{
   var values = indexer(fileStrem);
}

making it trivial to Configure a .txt file indexer etc al.

Of course exactly the same thing could be done with a provider, so as you prefer.

In terms of the FullText index, there are two ways to go here, as you have done, by adding the new fields to the Search Index, or using the Custom Full Text Settings liquid editor on a content type, to return these values into the FullTextIndex. I tend towards doing that more, as it makes it easier to differentiate between indexes that we might use internally for custom lucene queries, and not have to add so many fields to the index. For example with the full_text liquid aspect, we are also able to index into Widgets and BagPart items, and control which fields on these contained items are used for the full text index, from their own content definitions.

For example media text can already be added to a full text index, simply by returning (in a loop) {{ Content.MyPart.MyField.MediaTexts }}

What we also need at some point, is the same feature as the Full Text Index Settings for the BodyAspect to allow Widgets and BagParts to better contribute to a BodyAspect for rss feeds etc.

Whatever you decide in terms of providers vs options, it would be good, if it was reusable (later or now) in a liquid filter, to achieve the same thing as we can with | full_text

and what @Skrypt says about about the key names

Copy link
Member Author

@Piedone Piedone May 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for the clarification on options. I'd prefer the provider-based approach though: While if the configured file content extractor does nothing apart from processing the stream having such delegates in an options class is easier, if it needs to access injected services too it'd become cumbersome for the consumer (or we'd need to pass and IServiceProvider around as IBackgroundTask does).

As for how to handle FullText:

While handling MediaText can be done as follows too:

{% for mediaText in Model.Content.ContentTypeName.FieldName.MediaTexts %}
  {{ mediaText }}
{% endfor %}

I think doing something similar for file contents would complicate things quite a lot (both on the implementation and configuration side), and be an exception to what fields usually do (which is saving fields in a ContentFieldIndexHandler<T>).

In the wider scope I'd probably prefer all fields that are configured to be indexed (and stored and analyzed) to be automatically added to Content.ContentItem.FullText as well, so no further configuration is necessary to make it searchable too. So maybe a FullTextAspect-populating handler for each content field?

Copy link
Contributor

@Skrypt Skrypt Nov 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the wider scope I'd probably prefer all fields that are configured to be indexed

I see your point and it kind of makes sense but not at all times. So, I guess we should have some global recipes for Content Types FullText indexing. This would be a global full-text configuration that would be overriden by the Content Type settings themselves. I'm not sure if it's a good idea to put everything in the full-text column while you can already search in every field with a Lucene Query. I kind of understand that since every content types have different fields it makes it harder to keep all the Content Types full-text settings to be properly configured when people are adding new fields which should be added to the full-text index. Though, it is wrong to assume that every field should be part of it. So this would require some more thinking.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a field setting to add it to FullText optionally? Thus have a simple way to use them for search but keep the option open for more involved usage in custom queries.

@Piedone Piedone marked this pull request as draft June 7, 2021 23:37
Copy link
Member

@deanmarcussen deanmarcussen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you take it out of draft @Piedone ?

@Piedone
Copy link
Member Author

Piedone commented Jun 23, 2021

@deanmarcussen it is a draft though, not finished at all.

# Conflicts:
#	src/OrchardCore.Build/Dependencies.props
#	src/docs/resources/libraries/README.md
Copy link
Member

@deanmarcussen deanmarcussen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion is to use an Options class, not DI, such that one filter can be registered per extension (dictionary). This allows flexibility to replace or override existing filters. And finding the filter for an extension is also more straightforward. Can also list then in an options screen (something @agriffard has done for some features to "understand" what's used by the system).

Example we have moved to this pattern: Content Parts and Fields, Liquid template options.

Both @sebastienros and I have suggested an options class for this. It makes much more sense based on how we are trying to do things now - allows much more flexibility to configure easily, and using DI for providers like this is really bad for performance (because the container instantiates everything, even if 9 times out of ten is is not necesary)

The concern about having to pass a service provider is easily resolvable.
Here are some examples of how this can be achieved.

https://github.com/OrchardCMS/OrchardCore/blob/main/src/OrchardCore/OrchardCore.Workflows.Abstractions/Helpers/ServiceCollectionExtensions.cs
https://github.com/OrchardCMS/OrchardCore/blob/main/src/OrchardCore/OrchardCore.Workflows.Abstractions/Helpers/ServiceProviderExtensions.cs

I would imagine an Options Dictionary here could be keyed by extension name, and contain a Type.

The MediaFieldIndexHandler then just injects the options dictionary, a service provider, and instantiates the type of handler as required.

This has a lot of advantages, because the options are able to easily add/remove configurations, the text Type becomes built as and when required, we lose the CanHandle

@Skrypt
Copy link
Contributor

Skrypt commented Nov 21, 2021

Ok, then @deanmarcussen if you can do these changes that would go faster. Else, we wait on these changes.

@Piedone
Copy link
Member Author

Piedone commented Nov 22, 2021

The docs are complete as far as I'm concerned @hishamco, what do you mean?

Well, @deanmarcussen you let me decide and Sebastien most possibly didn't read my prior explanation, so... :) But fine, I'll change it to options.

@hishamco
Copy link
Member

I mean code docs ;)

@Piedone
Copy link
Member Author

Piedone commented Dec 10, 2021

Implemented provider registration with options.

# Conflicts:
#	src/OrchardCore.Build/Dependencies.props
@Piedone
Copy link
Member Author

Piedone commented Jan 8, 2022

@deanmarcussen would you please check? Apart from your approval this is good to go.

Copy link
Member

@deanmarcussen deanmarcussen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah looks good, just a couple of tweaks.

Sorry did not see the original push.

return _mediaFileTextProviderRegistrations.Any();
}

private static string NormalizeFileExtension(string fileExtension)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can go, if you use ordinal ignore case on the dictionary.

Or you could keep it for the . exception, and just remove the ToUpperInvariant()

foreach (var path in field.Paths)
{
var extensionWithoutDot = Path.GetExtension(path).Substring(1);
var providerType = _mediaFileIndexingOptions.GetRegisteredMediaFileTextProvider(extensionWithoutDot);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you just prefixed a . to the registration methods, i.e. something like var extensions = ext.TrimStart('.') + '.' then it would be normalized to being with, and no need to allocate another string here, just for the SubString()`

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the AddMediaFileTextProvider() is more convenient if you don't have to always specify the dot. But probably this is worthwhile.

return null;
}

public bool AnyMediaFileTextProviders()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we normally do is just expose the
private readonly Dictionary<string, Type> _mediaFileTextProviderRegistrations = new Dictionary<string, Type>();

as a readonly dictionary, so we don't need all these extra methods, for Any() etc.

so

public IReadOnlyDictionary<string, Type> MediaFileTextProviders => _mediaFileTextProviders

but up to you

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd hide that implementation detail.

Piedone and others added 3 commits January 9, 2022 13:42
Co-authored-by: Dean Marcussen <dean.marcussen@gmail.com>
Co-authored-by: Dean Marcussen <dean.marcussen@gmail.com>
@Piedone
Copy link
Member Author

Piedone commented Jan 9, 2022

Implemented the suggestions.

}
}

if (_mediaFileIndexingOptions.AnyMediaFileTextProviders())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is actually unecesary, because this is not a feature, so pdf indexing is always enabled, so there will always be a registered pdf indexer.

Unless you implement a Remove method (I don't see that we need it, so I'd just drop the .Any() check)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, though defensive checks like this are useful if the feature is changed later. Anyway, removed it.


public MediaFileIndexingOptions RegisterMediaFileTextProvider(string fileExtension, Type providerType)
{
// Deliberate overwrite behavior so you can override registrations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggestion here was to add
var extension = fileExtension.TrimStart('.') + '.' during registration so that the . is always prefixed (this is a lazy singleton, so will only happen once), and it has been normalized.

At which point you can remove the NormalizeFileExtension method, because no need to check it? as we know that Path.GetExtension includes the . right?

Otherwise NormalizeFileExtension should prepend the . rather than throwing an exception, or it needs a different name, because it's not normalizing ;)

I'd just drop it myself - keep the whole thing simpler

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a published part of the interface that any developer may use, just as GetRegisteredMediaFileTextProvider() (though that perhaps less so, I still see the utility in keeping it open). Thus, we need validation. And while as I mentioned having to always include a dot in AddMediaFileTextProvider() and RegisterMediaFileTextProvider() (something that other developers will use) is inconvenient, perhaps it's acceptable to reduce allocations, as you asked for: Because if we require it in GetRegisteredMediaFileTextProvider() then it would be confusing to not require the leading dot in these. Thus while NormalizeFileExtension()` is now indeed incorrectly named (fixed it) if it would prepend the dot then we're back at having additional allocations, or an inconsistent API. I find both of these unfavorable compared to having the current check.

@Piedone Piedone merged commit b0cc503 into OrchardCMS:main Jan 10, 2022
@LombiqBot LombiqBot deleted the issue/OCORE-55 branch February 3, 2022 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants