Allows to use custom analysers in ES or Solr #223

davidclement90 · 2017-04-19T11:57:05Z

Issue : #222

Signed-off-by: David Clement david.clement90@laposte.net

janusgraph-bot · 2017-04-19T11:58:02Z

Committer of one or more commits is not listed as a CLA signer, either individual or as a member of an organization.

sjudeng · 2017-04-25T02:13:14Z

It doesn't look like support has been added for setting the custom analyzer, just leveraging it in queries. I think you could support setting the custom analyzer by adding a new "analyzer" parameter to ParameterType.java and then updating ElasticSearchIndex.java#register to support same (see example).

If nothing else this would allow a cleaner test implementation. Though it would also probably require some documentation updates to show an example of setting the custom analyzer. But I think adding this would make a more complete implementation. What do you think?

davidclement90 · 2017-04-26T11:41:16Z

You are right, this PR only allow to leveraging custom analyser in queries for Equals, NotEquals and Contains predicates. But it a first step.
To use custom analyser, I use #233 (that I just backport from my Titan) and I manage my mapping in Kibana.
I tried to adapt ParameterType but to declare new analyser in ES you usually need to update index settings.
But an update of analyser in setting has to be done on closed index. So it can not be register easily, thus I use directly Kibana.
I think that to adapt ParmeterType, we need to first separate index (instead to have one index with multiple types that is not really the best use of ES elastic/elasticsearch#15613, JanusGraph will have multiple index) which allow close index more easily. Then adapt ParameterType.

What do you think ?

sjudeng · 2017-04-26T13:50:53Z

The idea of supporting an external mapping is interesting, I'll have to look over #233.

Regarding this PR can the custom analyzer be set during initial mapping creation? If so I think there'd still be value in supporting this using the existing machinery within JanusGraph (e.g. ParameterType through ElasticSearchIndex.java#register). For one I think this would make the feature more accessible for user's who don't want to go deeper into ES mapping details. Also we could eventually look at adding corresponding support to the other indexing backends.

I don't think updating the analyzer needs to be supported (same is already true of field type, etc. defined in ElasticSearchIndex.java#register). But can you explain more on why this would require a separate index instead of type for each property? I'm just seeing something like the following added to the TEXT and TEXTSTRING cases in the above register method. Or am I missing something?

String analyzer = (String) ParameterType.ANALYZER.findParameter(information.getParameters(), null);
if (analyzer != null) mapping.field("analyzer", analyzer);

davidclement90 · 2017-04-27T14:45:58Z

To use custom analyzer like https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-custom-analyzer.html, you need to declare it in index settings.
So this settings can be add when the index is created. If you want to add another one after the creation, you need to close so make it unavailable, update the setting then open it. As JanusGraph have one index with multiple types, this action make all mixed indexes unavailable. If you will have one ES index by mixed index, this update will not make all mixed indexes unavailable.
But it's another issue.
For this one, I just discover the ES_CREATE_EXTRAS_NS properties, i can use it to register all my custom analyser when JanusGraph will push my index.
So you are right, I currently work to add a new ParameterType and I also try to implement this feature on other indexing backends.

sjudeng · 2017-04-27T15:22:34Z

How about just adding support for setting the built-in analyzer to use (e.g. english, stop, etc.) and save the case of defining a truly custom analyzer for a separate feature (or #233)? This is what you're testing anyway (e.g. setting analyzer to english on text field) so I think it would be enough just to support setting that through JanusGraph, which would avoid need for custom HTTPClient and the index create-delete-recreate steps in the test.

davidclement90 · 2017-04-28T12:20:37Z

I do the change. I can implement it in ES and Solr (on Lucene is not possible).
So we can set analyzer for Mapping.STRING and Mapping.Text field with a separated ParemeterType.
To use truly custom analyzer, you are right , we should use #233.

sjudeng

Thanks for addressing the questions. I think this is much better. Some more minor feedback is below.

sjudeng · 2017-04-28T13:35:12Z

docs/elasticsearch.txt


 * Please refer to the https://www.elastic.co[Elasticsearch homepage] and available documentation for more information on Elasticsearch and how to setup an Elasticsearch cluster.
+
+=== Custom Analyzer


I wouldn't include this section here since you already provide this in the textsearch page below and this doesn't seem to fit here. Instead how about adding a bullet "Analyzer" or "Text Analyzer" or something, with a sentence or two description, to the list at the top of this page ("Full Text", "Geo", etc.)?

sjudeng · 2017-04-28T13:35:39Z

docs/solr.txt

 ///////////
+
+
+==== Custom Analyzer


Same comment as above ... consider adding a bullet to list at the top instead of this section here.

sjudeng · 2017-04-28T13:37:26Z

docs/textsearch.txt

+
+==== Custom Analyser
+
+By default, JanusGraph will use the default analyzer from the indexing backend for properties with mapping.TEXT, and no analyzer for properties with mapping.STRING. If one wants to use another analyzer, it can be explicitly specified through a parameter : ParameterType.TEXT_ANALYZER for mapping.TEXT and parameterType.STRING_ANALYZER for mapping.STRING.


Capitalization fixes: Mapping.STRING, Mapping.TEXT, ParameterType.STRING_ANALYZER, etc.

sjudeng · 2017-04-28T13:37:56Z

docs/textsearch.txt

+
+===== For Elasticsearch
+
+The name of the analyzer must be set as parameter value.


Include a "more information" link to ES documentation on analyzers?

sjudeng · 2017-04-28T13:38:52Z

janusgraph-es/src/main/java/org/janusgraph/diskstorage/es/ElasticSearchIndex.java

            "which typically only happens the first time JanusGraph is started on top of ES. If the index JanusGraph is " +
            "configured to use already exists, then this setting has no effect.", ConfigOption.Type.MASKABLE, 200L);
-
+    


Remove whitespace

sjudeng · 2017-04-28T13:57:06Z

janusgraph-solr/src/main/java/org/janusgraph/diskstorage/solr/SolrIndex.java

+            try {
+                ((Constructor<Tokenizer>) ClassLoader.getSystemClassLoader().loadClass(analyzer)
+                        .getConstructor()).newInstance();
+            } catch (InstantiationException | IllegalAccessException | IllegalArgumentException


Simplify by using base ReflectiveOperationException (e.g. catch (ReflectiveOperationException | IllegalArgumentException | SecurityException e))?

sjudeng · 2017-04-28T13:57:23Z

janusgraph-solr/src/main/java/org/janusgraph/diskstorage/solr/SolrIndex.java

+            try {
+                ((Constructor<Tokenizer>) ClassLoader.getSystemClassLoader().loadClass(analyzer)
+                        .getConstructor()).newInstance();
+            } catch (InstantiationException | IllegalAccessException | IllegalArgumentException


catch (ReflectiveOperationException | IllegalArgumentException | SecurityException e)

sjudeng · 2017-04-28T14:00:42Z

janusgraph-solr/src/main/java/org/janusgraph/diskstorage/solr/SolrIndex.java

+                terms.add(termAtt.getBytesRef().utf8ToString());
+            }
+            return terms;
+        } catch (InstantiationException | IllegalAccessException | IllegalArgumentException | InvocationTargetException


catch (ReflectiveOperationException | IllegalArgumentException | SecurityException e)

sjudeng · 2017-04-28T14:02:02Z

janusgraph-solr/src/main/java/org/janusgraph/diskstorage/solr/SolrIndex.java

            }
        }
        //Since all data types must be defined in the schema.xml, pre-registering a type does not work
+        //But we check Analyse feature


Is this check here necessary since it looks like errors would be handled in your customTokenize method?

To fail during the property registration and not during the search. So the index will not have property with bad configuration.
I think if the configuration is wrong, JanusGraph should fail as soon as possible.

sjudeng · 2017-04-28T14:27:49Z

janusgraph-test/src/main/java/org/janusgraph/diskstorage/indexing/IndexProviderTest.java

-            put(TEXT,new StandardKeyInformation(String.class, Cardinality.SINGLE, new Parameter("mapping",
-                    indexFeatures.supportsStringMapping(Mapping.TEXT)?Mapping.TEXT:Mapping.TEXTSTRING)));
+            put(TEXT,new StandardKeyInformation(String.class, Cardinality.SINGLE,
+                    indexFeatures.supportsStringMapping(Mapping.TEXT)?Mapping.TEXT.asParameter():Mapping.TEXTSTRING.asParameter()));


I like how you cleaned these statements up. Going a little further how about (indexFeatures.supportsStringMapping(Mapping.TEXT)?Mapping.TEXT:Mapping.TEXTSTRING).asParameter())?

minor, is it OK to glue text to ?: ?

amcp

minor comments

amcp · 2017-05-17T09:02:09Z

janusgraph-core/src/main/java/org/janusgraph/diskstorage/indexing/IndexFeatures.java

        this.defaultStringMapping = defaultMap;
        this.supportedStringMappings = supportedMap;
        this.wildcardField = wildcardField;
        this.supportedCardinaities = supportedCardinaities;


please fix spelling: supportedCardinalities

amcp · 2017-05-17T09:03:34Z

janusgraph-core/src/main/java/org/janusgraph/diskstorage/indexing/IndexFeatures.java

    private final ImmutableSet<Mapping> supportedStringMappings;
    private final String wildcardField;
    private final boolean supportsNanoseconds;
+    private final boolean supportCustomAnalyser;


two minor comments here

please use the same naming convention as supportsNanoseconds

Elasticsearch spelling here and elsewhere: supportsCustomAnalyzer

amcp · 2017-05-17T09:04:07Z

janusgraph-core/src/main/java/org/janusgraph/diskstorage/indexing/IndexFeatures.java

            return this;
        }
+
+        public Builder supportCustomAnalyser() {


supportsCustomAnalyzer

amcp · 2017-05-17T09:04:14Z

janusgraph-core/src/main/java/org/janusgraph/diskstorage/indexing/IndexFeatures.java

        private Set<Cardinality> supportedCardinalities = Sets.newHashSet();
        private String wildcardField = "*";
        private boolean supportsNanoseconds;
+        private boolean supportCustomAnalyser;


supportsCustomAnalyzer

amcp · 2017-05-17T09:04:29Z

janusgraph-core/src/main/java/org/janusgraph/diskstorage/indexing/IndexFeatures.java

        this.wildcardField = wildcardField;
        this.supportedCardinaities = supportedCardinaities;
        this.supportsNanoseconds = supportsNanoseconds;
+        this.supportCustomAnalyser = supportCustomAnalyser;


supportsCustomAnalyzer

amcp · 2017-05-17T09:06:59Z

janusgraph-es/src/main/java/org/janusgraph/diskstorage/es/ElasticSearchIndex.java

                    case STRING:
-                        mapping.field("index","not_analyzed");
+                        if (stringAnalyzer != null) {
+                            mapping.field("analyzer", stringAnalyzer);


use constant

amcp · 2017-05-17T09:07:05Z

janusgraph-es/src/main/java/org/janusgraph/diskstorage/es/ElasticSearchIndex.java

                    	break;
                    case TEXTSTRING:
+                        if (textAnalyzer != null) {
+                            mapping.field("analyzer",textAnalyzer);


use constant

amcp · 2017-05-17T09:07:10Z

janusgraph-es/src/main/java/org/janusgraph/diskstorage/es/ElasticSearchIndex.java

                        mapping.field("type", "string");
-                        mapping.field("index","not_analyzed");
+                        if (stringAnalyzer != null) {
+                            mapping.field("analyzer", stringAnalyzer);


use constant

amcp · 2017-05-17T09:07:34Z

janusgraph-es/src/main/java/org/janusgraph/diskstorage/es/ElasticSearchIndex.java

+                        if (stringAnalyzer != null) {
+                            mapping.field("analyzer", stringAnalyzer);
+                        }else{
+                            mapping.field("index","not_analyzed");


externalize index to a string constant

amcp · 2017-05-17T09:07:46Z

janusgraph-es/src/main/java/org/janusgraph/diskstorage/es/ElasticSearchIndex.java

+                        if (stringAnalyzer != null) {
+                            mapping.field("analyzer", stringAnalyzer);
+                        } else {
+                            mapping.field("index","not_analyzed");


use constant

Custom analyzers can be set throw new ParameterType. Signed-off-by: David Clement <david.clement90@laposte.net>

sjudeng · 2017-05-23T01:56:03Z

@amcp Do the changes look okay here? Once this can get merged we can request #233 be rebased and then I'd like to add my review to it as well.

amcp · 2017-05-26T16:12:35Z

@sjudeng checking now

amcp

a few more minor changes please

amcp · 2017-05-26T16:17:16Z

janusgraph-core/src/main/java/org/janusgraph/diskstorage/indexing/IndexFeatures.java

        }
+
+        public Builder supportsCustomAnalyser() {
+            supportsCustomAnalyser = true;


use ES spelling: supportsCustomAnalyzer

amcp · 2017-05-26T16:17:32Z

janusgraph-core/src/main/java/org/janusgraph/diskstorage/indexing/IndexFeatures.java

            return this;
        }
+
+        public Builder supportsCustomAnalyser() {


use ES spelling: supportsCustomAnalyzer

amcp · 2017-05-26T16:17:36Z

janusgraph-core/src/main/java/org/janusgraph/diskstorage/indexing/IndexFeatures.java

        private Set<Cardinality> supportedCardinalities = Sets.newHashSet();
        private String wildcardField = "*";
        private boolean supportsNanoseconds;
+        private boolean supportsCustomAnalyser;


use ES spelling: supportsCustomAnalyzer

amcp · 2017-05-26T16:18:28Z

janusgraph-es/src/main/java/org/janusgraph/diskstorage/es/ElasticSearchIndex.java


    private static final IndexFeatures ES_FEATURES = new IndexFeatures.Builder()
-            .setDefaultStringMapping(Mapping.TEXT).supportedStringMappings(Mapping.TEXT, Mapping.TEXTSTRING, Mapping.STRING).setWildcardField("_all").supportsCardinality(Cardinality.SINGLE).supportsCardinality(Cardinality.LIST).supportsCardinality(Cardinality.SET).supportsNanoseconds().build();
+            .setDefaultStringMapping(Mapping.TEXT).supportedStringMappings(Mapping.TEXT, Mapping.TEXTSTRING, Mapping.STRING).setWildcardField("_all").supportsCardinality(Cardinality.SINGLE).supportsCardinality(Cardinality.LIST).supportsCardinality(Cardinality.SET).supportsNanoseconds().supportsCustomAnalyser().build();


use ES spelling: supportsCustomAnalyzer

amcp · 2017-05-26T16:18:46Z

janusgraph-es/src/main/java/org/janusgraph/diskstorage/es/ElasticSearchIndex.java

-                        mapping.field("index","not_analyzed");
+                        if (stringAnalyzer != null) {
+                            mapping.field(ANALYZER, stringAnalyzer);
+                        }else{


spacing should be } else {

amcp · 2017-05-26T16:20:37Z

janusgraph-solr/src/main/java/org/janusgraph/diskstorage/solr/SolrIndex.java

        //Since all data types must be defined in the schema.xml, pre-registering a type does not work
+        //But we check Analyse feature
+        String analyzer = (String) ParameterType.STRING_ANALYZER.findParameter(information.getParameters(), null);
+        if (analyzer !=null) {


add a space after !=

amcp · 2017-05-26T16:20:54Z

janusgraph-solr/src/main/java/org/janusgraph/diskstorage/solr/SolrIndex.java

-                }
-                if (janusgraphPredicate == Text.PREFIX || janusgraphPredicate == Text.CONTAINS_PREFIX) {
+                    return tokenize(informations, value, key, janusgraphPredicate,  (String) ParameterType.TEXT_ANALYZER.findParameter(informations.get(key).getParameters(), null));
+                }else if (janusgraphPredicate == Text.PREFIX || janusgraphPredicate == Text.CONTAINS_PREFIX) {


add a space after the {

amcp · 2017-05-26T16:21:10Z

janusgraph-solr/src/main/java/org/janusgraph/diskstorage/solr/SolrIndex.java

+                    String tokenizer = (String) ParameterType.STRING_ANALYZER.findParameter(informations.get(key).getParameters(), null);
+                    if(tokenizer != null){
+                        return tokenize(informations, value, key, janusgraphPredicate,tokenizer);
+                    }else{


fix spacing please } else {

amcp · 2017-05-26T16:21:22Z

janusgraph-solr/src/main/java/org/janusgraph/diskstorage/solr/SolrIndex.java

+        List<String> terms;
+        if(tokenizer != null){
+            terms = customTokenize(tokenizer, (String) value);
+        }else{


fix spacing please `} else {

amcp · 2017-05-26T16:21:37Z

janusgraph-solr/src/main/java/org/janusgraph/diskstorage/solr/SolrIndex.java

+            return terms;
+        } catch ( ReflectiveOperationException | IOException e) {
+                throw new IllegalArgumentException(e.getMessage(),e);
+        } finally{


add a space after finally

amcp · 2017-05-26T16:23:20Z

@sjudeng I found a few more things to fix, so one more round. Thanks.

sjudeng · 2017-05-27T12:53:03Z

janusgraph-es/src/main/java/org/janusgraph/diskstorage/es/ElasticSearchIndex.java

-                        b.must(QueryBuilders.termQuery(fieldName, term));
-                    }
-                    return b;
+                if (janusgraphPredicate == Text.CONTAINS ||janusgraphPredicate == Cmp.EQUAL ) {


Add space after || and remove space before )

sjudeng · 2017-05-27T12:54:34Z

janusgraph-es/src/main/java/org/janusgraph/diskstorage/es/ElasticSearchIndex.java

-                    return QueryBuilders.termQuery(fieldName, (String) value);
                } else if (janusgraphPredicate == Cmp.NOT_EQUAL) {
-                    return QueryBuilders.boolQuery().mustNot(QueryBuilders.termQuery(fieldName, (String) value));
+                      return QueryBuilders.boolQuery().mustNot(QueryBuilders.matchQuery(fieldName, value).operator(Operator.AND));


Remove extra spaces

Signed-off-by: sjudeng <sjudeng@users.noreply.github.com>

sjudeng · 2017-05-27T21:33:00Z

@davidclement90 I pushed a commit to your branch with the requested code style updates.

@amcp Can you check this over again when you have time? I'm working on a separate update to ElasicSearchIndex which this PR is blocking because of conflicts, so I'd like to get it merged if possible.

amcp · 2017-05-28T16:02:47Z

@sjudeng merged this in, you are good to work on the other PR now.

…atch Allows to use custom analysers in ES or Solr

janusgraph-bot added cla: no This PR is not compliant with the CLA cla: yes This PR is compliant with the CLA and removed cla: no This PR is not compliant with the CLA labels Apr 19, 2017

davidclement90 force-pushed the elasticsearch-match branch from ddf240b to 2fc3dd0 Compare April 28, 2017 12:10

davidclement90 changed the title ~~Use match instead of term for Equals, NotEquals and Contains predicates in ES.~~ Allows to use custom analysers in ES or Solr Apr 28, 2017

davidclement90 force-pushed the elasticsearch-match branch from 2fc3dd0 to e07d174 Compare April 28, 2017 12:23

sjudeng requested changes Apr 28, 2017

View reviewed changes

janusgraph-bot added the cla: yes This PR is compliant with the CLA label May 6, 2017

davidclement90 force-pushed the elasticsearch-match branch 3 times, most recently from 3f01044 to aece87e Compare May 10, 2017 12:00

sjudeng approved these changes May 10, 2017

View reviewed changes

amcp requested changes May 17, 2017

View reviewed changes

davidclement90 force-pushed the elasticsearch-match branch from aece87e to 8288d7e Compare May 18, 2017 09:02

This feature allows to use custom analysers in ES and Solr.

8288d7e

Custom analyzers can be set throw new ParameterType. Signed-off-by: David Clement <david.clement90@laposte.net>

amcp requested changes May 26, 2017

View reviewed changes

sjudeng reviewed May 27, 2017

View reviewed changes

Minor code style updates.

0f822c7

Signed-off-by: sjudeng <sjudeng@users.noreply.github.com>

sjudeng force-pushed the elasticsearch-match branch from 8f715d1 to 0f822c7 Compare May 27, 2017 21:15

amcp approved these changes May 28, 2017

View reviewed changes

amcp merged commit 2ec94a3 into JanusGraph:master May 28, 2017

davidclement90 deleted the elasticsearch-match branch June 7, 2017 11:50

bwatson-rti-org pushed a commit to bwatson-rti-org/janusgraph that referenced this pull request Mar 9, 2019

Merge pull request JanusGraph#223 from davidclement90/elasticsearch-m…

385db82

…atch Allows to use custom analysers in ES or Solr

micpod pushed a commit to micpod/janusgraph that referenced this pull request Nov 5, 2019

Merge pull request JanusGraph#223 from davidclement90/elasticsearch-m…

6e6ca7c

…atch Allows to use custom analysers in ES or Solr


		* Please refer to the https://www.elastic.co[Elasticsearch homepage] and available documentation for more information on Elasticsearch and how to setup an Elasticsearch cluster.

		=== Custom Analyzer


		==== Custom Analyser

		By default, JanusGraph will use the default analyzer from the indexing backend for properties with mapping.TEXT, and no analyzer for properties with mapping.STRING. If one wants to use another analyzer, it can be explicitly specified through a parameter : ParameterType.TEXT_ANALYZER for mapping.TEXT and parameterType.STRING_ANALYZER for mapping.STRING.


		===== For Elasticsearch

		The name of the analyzer must be set as parameter value.

		"which typically only happens the first time JanusGraph is started on top of ES. If the index JanusGraph is " +
		"configured to use already exists, then this setting has no effect.", ConfigOption.Type.MASKABLE, 200L);

Uh oh!

Allows to use custom analysers in ES or Solr #223

Allows to use custom analysers in ES or Solr #223

Uh oh!

Conversation

davidclement90 commented Apr 19, 2017

Uh oh!

janusgraph-bot commented Apr 19, 2017

Uh oh!

sjudeng commented Apr 25, 2017

Uh oh!

davidclement90 commented Apr 26, 2017

Uh oh!

sjudeng commented Apr 26, 2017

Uh oh!

davidclement90 commented Apr 27, 2017

Uh oh!

sjudeng commented Apr 27, 2017

Uh oh!

davidclement90 commented Apr 28, 2017

Uh oh!

sjudeng left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amcp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sjudeng commented May 23, 2017

Uh oh!

amcp commented May 26, 2017

Uh oh!

amcp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!