SOLR-12255: Add docs for Nori Korean tokenizer (apache#270)

MarcusSorealheis · Sep 6, 2021 · 7d75657 · 7d75657
1 parent aa5d93d
commit 7d75657
Show file tree

Hide file tree

Showing 3 changed files with 137 additions and 1 deletion.
diff --git a/solr/solr-ref-guide/src/caches-warming.adoc b/solr/solr-ref-guide/src/caches-warming.adoc
@@ -187,7 +187,7 @@ Ideally, this number should be as close to 1 as possible.
 
 If you find that you have a low hit ratio but you've set your cache size high, you can optimize by reducing the cache size - there's no need to keep those objects in memory when they are not being used.
 
-Another useful metric is the cache evictions, which measures the ojects removed from the cache.
+Another useful metric is the cache evictions, which measures the objects removed from the cache.
 A high rate of evictions can indicate that your cache is too small and increasing it may show a higher hit ratio.
 Alternatively, if your hit ratio is high but your evictions are low, your cache might be too large and you may benefit from reducing the size.
 

diff --git a/solr/solr-ref-guide/src/filters.adoc b/solr/solr-ref-guide/src/filters.adoc
@@ -2670,6 +2670,15 @@ If `true`, then individual tokens will be output if no shingles are possible.
 +
 The string to use when joining adjacent tokens to form a shingle.
 
+`fillerToken`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: `_` (underscore)
+|===
++
+The character used to fill in for removed stop words in order to preserve position increments.
+
 *Example:*
 
 Default behavior.

diff --git a/solr/solr-ref-guide/src/language-analysis.adoc b/solr/solr-ref-guide/src/language-analysis.adoc
@@ -1079,6 +1079,7 @@ The languages covered here are:
 | <<Italian>>
 | <<Irish>>
 | <<Japanese>>
+| <<Korean>>
 | <<Latvian>>
 | <<Norwegian>>
 | <<Persian>>
@@ -1093,6 +1094,8 @@ The languages covered here are:
 | <<Thai>>
 | <<Turkish>>
 | <<Ukrainian>>
+|
+|
 |===
 
 === Arabic
@@ -2419,6 +2422,130 @@ Example:
 ====
 --
 
+=== Korean
+
+The Korean (nori) analyzer integrates Lucene's nori analysis module into Solr.
+It uses the https://bitbucket.org/eunjeon/mecab-ko-dic[mecab-ko-dic] dictionary to perform morphological analysis of Korean texts.
+
+The dictionary was built with http://taku910.github.io/mecab/[MeCab] and defines a format for the features adapted for the Korean language.
+
+Nori also has a user dictionary feature that allows overriding the statistical model with your own entries for segmentation, part-of-speech tags, and readings without a need to specify weights.
+
+*Example*:
+
+[.dynamic-tabs]
+--
+[example.tab-pane#byname-lang-korean]
+====
+[.tab-label]*With name*
+[source,xml]
+----
+<fieldType name="text_ko" class="solr.TextField" positionIncrementGap="100">
+  <analyzer>
+    <tokenizer name="korean" decompoundMode="discard" outputUnknownUnigrams="false"/>
+    <filter name="koreanPartOfSpeechStop" />
+    <filter name="koreanReadingForm" />
+    <filter name="lowercase" />
+  </analyzer>
+</fieldType>
+----
+====
+
+[example.tab-pane#byclass-lang-korean]
+====
+[.tab-label]*With class name (legacy)*
+[source,xml]
+----
+<fieldType name="text_ko" class="solr.TextField" positionIncrementGap="100">
+  <analyzer>
+    <tokenizer class="solr.KoreanTokenizerFactory" decompoundMode="discard" outputUnknownUnigrams="false"/>
+    <filter class="solr.KoreanPartOfSpeechStopFilterFactory" />
+    <filter class="solr.KoreanReadingFormFilterFactory" />
+    <filter class="solr.LowerCaseFilterFactory" />
+  </analyzer>
+</fieldType>
+----
+====
+--
+
+
+==== Korean Tokenizer
+
+*Factory class*: `solr.KoreanTokenizerFactory`
+
+*SPI name*: `korean`
+
+*Arguments*:
+
+`userDictionary`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: none
+|===
++
+Path to a user-supplied dictionary to add custom nouns or compound terms to the default dictionary.
+
+`userDictionaryEncoding`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: none
+|===
++
+Character encoding of the user dictionary.
+
+`decompoundMode`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: `discard`
+|===
++
+Defines how to handle compound tokens. The options are:
+
+* `none`: No decomposition for tokens.
+* `discard`: Tokens are decomposed and the original form is discarded.
+* `mixed`: Tokens are decomposed and the original form is retained.
+
+`outputUnknownUnigrams`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: `false`
+|===
++
+If `true`, unigrams will be output for unknown words.
+
+`discardPunctuation`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: `true`
+|===
++
+If `true`, punctuation will be discarded.
+
+==== Korean Part of Speech Stop Filter
+
+This filter removes tokens that match parts of speech tags.
+
+*Factory class*: `solr.KoreanPartOfSpeechStopFilterFactory`
+
+*SPI name*: `koreanPartOfSpeechStop`
+
+*Arguments*: None.
+
+==== Korean Reading Form Filter
+
+This filter replaces term text with the Reading Attribute, the Hangul transcription of Hanja characters.
+
+*Factory class*: `solr.KoreanReadingFormFilterFactory`
+
+*SPI name*: `koreanReadingForm`
+
+*Arguments*: None.
+
 [[hebrew-lao-myanmar-khmer]]
 === Hebrew, Lao, Myanmar, Khmer