Skip to content

Commit

Permalink
SOLR-12255: Add docs for Nori Korean tokenizer (apache#270)
Browse files Browse the repository at this point in the history
  • Loading branch information
ctargett committed Sep 6, 2021
1 parent aa5d93d commit 7d75657
Show file tree
Hide file tree
Showing 3 changed files with 137 additions and 1 deletion.
2 changes: 1 addition & 1 deletion solr/solr-ref-guide/src/caches-warming.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,7 @@ Ideally, this number should be as close to 1 as possible.

If you find that you have a low hit ratio but you've set your cache size high, you can optimize by reducing the cache size - there's no need to keep those objects in memory when they are not being used.

Another useful metric is the cache evictions, which measures the ojects removed from the cache.
Another useful metric is the cache evictions, which measures the objects removed from the cache.
A high rate of evictions can indicate that your cache is too small and increasing it may show a higher hit ratio.
Alternatively, if your hit ratio is high but your evictions are low, your cache might be too large and you may benefit from reducing the size.

Expand Down
9 changes: 9 additions & 0 deletions solr/solr-ref-guide/src/filters.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -2670,6 +2670,15 @@ If `true`, then individual tokens will be output if no shingles are possible.
+
The string to use when joining adjacent tokens to form a shingle.

`fillerToken`::
+
[%autowidth,frame=none]
|===
|Optional |Default: `_` (underscore)
|===
+
The character used to fill in for removed stop words in order to preserve position increments.

*Example:*

Default behavior.
Expand Down
127 changes: 127 additions & 0 deletions solr/solr-ref-guide/src/language-analysis.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -1079,6 +1079,7 @@ The languages covered here are:
| <<Italian>>
| <<Irish>>
| <<Japanese>>
| <<Korean>>
| <<Latvian>>
| <<Norwegian>>
| <<Persian>>
Expand All @@ -1093,6 +1094,8 @@ The languages covered here are:
| <<Thai>>
| <<Turkish>>
| <<Ukrainian>>
|
|
|===

=== Arabic
Expand Down Expand Up @@ -2419,6 +2422,130 @@ Example:
====
--

=== Korean

The Korean (nori) analyzer integrates Lucene's nori analysis module into Solr.
It uses the https://bitbucket.org/eunjeon/mecab-ko-dic[mecab-ko-dic] dictionary to perform morphological analysis of Korean texts.

The dictionary was built with http://taku910.github.io/mecab/[MeCab] and defines a format for the features adapted for the Korean language.

Nori also has a user dictionary feature that allows overriding the statistical model with your own entries for segmentation, part-of-speech tags, and readings without a need to specify weights.

*Example*:

[.dynamic-tabs]
--
[example.tab-pane#byname-lang-korean]
====
[.tab-label]*With name*
[source,xml]
----
<fieldType name="text_ko" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer name="korean" decompoundMode="discard" outputUnknownUnigrams="false"/>
<filter name="koreanPartOfSpeechStop" />
<filter name="koreanReadingForm" />
<filter name="lowercase" />
</analyzer>
</fieldType>
----
====

[example.tab-pane#byclass-lang-korean]
====
[.tab-label]*With class name (legacy)*
[source,xml]
----
<fieldType name="text_ko" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.KoreanTokenizerFactory" decompoundMode="discard" outputUnknownUnigrams="false"/>
<filter class="solr.KoreanPartOfSpeechStopFilterFactory" />
<filter class="solr.KoreanReadingFormFilterFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
----
====
--


==== Korean Tokenizer

*Factory class*: `solr.KoreanTokenizerFactory`

*SPI name*: `korean`

*Arguments*:

`userDictionary`::
+
[%autowidth,frame=none]
|===
|Optional |Default: none
|===
+
Path to a user-supplied dictionary to add custom nouns or compound terms to the default dictionary.

`userDictionaryEncoding`::
+
[%autowidth,frame=none]
|===
|Optional |Default: none
|===
+
Character encoding of the user dictionary.

`decompoundMode`::
+
[%autowidth,frame=none]
|===
|Optional |Default: `discard`
|===
+
Defines how to handle compound tokens. The options are:

* `none`: No decomposition for tokens.
* `discard`: Tokens are decomposed and the original form is discarded.
* `mixed`: Tokens are decomposed and the original form is retained.

`outputUnknownUnigrams`::
+
[%autowidth,frame=none]
|===
|Optional |Default: `false`
|===
+
If `true`, unigrams will be output for unknown words.

`discardPunctuation`::
+
[%autowidth,frame=none]
|===
|Optional |Default: `true`
|===
+
If `true`, punctuation will be discarded.

==== Korean Part of Speech Stop Filter

This filter removes tokens that match parts of speech tags.

*Factory class*: `solr.KoreanPartOfSpeechStopFilterFactory`

*SPI name*: `koreanPartOfSpeechStop`

*Arguments*: None.

==== Korean Reading Form Filter

This filter replaces term text with the Reading Attribute, the Hangul transcription of Hanja characters.

*Factory class*: `solr.KoreanReadingFormFilterFactory`

*SPI name*: `koreanReadingForm`

*Arguments*: None.

[[hebrew-lao-myanmar-khmer]]
=== Hebrew, Lao, Myanmar, Khmer

Expand Down

0 comments on commit 7d75657

Please sign in to comment.