Skip to content

Commit

Permalink
Don't attach things to kakarijoshis
Browse files Browse the repository at this point in the history
  • Loading branch information
Kimtaro committed Jul 17, 2013
1 parent 19d75df commit b294970
Show file tree
Hide file tree
Showing 2 changed files with 19 additions and 2 deletions.
7 changes: 6 additions & 1 deletion lib/providers/mecab_ipadic.rb
Expand Up @@ -163,6 +163,7 @@ def initialize(text, output)
FUHENKAGATA = '不変化型'
JINMEI = '人名'
MEIREI_I = '命令i'
KAKARIJOSHI = '係助詞'

# Etc
NA = 'な'
Expand All @@ -177,6 +178,7 @@ def words
words = []
tokens = @tokens.find_all { |t| t[:type] == :parsed }
tokens = tokens.to_enum
previous = nil

# This is becoming very big
begin
Expand Down Expand Up @@ -272,7 +274,8 @@ def words
when JODOUSHI
pos = Ve::PartOfSpeech::Postposition

if [TOKUSHU_TA, TOKUSHU_NAI, TOKUSHU_TAI, TOKUSHU_MASU, TOKUSHU_NU].include?(token[:inflection_type])
if (previous.nil? || (!previous.nil? && previous[:pos2] != KAKARIJOSHI)) &&
[TOKUSHU_TA, TOKUSHU_NAI, TOKUSHU_TAI, TOKUSHU_MASU, TOKUSHU_NU].include?(token[:inflection_type])
attach_to_previous = true
elsif token[:inflection_type] == FUHENKAGATA && token[:lemma] == NN
attach_to_previous = true
Expand Down Expand Up @@ -338,6 +341,8 @@ def words

words << word
end

previous = token
end
rescue StopIteration
end
Expand Down
14 changes: 13 additions & 1 deletion tests/mecab_ipadic_parse_test.rb
Expand Up @@ -739,11 +739,23 @@ def test_words
:pos => [Ve::PartOfSpeech::Verb, Ve::PartOfSpeech::Verb],
:extra => [{:reading=>"オシエテ", :transcription=>"オシエテ", :grammar=>nil}, {:reading=>"クダサイ", :transcription=>"クダサイ", :grammar=>nil}],
:tokens => [0..1, 2..2]},
'教えてください', <<-EOR.split("\n"))
'教えてください', <<-EOR.split("\n"))
教え 動詞,自立,*,*,一段,連用形,教える,オシエ,オシエ,おしえ/教え,
て 助詞,接続助詞,*,*,*,*,て,テ,テ,,
ください 動詞,非自立,*,*,五段・ラ行特殊,命令i,くださる,クダサイ,クダサイ,,
EOS
EOR

# はない
assert_parses_into_words(Ve::Parse::MecabIpadic, {:words => ["は", "ない"],
:lemmas => ["は", "ない"],
:pos => [Ve::PartOfSpeech::Postposition, Ve::PartOfSpeech::Postposition],
:extra => [{:reading=>"ハ", :transcription=>"ワ", :grammar=>nil}, {:reading=>"ナイ", :transcription=>"ナイ", :grammar=>nil}],
:tokens => [0..0, 1..1]},
'はない', <<-EOR.split("\n"))
は 助詞,係助詞,*,*,*,*,は,ハ,ワ,,
ない 助動詞,*,*,*,特殊・ナイ,基本形,ない,ナイ,ナイ,,
EOS
EOR

# TODO: xした should parse as adjective?
Expand Down

0 comments on commit b294970

Please sign in to comment.